longest common substring

H

Henry Townsend

Is there a standard algorithm or module which finds the N longest common
substrings in a set of text files?

Here's the use case: I'm trying to clean up a very old, very large, and
very ugly build system which has thousands of unparameterized
compile/link commands in hundreds of Makefiles. I want to search them
for frequently-occurring long substrings. Hopefully this will turn up
phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
"-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for semantic
meaning, make up reasonable names like $(SYS_LIBS) and $(UNIX_DEFINES),
and do a global replace. Then repeat until satisfied.

I've done some searching but haven't found anything close to the above.

Thanks,
HT
 
X

xhoster

Henry Townsend said:
Is there a standard algorithm or module which finds the N longest common
substrings in a set of text files?

I would probably just go implement it. It would be probably be faster than
looking for a pre-existing module and learning how to use it.

Here's the use case: I'm trying to clean up a very old, very large, and
very ugly build system which has thousands of unparameterized
compile/link commands in hundreds of Makefiles. I want to search them
for frequently-occurring long substrings. Hopefully this will turn up
phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
"-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for semantic
meaning, make up reasonable names like $(SYS_LIBS) and $(UNIX_DEFINES),
and do a global replace. Then repeat until satisfied.

I would probably do this with a series of perl one liners and gnu text
utils, rather than all in Perl, but it would follow about the same pattern
as below. For memory/performance reasons, I would pick a "reasonable" upper
bound on the length of the common substrings. Once you have a list of
candidates which are over 50 (say), then it should be easy enough to go
back by hand and see which one is the absolute longest, but for present
purposes that probably isn't even necessary.


use strict;
use warnings;
my $max=50; # the longest string which is "long enough"
my $hash;
while (my $line=<>) { chomp $line;
foreach (0..(length $line)-1) {
$hash->{substr $line, $_, $max}++;
}
};


my $count=0;
while(1) {
my @common= grep {$hash->{$_}>1 and length == $max} keys %$hash;
$count+=@common;
print length $_, "\t$hash->{$_}\t$_\n" foreach @common;
last if $count>10 or $max==0;
my %hash2;
$max--;
$hash2{substr $_, 0, $max}++ foreach keys %$hash;
$hash=\%hash2;
}


Xho
 
T

Ted Zlatanov

Is there a standard algorithm or module which finds the N longest
common substrings in a set of text files?

Here's the use case: I'm trying to clean up a very old, very large,
and very ugly build system which has thousands of unparameterized
compile/link commands in hundreds of Makefiles. I want to search them
for frequently-occurring long substrings. Hopefully this will turn up
phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
"-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for
semantic meaning, make up reasonable names like $(SYS_LIBS) and
$(UNIX_DEFINES), and do a global replace. Then repeat until satisfied.

Seems like what you really want is to find what words commonly follow
each other. In other words, it would be good to know that -lcurses
usually follows -ltermlib. That will let you build sets of associated
words (so, for instance, "-lcurses -ldl -lnsl" and "-lnsl -lcurses
-ldl" will be noticed).

To do that is fairly easy. Here I will show you with a script that
reads from standard input or a set of files, and prints the contents
of %h at the end. So you should be able to filter this by number of
words that follow each other. You can also make sets of words that
are strongly associated, meaning that they are likely to appear
together. This will be much more useful, I think, than common
substrings, which would break on space and order of options.

I figured that order doesn't matter so I always sort the list of the
two words. That way, "a b" and "b a" will count the same.

Ted

#!/usr/bin/perl

use warnings;
use strict;
use Data::Dumper;

my @words = split ' ', join('', <>);

my %h; # this will hold the frequencies
while (exists $words[1])
{
my ($w1, $w2) = sort @words[0,1];

$h{$w1}->{$w2}++;

shift @words;
}

print Dumper \%h;
 
X

xhoster

my $count=0;
while(1) {
my @common= grep {$hash->{$_}>1 and length == $max} keys %$hash;
$count+=@common;
print length $_, "\t$hash->{$_}\t$_\n" foreach @common;
last if $count>10 or $max==0;
my %hash2;
$max--;
$hash2{substr $_, 0, $max}++ foreach keys %$hash;

But this should be:

$hash2{substr $_, 0, $max}+=$hash->{$_} foreach keys %$hash;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top