longest common substring

Henry Townsend · Nov 1, 2006

Is there a standard algorithm or module which finds the N longest common
substrings in a set of text files?

Here's the use case: I'm trying to clean up a very old, very large, and
very ugly build system which has thousands of unparameterized
compile/link commands in hundreds of Makefiles. I want to search them
for frequently-occurring long substrings. Hopefully this will turn up
phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
"-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for semantic
meaning, make up reasonable names like $(SYS_LIBS) and $(UNIX_DEFINES),
and do a global replace. Then repeat until satisfied.

I've done some searching but haven't found anything close to the above.

Thanks,
HT

xhoster · Nov 1, 2006

Henry Townsend said:
Is there a standard algorithm or module which finds the N longest common
substrings in a set of text files?

I would probably just go implement it. It would be probably be faster than
looking for a pre-existing module and learning how to use it.

Here's the use case: I'm trying to clean up a very old, very large, and
very ugly build system which has thousands of unparameterized
compile/link commands in hundreds of Makefiles. I want to search them
for frequently-occurring long substrings. Hopefully this will turn up
phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
"-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for semantic
meaning, make up reasonable names like $(SYS_LIBS) and $(UNIX_DEFINES),
and do a global replace. Then repeat until satisfied.

I would probably do this with a series of perl one liners and gnu text
utils, rather than all in Perl, but it would follow about the same pattern
as below. For memory/performance reasons, I would pick a "reasonable" upper
bound on the length of the common substrings. Once you have a list of
candidates which are over 50 (say), then it should be easy enough to go
back by hand and see which one is the absolute longest, but for present
purposes that probably isn't even necessary.

use strict;
use warnings;
my $max=50; # the longest string which is "long enough"
my $hash;
while (my $line=<>) { chomp $line;
foreach (0..(length $line)-1) {
$hash->{substr $line, $_, $max}++;
}
};

my $count=0;
while(1) {
my @common= grep {$hash->{$_}>1 and length == $max} keys %$hash;
$count+=@common;
print length $_, "\t$hash->{$_}\t$_\n" foreach @common;
last if $count>10 or $max==0;
my %hash2;
$max--;
$hash2{substr $_, 0, $max}++ foreach keys %$hash;
$hash=\%hash2;
}

Xho

Ted Zlatanov · Nov 1, 2006

Is there a standard algorithm or module which finds the N longest
common substrings in a set of text files?

Here's the use case: I'm trying to clean up a very old, very large,
and very ugly build system which has thousands of unparameterized
compile/link commands in hundreds of Makefiles. I want to search them
for frequently-occurring long substrings. Hopefully this will turn up
phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
"-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for
semantic meaning, make up reasonable names like $(SYS_LIBS) and
$(UNIX_DEFINES), and do a global replace. Then repeat until satisfied.

Seems like what you really want is to find what words commonly follow
each other. In other words, it would be good to know that -lcurses
usually follows -ltermlib. That will let you build sets of associated
words (so, for instance, "-lcurses -ldl -lnsl" and "-lnsl -lcurses
-ldl" will be noticed).

To do that is fairly easy. Here I will show you with a script that
reads from standard input or a set of files, and prints the contents
of %h at the end. So you should be able to filter this by number of
words that follow each other. You can also make sets of words that
are strongly associated, meaning that they are likely to appear
together. This will be much more useful, I think, than common
substrings, which would break on space and order of options.

I figured that order doesn't matter so I always sort the list of the
two words. That way, "a b" and "b a" will count the same.

Ted

#!/usr/bin/perl

use warnings;
use strict;
use Data:

umper;

my @words = split ' ', join('', <>);

my %h; # this will hold the frequencies
while (exists $words[1])
{
my ($w1, $w2) = sort @words[0,1];

$h{$w1}->{$w2}++;

shift @words;
}

print Dumper \%h;

xhoster · Nov 1, 2006

my $count=0;
while(1) {
my @common= grep {$hash->{$_}>1 and length == $max} keys %$hash;
$count+=@common;
print length $_, "\t$hash->{$_}\t$_\n" foreach @common;
last if $count>10 or $max==0;
my %hash2;
$max--;
$hash2{substr $_, 0, $max}++ foreach keys %$hash;

But this should be:

$hash2{substr $_, 0, $max}+=$hash->{$_} foreach keys %$hash;

Peter Scott · Nov 2, 2006

Is there a standard algorithm or module which finds the N longest common
substrings in a set of text files?

http://search.cpan.org/~gray/Tree-Suffix-0.14/lib/Tree/Suffix.pm

[SUMMARY] Longest Repeated Substring (#153)	0	Jan 24, 2008
No-syntax Web-programming-IDE (was: Does turtle graphics have the wrong associations?)	0	Nov 22, 2009
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
phony French doc defrauding holistic healthcare practitioners via web	0	Apr 12, 2009
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.vhdl FAQ part 4 of 4: glossary	0	Jul 8, 2003
comp.lang.vhdl FAQ part 1 of 4: general	0	Jul 8, 2003
comp.lang.vhdl FAQ part 3 of 4: products & services	0	Jul 8, 2003

longest common substring

Henry Townsend

xhoster

Ted Zlatanov

xhoster

Peter Scott

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads