longest common substring

Discussion in 'Perl Misc' started by Henry Townsend, Nov 1, 2006.

  1. Is there a standard algorithm or module which finds the N longest common
    substrings in a set of text files?

    Here's the use case: I'm trying to clean up a very old, very large, and
    very ugly build system which has thousands of unparameterized
    compile/link commands in hundreds of Makefiles. I want to search them
    for frequently-occurring long substrings. Hopefully this will turn up
    phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
    "-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for semantic
    meaning, make up reasonable names like $(SYS_LIBS) and $(UNIX_DEFINES),
    and do a global replace. Then repeat until satisfied.

    I've done some searching but haven't found anything close to the above.

    Thanks,
    HT
     
    Henry Townsend, Nov 1, 2006
    #1
    1. Advertising

  2. Henry Townsend

    Guest

    Henry Townsend <> wrote:
    > Is there a standard algorithm or module which finds the N longest common
    > substrings in a set of text files?


    I would probably just go implement it. It would be probably be faster than
    looking for a pre-existing module and learning how to use it.


    > Here's the use case: I'm trying to clean up a very old, very large, and
    > very ugly build system which has thousands of unparameterized
    > compile/link commands in hundreds of Makefiles. I want to search them
    > for frequently-occurring long substrings. Hopefully this will turn up
    > phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
    > "-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for semantic
    > meaning, make up reasonable names like $(SYS_LIBS) and $(UNIX_DEFINES),
    > and do a global replace. Then repeat until satisfied.


    I would probably do this with a series of perl one liners and gnu text
    utils, rather than all in Perl, but it would follow about the same pattern
    as below. For memory/performance reasons, I would pick a "reasonable" upper
    bound on the length of the common substrings. Once you have a list of
    candidates which are over 50 (say), then it should be easy enough to go
    back by hand and see which one is the absolute longest, but for present
    purposes that probably isn't even necessary.


    use strict;
    use warnings;
    my $max=50; # the longest string which is "long enough"
    my $hash;
    while (my $line=<>) { chomp $line;
    foreach (0..(length $line)-1) {
    $hash->{substr $line, $_, $max}++;
    }
    };


    my $count=0;
    while(1) {
    my @common= grep {$hash->{$_}>1 and length == $max} keys %$hash;
    $count+=@common;
    print length $_, "\t$hash->{$_}\t$_\n" foreach @common;
    last if $count>10 or $max==0;
    my %hash2;
    $max--;
    $hash2{substr $_, 0, $max}++ foreach keys %$hash;
    $hash=\%hash2;
    }


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Nov 1, 2006
    #2
    1. Advertising

  3. Henry Townsend

    Ted Zlatanov Guest

    On 1 Nov 2006, wrote:

    > Is there a standard algorithm or module which finds the N longest
    > common substrings in a set of text files?
    >
    > Here's the use case: I'm trying to clean up a very old, very large,
    > and very ugly build system which has thousands of unparameterized
    > compile/link commands in hundreds of Makefiles. I want to search them
    > for frequently-occurring long substrings. Hopefully this will turn up
    > phrases like "-lrpcsvc -ltermlib -lcurses -ldl -lnsl -lsocket" or
    > "-DUNIX -DANSI -DUSE_SOCKETS". I would then evaluate these for
    > semantic meaning, make up reasonable names like $(SYS_LIBS) and
    > $(UNIX_DEFINES), and do a global replace. Then repeat until satisfied.


    Seems like what you really want is to find what words commonly follow
    each other. In other words, it would be good to know that -lcurses
    usually follows -ltermlib. That will let you build sets of associated
    words (so, for instance, "-lcurses -ldl -lnsl" and "-lnsl -lcurses
    -ldl" will be noticed).

    To do that is fairly easy. Here I will show you with a script that
    reads from standard input or a set of files, and prints the contents
    of %h at the end. So you should be able to filter this by number of
    words that follow each other. You can also make sets of words that
    are strongly associated, meaning that they are likely to appear
    together. This will be much more useful, I think, than common
    substrings, which would break on space and order of options.

    I figured that order doesn't matter so I always sort the list of the
    two words. That way, "a b" and "b a" will count the same.

    Ted

    #!/usr/bin/perl

    use warnings;
    use strict;
    use Data::Dumper;

    my @words = split ' ', join('', <>);

    my %h; # this will hold the frequencies
    while (exists $words[1])
    {
    my ($w1, $w2) = sort @words[0,1];

    $h{$w1}->{$w2}++;

    shift @words;
    }

    print Dumper \%h;
     
    Ted Zlatanov, Nov 1, 2006
    #3
  4. Henry Townsend

    Guest

    wrote:
    >
    > my $count=0;
    > while(1) {
    > my @common= grep {$hash->{$_}>1 and length == $max} keys %$hash;
    > $count+=@common;
    > print length $_, "\t$hash->{$_}\t$_\n" foreach @common;
    > last if $count>10 or $max==0;
    > my %hash2;
    > $max--;
    > $hash2{substr $_, 0, $max}++ foreach keys %$hash;


    But this should be:

    $hash2{substr $_, 0, $max}+=$hash->{$_} foreach keys %$hash;

    > $hash=\%hash2;
    > }
    >
    > Xho


    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Nov 1, 2006
    #4
  5. Henry Townsend

    Peter Scott Guest

    Peter Scott, Nov 2, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Denny
    Replies:
    1
    Views:
    809
  2. Ruby Quiz
    Replies:
    52
    Views:
    694
  3. Ruby Quiz
    Replies:
    0
    Views:
    189
    Ruby Quiz
    Jan 24, 2008
  4. Replies:
    3
    Views:
    224
    Sherm Pendley
    Aug 3, 2005
  5. John Reye
    Replies:
    28
    Views:
    1,406
    Tim Rentsch
    May 8, 2012
Loading...

Share This Page