Semi OT: Uniquely Identifying Substrings for an Elem in a Set: substr, Sets and Complexity

Discussion in 'Perl Misc' started by Veli-Pekka Tätilä, Aug 21, 2005.

  1. Hi,
    Here's a text processing problem that came to mind based on finding a
    certain set of sub strings. As I came up with this one on my own
    incidentally, I don't know any real good words for Googling useful
    information or solutions to the problem and thought I'd ask here then,
    having seen a number of basic string processing threads during my short-time
    presence.

    The Problem:

    Suppose we have a set S of character strings. Each string in the set is
    unique (well sets are), may be of arbitrary length and is composed of some
    known alphabet (e.g. a to z). The task is to write a function that takes an
    element E in S and returns all sub strings of E that uniquely identify E in
    S. Put another way, all the returned strings are sub strings of E but they
    may *not* be sub strings of any other element in S.

    Some Thoughts:

    Upon closer inspection, I also found out that the set of sub strings may
    contain an element that is equal to E if E differs from some other string in
    S by only one character. In that case, E as a whole is the only substring
    identifying E.

    Now, I must confess the problem is not inherently related to Perl in any way
    thus the Semi OT tag in the subject-line. Perl would be the language of
    choice for implementation personally, however, and thus some questions about
    that, too.

    There are bound to be better methods than the obvious brute-force tac of
    trying everything. Howabout the string matching, is substr the way to go or
    am I better off trying to hack regular expressions to do the job? If I'm
    aiming at low temporal complexity, what would be the worst case complexity
    for this problem? Surely not O(n). Performance is not an issue but I'm just
    curious and not guru enough in math or computer science to figure it out
    myself.

    I'm certain the solution is relatively simple. I didn't find anything that
    useful in the Perl Cookbook. The problem is not exactly a straight
    intersection, difference etc... of two arrays but rather the set will
    include elements that are not directly in the array being processed, though
    they are based on an element of the array.

    Motivation:

    By the way, the thing that got me into pondering this problem in the first
    place is somewhat related. I was writing a real crude mini-shell for a
    textmode app of mine and in one command the user had to choose a very long
    element (voice name) from a long list. To make the process easier, I allowed
    selection based on a partial match provided that only one element contained
    the sub string in question. Otherwise the matching elements would be listed.
    Then I started to think if I could generate all the strings that select a
    given element programatically, read lazily, and here's this post you're
    reading.

    PS: I had a hard time titling this post appropriately. Feel free to change
    the subject to something more descriptive.

    --
    With kind regards Veli-Pekka Tätilä ()
    Accessibility, game music, synthesizers and programming:
    http://www.student.oulu.fi/~vtatila/
     
    Veli-Pekka Tätilä, Aug 21, 2005
    #1
    1. Advertising

  2. Re: Semi OT: Correction about Perl Funcs: ment index not substr

    Hi again,
    First mistake spotted. When talking about the perl functions I naturally
    ment index rather than substr. I think this does come down to my English
    skills and the semantics. Sub string means any part of a larger original
    string. yet the substr function won't tell if a given substring is found but
    rather extracts a substring based on character indeces. Maybe the names
    cutString, piece or contains would be more descriptive. Well, my bad as
    usual.

    PS: Please reply to my original message in most cases. If someone wants to
    discuss Perl's naming further, do at least change the subject.

    With kind regards Veli-Pekka Tätilä ()
    Accessibility, game music, synthesizers and programming:
    http://www.student.oulu.fi/~vtatila/
     
    Veli-Pekka Tätilä, Aug 21, 2005
    #2
    1. Advertising

  3. Veli-Pekka Tätilä

    Guest

    "Veli-Pekka Tätilä" <> wrote:
    > Hi,
    > Here's a text processing problem that came to mind based on finding a
    > certain set of sub strings. As I came up with this one on my own
    > incidentally, I don't know any real good words for Googling useful
    > information or solutions to the problem and thought I'd ask here then,
    > having seen a number of basic string processing threads during my
    > short-time presence.
    >
    > The Problem:
    >
    > Suppose we have a set S of character strings. Each string in the set is
    > unique (well sets are), may be of arbitrary length and is composed of
    > some known alphabet (e.g. a to z). The task is to write a function that
    > takes an element E in S and returns all sub strings of E that uniquely
    > identify E in S. Put another way, all the returned strings are sub
    > strings of E but they may *not* be sub strings of any other element in S.


    This was discussed here two months ago, and someone claimed Text::Ngrams
    was the/a solution. I don't whether it actually is or not, but you may
    want to look up the original thread.


    > Upon closer inspection, I also found out that the set of sub strings may
    > contain an element that is equal to E if E differs from some other string
    > in S by only one character. In that case, E as a whole is the only
    > substring identifying E.


    What if E itself is a substring of some other element of S? Then not even
    E as a whole identifies E!

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Aug 21, 2005
    #3
  4. Also sprach Veli-Pekka Tätilä:

    > Hi,
    > Here's a text processing problem that came to mind based on finding a
    > certain set of sub strings. As I came up with this one on my own
    > incidentally, I don't know any real good words for Googling useful
    > information or solutions to the problem and thought I'd ask here then,
    > having seen a number of basic string processing threads during my short-time
    > presence.
    >
    > The Problem:
    >
    > Suppose we have a set S of character strings. Each string in the set is
    > unique (well sets are), may be of arbitrary length and is composed of some
    > known alphabet (e.g. a to z). The task is to write a function that takes an
    > element E in S and returns all sub strings of E that uniquely identify E in
    > S. Put another way, all the returned strings are sub strings of E but they
    > may *not* be sub strings of any other element in S.
    >
    > Some Thoughts:
    >
    > Upon closer inspection, I also found out that the set of sub strings may
    > contain an element that is equal to E if E differs from some other string in
    > S by only one character. In that case, E as a whole is the only substring
    > identifying E.
    >
    > Now, I must confess the problem is not inherently related to Perl in any way
    > thus the Semi OT tag in the subject-line. Perl would be the language of
    > choice for implementation personally, however, and thus some questions about
    > that, too.
    >
    > There are bound to be better methods than the obvious brute-force tac of
    > trying everything. Howabout the string matching, is substr the way to go or
    > am I better off trying to hack regular expressions to do the job? If I'm
    > aiming at low temporal complexity, what would be the worst case complexity
    > for this problem? Surely not O(n). Performance is not an issue but I'm just
    > curious and not guru enough in math or computer science to figure it out
    > myself.


    With a set of $n words, I would assume the complexity of a brute-force
    approach to be

    O( $n * ($n-1) * ($avg * ($avg-1) / 2) )

    where $avg is the average word length, the term ($avg * ($avg-1) / 2)
    being the number of substrings for an average word and ($n-1) the number
    of words to test each substring against. As the average number of
    substrings doesn't depend on the size of the set, it can be assumed to
    be some constant in which case the algorithm is in O($n**2) for all
    words and subsequently O($n) for retrieving the unique substrings for
    one word.

    Note that the question whether to use a pattern match or index() or
    whatever doesn't really change the algorithm. In the one below I used
    index() but could have done it just as easily using a pattern match.

    Here's an implementation of such a brute-force algorithm which gave me
    the opportunity to make use of Eric J. Roode's Iterator module recently
    uploaded to the CPAN:

    #!/usr/bin/perl -w

    use strict;
    use Iterator;
    use List::MoreUtils qw/all/;

    chomp( my @words = <DATA> );

    for (@words) {
    print "Unique ids for $_:\n ";
    print join ", ", uniq_id($_, \@words);
    print "\n";
    }

    sub uniq_id {
    my ($word, $set) = @_;
    my $iter = do {
    my ($len, $off) = (1, 0);
    my $i = Iterator->new(
    sub {
    Iterator::is_done() if $len > length $word;
    my $sub = substr $word, $off++, $len;
    ($len, $off) = ($len + 1, 0)
    if $off + $len > length $word;
    return $sub;
    }
    );
    $i;
    };
    my @ids;
    while ($iter->isnt_exhausted) {
    my $pat = $iter->value;
    push @ids, $pat
    if all { index($_, $pat) == -1 } grep { $_ ne $word } @$set;
    }
    return @ids;
    }

    __DATA__
    abc
    bcd
    foobar
    foo
    bar
    pearl
    perl
    modperl

    One optimization immediately springs to my mind (unfortunately after
    I've written the above):

    The iterator returns the substrings in ascending order length-wise.
    However, it's always so that if there is no unique substring of length
    $m, there cannot possibly be one of $m-1. Therefore, have the iterator
    return the substrings in descending order and rewrite the 'all { } grep
    { } @$set' condition into a real loop that exits early as soon as none
    of the substrings of a given length are unique.

    This does not really change the complexity of the algorithm as it only
    makes the constant ($avg * ($avg-1) / 2) smaller. For small sets though
    this constant is the dominant parameter.

    > Motivation:
    >
    > By the way, the thing that got me into pondering this problem in the first
    > place is somewhat related. I was writing a real crude mini-shell for a
    > textmode app of mine and in one command the user had to choose a very long
    > element (voice name) from a long list. To make the process easier, I allowed
    > selection based on a partial match provided that only one element contained
    > the sub string in question. Otherwise the matching elements would be listed.
    > Then I started to think if I could generate all the strings that select a
    > given element programatically, read lazily, and here's this post you're
    > reading.


    Something like TAB-completion offered by many shells? That would
    decrease the complexity somewhat as you'd only have to test substrings
    with offset zero (that is, at the beginning of the word).

    > PS: I had a hard time titling this post appropriately. Feel free to change
    > the subject to something more descriptive.


    Your Subject was fine. You can't normally squeeze the whole content of a
    posting into the Subject. If that was possible, postings wouldn't need a
    body and we'd all communicate here by exchanging Subjects. :)

    Tassilo
    --
    use bigint;
    $n=71423350343770280161397026330337371139054411854220053437565440;
    $m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
     
    Tassilo v. Parseval, Aug 21, 2005
    #4
  5. Veli-Pekka Tätilä

    Anno Siegel Guest

    Veli-Pekka Tätilä <> wrote in comp.lang.perl.misc:
    > Hi,
    > Here's a text processing problem that came to mind based on finding a
    > certain set of sub strings. As I came up with this one on my own
    > incidentally, I don't know any real good words for Googling useful
    > information or solutions to the problem and thought I'd ask here then,
    > having seen a number of basic string processing threads during my short-time
    > presence.
    >
    > The Problem:
    >
    > Suppose we have a set S of character strings. Each string in the set is
    > unique (well sets are), may be of arbitrary length and is composed of some
    > known alphabet (e.g. a to z). The task is to write a function that takes an
    > element E in S and returns all sub strings of E that uniquely identify E in
    > S. Put another way, all the returned strings are sub strings of E but they
    > may *not* be sub strings of any other element in S.


    Here is a brute-force solution. Generate all substrings of all strings
    in the set and remember the string each substring was generated from.
    (That happens in the hash %coll.) Then the substrings that were generated
    from exactly one string are the ones you're looking for.

    If you have to find solutions to your problem repeatedly (with the same
    basic set of strings), my solution may even be practical. Its main
    virtue is that the algortithm is so simple, it is almost certainly correct.

    With a bow to Tassilo for providing a good example set:

    #!/usr/bin/perl
    use strict; use warnings; $| = 1;

    my %coll;
    for my $str ( <DATA> ) {
    chomp $str;
    push @{ $coll{ $_}}, $str for substrings( $str);
    }

    # postprocessing for nice output
    my %abbrev;
    @{ $coll{ $_}} == 1 and push @{ $abbrev{ $coll{ $_}->[ 0]}}, $_ for
    keys %coll;

    print "$_ -> @{ $abbrev{ $_}}\n" for sort keys %abbrev;

    #################################################################

    sub substrings {
    my $str = shift;
    map substr_n( $str, $_), 0 .. length $str;
    }

    sub substr_n {
    my ($str, $l) = @_;
    map substr( $str, $_, $l), 0 .. length( $str) - $l;
    }

    __DATA__
    abc
    bcd
    foobar
    foo
    bar
    pearl
    perl
    modperl

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
     
    Anno Siegel, Aug 22, 2005
    #5
  6. Veli-Pekka Tätilä

    Guest

    -berlin.de (Anno Siegel) wrote:
    > >
    > > The Problem:
    > >
    > > Suppose we have a set S of character strings. Each string in the set is
    > > unique (well sets are), may be of arbitrary length and is composed of
    > > some known alphabet (e.g. a to z). The task is to write a function that
    > > takes an element E in S and returns all sub strings of E that uniquely
    > > identify E in S. Put another way, all the returned strings are sub
    > > strings of E but they may *not* be sub strings of any other element in
    > > S.

    >
    > Here is a brute-force solution. Generate all substrings of all strings
    > in the set and remember the string each substring was generated from.
    > (That happens in the hash %coll.) Then the substrings that were
    > generated from exactly one string are the ones you're looking for.
    >
    > If you have to find solutions to your problem repeatedly (with the same
    > basic set of strings), my solution may even be practical. Its main
    > virtue is that the algortithm is so simple, it is almost certainly
    > correct.


    :)

    If you add 'mississippi' to the list, it misses 'iss' as a identifying
    substring (and I suppose others), due to the internal duplication.

    > With a bow to Tassilo for providing a good example set:
    >
    > #!/usr/bin/perl
    > use strict; use warnings; $| = 1;
    >
    > my %coll;
    > for my $str ( <DATA> ) {
    > chomp $str;
    > push @{ $coll{ $_}}, $str for substrings( $str);
    > }


    You should either use a hash of hash rather than a hash of array, (which
    seems to me like it would be a more natural anyway) or change substrings so
    that it only returns one copy of each substring for any given invocation.


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Aug 22, 2005
    #6
  7. Veli-Pekka Tätilä

    Anno Siegel Guest

    <> wrote in comp.lang.perl.misc:
    > -berlin.de (Anno Siegel) wrote:
    > > >
    > > > The Problem:
    > > >
    > > > Suppose we have a set S of character strings. Each string in the set is
    > > > unique (well sets are), may be of arbitrary length and is composed of
    > > > some known alphabet (e.g. a to z). The task is to write a function that
    > > > takes an element E in S and returns all sub strings of E that uniquely
    > > > identify E in S. Put another way, all the returned strings are sub
    > > > strings of E but they may *not* be sub strings of any other element in
    > > > S.

    > >
    > > Here is a brute-force solution. Generate all substrings of all strings
    > > in the set and remember the string each substring was generated from.
    > > (That happens in the hash %coll.) Then the substrings that were
    > > generated from exactly one string are the ones you're looking for.
    > >
    > > If you have to find solutions to your problem repeatedly (with the same
    > > basic set of strings), my solution may even be practical. Its main
    > > virtue is that the algortithm is so simple, it is almost certainly
    > > correct.

    >
    > :)
    >
    > If you add 'mississippi' to the list, it misses 'iss' as a identifying
    > substring (and I suppose others), due to the internal duplication.


    Oh dear! Good thing I said "almost certainly".

    [Advice snipped but incorporated below]

    Anno

    #!/usr/bin/perl
    use strict; use warnings; $| = 1;

    my %coll;
    for my $str ( <DATA> ) {
    chomp $str;
    $coll{ $_}->{ $str} = 1 for substrings( $str);
    }

    # postprocessing for nice output
    my %abbrev;

    for ( keys %coll ) {
    if ( ( my ( $str) = keys %{ $coll{ $_}}) == 1 ) {
    push @{ $abbrev{ $str}}, $_;
    }
    }

    @$_ = sort { length( $a) <=> length( $b) || $a cmp $b } @$_ for values %abbrev;

    print "$_ -> @{ $abbrev{ $_}}\n" for sort keys %abbrev;

    #################################################################

    sub substrings {
    my $str = shift;
    map substr_n( $str, $_), 0 .. length $str;
    }

    sub substr_n {
    my ($str, $l) = @_;
    map substr( $str, $_, $l), 0 .. length( $str) - $l;
    }

    __DATA__
    abc
    bcd
    foobar
    foo
    bar
    pearl
    perl
    modperl
    mississippi
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
     
    Anno Siegel, Aug 23, 2005
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    638
    Curt_C [MVP]
    Jul 27, 2004
  2. Alan Silver

    How do I uniquely identify a control?

    Alan Silver, Feb 24, 2005, in forum: ASP .Net
    Replies:
    6
    Views:
    474
    Alan Silver
    Feb 24, 2005
  3. James Stroud

    py2app semi-standalone semi-works

    James Stroud, Oct 4, 2006, in forum: Python
    Replies:
    2
    Views:
    736
    James Stroud
    Oct 4, 2006
  4. Walter Roberson

    Time Complexity for substr() function

    Walter Roberson, May 13, 2004, in forum: Perl Misc
    Replies:
    1
    Views:
    151
    Edward Wijaya
    May 13, 2004
  5. Ferrous Cranus
    Replies:
    58
    Views:
    507
    Dave Angel
    Jan 24, 2013
Loading...

Share This Page