Pattern Searching

Discussion in 'Perl Misc' started by Shiraz, Jun 22, 2005.

  1. Shiraz

    Shiraz Guest

    This is my goal, if i have an input file:

    BEGIN----------
    Smith
    Johnathan
    Smiths
    Mona
    Moynaham
    END------------

    i want to generate a minimum match so that with the minimum match
    string, each entry can be identified. the output would be:

    BEGIN----------
    Smith - Smith
    Smiths - Smiths or hs
    Mona - Mon or ona
    Moynaham - Moy or nah or aha or ham
    END------------

    Is there any built-in function to do such analysis in perl?
     
    Shiraz, Jun 22, 2005
    #1
    1. Advertising

  2. "Shiraz" <> wrote in news:1119479503.935902.303370
    @z14g2000cwz.googlegroups.com:

    > This is my goal, if i have an input file:
    >
    > BEGIN----------
    > Smith
    > Johnathan
    > Smiths
    > Mona
    > Moynaham
    > END------------
    >
    > i want to generate a minimum match so that with the minimum match
    > string, each entry can be identified. the output would be:
    >
    > BEGIN----------
    > Smith - Smith
    > Smiths - Smiths or hs
    > Mona - Mon or ona
    > Moynaham - Moy or nah or aha or ham
    > END------------
    >
    > Is there any built-in function to do such analysis in perl?


    No.

    But you can always browse through CPAN to see if someone has written a
    module to help with this task.

    Sinan
    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Jun 23, 2005
    #2
    1. Advertising

  3. Shiraz

    Arne Ruhnau Guest

    Shiraz wrote:
    > This is my goal, if i have an input file:
    >
    > BEGIN----------
    > Smith
    > Johnathan
    > Smiths
    > Mona
    > Moynaham
    > END------------
    >
    > i want to generate a minimum match so that with the minimum match
    > string, each entry can be identified. the output would be:
    >
    > BEGIN----------
    > Smith - Smith
    > Smiths - Smiths or hs
    > Mona - Mon or ona
    > Moynaham - Moy or nah or aha or ham
    > END------------
    >
    > Is there any built-in function to do such analysis in perl?


    Hmm... my first idea would be to, for every entry, extract every possible
    2..length($entry) - Gram (i.e. each sequence of 2..length($entry) letters)
    and count them.
    For Smith, this would become
    Sm, mi, it, th, Smi, mit, ith, Smit, mith, Smith
    and for Smiths
    Sm, mi, it, th, hs, Smi, mit, ith, ths, Smit, mith, iths, Smith, miths, Smiths

    (thus, there are more unique-to-the-entry substrings than you listed above)

    Of course, one would have to remember which entry generated the ngram just
    counted. Afterwards, you can grep all those from the resulting tree whose
    terminal nodes (i.e. counts) are exactly one, i.e. which were unique to the
    associated word.

    hth,

    Arne Ruhnau
     
    Arne Ruhnau, Jun 23, 2005
    #3
  4. Shiraz

    Anno Siegel Guest

    Arne Ruhnau <> wrote in comp.lang.perl.misc:
    > Shiraz wrote:
    > > This is my goal, if i have an input file:
    > >
    > > BEGIN----------
    > > Smith
    > > Johnathan
    > > Smiths
    > > Mona
    > > Moynaham
    > > END------------
    > >
    > > i want to generate a minimum match so that with the minimum match
    > > string, each entry can be identified. the output would be:
    > >
    > > BEGIN----------
    > > Smith - Smith
    > > Smiths - Smiths or hs
    > > Mona - Mon or ona
    > > Moynaham - Moy or nah or aha or ham
    > > END------------
    > >
    > > Is there any built-in function to do such analysis in perl?

    >
    > Hmm... my first idea would be to, for every entry, extract every possible
    > 2..length($entry) - Gram (i.e. each sequence of 2..length($entry) letters)


    Why only strings of length 2 or more? "y" uniquely identifies "Moynaham"
    in the list above.

    > and count them.


    > For Smith, this would become
    > Sm, mi, it, th, Smi, mit, ith, Smit, mith, Smith
    > and for Smiths
    > Sm, mi, it, th, hs, Smi, mit, ith, ths, Smit, mith, iths, Smith, miths, Smiths
    >
    > (thus, there are more unique-to-the-entry substrings than you listed above)
    >
    > Of course, one would have to remember which entry generated the ngram just
    > counted. Afterwards, you can grep all those from the resulting tree whose
    > terminal nodes (i.e. counts) are exactly one, i.e. which were unique to the
    > associated word.


    If you know the substring is unique (i.e. is a substring of only one of
    the names) you don't *have* to remember which name generated which
    substring. You can find it by looking which of the names contains the
    substring.

    Anno
     
    Anno Siegel, Jun 23, 2005
    #4
  5. Shiraz

    Arne Ruhnau Guest

    Anno Siegel wrote:
    > Arne Ruhnau <> wrote in comp.lang.perl.misc:
    >
    >>Shiraz wrote:
    >>
    >>>This is my goal, if i have an input file:
    >>>
    >>>BEGIN----------
    >>>Smith
    >>>Johnathan
    >>>Smiths
    >>>Mona
    >>>Moynaham
    >>>END------------
    >>>
    >>>i want to generate a minimum match so that with the minimum match
    >>>string, each entry can be identified. the output would be:
    >>>
    >>>BEGIN----------
    >>>Smith - Smith
    >>>Smiths - Smiths or hs
    >>>Mona - Mon or ona
    >>>Moynaham - Moy or nah or aha or ham
    >>>END------------
    >>>
    >>>Is there any built-in function to do such analysis in perl?

    >>
    >>Hmm... my first idea would be to, for every entry, extract every possible
    >>2..length($entry) - Gram (i.e. each sequence of 2..length($entry) letters)

    >
    > Why only strings of length 2 or more? "y" uniquely identifies "Moynaham"
    > in the list above.


    Well, I thought unique unigrams could be too sparse / non-existent, so that
    there is no point in looking at them anyway. But for completeness/maximally
    short unique substrings, they should be included.

    <snip>
    >>Of course, one would have to remember which entry generated the ngram just
    >>counted. Afterwards, you can grep all those from the resulting tree whose
    >>terminal nodes (i.e. counts) are exactly one, i.e. which were unique to the
    >>associated word.

    >
    > If you know the substring is unique (i.e. is a substring of only one of
    > the names) you don't *have* to remember which name generated which
    > substring. You can find it by looking which of the names contains the
    > substring.


    Correct, hadn't thought of that... But if you want to make some kind of
    look-up based on the unique substring, returning the string identified by
    it, you have to know it (to prevent yourself from scanning the original
    list). Or am I getting it wrong?

    Arne
     
    Arne Ruhnau, Jun 23, 2005
    #5
  6. Shiraz

    Anno Siegel Guest

    Arne Ruhnau <> wrote in comp.lang.perl.misc:
    > Anno Siegel wrote:
    > > Arne Ruhnau <> wrote in

    > comp.lang.perl.misc:
    > >
    > >>Shiraz wrote:
    > >>
    > >>>This is my goal, if i have an input file:
    > >>>
    > >>>BEGIN----------
    > >>>Smith
    > >>>Johnathan
    > >>>Smiths
    > >>>Mona
    > >>>Moynaham
    > >>>END------------
    > >>>
    > >>>i want to generate a minimum match so that with the minimum match
    > >>>string, each entry can be identified. the output would be:
    > >>>
    > >>>BEGIN----------
    > >>>Smith - Smith
    > >>>Smiths - Smiths or hs
    > >>>Mona - Mon or ona
    > >>>Moynaham - Moy or nah or aha or ham
    > >>>END------------
    > >>>
    > >>>Is there any built-in function to do such analysis in perl?
    > >>
    > >>Hmm... my first idea would be to, for every entry, extract every possible
    > >>2..length($entry) - Gram (i.e. each sequence of 2..length($entry) letters)


    [...]

    > >>Of course, one would have to remember which entry generated the ngram just
    > >>counted. Afterwards, you can grep all those from the resulting tree whose
    > >>terminal nodes (i.e. counts) are exactly one, i.e. which were unique to the
    > >>associated word.

    > >
    > > If you know the substring is unique (i.e. is a substring of only one of
    > > the names) you don't *have* to remember which name generated which
    > > substring. You can find it by looking which of the names contains the
    > > substring.

    >
    > Correct, hadn't thought of that... But if you want to make some kind of
    > look-up based on the unique substring, returning the string identified by
    > it, you have to know it (to prevent yourself from scanning the original
    > list). Or am I getting it wrong?


    Sure, you probably will eventually build a lookup table for the unique
    substrings, but you don't have to keep track of each substring's origin
    while you find them.

    Anno
     
    Anno Siegel, Jun 23, 2005
    #6
  7. Shiraz

    Shiraz Guest

    Arne, i appreciate the input.... i'll see what kind of tuning can be
    done to accomplish that fast since the input file maybe as big as 50000
    lines at a time.....

    i'll put up the code in a couple of days.

    Thanks all
     
    Shiraz, Jun 23, 2005
    #7
  8. Shiraz wrote:

    > Arne, i appreciate the input....


    What input was that? Please quote some context when posting replies.

    > done to accomplish that fast since the input file maybe as big as 50000
    > lines at a time.....


    50000 is really not that huge a number.

    > i'll put up the code in a couple of days.


    Please read the posting guidelines for this group by then.

    Sinan
     
    A. Sinan Unur, Jun 23, 2005
    #8
  9. Shiraz

    Guest

    "Shiraz" <> wrote:
    > This is my goal, if i have an input file:
    >
    > BEGIN----------
    > Smith
    > Johnathan
    > Smiths
    > Mona
    > Moynaham
    > END------------
    >
    > i want to generate a minimum match so that with the minimum match
    > string, each entry can be identified. the output would be:
    >
    > BEGIN----------
    > Smith - Smith
    > Smiths - Smiths or hs
    > Mona - Mon or ona
    > Moynaham - Moy or nah or aha or ham
    > END------------


    Smith has *no* unambiguous match string!

    What do you mean by "minimum"? "hs" is shorter than "Smiths" so why
    is "Smiths" included if you only want the minimum?

    >
    > Is there any built-in function to do such analysis in perl?


    No.

    sub foo {
    my %h2;
    foreach (@_) {
    my %h;
    foreach my $p1 (0..length($_)-1) {
    foreach my $p2 ($p1 .. length($_)-1) {
    $h{substr $_, $p1, $p2-$p1+1}=1;
    }
    };
    $h2{$_}++ foreach keys %h;
    };
    return map { my $t=$_; $t, grep -1!=index($_,$t),@_}
    grep $h2{$_}==1, keys %h2;
    };

    print join "\t", foo(qw/Smith Smiths Mona Moynaham/);

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jun 23, 2005
    #9
  10. Shiraz

    Shiraz Guest

    Arne,
    the input about how to analyze the problem, look for strings of various
    lengths and see which one is unique to an entry

    xhos
    if in the list that i get contains, both smith and smiths, there is
    nothing i can do about it, so to identify them separately, i need a
    total match of smith, to identify 'smith' and i need 'hs' to identify
    'Smiths';
    you were correct about the hs;
     
    Shiraz, Jun 23, 2005
    #10
  11. Shiraz

    Shiraz Guest

    Hey guys there is a package called Text-Ngrams which solved my
    problems.....
     
    Shiraz, Jun 25, 2005
    #11
  12. Shiraz

    Shiraz Guest

    Hey guys there is a package called Text-Ngrams which solved my
    problems.....
     
    Shiraz, Jun 25, 2005
    #12
  13. Shiraz

    Shiraz Guest

    Hey guys there is a package called Text-Ngrams which solved my
    problems.....
     
    Shiraz, Jun 25, 2005
    #13
  14. Shiraz <> wrote:

    > Hey guys there is a package called Text-Ngrams which solved my
    > problems.....



    What were your problems that the module solved?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Jun 25, 2005
    #14
  15. Shiraz

    Shiraz Guest

    i appologize, i should be more verbose... any ways i was looking for a
    way to come up with a minimum match string, (given a list of stings) by
    which each entry can be uniquely identified.


    Smith - Smith
    Smiths - Smiths or hs
    Mona - Mon or ona
    Moynaham - Moy or nah or aha or ham

    just using
    use Text::Ngrams;
    my $ng3 = Text::Ngrams->new( windowsize => 6 );

    $ng3->process_files('rates/atsi-BF.csv');
    print F_OUT_LOG $ng3->to_string( out);
     
    Shiraz, Jun 26, 2005
    #15
  16. Shiraz wrote:
    > Tad McClellan wrote:
    >> Shiraz <> wrote:
    >>> Hey guys there is a package called Text-Ngrams which solved my
    >>> problems.....

    >>
    >> What were your problems that the module solved?

    >
    > i appologize, i should be more verbose...


    No, you should rather have provided context by quoting appropriate part
    of previous messages. ;-)

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jun 26, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. gonas

    Searching a pattern

    gonas, Nov 20, 2004, in forum: Java
    Replies:
    2
    Views:
    349
    Oscar kind
    Nov 21, 2004
  2. Brian Andrus

    searching for null in pattern

    Brian Andrus, Nov 7, 2003, in forum: Perl Misc
    Replies:
    6
    Views:
    73
    Brad Baxter
    Nov 8, 2003
  3. jhu
    Replies:
    6
    Views:
    118
    Dave Weaver
    Nov 26, 2007
  4. Peter Stacy
    Replies:
    1
    Views:
    105
    John W. Krahn
    Nov 8, 2009
  5. stumblng.tumblr
    Replies:
    1
    Views:
    207
    stumblng.tumblr
    Feb 4, 2008
Loading...

Share This Page