Pattern matching for "best match"

Discussion in 'Perl Misc' started by Yash, Jan 28, 2004.

  1. Yash

    Yash Guest

    We have a list of regular expressions such as:
    vodafone.*horoscope
    vodafone.*pxtsend
    vodafone
    yahoo

    The total number of such reg expressions is few hundreds.
    Our program reads a large file with millions of URLs. For each URL, it
    has to find the best match in the list of regular expressions.
    For example www.vodafome.com will match with vodafone.
    www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
    Basically we have to find the regular expression that matches best for
    a given URL.

    Do you have any suggestions/pointers for doing this in a very
    efficient way? Suggestions regarding data structures to use,
    algorithm, etc. would be helpful.

    Thanks
    Yash, Jan 28, 2004
    #1
    1. Advertising

  2. Yash

    Ben Morrow Guest

    (Yash) wrote:
    > We have a list of regular expressions such as:
    > vodafone.*horoscope
    > vodafone.*pxtsend
    > vodafone
    > yahoo
    >
    > The total number of such reg expressions is few hundreds.
    > Our program reads a large file with millions of URLs. For each URL, it
    > has to find the best match in the list of regular expressions.
    > For example www.vodafome.com will match with vodafone.
    > www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
    > Basically we have to find the regular expression that matches best for
    > a given URL.
    >
    > Do you have any suggestions/pointers for doing this in a very
    > efficient way? Suggestions regarding data structures to use,
    > algorithm, etc. would be helpful.


    Define what you mean by the 'best' match. The one that matches the
    most text? The intuitive (at least to me) meaning would involve
    working out that (e.g.) /vodafone.*horoscope/ will match a subset of
    the strings that /vodafone/ will match, and therefore the first is a
    more specific ('better') match than the second. I would imagine this
    would be pretty hard to work out just starting from the regexes.

    Ben

    --
    Musica Dei donum optimi, trahit homines, trahit deos. |
    Musica truces mollit animos, tristesque mentes erigit. |
    Musica vel ipsas arbores et horridas movet feras. |
    Ben Morrow, Jan 28, 2004
    #2
    1. Advertising

  3. Yash

    Tore Aursand Guest

    On Wed, 28 Jan 2004 04:00:05 -0800, Yash wrote:
    > We have a list of regular expressions such as: vodafone.*horoscope
    > vodafone.*pxtsend
    > vodafone
    > yahoo
    >
    > The total number of such reg expressions is few hundreds. Our program
    > reads a large file with millions of URLs. For each URL, it has to find
    > the best match in the list of regular expressions.


    You never define "best" in your posting. What do _you_ require to be a
    perfect match? Almost perfect match? No match at all? Etc.

    > For example www.vodafome.com will match with vodafone.


    It will? Why?

    > www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
    > Basically we have to find the regular expression that matches best for a
    > given URL.


    You could always try to treat the strings you match against a regular
    expressions and in addition try to compare the strings letter by letter,
    and in turn the distance between each letter in each string.

    There are modules out there for comparing strings, as well. I suggest you
    check out CPAN;

    http://www.cpan.org/


    --
    Tore Aursand <>
    "Life is pleasant. Death is peaceful. It's the transition that's
    troublesome." -- Isaac Asimov
    Tore Aursand, Jan 28, 2004
    #3
  4. On Wed, 28 Jan 2004 04:00:05 -0800, Yash wrote:

    > We have a list of regular expressions such as:
    > vodafone.*horoscope
    > vodafone.*pxtsend
    > vodafone
    > yahoo
    >
    > The total number of such reg expressions is few hundreds.
    > Our program reads a large file with millions of URLs. For each URL, it
    > has to find the best match in the list of regular expressions.
    > For example www.vodafome.com will match with vodafone.
    > www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
    > Basically we have to find the regular expression that matches best for
    > a given URL.
    >
    > Do you have any suggestions/pointers for doing this in a very
    > efficient way? Suggestions regarding data structures to use,
    > algorithm, etc. would be helpful.


    Are you trying to do web filtering or redirection? Fuzzy logic searches?
    What?

    And .... where's the code? :) Or is this where you're stuck?

    NEI (Not Enoguh Information)

    --
    Jim

    Copyright notice: all code written by the author in this post is
    released under the GPL. http://www.gnu.org/licenses/gpl.txt
    for more information.

    a fortune quote ...
    If all the world's a stage, I want to operate the trap door. --
    Paul Beatty
    James Willmore, Jan 28, 2004
    #4
  5. Yash

    Jim Keenan Guest

    (Yash) wrote in message news:<>...
    > We have a list of regular expressions such as:
    > vodafone.*horoscope
    > vodafone.*pxtsend
    > vodafone
    > yahoo
    >
    > The total number of such reg expressions is few hundreds.
    > Our program reads a large file with millions of URLs. For each URL, it
    > has to find the best match in the list of regular expressions.
    > For example www.vodafome.com will match with vodafone.
    > www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
    > Basically we have to find the regular expression that matches best for
    > a given URL.
    >

    Define "best".
    Jim Keenan, Jan 28, 2004
    #5
  6. Yash

    Steve Guest

    (Yash) wrote in message news:<>...
    > We have a list of regular expressions such as:
    > vodafone.*horoscope
    > vodafone.*pxtsend
    > vodafone
    > yahoo
    >
    > The total number of such reg expressions is few hundreds.
    > Our program reads a large file with millions of URLs. For each URL, it
    > has to find the best match in the list of regular expressions.
    > For example www.vodafome.com will match with vodafone.
    > www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
    > Basically we have to find the regular expression that matches best for
    > a given URL.
    >
    > Do you have any suggestions/pointers for doing this in a very
    > efficient way? Suggestions regarding data structures to use,
    > algorithm, etc. would be helpful.
    >
    > Thanks


    If you define 'the best match' as the string with the longest
    continual stretch of characters identical to a target, then molecular
    biology has developed some very efficient ways to do this to align DNA
    (and protein) sequences. BLAST is the most commonly used and you can
    get a description by Googling - essentially it takes a 'seed' of a few
    characters, attempts to match and when it finds a match it looks
    either side to extend the alignment and then ranks by length of
    overlap. Molecular biologists need to worry about allowing
    discrepancies and gaps within the match, but I suspect that for your
    application a simple exact match is good enough.

    I need not point out that the scale of your problem is tiny compared
    with aligning genomes (but I did).

    HTH

    Steve
    Steve, Jan 28, 2004
    #6
  7. Hi Yash,

    Yash wrote:
    > We have a list of regular expressions such as:
    > vodafone.*horoscope
    > vodafone.*pxtsend
    > vodafone
    > yahoo
    >
    > The total number of such reg expressions is few hundreds.
    > Our program reads a large file with millions of URLs. For each URL, it
    > has to find the best match in the list of regular expressions.
    > For example www.vodafome.com will match with vodafone.
    > www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
    > Basically we have to find the regular expression that matches best for
    > a given URL.
    >
    > Do you have any suggestions/pointers for doing this in a very
    > efficient way? Suggestions regarding data structures to use,
    > algorithm, etc. would be helpful.
    >
    > Thanks


    I assume you have put all patterns in "priority" order
    so that the first hit is the "best". My very nice Perl
    teacher suggested to generate an anonymous sub if I was
    going to do what you are doing:

    my $sub = 'sub test_for_match { $_ = $_[0]; '.
    join( '', map { "/$_/o && return $_; " } @patterns ) .
    'return "No match"; }';
    eval { $sub };
    test_for_match( 'www.vodafone.com' );

    Haven't tested the code, but I hope I conveyed the message.
    The subroutine gets compiled once so it has little overhead
    and there are probably other optimization which can be done.

    Regards,

    /Gunnar
    Gunnar Strand, Jan 29, 2004
    #7
  8. Yash

    Sam Holden Guest

    On Thu, 29 Jan 2004 06:21:08 +0100,
    Gunnar Strand <> wrote:
    >
    > I assume you have put all patterns in "priority" order
    > so that the first hit is the "best". My very nice Perl
    > teacher suggested to generate an anonymous sub if I was
    > going to do what you are doing:
    >
    > my $sub = 'sub test_for_match { $_ = $_[0]; '.
    > join( '', map { "/$_/o && return $_; " } @patterns ) .
    > 'return "No match"; }';
    > eval { $sub };
    > test_for_match( 'www.vodafone.com' );


    There is no anonymous sub in that code.

    And the test_for_match subroutine never exists since you are using block
    eval which doesn't do what you think it does. Anyway using an eval
    correctly would just lead to more problems because the code is
    completely wrong.

    If your perl teacher suggested such code, find another perl teacher.
    More likely you misinterpreted them.


    > Haven't tested the code, but I hope I conveyed the message.
    > The subroutine gets compiled once so it has little overhead
    > and there are probably other optimization which can be done.


    5 seconds of testing would show that it doesn't work, thanks for letting
    lots of people do so instead of you doing so before posting it.

    That style of code generating a subroutine isn't necessary since perl
    5.005 (ie. since mid 1998) due to the introduction of qr//.

    See the obvious FAQ:

    perlfaq6: How do I efficiently match many regular expressions at once?

    --
    Sam Holden
    Sam Holden, Jan 29, 2004
    #8
  9. Yash

    Guest

    (Yash) wrote:
    > We have a list of regular expressions such as:
    > vodafone.*horoscope
    > vodafone.*pxtsend
    > vodafone
    > yahoo
    >
    > The total number of such reg expressions is few hundreds.
    > Our program reads a large file with millions of URLs. For each URL, it
    > has to find the best match in the list of regular expressions.
    > For example www.vodafome.com will match with vodafone.
    > www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
    > Basically we have to find the regular expression that matches best for
    > a given URL.


    I assume the regular expressions are in order of "goodness".

    warning "Untested code";
    my @qr_rigo= map { qr/$_/ } @regex_in_goodness_order;
    sub get_best {
    study $_[0]; ## Will this help? Try and see.
    foreach (@qr_rigo) {
    return $_ if $_[0] =~ /$_/;
    };
    return;
    };

    while (my $url=<>) {
    chomp $url;
    my $b=get_best($url);
    ##
    };


    >
    > Do you have any suggestions/pointers for doing this in a very
    > efficient way? Suggestions regarding data structures to use,
    > algorithm, etc. would be helpful.


    If the above is not efficient enough (or if you have no way to order the
    regexes by goodness by hand), then further optimization probably relies
    minutes details and trade-offs which we don't know about. (What you mean
    by "best match", how many of the 1000s of regex are substrings of each
    other, the skew in the actual data set, etc.)

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service New Rate! $9.95/Month 50GB
    , Jan 29, 2004
    #9
  10. Sam,

    You are quite right, that was really, really bad
    suggestion in more than one way. Sorry about that.

    /Gunnar

    Sam Holden wrote:
    > On Thu, 29 Jan 2004 06:21:08 +0100,
    > Gunnar Strand <> wrote:
    >
    >>I assume you have put all patterns in "priority" order
    >>so that the first hit is the "best". My very nice Perl
    >>teacher suggested to generate an anonymous sub if I was
    >>going to do what you are doing:
    >>
    >>my $sub = 'sub test_for_match { $_ = $_[0]; '.
    >> join( '', map { "/$_/o && return $_; " } @patterns ) .
    >> 'return "No match"; }';
    >>eval { $sub };
    >>test_for_match( 'www.vodafone.com' );

    >
    >
    > There is no anonymous sub in that code.
    >
    > And the test_for_match subroutine never exists since you are using block
    > eval which doesn't do what you think it does. Anyway using an eval
    > correctly would just lead to more problems because the code is
    > completely wrong.
    >
    > If your perl teacher suggested such code, find another perl teacher.
    > More likely you misinterpreted them.
    >
    >
    >
    >>Haven't tested the code, but I hope I conveyed the message.
    >>The subroutine gets compiled once so it has little overhead
    >>and there are probably other optimization which can be done.

    >
    >
    > 5 seconds of testing would show that it doesn't work, thanks for letting
    > lots of people do so instead of you doing so before posting it.
    >
    > That style of code generating a subroutine isn't necessary since perl
    > 5.005 (ie. since mid 1998) due to the introduction of qr//.
    >
    > See the obvious FAQ:
    >
    > perlfaq6: How do I efficiently match many regular expressions at once?
    >
    Gunnar Strand, Jan 29, 2004
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. DelphiDude
    Replies:
    3
    Views:
    1,168
  2. danpres2k
    Replies:
    3
    Views:
    7,471
    danpres2k
    Aug 25, 2003
  3. CV
    Replies:
    2
    Views:
    592
    Charles DeRykus
    Aug 31, 2004
  4. Marc Bissonnette

    Pattern matching : not matching problem

    Marc Bissonnette, Jan 8, 2004, in forum: Perl Misc
    Replies:
    9
    Views:
    234
    Marc Bissonnette
    Jan 13, 2004
  5. Bobby Chamness
    Replies:
    2
    Views:
    227
    Xicheng Jia
    May 3, 2007
Loading...

Share This Page