Pattern matching for "best match"

Yash · Jan 28, 2004

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

Thanks

Ben Morrow · Jan 28, 2004

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

Define what you mean by the 'best' match. The one that matches the
most text? The intuitive (at least to me) meaning would involve
working out that (e.g.) /vodafone.*horoscope/ will match a subset of
the strings that /vodafone/ will match, and therefore the first is a
more specific ('better') match than the second. I would imagine this
would be pretty hard to work out just starting from the regexes.

Ben

Tore Aursand · Jan 28, 2004

We have a list of regular expressions such as: vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds. Our program
reads a large file with millions of URLs. For each URL, it has to find
the best match in the list of regular expressions.

You never define "best" in your posting. What do _you_ require to be a
perfect match? Almost perfect match? No match at all? Etc.

For example www.vodafome.com will match with vodafone.

It will? Why?

www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for a
given URL.

You could always try to treat the strings you match against a regular
expressions and in addition try to compare the strings letter by letter,
and in turn the distance between each letter in each string.

There are modules out there for comparing strings, as well. I suggest you
check out CPAN;

http://www.cpan.org/

James Willmore · Jan 28, 2004

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

Are you trying to do web filtering or redirection? Fuzzy logic searches?
What?

And .... where's the code?

Or is this where you're stuck?

NEI (Not Enoguh Information)

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
If all the world's a stage, I want to operate the trap door. --
Paul Beatty

Jim Keenan · Jan 28, 2004

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Define "best".

Steve · Jan 28, 2004

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

Thanks

If you define 'the best match' as the string with the longest
continual stretch of characters identical to a target, then molecular
biology has developed some very efficient ways to do this to align DNA
(and protein) sequences. BLAST is the most commonly used and you can
get a description by Googling - essentially it takes a 'seed' of a few
characters, attempts to match and when it finds a match it looks
either side to extend the alignment and then ranks by length of
overlap. Molecular biologists need to worry about allowing
discrepancies and gaps within the match, but I suspect that for your
application a simple exact match is good enough.

I need not point out that the scale of your problem is tiny compared
with aligning genomes (but I did).

HTH

Steve

Gunnar Strand · Jan 29, 2004

Hi Yash,

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

Thanks

I assume you have put all patterns in "priority" order
so that the first hit is the "best". My very nice Perl
teacher suggested to generate an anonymous sub if I was
going to do what you are doing:

my $sub = 'sub test_for_match { $_ = $_[0]; '.
join( '', map { "/$_/o && return $_; " } @patterns ) .
'return "No match"; }';
eval { $sub };
test_for_match( 'www.vodafone.com' );

Haven't tested the code, but I hope I conveyed the message.
The subroutine gets compiled once so it has little overhead
and there are probably other optimization which can be done.

Regards,

/Gunnar

Sam Holden · Jan 29, 2004

I assume you have put all patterns in "priority" order
so that the first hit is the "best". My very nice Perl
teacher suggested to generate an anonymous sub if I was
going to do what you are doing:

my $sub = 'sub test_for_match { $_ = $_[0]; '.
join( '', map { "/$_/o && return $_; " } @patterns ) .
'return "No match"; }';
eval { $sub };
test_for_match( 'www.vodafone.com' );

There is no anonymous sub in that code.

And the test_for_match subroutine never exists since you are using block
eval which doesn't do what you think it does. Anyway using an eval
correctly would just lead to more problems because the code is
completely wrong.

If your perl teacher suggested such code, find another perl teacher.
More likely you misinterpreted them.

Haven't tested the code, but I hope I conveyed the message.
The subroutine gets compiled once so it has little overhead
and there are probably other optimization which can be done.

5 seconds of testing would show that it doesn't work, thanks for letting
lots of people do so instead of you doing so before posting it.

That style of code generating a subroutine isn't necessary since perl
5.005 (ie. since mid 1998) due to the introduction of qr//.

See the obvious FAQ:

perlfaq6: How do I efficiently match many regular expressions at once?

ctcgag · Jan 29, 2004

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

I assume the regular expressions are in order of "goodness".

warning "Untested code";
my @qr_rigo= map { qr/$_/ } @regex_in_goodness_order;
sub get_best {
study $_[0]; ## Will this help? Try and see.
foreach (@qr_rigo) {
return $_ if $_[0] =~ /$_/;
};
return;
};

while (my $url=<>) {
chomp $url;
my $b=get_best($url);
##
};

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

If the above is not efficient enough (or if you have no way to order the
regexes by goodness by hand), then further optimization probably relies
minutes details and trade-offs which we don't know about. (What you mean
by "best match", how many of the 1000s of regex are substrings of each
other, the skew in the actual data set, etc.)

Xho

Gunnar Strand · Jan 29, 2004

Sam,

You are quite right, that was really, really bad
suggestion in more than one way. Sorry about that.

/Gunnar

Sam said:
I assume you have put all patterns in "priority" order
so that the first hit is the "best". My very nice Perl
teacher suggested to generate an anonymous sub if I was
going to do what you are doing:

my $sub = 'sub test_for_match { $_ = $_[0]; '.
join( '', map { "/$_/o && return $_; " } @patterns ) .
'return "No match"; }';
eval { $sub };
test_for_match( 'www.vodafone.com' );

Click to expand...

There is no anonymous sub in that code.

And the test_for_match subroutine never exists since you are using block
eval which doesn't do what you think it does. Anyway using an eval
correctly would just lead to more problems because the code is
completely wrong.

If your perl teacher suggested such code, find another perl teacher.
More likely you misinterpreted them.

Haven't tested the code, but I hope I conveyed the message.
The subroutine gets compiled once so it has little overhead
and there are probably other optimization which can be done.

Click to expand...

5 seconds of testing would show that it doesn't work, thanks for letting
lots of people do so instead of you doing so before posting it.

That style of code generating a subroutine isn't necessary since perl
5.005 (ie. since mid 1998) due to the introduction of qr//.

See the obvious FAQ:

perlfaq6: How do I efficiently match many regular expressions at once?

[PATTERN MATCHING]	0	Dec 20, 2004
Trying to parse/match a C string literal	12	Sep 24, 2009
Mixed Content XML pattern matching	2	Aug 25, 2006
Help needed with tough regular expression matching	11	Oct 12, 2009
pattern matching question	2	Apr 13, 2005
Pattern Matching	3	Jul 19, 2004
[perl-python] string pattern matching	9	Feb 1, 2005
Best Practice for Multiline Regexps	8	Sep 1, 2009

Pattern matching for "best match"

Yash

Ben Morrow

Tore Aursand

James Willmore

Jim Keenan

Steve

Gunnar Strand

Sam Holden

ctcgag

Gunnar Strand

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads