Pattern matching for "best match"

Y

Yash

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

Thanks
 
B

Ben Morrow

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

Define what you mean by the 'best' match. The one that matches the
most text? The intuitive (at least to me) meaning would involve
working out that (e.g.) /vodafone.*horoscope/ will match a subset of
the strings that /vodafone/ will match, and therefore the first is a
more specific ('better') match than the second. I would imagine this
would be pretty hard to work out just starting from the regexes.

Ben
 
T

Tore Aursand

We have a list of regular expressions such as: vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds. Our program
reads a large file with millions of URLs. For each URL, it has to find
the best match in the list of regular expressions.

You never define "best" in your posting. What do _you_ require to be a
perfect match? Almost perfect match? No match at all? Etc.
For example www.vodafome.com will match with vodafone.

It will? Why?
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for a
given URL.

You could always try to treat the strings you match against a regular
expressions and in addition try to compare the strings letter by letter,
and in turn the distance between each letter in each string.

There are modules out there for comparing strings, as well. I suggest you
check out CPAN;

http://www.cpan.org/
 
J

James Willmore

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

Are you trying to do web filtering or redirection? Fuzzy logic searches?
What?

And .... where's the code? :) Or is this where you're stuck?

NEI (Not Enoguh Information)

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
If all the world's a stage, I want to operate the trap door. --
Paul Beatty
 
J

Jim Keenan

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.
Define "best".
 
S

Steve

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

Thanks

If you define 'the best match' as the string with the longest
continual stretch of characters identical to a target, then molecular
biology has developed some very efficient ways to do this to align DNA
(and protein) sequences. BLAST is the most commonly used and you can
get a description by Googling - essentially it takes a 'seed' of a few
characters, attempts to match and when it finds a match it looks
either side to extend the alignment and then ranks by length of
overlap. Molecular biologists need to worry about allowing
discrepancies and gaps within the match, but I suspect that for your
application a simple exact match is good enough.

I need not point out that the scale of your problem is tiny compared
with aligning genomes (but I did).

HTH

Steve
 
G

Gunnar Strand

Hi Yash,
We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

Thanks

I assume you have put all patterns in "priority" order
so that the first hit is the "best". My very nice Perl
teacher suggested to generate an anonymous sub if I was
going to do what you are doing:

my $sub = 'sub test_for_match { $_ = $_[0]; '.
join( '', map { "/$_/o && return $_; " } @patterns ) .
'return "No match"; }';
eval { $sub };
test_for_match( 'www.vodafone.com' );

Haven't tested the code, but I hope I conveyed the message.
The subroutine gets compiled once so it has little overhead
and there are probably other optimization which can be done.

Regards,

/Gunnar
 
S

Sam Holden

I assume you have put all patterns in "priority" order
so that the first hit is the "best". My very nice Perl
teacher suggested to generate an anonymous sub if I was
going to do what you are doing:

my $sub = 'sub test_for_match { $_ = $_[0]; '.
join( '', map { "/$_/o && return $_; " } @patterns ) .
'return "No match"; }';
eval { $sub };
test_for_match( 'www.vodafone.com' );

There is no anonymous sub in that code.

And the test_for_match subroutine never exists since you are using block
eval which doesn't do what you think it does. Anyway using an eval
correctly would just lead to more problems because the code is
completely wrong.

If your perl teacher suggested such code, find another perl teacher.
More likely you misinterpreted them.

Haven't tested the code, but I hope I conveyed the message.
The subroutine gets compiled once so it has little overhead
and there are probably other optimization which can be done.

5 seconds of testing would show that it doesn't work, thanks for letting
lots of people do so instead of you doing so before posting it.

That style of code generating a subroutine isn't necessary since perl
5.005 (ie. since mid 1998) due to the introduction of qr//.

See the obvious FAQ:

perlfaq6: How do I efficiently match many regular expressions at once?
 
C

ctcgag

We have a list of regular expressions such as:
vodafone.*horoscope
vodafone.*pxtsend
vodafone
yahoo

The total number of such reg expressions is few hundreds.
Our program reads a large file with millions of URLs. For each URL, it
has to find the best match in the list of regular expressions.
For example www.vodafome.com will match with vodafone.
www.vodafone.abc.horoscope will match with vodafone.*horoscope, etc.
Basically we have to find the regular expression that matches best for
a given URL.

I assume the regular expressions are in order of "goodness".

warning "Untested code";
my @qr_rigo= map { qr/$_/ } @regex_in_goodness_order;
sub get_best {
study $_[0]; ## Will this help? Try and see.
foreach (@qr_rigo) {
return $_ if $_[0] =~ /$_/;
};
return;
};

while (my $url=<>) {
chomp $url;
my $b=get_best($url);
##
};

Do you have any suggestions/pointers for doing this in a very
efficient way? Suggestions regarding data structures to use,
algorithm, etc. would be helpful.

If the above is not efficient enough (or if you have no way to order the
regexes by goodness by hand), then further optimization probably relies
minutes details and trade-offs which we don't know about. (What you mean
by "best match", how many of the 1000s of regex are substrings of each
other, the skew in the actual data set, etc.)

Xho
 
G

Gunnar Strand

Sam,

You are quite right, that was really, really bad
suggestion in more than one way. Sorry about that.

/Gunnar

Sam said:
I assume you have put all patterns in "priority" order
so that the first hit is the "best". My very nice Perl
teacher suggested to generate an anonymous sub if I was
going to do what you are doing:

my $sub = 'sub test_for_match { $_ = $_[0]; '.
join( '', map { "/$_/o && return $_; " } @patterns ) .
'return "No match"; }';
eval { $sub };
test_for_match( 'www.vodafone.com' );


There is no anonymous sub in that code.

And the test_for_match subroutine never exists since you are using block
eval which doesn't do what you think it does. Anyway using an eval
correctly would just lead to more problems because the code is
completely wrong.

If your perl teacher suggested such code, find another perl teacher.
More likely you misinterpreted them.


Haven't tested the code, but I hope I conveyed the message.
The subroutine gets compiled once so it has little overhead
and there are probably other optimization which can be done.


5 seconds of testing would show that it doesn't work, thanks for letting
lots of people do so instead of you doing so before posting it.

That style of code generating a subroutine isn't necessary since perl
5.005 (ie. since mid 1998) due to the introduction of qr//.

See the obvious FAQ:

perlfaq6: How do I efficiently match many regular expressions at once?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top