Fastest way to find a match?

B

bukzor

Hi,

I'm trying to find the fastest way in perl to see if a name contains
another.

I've a list of 2704 names (aka "A")

I've another name (aka "B")

I need to know if any of A is contained in B.

A = foo foo1 foo2 foo3 foo45 ....
B = INCASE_foo2_YOUWANT
is a match

B = INCASE_YOURDONOTWANT
is not a match.

what would be the fastest way to check the 2704 possible values of
"A" ?

Thanks,


so far, I'm using

foreach $t (keys %A) {
$v = $B;
$v = s/$t//;
if ($v ne $B) {
print "MATCH"
}
}
 
J

Joost Diepenmaat

bukzor said:
Hi,

I'm trying to find the fastest way in perl to see if a name contains
another.

I've a list of 2704 names (aka "A")

I've another name (aka "B")

foreach $t (keys %A) {
$v = $B;
$v = s/$t//;
if ($v ne $B) {
print "MATCH"
}
}

Did you notice that doesn't work? $v = s/../../; assignes the number of
replacements of $_ into $v.

For an unordered list of strings of lengths >= 1, the fastest check
would probably be to use the index() function, though it may not be that
much more or even less fast than a regex match (NOT a replace!).

See perldoc -f index and perldoc perlretut.
 
B

Ben Morrow

Quoth Joost Diepenmaat said:
Did you notice that doesn't work? $v = s/../../; assignes the number of
replacements of $_ into $v.

For an unordered list of strings of lengths >= 1, the fastest check
would probably be to use the index() function, though it may not be that
much more or even less fast than a regex match (NOT a replace!).

For finding one name, index is best. For finding many, probably best is
something like

# this makes things considerably faster, but make sure your strings
# are all in the same Unicode state first. If necessary, Encode them
# all to UTF8.
use bytes;

for my $B (@B) {
study $B;
for my $A (@A) {
$B =~ /$A/ and print "MATCH";
}
}

as 'study' only remembers the last string studied. An alternative would
be to build a big regex from all the search strings joined with '|', but
that would probably end up slower due to having such a large compiled
regex.

Ben
 
J

John Bokma

bukzor said:
Hi,

I'm trying to find the fastest way in perl to see if a name contains
another.

I've a list of 2704 names (aka "A")

I've another name (aka "B")

I need to know if any of A is contained in B.

A = foo foo1 foo2 foo3 foo45 ....
B = INCASE_foo2_YOUWANT

As usual, it's hard to say without seeing some real examples of "A" and
"B". Are the parts in B always separated by _ ?
 
J

Jürgen Exner

bukzor said:
Hi,

I'm trying to find the fastest way in perl to see if a name contains
another.

I've a list of 2704 names (aka "A")

I've another name (aka "B")

I need to know if any of A is contained in B.

A = foo foo1 foo2 foo3 foo45 ....
B = INCASE_foo2_YOUWANT
is a match

B = INCASE_YOURDONOTWANT
is not a match.

what would be the fastest way to check the 2704 possible values of
"A" ?

Thanks,


so far, I'm using

foreach $t (keys %A) {
$v = $B;
$v = s/$t//;
if ($v ne $B) {

What does the string value of the number of matches have to do with the
original text of $B? This condition will always succeed unless $t and $B are
both '1'.

Maybe you meant to test the result of s/// directly?
if ($v =~ s/$t/) {

However, why do a s/// and awkwardly restore $v for each iteration in the
first place? A simple
if ($B =~ m/$_/) {
will do without all the temporary assignments, which cost time!


Having said that I strongly believe your code isn't doing what you think
it's doing in the first place. Initially you wrote
"if any of A is contained in B"
But your code is testing if B is a regular expression that matches any of A.
That is something very different.

I would imagine a simple index() is what you are looking for

foreach (keys %A) {
if (index($B, $_) > -1) {
print "FOUND";
}
}

This is probably also the fastest method, but you may want to run some
benchmarks.

jue
 
M

Michele Dondi

I'm trying to find the fastest way in perl to see if a name contains
another. [...]
so far, I'm using

foreach $t (keys %A) {
$v = $B;
$v = s/$t//;
if ($v ne $B) {

Funniest way I've seen thus far to check for a match!
print "MATCH"

How 'bout

use List::Util 'first';

# ...

my @rx=map qr/\Q$_\E/ => @A;
say "match!" if first { $B ~~ $_ } @rx;

?

L::U is XS and should be pretty fast.


Michele
 
B

bukzor

Thanks everyone for your thoughtful posts. The OP was psuedocode and
not to be taken literally.

The piece to check is indeed always surrounded by underscores. Here is
the solution from a coworker that I've implemented:


What about using a hash table? You could put the 2704 "A" names in
hash table. Then, break done "B" into all the possible parts that
could match something in "A" and see if there is something in the hash
table that matches.

$hash{"foo"} = 1;
$hash{"foo1"} = 1;
....

For "INCASE_foo2_YOUWANT", there is only one possible matcher after
you remove everything before the first underscore and after the last
underscore, so just check the hash table for the existence of "foo2".

For "INCASE_foo2_SOMETHING_YOUWANT", you could check both "foo2" and
"SOMETHING".

Wouldn't that be near instant?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

before:
51.670u 0.100s 0:57.76 89.6% 0+0k 0+0io 1033pf+0w
after:
0.560u 0.040s 0:00.58 103.4% 0+0k 0+0io 1033pf+0w
I'm a happy camper...
 
J

jm

Michele Dondi a écrit :
I'm trying to find the fastest way in perl to see if a name contains
another. [...]
so far, I'm using

foreach $t (keys %A) {
$v = $B;
$v = s/$t//;
if ($v ne $B) {

Funniest way I've seen thus far to check for a match!
print "MATCH"

How 'bout

use List::Util 'first';

# ...

my @rx=map qr/\Q$_\E/ => @A;
say "match!" if first { $B ~~ $_ } @rx;

?

L::U is XS and should be pretty fast.


Michele

I didnot find ~~ in man perlop.

Is this a perl operator?
or a L::U operator?
 
B

Ben Morrow

Quoth jm said:
Michele Dondi a écrit :

I didnot find ~~ in man perlop.

Is this a perl operator?
or a L::U operator?

It's the new 'smart match' operator in perl 5.10. In the case above,
since $_ is always a regular expression, it is equivalent to =~; so

print "match!\n" if first { $B =~ $_ } @rx;

Ben
 
D

Dr.Ruud

bukzor schreef:
I'm trying to find the fastest way in perl to see if a name
contains another.

I've a list of 2704 names (aka "A")

I've another name (aka "B")

I need to know if any of A is contained in B.

Considered fgrep?
 
M

Michele Dondi

I didnot find ~~ in man perlop.

Is this a perl operator?
or a L::U operator?

A

use 5.010;

operator. I wanted to stay 5.10ish. Of course you can use =~ instead.


Michele
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top