Fastest way to find a match?

bukzor · Mar 12, 2008

Hi,

I'm trying to find the fastest way in perl to see if a name contains
another.

I've a list of 2704 names (aka "A")

I've another name (aka "B")

I need to know if any of A is contained in B.

A = foo foo1 foo2 foo3 foo45 ....
B = INCASE_foo2_YOUWANT
is a match

B = INCASE_YOURDONOTWANT
is not a match.

what would be the fastest way to check the 2704 possible values of
"A" ?

Thanks,

so far, I'm using

foreach $t (keys %A) {
$v = $B;
$v = s/$t//;
if ($v ne $B) {
print "MATCH"
}
}

Joost Diepenmaat · Mar 12, 2008

bukzor said:
Hi,

I'm trying to find the fastest way in perl to see if a name contains
another.

I've a list of 2704 names (aka "A")

I've another name (aka "B")

foreach $t (keys %A) {
$v = $B;
$v = s/$t//;
if ($v ne $B) {
print "MATCH"
}
}

Did you notice that doesn't work? $v = s/../../; assignes the number of
replacements of $_ into $v.

For an unordered list of strings of lengths >= 1, the fastest check
would probably be to use the index() function, though it may not be that
much more or even less fast than a regex match (NOT a replace!).

See perldoc -f index and perldoc perlretut.

Ben Morrow · Mar 13, 2008

Quoth Joost Diepenmaat said:
Did you notice that doesn't work? $v = s/../../; assignes the number of
replacements of $_ into $v.

For an unordered list of strings of lengths >= 1, the fastest check
would probably be to use the index() function, though it may not be that
much more or even less fast than a regex match (NOT a replace!).

For finding one name, index is best. For finding many, probably best is
something like

# this makes things considerably faster, but make sure your strings
# are all in the same Unicode state first. If necessary, Encode them
# all to UTF8.
use bytes;

for my $B (@B) {
study $B;
for my $A (@A) {
$B =~ /$A/ and print "MATCH";
}
}

as 'study' only remembers the last string studied. An alternative would
be to build a big regex from all the search strings joined with '|', but
that would probably end up slower due to having such a large compiled
regex.

Ben

John Bokma · Mar 13, 2008

bukzor said:
Hi,

I'm trying to find the fastest way in perl to see if a name contains
another.

I've a list of 2704 names (aka "A")

I've another name (aka "B")

I need to know if any of A is contained in B.

A = foo foo1 foo2 foo3 foo45 ....
B = INCASE_foo2_YOUWANT

As usual, it's hard to say without seeing some real examples of "A" and
"B". Are the parts in B always separated by _ ?

Jürgen Exner · Mar 13, 2008

bukzor said:
Hi,

I'm trying to find the fastest way in perl to see if a name contains
another.

I've a list of 2704 names (aka "A")

I've another name (aka "B")

I need to know if any of A is contained in B.

A = foo foo1 foo2 foo3 foo45 ....
B = INCASE_foo2_YOUWANT
is a match

B = INCASE_YOURDONOTWANT
is not a match.

what would be the fastest way to check the 2704 possible values of
"A" ?

Thanks,

so far, I'm using

foreach $t (keys %A) {
$v = $B;
$v = s/$t//;
if ($v ne $B) {

What does the string value of the number of matches have to do with the
original text of $B? This condition will always succeed unless $t and $B are
both '1'.

Maybe you meant to test the result of s/// directly?
if ($v =~ s/$t/) {

However, why do a s/// and awkwardly restore $v for each iteration in the
first place? A simple
if ($B =~ m/$_/) {
will do without all the temporary assignments, which cost time!

Having said that I strongly believe your code isn't doing what you think
it's doing in the first place. Initially you wrote
"if any of A is contained in B"
But your code is testing if B is a regular expression that matches any of A.
That is something very different.

I would imagine a simple index() is what you are looking for

foreach (keys %A) {
if (index($B, $_) > -1) {
print "FOUND";
}
}

This is probably also the fastest method, but you may want to run some
benchmarks.

jue

Michele Dondi · Mar 13, 2008

I'm trying to find the fastest way in perl to see if a name contains
another. [...]
so far, I'm using

foreach $t (keys %A) {
$v = $B;
$v = s/$t//;
if ($v ne $B) {

Funniest way I've seen thus far to check for a match!

print "MATCH"

How 'bout

use List::Util 'first';

# ...

my @rx=map qr/\Q$_\E/ => @A;
say "match!" if first { $B ~~ $_ } @rx;

?

L::U is XS and should be pretty fast.

Michele

bukzor · Mar 13, 2008

Thanks everyone for your thoughtful posts. The OP was psuedocode and
not to be taken literally.

The piece to check is indeed always surrounded by underscores. Here is
the solution from a coworker that I've implemented:

What about using a hash table? You could put the 2704 "A" names in
hash table. Then, break done "B" into all the possible parts that
could match something in "A" and see if there is something in the hash
table that matches.

$hash{"foo"} = 1;
$hash{"foo1"} = 1;
....

For "INCASE_foo2_YOUWANT", there is only one possible matcher after
you remove everything before the first underscore and after the last
underscore, so just check the hash table for the existence of "foo2".

For "INCASE_foo2_SOMETHING_YOUWANT", you could check both "foo2" and
"SOMETHING".

Wouldn't that be near instant?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

before:
51.670u 0.100s 0:57.76 89.6% 0+0k 0+0io 1033pf+0w
after:
0.560u 0.040s 0:00.58 103.4% 0+0k 0+0io 1033pf+0w
I'm a happy camper...

jm · Mar 15, 2008

Michele Dondi a écrit :

I'm trying to find the fastest way in perl to see if a name contains
another. [...]
so far, I'm using

foreach $t (keys %A) {
$v = $B;
$v = s/$t//;
if ($v ne $B) {

Click to expand...

Funniest way I've seen thus far to check for a match!

print "MATCH"

Click to expand...

How 'bout

use List::Util 'first';

# ...

my @rx=map qr/\Q$_\E/ => @A;
say "match!" if first { $B ~~ $_ } @rx;

?

L::U is XS and should be pretty fast.

Michele

I didnot find ~~ in man perlop.

Is this a perl operator?
or a L::U operator?

Ben Morrow · Mar 15, 2008

Quoth jm said:
Michele Dondi a écrit :

I didnot find ~~ in man perlop.

Is this a perl operator?
or a L::U operator?

It's the new 'smart match' operator in perl 5.10. In the case above,
since $_ is always a regular expression, it is equivalent to =~; so

print "match!\n" if first { $B =~ $_ } @rx;

Ben

Dr.Ruud · Mar 16, 2008

bukzor schreef:

I'm trying to find the fastest way in perl to see if a name
contains another.

I've a list of 2704 names (aka "A")

I've another name (aka "B")

I need to know if any of A is contained in B.

Considered fgrep?

Michele Dondi · Mar 16, 2008

I didnot find ~~ in man perlop.

Is this a perl operator?
or a L::U operator?

A

use 5.010;

operator. I wanted to stay 5.10ish. Of course you can use =~ instead.

Michele

Optimal way to make a table for large lists	2	Jul 7, 2022
fastest way to monitor a dir	4	Nov 7, 2008
laziest / fastest way to match last characters of a string	5	Sep 11, 2008
fastest way to allocate memory ?	15	Jan 16, 2009
Fastest way to calculate leading whitespace	19	May 8, 2010
FOSS or Freeware, Prefferably Runs on Linux Mint: Search US Goverment Records, Legally to Find Literarary Work	8	Apr 5, 2023
Fastest way to detect a non-ASCII character in a list of strings.	2	Oct 17, 2010
Fastest way to compute dot product (inner product) in Ruby?	4	Nov 2, 2007

Fastest way to find a match?

bukzor

Joost Diepenmaat

Ben Morrow

John Bokma

Jürgen Exner

Michele Dondi

bukzor

jm

Ben Morrow

Dr.Ruud

Michele Dondi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads