why is pattern matching using '|' slower than 2 separate ones?

M

Mark Seger

I constructed a very simply step to compare two ways to do pattern
matching and found using the '|' to check for two patterns to be a lot
slower than 2 separate tests of one pattern each. Most surprising!

Consider the following:

#!/usr/bin/perl
$z='ZZZZZZZZZZZZZZZZZZZZZ';
for ($i=0; $i<1000000; $i++)
{
$a=1 if $z=~/xxxxx|yyyyy/;
# $a=1 if $z=~/xxxxx/ or $z=~/yyyyy/;
}

[root@dl380-1 collectl]# time ./test.pl
real 0m4.526s
user 0m4.320s
sys 0m0.000s

if I comment out the first test and uncomment the second one I get

[root@dl380-1 collectl]# time ./test.pl
real 0m0.702s
user 0m0.690s
sys 0m0.010s

As one might expect, when the strings get bigger or the number of tests
increase, the difference in test runs get even bigger. Comments?

-mark
 
J

James Willmore

On Tue, 09 Nov 2004 08:53:12 -0500, Mark Seger wrote:

Consider the following:

#!/usr/bin/perl
$z='ZZZZZZZZZZZZZZZZZZZZZ';
for ($i=0; $i<1000000; $i++)
{
$a=1 if $z=~/xxxxx|yyyyy/;
# $a=1 if $z=~/xxxxx/ or $z=~/yyyyy/;
}

[root@dl380-1 collectl]# time ./test.pl
real 0m4.526s
user 0m4.320s
sys 0m0.000s

if I comment out the first test and uncomment the second one I get

[root@dl380-1 collectl]# time ./test.pl
real 0m0.702s
user 0m0.690s
sys 0m0.010s

As one might expect, when the strings get bigger or the number of tests
increase, the difference in test runs get even bigger. Comments?

You might want to throw in the 'o' modifier to the regexs if the only
thing you're doing is matching. This might yield different results.
For example:

#!/usr/bin/perl
$z='ZZZZZZZZZZZZZZZZZZZZZ';
for ($i=0; $i<1000000; $i++)
{
$a=1 if $z=~/xxxxx|yyyyy/o;
# $a=1 if $z=~/xxxxx/o or $z=~/yyyyy/o;
}

The 'o' modifier compiles the regex once, so it doesn't need to be
compiled latter on. Running the same tests on my system showed some
improvement, but not much.

Check out perlretut for more information.

HTH

Jim
 
A

A. Sinan Unur

On Tue, 09 Nov 2004 08:53:12 -0500, Mark Seger wrote:


....

You might want to throw in the 'o' modifier to the regexs

Why? There is no interpolation happening here.
 
U

Uri Guttman

JW> $a=1 if $z=~/xxxxx|yyyyy/o;
JW> # $a=1 if $z=~/xxxxx/o or $z=~/yyyyy/o;

JW> The 'o' modifier compiles the regex once, so it doesn't need to be
JW> compiled latter on. Running the same tests on my system showed some
JW> improvement, but not much.

/o only doesn't recompile when there are variables in the regex. a fixed
string regex like those won't get any help from /o. and in recent perls
the regex is not recompiled if the variables haven't changed so much of
the need for /o is gone. also with qr// to handle compiling a regex once
and allowing its reuse, the need for /o is almost totally gone.

uri
 
U

Uri Guttman

JW> $a=1 if $z=~/xxxxx|yyyyy/o;
JW> # $a=1 if $z=~/xxxxx/o or $z=~/yyyyy/o;

JW> The 'o' modifier compiles the regex once, so it doesn't need to be
JW> compiled latter on. Running the same tests on my system showed some
JW> improvement, but not much.

/o only doesn't recompile when there are variables in the regex. a fixed
string regex like those won't get any help from /o. and in recent perls
the regex is not recompiled if the variables haven't changed so much of
the need for /o is gone. also with qr// to handle compiling a regex once
and allowing its reuse, the need for /o is almost totally gone.

uri
 
J

James Willmore

JW> $a=1 if $z=~/xxxxx|yyyyy/o;
JW> # $a=1 if $z=~/xxxxx/o or $z=~/yyyyy/o;

JW> The 'o' modifier compiles the regex once, so it doesn't need to be
JW> compiled latter on. Running the same tests on my system showed some
JW> improvement, but not much.

/o only doesn't recompile when there are variables in the regex. a fixed
string regex like those won't get any help from /o. and in recent perls
the regex is not recompiled if the variables haven't changed so much of
the need for /o is gone. also with qr// to handle compiling a regex once
and allowing its reuse, the need for /o is almost totally gone.

I didn't realize that new versions of Perl reduced the need to use /o.

Thanks.
 
A

Anno Siegel

James Willmore said:
What harm is there in compiling a regex that doesn't do interpolation?

Huh? Are you saying that /o *compiles* a regex?

Every regex must be compiled before it is used. When there is no
interpolation, that happens once at compile time. When there is,
the regex is recompiled before each use, in case the interpolating
string has changed. /o suppresses this re-compilation. As has been
noted, newer Perls only do the re-compilation when the string has
actually changed, but /o still suppresses it.

So /o plainly has no effect at all when there is no interpolation.

Anno
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Dave

$a = 'ZZZZZZZZZ'; # Or some such
For the answer to this and more (if you are interested) have a look at
Mastering Regular Expressions by Jeremy Friedl

Did not see the newer edition. Does it describe the operation of REx
optimizer?
The short answer is that the behaviour is entirely expected. The first
version caused the Regex engine to do lots of switching at each position in
the string to swap between looking for one then the other. It also hinders
the engine from doing certain optimisations which are easy with the simple
literal string.

This has little relation to what actually happens. The REx engine
proper is not even entered with these patterns. It is the optimizer
who rejects the match. And with the first version it tries to find
'x' or 'y' inside the string - which is much slower that looking for
'x' at each 5th position - as the second version does.

IMO, it is the "each 5th position" which helps - not switching between
two possibilities.

Run with use re 'debugcolor' for details,
Ilya
 
C

ctcgag

Ilya Zakharevich said:
[A complimentary Cc of this posting was sent to
Dave
<[email protected]>:

$a = 'ZZZZZZZZZ'; # Or some such
For the answer to this and more (if you are interested) have a look at
Mastering Regular Expressions by Jeremy Friedl

Did not see the newer edition. Does it describe the operation of REx
optimizer?
The short answer is that the behaviour is entirely expected. The first
version caused the Regex engine to do lots of switching at each
position in the string to swap between looking for one then the other.
It also hinders the engine from doing certain optimisations which are
easy with the simple literal string.

This has little relation to what actually happens. The REx engine
proper is not even entered with these patterns. It is the optimizer
who rejects the match. And with the first version it tries to find
'x' or 'y' inside the string - which is much slower that looking for
'x' at each 5th position - as the second version does.

IMO, it is the "each 5th position" which helps - not switching between
two possibilities.

Run with use re 'debugcolor' for details,

It looks like the optimizer is not doing the rejections on the first
version, it is only doing rejections on the 2nd versions. The 1st versions
goes through all the gory details of the regex engine, including a bunch of
Eval scopes.

(5.8.0)

Xho
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to


Wishfull thinking...
It looks like the optimizer is not doing the rejections on the first
version, it is only doing rejections on the 2nd versions.

Right, it should do something similar to
perl -Mre=debugcolor -wle "$z = 'ZZZZZZZZZ'; $z=~/x*[y]/;"
Compiling REx `x*[y]'
size 15 Got 124 bytes for offset annotations.
first at 1
synthetic stclass `ANYOF[xy]'.
1: STAR(4)
2: EXACT <x>(0)
4: ANYOF[y](15)
15: END(0)
stclass `ANYOF[xy]' minlen 1
Offsets: [15]
2[1] 1[1] 0[0] 3[3] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0]
0[0] 0[0] 6
Matching REx `x*[y]' against `ZZZZZZZZZ'
Matching stclass `ANYOF[xy]' against `ZZZZZZZZZ'
Contradicts stclass...
Match failed

Note the "synthetic stclass"; it should be the same for x|y; but it is
not. A bug in optimizer...

Thanks for bringing it to my attention, but better report it to p5p too,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,573
Members
45,046
Latest member
Gavizuho

Latest Threads

Top