Regex combining /(foo|bar)/ slower than using foreach (/foo/,/bar/) ???

G

Gunnar Hjalmarsson

I've got a problem with the perl regex compiler. It seems that
compliation of combined regexes ( or alternation whatever you call it
) is not optimized.

Using a /(foo|bar)/ regex on strings is slower than using a foreach
loop doing the matching one after another. I've written a testprogramm
and looked at the perl source to find out why. Now I know. It seems
that DFA won't get optimised for the alternation.

As I have no time and knowledge and skill for optimising the perlregex
compiler from scratch, what can I do. Programming such foreach loops
gives me headaches - it such 'awk'ward.

Do you need to optimise the program in this respect? If not, why would
you have a problem? ;-)
 
J

jolly

I've got a problem with the perl regex compiler. It seems that
compliation of combined regexes ( or alternation whatever you call it
) is not optimized.

Using a /(foo|bar)/ regex on strings is slower than using a foreach
loop doing the matching one after another. I've written a testprogramm
and looked at the perl source to find out why. Now I know. It seems
that DFA won't get optimised for the alternation.


As I have no time and knowledge and skill for optimising the perlregex
compiler from scratch, what can I do. Programming such foreach loops
gives me headaches - it such 'awk'ward.


Here's the testprogram for those of you that don't think it's true:

#!/bin/perl
use strict;
use Digest::MD5 qw(md5 md5_hex md5_base64);
use Time::HiRes qw(time );
#use re 'debug' ;


foreach my $regexcount (1,5,10)
{
foreach my $regexlength (2,5,10,20)
{
my @items = map{ createRandomTextWithLength($regexlength); }
(1..$regexcount);
my $regexstr = join('|',@items);
my $regex = qr /(?:$regexstr)/;

foreach my $stringlength (100,1000,10000,100000)
{
print localtime()." Stringlength: $stringlength Number of
Regexes:$regexcount Length of each Regex:$regexlength\n";

my $teststring = createRandomTextWithLength($stringlength);
my $timer;
{
my $test=$teststring;
$timer =time;
$test =~ s/$regex/foobar/g;
printf("ElapsedTime:%5.4f %20s
%20s\n",time-$timer,md5_hex($test),$regex);
}

{
my $test=$teststring;
$timer =time;
foreach my $oneregex (@items)
{
$test =~ s/$oneregex/foobar/g;
}
printf("ElapsedTime:%5.4f %20s
%20s\n",time-$timer,md5_hex($test),' for loop over '.join(',',@items));
}
print "\n";
}
}
}

sub createRandomTextWithLength($)
{
my($count) = (@_);
my $string;
for (1.. $count)
{
$string.=chr(ord('a')+rand(20));
}
return $string;
}
 
J

jolly

Yep, it's an optimisation issue. I always thought that using the
/(foo|bar)/ would be the quickest way. So lot's of code has been
already written with that in mind.
It has become a problem lately as the strings I do regexes on tend to
get larger ( e.q. xml-files ) and the performance penalty is HUGE.

I thought' that maybe someone has a solution for the problem. I haven't
taken a look how parrot works with regexes. I'm thinking of writing a
Module with an optimised parser for such regexes but that would be my
last resort.

Jolly
 
G

Gunnar Hjalmarsson

[ Please provide some context when replying to a message. ]

Yep, it's an optimisation issue. I always thought that using the
/(foo|bar)/ would be the quickest way. So lot's of code has been
already written with that in mind.
It has become a problem lately as the strings I do regexes on tend to
get larger ( e.q. xml-files ) and the performance penalty is HUGE.

XML files are seldom very large.

Anyway, are "foo" and "bar" plain strings? If they are, you may want to
try using the index() function instead of the regex engine for better
efficiency.
 
B

Brian McCauley

Using a /(foo|bar)/ regex on strings is slower than using a foreach
loop doing the matching one after another.

Yes, this is even mentioned in the FAQ, albeit in an oblique way.

The solution to FAQ "How do I efficiently match many regular expressions
at once?" does not mention joining them with '|'. It would IMNSHO be
better if it were to explicitly mention that joining was not efficient.
 
J

JollyJinx

A pity that Foo and bar are NOT plain strings the real thing looks more
like:

$self->{MATCHCACHE}= '([^[:alnum:]\xc0-\xff]('.join('|', map{
RockBottom::buildMatchFromString($_) }(sort
{length($b)<=>length($a)}(@matcharray)) ).')[^[:alnum:]\xc0-\xff])';

buildMatchFromString builds multiple versions of a string and it builds
regexes not just plain strings .

The array contains thousands of matches and replacement is into a hash
of strings ( actually a function ). But anyways the XML files are not
small ( > 100 kBytes ).


index isn't an option here ;-(
 
X

xhoster

I've got a problem with the perl regex compiler. It seems that
compliation of combined regexes ( or alternation whatever you call it
) is not optimized.

Maybe, but I don't think you've demonstrated that.
Using a /(foo|bar)/ regex on strings is slower than using a foreach
loop doing the matching one after another.

Well, they also aren't doing the same thing. So that makes any comparison
rather meaningless.
I've written a testprogramm
and looked at the perl source to find out why. Now I know. It seems
that DFA won't get optimised for the alternation.

What part in the perl source tipped you off that they aren't optimized?

As I have no time and knowledge and skill for optimising the perlregex
compiler from scratch, what can I do. Programming such foreach loops
gives me headaches - it such 'awk'ward.

You can write one subroutine or module, and then use it over and over.
That way you only have to program the foreach loop once.
Here's the testprogram for those of you that don't think it's true:

Try adding an assertion to your code to check that the md5_hex of each
$test are actually equal.

Xho
 
D

Darren Dunham

(e-mail address removed) wrote:
Yes, this is even mentioned in the FAQ, albeit in an oblique way.

Really? I've transitioned some code from looping to a single regex
alternation and got over an order of magnitude performance increase. It
may simply depend on the complexity and size of the items. I was
dealing with over a thousand items, so the penalty for running perl
opcodes on each one was significant. The regex was probably about a
megabyte when stuffed together, but was much faster.
The solution to FAQ "How do I efficiently match many regular expressions
at once?" does not mention joining them with '|'. It would IMNSHO be
better if it were to explicitly mention that joining was not efficient.

As a general case, I wouldn't agree with such an assertion.
 
G

Glenn Jackman

At 2005-02-18 06:15AM said:
Using a /(foo|bar)/ regex on strings is slower than using a foreach
loop doing the matching one after another.

The book "Mastering Regular Expressions" has a section titled "Perl
Efficiency Issues" that talks about just this issue. You might want to
look there.
 
B

Brian McCauley

Darren said:
Really? I've transitioned some code from looping to a single regex
alternation and got over an order of magnitude performance increase. It
may simply depend on the complexity and size of the items. I was
dealing with over a thousand items, so the penalty for running perl
opcodes on each one was significant. The regex was probably about a
megabyte when stuffed together, but was much faster.

This is not my experience, but as you sat it probably depends on the
complexity. Here's a quick benchmark I threw together to demonstrate
that one regex is slower than many.

use strict;
use warnings;
use Benchmark;

my @many = map { "Match$_" } 1 .. 1000;
my $one = join '|', @many;

$_ = qr/$_/ for $one, @many;

my $data = ('random stuff' x 100) . 'Match999';

timethese 1000 => {
one => sub { $data =~ $one },
many => sub { grep { $data =~ $_} @many },
};

__END__
Benchmark: timing 1000 iterations of many, one...
many: 2 wallclock secs ( 2.76 usr + 0.00 sys = 2.76 CPU) @
361.79/s (n=1000)
one: 33 wallclock secs (32.81 usr + 0.00 sys = 32.81 CPU) @
30.48/s (n=1000)
As a general case, I wouldn't agree with such an assertion.

I think it _is_ true in the general case - although it is possible what
there exist special cases where it is not.

Perhaps you can throw one together to illustrate a case where the single
regex is faster and then make the case for why that case should not be
considered exceptional.
 
T

Tassilo v. Parseval

Also sprach Brian McCauley:
Darren Dunham wrote:

This is not my experience, but as you sat it probably depends on the
complexity. Here's a quick benchmark I threw together to demonstrate
that one regex is slower than many.

So far, this is indeed the case (mostly). Right now though some attempts
are made by the perl-porters to rectify this. A few days ago a patch was
proposed that turned regexes containing alternatives (ideally with some
common prefix) into some sort of trie-structure that will speed up
matching and that should even beat the multiple-regex approach.

Tassilo
 
D

Darren Dunham

Brian McCauley said:
complexity. Here's a quick benchmark I threw together to demonstrate
that one regex is slower than many.
use strict;
use warnings;
use Benchmark;
my @many = map { "Match$_" } 1 .. 1000;
my $one = join '|', @many;
$_ = qr/$_/ for $one, @many;
my $data = ('random stuff' x 100) . 'Match999';
timethese 1000 => {
one => sub { $data =~ $one },
many => sub { grep { $data =~ $_} @many },
};
__END__
Benchmark: timing 1000 iterations of many, one...
many: 2 wallclock secs ( 2.76 usr + 0.00 sys = 2.76 CPU) @
361.79/s (n=1000)
one: 33 wallclock secs (32.81 usr + 0.00 sys = 32.81 CPU) @
30.48/s (n=1000)
I think it _is_ true in the general case - although it is possible what
there exist special cases where it is not.
Perhaps you can throw one together to illustrate a case where the single
regex is faster and then make the case for why that case should not be
considered exceptional.

I went back and found the code, and I guess it was a bit of a special
case, but I didn't think it was at the time. It turns out that my
matches were anchored on the line, and that appears to make a big
difference.

Change
$_ = qr/$_/ for $one, @many;

to

$one = qr/^($one)/;
$_ = qr/^$_/ for $one, @many;

Now this benchmark isn't quite what my task was doing, but it does show
the big improvement if the match is anchored. I'll have to keep in mind
that the anchor is helping me out in this case. That wasn't something I
noticed when I was working on it initially.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,050
Latest member
AngelS122

Latest Threads

Top