Regex combining /(foo|bar)/ slower than using foreach (/foo/,/bar/) ???

Gunnar Hjalmarsson · Feb 18, 2005

I've got a problem with the perl regex compiler. It seems that
compliation of combined regexes ( or alternation whatever you call it
) is not optimized.

Using a /(foo|bar)/ regex on strings is slower than using a foreach
loop doing the matching one after another. I've written a testprogramm
and looked at the perl source to find out why. Now I know. It seems
that DFA won't get optimised for the alternation.

As I have no time and knowledge and skill for optimising the perlregex
compiler from scratch, what can I do. Programming such foreach loops
gives me headaches - it such 'awk'ward.

Do you need to optimise the program in this respect? If not, why would
you have a problem? ;-)

jolly · Feb 18, 2005

I've got a problem with the perl regex compiler. It seems that
compliation of combined regexes ( or alternation whatever you call it
) is not optimized.

Using a /(foo|bar)/ regex on strings is slower than using a foreach
loop doing the matching one after another. I've written a testprogramm
and looked at the perl source to find out why. Now I know. It seems
that DFA won't get optimised for the alternation.

As I have no time and knowledge and skill for optimising the perlregex
compiler from scratch, what can I do. Programming such foreach loops
gives me headaches - it such 'awk'ward.

Here's the testprogram for those of you that don't think it's true:

#!/bin/perl
use strict;
use Digest::MD5 qw(md5 md5_hex md5_base64);
use Time::HiRes qw(time );
#use re 'debug' ;

foreach my $regexcount (1,5,10)
{
foreach my $regexlength (2,5,10,20)
{
my @items = map{ createRandomTextWithLength($regexlength); }
(1..$regexcount);
my $regexstr = join('|',@items);
my $regex = qr /(?:$regexstr)/;

foreach my $stringlength (100,1000,10000,100000)
{
print localtime()." Stringlength: $stringlength Number of
Regexes:$regexcount Length of each Regex:$regexlength\n";

my $teststring = createRandomTextWithLength($stringlength);
my $timer;
{
my $test=$teststring;
$timer =time;
$test =~ s/$regex/foobar/g;
printf("ElapsedTime:%5.4f %20s
%20s\n",time-$timer,md5_hex($test),$regex);
}

{
my $test=$teststring;
$timer =time;
foreach my $oneregex (@items)
{
$test =~ s/$oneregex/foobar/g;
}
printf("ElapsedTime:%5.4f %20s
%20s\n",time-$timer,md5_hex($test),' for loop over '.join(',',@items));
}
print "\n";
}
}
}

sub createRandomTextWithLength($)
{
my($count) = (@_);
my $string;
for (1.. $count)
{
$string.=chr(ord('a')+rand(20));
}
return $string;
}

jolly · Feb 18, 2005

Yep, it's an optimisation issue. I always thought that using the
/(foo|bar)/ would be the quickest way. So lot's of code has been
already written with that in mind.
It has become a problem lately as the strings I do regexes on tend to
get larger ( e.q. xml-files ) and the performance penalty is HUGE.

I thought' that maybe someone has a solution for the problem. I haven't
taken a look how parrot works with regexes. I'm thinking of writing a
Module with an optimised parser for such regexes but that would be my
last resort.

Jolly

Gunnar Hjalmarsson · Feb 18, 2005

[ Please provide some context when replying to a message. ]

Yep, it's an optimisation issue. I always thought that using the
/(foo|bar)/ would be the quickest way. So lot's of code has been
already written with that in mind.
It has become a problem lately as the strings I do regexes on tend to
get larger ( e.q. xml-files ) and the performance penalty is HUGE.

XML files are seldom very large.

Anyway, are "foo" and "bar" plain strings? If they are, you may want to
try using the index() function instead of the regex engine for better
efficiency.

Brian McCauley · Feb 18, 2005

Using a /(foo|bar)/ regex on strings is slower than using a foreach
loop doing the matching one after another.

Yes, this is even mentioned in the FAQ, albeit in an oblique way.

The solution to FAQ "How do I efficiently match many regular expressions
at once?" does not mention joining them with '|'. It would IMNSHO be
better if it were to explicitly mention that joining was not efficient.

JollyJinx · Feb 18, 2005

A pity that Foo and bar are NOT plain strings the real thing looks more
like:

$self->{MATCHCACHE}= '([^[:alnum:]\xc0-\xff]('.join('|', map{
RockBottom::buildMatchFromString($_) }(sort
{length($b)<=>length($a)}(@matcharray)) ).')[^[:alnum:]\xc0-\xff])';

buildMatchFromString builds multiple versions of a string and it builds
regexes not just plain strings .

The array contains thousands of matches and replacement is into a hash
of strings ( actually a function ). But anyways the XML files are not
small ( > 100 kBytes ).

index isn't an option here ;-(

xhoster · Feb 18, 2005

I've got a problem with the perl regex compiler. It seems that
compliation of combined regexes ( or alternation whatever you call it
) is not optimized.

Maybe, but I don't think you've demonstrated that.

Using a /(foo|bar)/ regex on strings is slower than using a foreach
loop doing the matching one after another.

Well, they also aren't doing the same thing. So that makes any comparison
rather meaningless.

I've written a testprogramm
and looked at the perl source to find out why. Now I know. It seems
that DFA won't get optimised for the alternation.

What part in the perl source tipped you off that they aren't optimized?

As I have no time and knowledge and skill for optimising the perlregex
compiler from scratch, what can I do. Programming such foreach loops
gives me headaches - it such 'awk'ward.

You can write one subroutine or module, and then use it over and over.
That way you only have to program the foreach loop once.

Here's the testprogram for those of you that don't think it's true:

Try adding an assertion to your code to check that the md5_hex of each
$test are actually equal.

Xho

Darren Dunham · Feb 18, 2005

(e-mail address removed) wrote:

Yes, this is even mentioned in the FAQ, albeit in an oblique way.

Really? I've transitioned some code from looping to a single regex
alternation and got over an order of magnitude performance increase. It
may simply depend on the complexity and size of the items. I was
dealing with over a thousand items, so the penalty for running perl
opcodes on each one was significant. The regex was probably about a
megabyte when stuffed together, but was much faster.

The solution to FAQ "How do I efficiently match many regular expressions
at once?" does not mention joining them with '|'. It would IMNSHO be
better if it were to explicitly mention that joining was not efficient.

As a general case, I wouldn't agree with such an assertion.

Glenn Jackman · Feb 18, 2005

At 2005-02-18 06:15AM said:
Using a /(foo|bar)/ regex on strings is slower than using a foreach
loop doing the matching one after another.

The book "Mastering Regular Expressions" has a section titled "Perl
Efficiency Issues" that talks about just this issue. You might want to
look there.

Brian McCauley · Feb 19, 2005

Darren said:
Really? I've transitioned some code from looping to a single regex
alternation and got over an order of magnitude performance increase. It
may simply depend on the complexity and size of the items. I was
dealing with over a thousand items, so the penalty for running perl
opcodes on each one was significant. The regex was probably about a
megabyte when stuffed together, but was much faster.

This is not my experience, but as you sat it probably depends on the
complexity. Here's a quick benchmark I threw together to demonstrate
that one regex is slower than many.

use strict;
use warnings;
use Benchmark;

my @many = map { "Match$_" } 1 .. 1000;
my $one = join '|', @many;

$_ = qr/$_/ for $one, @many;

my $data = ('random stuff' x 100) . 'Match999';

timethese 1000 => {
one => sub { $data =~ $one },
many => sub { grep { $data =~ $_} @many },
};

__END__
Benchmark: timing 1000 iterations of many, one...
many: 2 wallclock secs ( 2.76 usr + 0.00 sys = 2.76 CPU) @
361.79/s (n=1000)
one: 33 wallclock secs (32.81 usr + 0.00 sys = 32.81 CPU) @
30.48/s (n=1000)

As a general case, I wouldn't agree with such an assertion.

I think it _is_ true in the general case - although it is possible what
there exist special cases where it is not.

Perhaps you can throw one together to illustrate a case where the single
regex is faster and then make the case for why that case should not be
considered exceptional.

Tassilo v. Parseval · Feb 19, 2005

Also sprach Brian McCauley:

Darren Dunham wrote:

This is not my experience, but as you sat it probably depends on the
complexity. Here's a quick benchmark I threw together to demonstrate
that one regex is slower than many.

So far, this is indeed the case (mostly). Right now though some attempts
are made by the perl-porters to rectify this. A few days ago a patch was
proposed that turned regexes containing alternatives (ideally with some
common prefix) into some sort of trie-structure that will speed up
matching and that should even beat the multiple-regex approach.

Tassilo

brian d foy · Feb 20, 2005

I missed this discussion before, but the blead version of the FAQ
does have an alternation example, but punts to Mastering Regular
Expressions for the full details on why alternation can be really
slow. If anyone has a better answer, they can post a patch.

http://faq.perl.org/perlfaq6.html#How_do_I_efficiently

Darren Dunham · Feb 24, 2005

Brian McCauley said:
complexity. Here's a quick benchmark I threw together to demonstrate
that one regex is slower than many.

use strict;
use warnings;
use Benchmark;

my @many = map { "Match$_" } 1 .. 1000;
my $one = join '|', @many;

$_ = qr/$_/ for $one, @many;

my $data = ('random stuff' x 100) . 'Match999';

timethese 1000 => {
one => sub { $data =~ $one },
many => sub { grep { $data =~ $_} @many },
};

__END__
Benchmark: timing 1000 iterations of many, one...
many: 2 wallclock secs ( 2.76 usr + 0.00 sys = 2.76 CPU) @
361.79/s (n=1000)
one: 33 wallclock secs (32.81 usr + 0.00 sys = 32.81 CPU) @
30.48/s (n=1000)

I think it _is_ true in the general case - although it is possible what
there exist special cases where it is not.

Perhaps you can throw one together to illustrate a case where the single
regex is faster and then make the case for why that case should not be
considered exceptional.

I went back and found the code, and I guess it was a bit of a special
case, but I didn't think it was at the time. It turns out that my
matches were anchored on the line, and that appears to make a big
difference.

Change

$_ = qr/$_/ for $one, @many;

to

$one = qr/^($one)/;
$_ = qr/^$_/ for $one, @many;

Now this benchmark isn't quite what my task was doing, but it does show
the big improvement if the match is anchored. I'll have to keep in mind
that the anchor is helping me out in this case. That wasn't something I
noticed when I was working on it initially.

python-dev Summary for 2004-08-01 through 2004-08-15	17	Aug 24, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.vhdl FAQ part 1 of 4: general	0	Jul 8, 2003
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Regex combining /(foo|bar)/ slower than using foreach (/foo/,/bar/) ???

Gunnar Hjalmarsson

jolly

jolly

Gunnar Hjalmarsson

Brian McCauley

JollyJinx

xhoster

Darren Dunham

Glenn Jackman

Brian McCauley

Tassilo v. Parseval

brian d foy

Darren Dunham

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads