I've got a problem with the perl regex compiler. It seems that
compliation of combined regexes ( or alternation whatever you call it
) is not optimized.
Using a /(foo|bar)/ regex on strings is slower than using a foreach
loop doing the matching one after another. I've written a testprogramm
and looked at the perl source to find out why. Now I know. It seems
that DFA won't get optimised for the alternation.
As I have no time and knowledge and skill for optimising the perlregex
compiler from scratch, what can I do. Programming such foreach loops
gives me headaches - it such 'awk'ward.
Here's the testprogram for those of you that don't think it's true:
#!/bin/perl
use strict;
use Digest::MD5 qw(md5 md5_hex md5_base64);
use Time::HiRes qw(time );
#use re 'debug' ;
foreach my $regexcount (1,5,10)
{
foreach my $regexlength (2,5,10,20)
{
my @items = map{ createRandomTextWithLength($regexlength); }
(1..$regexcount);
my $regexstr = join('|',@items);
my $regex = qr /(?:$regexstr)/;
foreach my $stringlength (100,1000,10000,100000)
{
print localtime()." Stringlength: $stringlength Number of
Regexes:$regexcount Length of each Regex:$regexlength\n";
my $teststring = createRandomTextWithLength($stringlength);
my $timer;
{
my $test=$teststring;
$timer =time;
$test =~ s/$regex/foobar/g;
printf("ElapsedTime:%5.4f %20s
%20s\n",time-$timer,md5_hex($test),$regex);
}
{
my $test=$teststring;
$timer =time;
foreach my $oneregex (@items)
{
$test =~ s/$oneregex/foobar/g;
}
printf("ElapsedTime:%5.4f %20s
%20s\n",time-$timer,md5_hex($test),' for loop over '.join(',',@items));
}
print "\n";
}
}
}
sub createRandomTextWithLength($)
{
my($count) = (@_);
my $string;
for (1.. $count)
{
$string.=chr(ord('a')+rand(20));
}
return $string;
}