Anything to be done about utf8 regexp performance?

  • Thread starter Jochen Lehmeier
  • Start date
J

Jochen Lehmeier

Hello,
perl -V|head
Summary of my perl5 (revision 5 version 8 subversion 8) configuration:
Platform:
osname=linux, osvers=2.6.22-3-k7, archname=i486-linux-gnu-thread-multi
uname='linux k 2.6.22-3-k7 #1 smp mon oct 22 22:51:54 utc 2007 i686
gnulinux
cat test.pl
#!/usr/local/bin/perl

use strict;
use warnings;

my $a = "a".("x" x 1000);
my $b = "\x{1234}".("x" x 1000);

for (0..1000)
{
$a =~ s/r/xxx/;
$a =~ s/r/xxx/i;
$b =~ s/r/xxx/;
$b =~ s/r/xxx/i;
}
perl -d:SmallProf test.pl


^L ================ SmallProf version 2.02 ================
Profile of test.pl
Page 94
=================================================================
count wall tm cpu time line
0 0.00000 0.00000 1:#!/usr/local/bin/perl
0 0.00000 0.00000 2:
0 0.00000 0.00000 3:use strict;
0 0.00000 0.00000 4:use warnings;
0 0.00000 0.00000 5:
1 0.00005 0.00000 6:my $a = "a".("x" x 1000);
1 0.00006 0.00000 7:my $b = "\x{1234}".("x" x 1000);
0 0.00000 0.00000 8:
1 0.00000 0.00000 9:for (0..1000)
0 0.00000 0.00000 10:{
1001 0.00596 0.07000 11: $a =~ s/r/xxx/;
1001 0.01276 0.03000 12: $a =~ s/r/xxx/i;
1001 0.04787 0.14000 13: $b =~ s/r/xxx/;
1004 2.05547 2.10000 14: $b =~ s/r/xxx/i;
0 0.00000 0.00000 15:}

I can live with line 13, but line 14 is not funny anymore. 344 times
slower than a latin1 regexp... or 161 times slower than a
latin1-case-insentitive one.

I understand that case calculations are much more complex in utf8 than
latin1. Is there anything that can be done, anyway?
 
E

Eric Pozharski

On 2009-11-03 said:
#!/usr/local/bin/perl

use strict;
use warnings;

my $a = "a".("x" x 1000);
my $b = "\x{1234}".("x" x 1000);

for (0..1000)
{
$a =~ s/r/xxx/;
$a =~ s/r/xxx/i;
$b =~ s/r/xxx/;
$b =~ s/r/xxx/i;
}
*SKIP*
I can live with line 13, but line 14 is not funny anymore. 344 times
slower than a latin1 regexp... or 161 times slower than a
latin1-case-insentitive one.

I understand that case calculations are much more complex in utf8 than
latin1. Is there anything that can be done, anyway?

HTH (as you can see, that idea has it's limitations):

#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw{ cmpthese timethese };

my $a = "a" . ("x" x 1000);
my $b = "\x{1234}" . ("x" x 1000);

cmpthese timethese -3, {
code00 => sub { $a =~ s/r/xxx/i; },
code01 => sub { $b =~ s/r/xxx/i; },
code02 => sub { $b =~ s/[rR]/xxx/; },
};

__END__
Benchmark: running code00, code01, code02 for at least 3 CPU seconds...
code00: 2 wallclock secs ( 3.02 usr + 0.00 sys = 3.02 CPU) @ 316342.72/s (n=955355)
code01: 4 wallclock secs ( 3.20 usr + 0.00 sys = 3.20 CPU) @ 4509.38/s (n=14430)
code02: 2 wallclock secs ( 3.13 usr + 0.00 sys = 3.13 CPU) @ 57964.86/s (n=181430)
Rate code01 code02 code00
code01 4509/s -- -92% -99%
code02 57965/s 1185% -- -82%
code00 316343/s 6915% 446% --
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top