Regexp discovery - using ^ with /m is a time sink

Koszalek Opalek · Feb 14, 2009

#!/usr/bin/perl

=pod

The code below benchmarks two regexp's that look for lines
starting with the equals sign in a multiline string. The
regexps differ only by how the line break is detected.
The first regexp uses the ^ metacharacter and the /m flag:
qr{\G^=[^\n]*}ism
The other relies on a negative look-behind assertion:
qr{\G(?<=\n)=[^\n]*}ism

One difference between the two regexps is that the ^
version matches '=' at the beginning of the string,
whereas the other does not. But there is something
else as well. The second version is at least 50 times
faster!

Note that both regexps use the \G assertion (match only at
pos()) -- and the position is set to a random number in each
loop iteration. I assumed both regexp's will be very fast
(because the have to be checked only at one pos in string)
-- apparently not so.

Could someone explain what's going behind the scenes in
the regexp engine? Is it scanning the complete string for
line breaks if I use ^, even though it has to match only
at pos() ?

K.

=cut

use strict;
use Time::HiRes qw( time );
$| = 1;

my $gibberish;
for( 1 .. 1000 ) {
for( 1 .. int(rand 50) ) {
$gibberish .= chr( int( rand 60) + 32 );
};
$gibberish .= "\n";
}

my $l = length $gibberish;
my $cnt = 100_000;

my @positions;
for( 1 .. $cnt ) { push @positions, int( rand $l) };

print "String length: $l.\n\n";
for my $re (
qr{\G(?<=\n)=[^\n]*}ism,
qr{\G^=[^\n]*}ism,
) {
my $succ = 0;
my $start = time;
foreach ( @positions ) {
pos $gibberish = $_;
$succ++ if( $gibberish =~ m/$re/g );
};
print "Regexp: $re.\n";
print "Successful matches $succ.\n";
printf "Time = %f.\n\n", time - $start;
};

print "$cnt matches for each regexp.\n";

Ilya Zakharevich · Feb 15, 2009

Note that both regexps use the \G assertion (match only at
pos()) -- and the position is set to a random number in each
loop iteration. I assumed both regexp's will be very fast
(because the have to be checked only at one pos in string)
-- apparently not so.

Could someone explain what's going behind the scenes in
the regexp engine? Is it scanning the complete string for
line breaks if I use ^, even though it has to match only
at pos() ?

There is something fishy with how optimizer treats \G. I might have
skipped some case(s), and it was not corrected in the years passed...

I also found it by benchmarking. Had no time to look into the
sources...

Yours,
Ilya

Eric Pozharski · Feb 15, 2009

#!/usr/bin/perl

=pod

The code below benchmarks two regexp's that look for lines
starting with the equals sign in a multiline string. The
regexps differ only by how the line break is detected.
The first regexp uses the ^ metacharacter and the /m flag:
qr{\G^=[^\n]*}ism
The other relies on a negative look-behind assertion:
qr{\G(?<=\n)=[^\n]*}ism
*SKIP*
Could someone explain what's going behind the scenes in
the regexp engine? Is it scanning the complete string for
line breaks if I use ^, even though it has to match only
at pos() ?

You can do it yourself. Your distro is supposed to provide B<perl>
compiled with I<-DDEBUGGING> enabled. Than use I<-D512> option,
F<perldebguts> has more.

=cut

use strict;
use Time::HiRes qw( time );
$| = 1;

my $gibberish;
for( 1 .. 1000 ) {
for( 1 .. int(rand 50) ) {
$gibberish .= chr( int( rand 60) + 32 );
};
$gibberish .= "\n";
}

Please don't. (if I retranslate it back to English correctly) "Random
random number generators cycle after random number of cycles"
(attributed to Knuth). Once you use random patterns you'll get random
results -- if your results are random then no-one (you -- first) can't
trust those results.

I've tried your REs -- and for me successful look-behind is slightly
faster than anything else. (I should admit I've never used C<m//gism>,
and C<qr/\G/> so that's possible I've messed something up.)

*CUT*

Koszalek Opalek · Feb 16, 2009

I also found it by benchmarking. Had no time to look into the
sources...

Should I report this to (e-mail address removed) ?

K.

Koszalek Opalek · Feb 16, 2009

Once you use random patterns you'll get random
results -- if your results are random then no-one (you -- first) can't
trust those results.

The test result is the time ratio (regexp1/regexp2).
You can hardly call it random:
http://en.wikipedia.org/wiki/Law_of_large_numbers

I've tried your REs -- and for me successful look-behind is slightly
faster than anything else. (I should admit I've never used C<m//gism>,
and C<qr/\G/> so that's possible I've messed something up.)

Have you tried just REs or have you run the code that
I posted? I used 5.8.8 for the benchmark but I'm pretty
sure I would have noticed if it ran any faster in 5.10.

Anyway, I'm compiling 5.10 (with the -DDEBUGGING) that
you mentioned elsethread and will try to investigate
further.

K.

Ilya Zakharevich · Feb 16, 2009

Should I report this to (e-mail address removed) ?

You better do. I discovered it profiling edits to FreezeThaw;

the REx is /\G\$(\d+)\|/
the string is a concatenation of 2N copies of $1000|;

one matches with pos() set at 6N (so it should match immediately: the
offset is known, the length is bounded, and even if it looks for
"floating anchor" [which is '|'], it is located very close, at offset
5).

time perl -wle "($n,$c) = @ARGV; $s = q($1000|) x (2*$n); pos($s) = 6 * $n; $s =~ /\G\$(\d+)\|/ for 1..$c" 1e6 15

also run with 1e6 5, and 1e2 15.

It finishes with linear time in the second argument (as expected); but
the increment is much quickier with 1e2 than with 1e6, which I do not
think is a correct behaviour.

Thanks,
Ilya

help with regexp	5	Feb 7, 2013
unexplained warning message in m{...} regexp	34	Apr 24, 2009
Replace an occurrence of a regexp with a function call on a substringof the match, multiple times on	4	Sep 16, 2013
Issue with textbox script?	0	Sep 5, 2022
m// on very long lines leaks memory	5	Mar 13, 2008
GET NEIL DEGRASSES TYSON, I ripped a hole with this one...	0	Nov 10, 2022
FAQ 6.20 What good is "\G" in a regular expression?	0	Mar 3, 2011
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022

Regexp discovery - using ^ with /m is a time sink

Koszalek Opalek

Ilya Zakharevich

Eric Pozharski

Koszalek Opalek

Koszalek Opalek

Ilya Zakharevich

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads