Regexp discovery - using ^ with /m is a time sink


K

Koszalek Opalek

#!/usr/bin/perl

=pod

The code below benchmarks two regexp's that look for lines
starting with the equals sign in a multiline string. The
regexps differ only by how the line break is detected.
The first regexp uses the ^ metacharacter and the /m flag:
qr{\G^=[^\n]*}ism
The other relies on a negative look-behind assertion:
qr{\G(?<=\n)=[^\n]*}ism

One difference between the two regexps is that the ^
version matches '=' at the beginning of the string,
whereas the other does not. But there is something
else as well. The second version is at least 50 times
faster!

Note that both regexps use the \G assertion (match only at
pos()) -- and the position is set to a random number in each
loop iteration. I assumed both regexp's will be very fast
(because the have to be checked only at one pos in string)
-- apparently not so.

Could someone explain what's going behind the scenes in
the regexp engine? Is it scanning the complete string for
line breaks if I use ^, even though it has to match only
at pos() ?

K.

=cut


use strict;
use Time::HiRes qw( time );
$| = 1;

my $gibberish;
for( 1 .. 1000 ) {
for( 1 .. int(rand 50) ) {
$gibberish .= chr( int( rand 60) + 32 );
};
$gibberish .= "\n";
}

my $l = length $gibberish;
my $cnt = 100_000;


my @positions;
for( 1 .. $cnt ) { push @positions, int( rand $l) };

print "String length: $l.\n\n";
for my $re (
qr{\G(?<=\n)=[^\n]*}ism,
qr{\G^=[^\n]*}ism,
) {
my $succ = 0;
my $start = time;
foreach ( @positions ) {
pos $gibberish = $_;
$succ++ if( $gibberish =~ m/$re/g );
};
print "Regexp: $re.\n";
print "Successful matches $succ.\n";
printf "Time = %f.\n\n", time - $start;
};

print "$cnt matches for each regexp.\n";
 
Ad

Advertisements

I

Ilya Zakharevich

Note that both regexps use the \G assertion (match only at
pos()) -- and the position is set to a random number in each
loop iteration. I assumed both regexp's will be very fast
(because the have to be checked only at one pos in string)
-- apparently not so.

Could someone explain what's going behind the scenes in
the regexp engine? Is it scanning the complete string for
line breaks if I use ^, even though it has to match only
at pos() ?

There is something fishy with how optimizer treats \G. I might have
skipped some case(s), and it was not corrected in the years passed...

I also found it by benchmarking. Had no time to look into the
sources...

Yours,
Ilya
 
E

Eric Pozharski

#!/usr/bin/perl

=pod

The code below benchmarks two regexp's that look for lines
starting with the equals sign in a multiline string. The
regexps differ only by how the line break is detected.
The first regexp uses the ^ metacharacter and the /m flag:
qr{\G^=[^\n]*}ism
The other relies on a negative look-behind assertion:
qr{\G(?<=\n)=[^\n]*}ism
*SKIP*
Could someone explain what's going behind the scenes in
the regexp engine? Is it scanning the complete string for
line breaks if I use ^, even though it has to match only
at pos() ?

You can do it yourself. Your distro is supposed to provide B<perl>
compiled with I<-DDEBUGGING> enabled. Than use I<-D512> option,
F<perldebguts> has more.

=cut


use strict;
use Time::HiRes qw( time );
$| = 1;

my $gibberish;
for( 1 .. 1000 ) {
for( 1 .. int(rand 50) ) {
$gibberish .= chr( int( rand 60) + 32 );
};
$gibberish .= "\n";
}

Please don't. (if I retranslate it back to English correctly) "Random
random number generators cycle after random number of cycles"
(attributed to Knuth). Once you use random patterns you'll get random
results -- if your results are random then no-one (you -- first) can't
trust those results.

I've tried your REs -- and for me successful look-behind is slightly
faster than anything else. (I should admit I've never used C<m//gism>,
and C<qr/\G/> so that's possible I've messed something up.)

*CUT*
 
K

Koszalek Opalek

Once you use random patterns you'll get random
results -- if your results are random then no-one (you -- first) can't
trust those results.

The test result is the time ratio (regexp1/regexp2).
You can hardly call it random:
http://en.wikipedia.org/wiki/Law_of_large_numbers

I've tried your REs -- and for me successful look-behind is slightly
faster than anything else.  (I should admit I've never used C<m//gism>,
and C<qr/\G/> so that's possible I've messed something up.)

Have you tried just REs or have you run the code that
I posted? I used 5.8.8 for the benchmark but I'm pretty
sure I would have noticed if it ran any faster in 5.10.

Anyway, I'm compiling 5.10 (with the -DDEBUGGING) that
you mentioned elsethread and will try to investigate
further.

K.
 
Ad

Advertisements

I

Ilya Zakharevich

Should I report this to (e-mail address removed) ?

You better do. I discovered it profiling edits to FreezeThaw;

the REx is /\G\$(\d+)\|/
the string is a concatenation of 2N copies of $1000|;

one matches with pos() set at 6N (so it should match immediately: the
offset is known, the length is bounded, and even if it looks for
"floating anchor" [which is '|'], it is located very close, at offset
5).

time perl -wle "($n,$c) = @ARGV; $s = q($1000|) x (2*$n); pos($s) = 6 * $n; $s =~ /\G\$(\d+)\|/ for 1..$c" 1e6 15

also run with 1e6 5, and 1e2 15.

It finishes with linear time in the second argument (as expected); but
the increment is much quickier with 1e2 than with 1e6, which I do not
think is a correct behaviour.

Thanks,
Ilya
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top