Regular expression 'c' modifier


G

gamo

Recently I see a script nuked by comparison to the C version,
and the two main causes were using bigint, which I don't
really need and 'use integer' could do the job, and don't use
the /gc modifier in a regex instead of a normal /g. I think
that the documentation of the c modifier is not very clear
about its importance. Here is a comparison:

#!/usr/bin/perl -W

use strict;
use Benchmark qw(cmpthese);

my $string = "aabc" x 8192;
my $i;

cmpthese(-3, {
g => sub { while ($string=~/(a+)/g) { $i = $1; } },
gc => sub { while ($string=~/(a+)/gc) { $i=$1; } },
});

__END__

Rate g gc
g 379/s -- -100%
gc 12105006/s 3192107% --
j
 
Ad

Advertisements

P

Peter J. Holzer

Recently I see a script nuked by comparison to the C version,
and the two main causes were using bigint, which I don't
really need and 'use integer' could do the job, and don't use
the /gc modifier in a regex instead of a normal /g. I think
that the documentation of the c modifier is not very clear
about its importance. Here is a comparison:

#!/usr/bin/perl -W

use strict;
use Benchmark qw(cmpthese);

my $string = "aabc" x 8192;
my $i;

cmpthese(-3, {
g => sub { while ($string=~/(a+)/g) { $i = $1; } },
gc => sub { while ($string=~/(a+)/gc) { $i=$1; } },
});

__END__

Rate g gc
g 379/s -- -100%
gc 12105006/s 3192107% --

Yes, matching 0 times in a string of length 2 is much faster than
matching 8192 times in a string of length 32768.

It is always suspicious if you get such a huge speedup and it is a good
idea to check that the new code is really equivalent to the old one.

hp
 
B

Bjoern Hoehrmann

* Peter J. Holzer wrote in comp.lang.perl.misc:
Yes, matching 0 times in a string of length 2 is much faster than
matching 8192 times in a string of length 32768.

To elaborate on that, the pos() of a string is a property of the string,
and ordinarily the position would be reset on a match failure. With 'c'
the position is not reset, so after the first round through the loop the
`substr $string, pos $string` string would just be 'bc' which does not
match /(a+)/ so regardless of how many times `cmpthese` calls the `gc`
version, the loop body is executed only the first time if nothing resets
the string position.
 
G

gamo

El 24/11/13 22:19, Bjoern Hoehrmann escribió:
* Peter J. Holzer wrote in comp.lang.perl.misc:

To elaborate on that, the pos() of a string is a property of the string,
and ordinarily the position would be reset on a match failure. With 'c'
the position is not reset, so after the first round through the loop the
`substr $string, pos $string` string would just be 'bc' which does not
match /(a+)/ so regardless of how many times `cmpthese` calls the `gc`
version, the loop body is executed only the first time if nothing resets
the string position.

My fault. Taking a longer string and doing only one pass, Time::HiRes
says that /gc is only sigjhtly better than /g. The results don't
change in the number of matches, as does with a sub inside cmpthese.

Fortunately I don't have to change anything in the original code, as
the results are the expected with or without regex at all. Anyway, I
want to know now the difference between a regex like
while ($string =~ /(\d+)/gc){ $i=$1; #... }
and the index/substr equivalent when the $string contains digits and
only one character \n between numbers. I have to look at m and s
modifiers, too.

Thanks
 
Ad

Advertisements

G

gamo

El 24/11/13 22:46, gamo escribió:
want to know now the difference between a regex like
while ($string =~ /(\d+)/gc){ $i=$1; #... }
and the index/substr equivalent when the $string contains digits and
only one character \n between numbers. I have to look at m and s
modifiers, too.

Thanks

I am about to get a better result with index/substr/lenght

Time /gc = 3.466519 s.
Time ind = 3.106548 s.
Counters: 8388608, 8388608

with this code:


#!/usr/bin/perl -W

use strict;

my $string = "1123\n" x (8192 * 1024);
my $i;
my $n = chr(ord("\n"));
my ($c1, $c2);

use Time::HiRes qw(gettimeofday tv_interval);

my $t0 = [gettimeofday];
while ($string =~ /(\d+)/gc){
$i = $1;
$c1++ if ($i == 1123);
}
my $t1 = [gettimeofday];

my $j;
my $k=0;
while ($k<length($string)){
$j = index($string,$n,$k+1);
$i = substr($string, $k, $j-$k);
$k += length($i)+1;
$c2++ if ($i == 1123);
}
my $t2 = [gettimeofday];

print "Time /gc = ", tv_interval($t0,$t1), " s.\n";
print "Time ind = ", tv_interval($t1,$t2), " s.\n";
print "Counters: $c1, $c2\n";

__END__


But it's rather extrange to use, it's not simple.

Best regards
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top