Regexp discovery - using ^ with /m is a time sink

Discussion in 'Perl Misc' started by Koszalek Opalek, Feb 14, 2009.

  1. #!/usr/bin/perl

    =pod

    The code below benchmarks two regexp's that look for lines
    starting with the equals sign in a multiline string. The
    regexps differ only by how the line break is detected.
    The first regexp uses the ^ metacharacter and the /m flag:
    qr{\G^=[^\n]*}ism
    The other relies on a negative look-behind assertion:
    qr{\G(?<=\n)=[^\n]*}ism

    One difference between the two regexps is that the ^
    version matches '=' at the beginning of the string,
    whereas the other does not. But there is something
    else as well. The second version is at least 50 times
    faster!

    Note that both regexps use the \G assertion (match only at
    pos()) -- and the position is set to a random number in each
    loop iteration. I assumed both regexp's will be very fast
    (because the have to be checked only at one pos in string)
    -- apparently not so.

    Could someone explain what's going behind the scenes in
    the regexp engine? Is it scanning the complete string for
    line breaks if I use ^, even though it has to match only
    at pos() ?

    K.

    =cut


    use strict;
    use Time::HiRes qw( time );
    $| = 1;

    my $gibberish;
    for( 1 .. 1000 ) {
    for( 1 .. int(rand 50) ) {
    $gibberish .= chr( int( rand 60) + 32 );
    };
    $gibberish .= "\n";
    }

    my $l = length $gibberish;
    my $cnt = 100_000;


    my @positions;
    for( 1 .. $cnt ) { push @positions, int( rand $l) };

    print "String length: $l.\n\n";
    for my $re (
    qr{\G(?<=\n)=[^\n]*}ism,
    qr{\G^=[^\n]*}ism,
    ) {
    my $succ = 0;
    my $start = time;
    foreach ( @positions ) {
    pos $gibberish = $_;
    $succ++ if( $gibberish =~ m/$re/g );
    };
    print "Regexp: $re.\n";
    print "Successful matches $succ.\n";
    printf "Time = %f.\n\n", time - $start;
    };

    print "$cnt matches for each regexp.\n";
     
    Koszalek Opalek, Feb 14, 2009
    #1
    1. Advertising

  2. On 2009-02-14, Koszalek Opalek <> wrote:
    > Note that both regexps use the \G assertion (match only at
    > pos()) -- and the position is set to a random number in each
    > loop iteration. I assumed both regexp's will be very fast
    > (because the have to be checked only at one pos in string)
    > -- apparently not so.
    >
    > Could someone explain what's going behind the scenes in
    > the regexp engine? Is it scanning the complete string for
    > line breaks if I use ^, even though it has to match only
    > at pos() ?


    There is something fishy with how optimizer treats \G. I might have
    skipped some case(s), and it was not corrected in the years passed...

    I also found it by benchmarking. Had no time to look into the
    sources...

    Yours,
    Ilya
     
    Ilya Zakharevich, Feb 15, 2009
    #2
    1. Advertising

  3. On 2009-02-14, Koszalek Opalek <> wrote:
    > #!/usr/bin/perl
    >
    >=pod
    >
    > The code below benchmarks two regexp's that look for lines
    > starting with the equals sign in a multiline string. The
    > regexps differ only by how the line break is detected.
    > The first regexp uses the ^ metacharacter and the /m flag:
    > qr{\G^=[^\n]*}ism
    > The other relies on a negative look-behind assertion:
    > qr{\G(?<=\n)=[^\n]*}ism
    >

    *SKIP*
    > Could someone explain what's going behind the scenes in
    > the regexp engine? Is it scanning the complete string for
    > line breaks if I use ^, even though it has to match only
    > at pos() ?


    You can do it yourself. Your distro is supposed to provide B<perl>
    compiled with I<-DDEBUGGING> enabled. Than use I<-D512> option,
    F<perldebguts> has more.

    If my understanding of I<-D512> output is correct, then answer is: no,
    perl doesn't.

    >
    >=cut
    >
    >
    > use strict;
    > use Time::HiRes qw( time );
    > $| = 1;
    >
    > my $gibberish;
    > for( 1 .. 1000 ) {
    > for( 1 .. int(rand 50) ) {
    > $gibberish .= chr( int( rand 60) + 32 );
    > };
    > $gibberish .= "\n";
    > }


    Please don't. (if I retranslate it back to English correctly) "Random
    random number generators cycle after random number of cycles"
    (attributed to Knuth). Once you use random patterns you'll get random
    results -- if your results are random then no-one (you -- first) can't
    trust those results.

    I've tried your REs -- and for me successful look-behind is slightly
    faster than anything else. (I should admit I've never used C<m//gism>,
    and C<qr/\G/> so that's possible I've messed something up.)

    *CUT*

    --
    Torvalds' goal for Linux is very simple: World Domination
    Stallman's goal for GNU is even simpler: Freedom
     
    Eric Pozharski, Feb 15, 2009
    #3
  4. On Feb 15, 2:20 pm, Ilya Zakharevich <> wrote:

    > I also found it by benchmarking.  Had no time to look into the
    > sources...


    Should I report this to ?

    K.
     
    Koszalek Opalek, Feb 16, 2009
    #4
  5. On Feb 16, 1:03 am, Eric Pozharski <> wrote:

    > Once you use random patterns you'll get random
    > results -- if your results are random then no-one (you -- first) can't
    > trust those results.


    The test result is the time ratio (regexp1/regexp2).
    You can hardly call it random:
    http://en.wikipedia.org/wiki/Law_of_large_numbers


    > I've tried your REs -- and for me successful look-behind is slightly
    > faster than anything else.  (I should admit I've never used C<m//gism>,
    > and C<qr/\G/> so that's possible I've messed something up.)


    Have you tried just REs or have you run the code that
    I posted? I used 5.8.8 for the benchmark but I'm pretty
    sure I would have noticed if it ran any faster in 5.10.

    Anyway, I'm compiling 5.10 (with the -DDEBUGGING) that
    you mentioned elsethread and will try to investigate
    further.

    K.
     
    Koszalek Opalek, Feb 16, 2009
    #5
  6. On 2009-02-16, Koszalek Opalek <> wrote:
    > On Feb 15, 2:20 pm, Ilya Zakharevich <> wrote:
    >
    >> I also found it by benchmarking.  Had no time to look into the
    >> sources...

    >
    > Should I report this to ?


    You better do. I discovered it profiling edits to FreezeThaw;

    the REx is /\G\$(\d+)\|/
    the string is a concatenation of 2N copies of $1000|;

    one matches with pos() set at 6N (so it should match immediately: the
    offset is known, the length is bounded, and even if it looks for
    "floating anchor" [which is '|'], it is located very close, at offset
    5).

    time perl -wle "($n,$c) = @ARGV; $s = q($1000|) x (2*$n); pos($s) = 6 * $n; $s =~ /\G\$(\d+)\|/ for 1..$c" 1e6 15

    also run with 1e6 5, and 1e2 15.

    It finishes with linear time in the second argument (as expected); but
    the increment is much quickier with 1e2 than with 1e6, which I do not
    think is a correct behaviour.

    Thanks,
    Ilya
     
    Ilya Zakharevich, Feb 16, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?b3R0bw==?=

    Using custom sink and cookies from ASP.NET application

    =?Utf-8?B?b3R0bw==?=, Dec 21, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    454
    Scott Allen
    Dec 21, 2004
  2. Marcin Vorbrodt

    Re: auto_ptr and sink function

    Marcin Vorbrodt, Sep 22, 2003, in forum: C++
    Replies:
    0
    Views:
    448
    Marcin Vorbrodt
    Sep 22, 2003
  3. Marcin Vorbrodt

    auto_ptr and sink function

    Marcin Vorbrodt, Sep 22, 2003, in forum: C++
    Replies:
    4
    Views:
    471
    tom_usenet
    Sep 22, 2003
  4. Chinedu

    Building an ATL ActiveX Sink

    Chinedu, Mar 9, 2005, in forum: C++
    Replies:
    1
    Views:
    520
    Thomas Matthews
    Mar 9, 2005
  5. Joao Silva
    Replies:
    16
    Views:
    366
    7stud --
    Aug 21, 2009
Loading...

Share This Page