Regex combining /(foo|bar)/ slower than using foreach (/foo/,/bar/) ???

Discussion in 'Perl Misc' started by Gunnar Hjalmarsson, Feb 18, 2005.

  1. Do you need to optimise the program in this respect? If not, why would
    you have a problem? ;-)
     
    Gunnar Hjalmarsson, Feb 18, 2005
    #1
    1. Advertisements

  2. Gunnar Hjalmarsson

    jolly Guest

    I've got a problem with the perl regex compiler. It seems that
    compliation of combined regexes ( or alternation whatever you call it
    ) is not optimized.

    Using a /(foo|bar)/ regex on strings is slower than using a foreach
    loop doing the matching one after another. I've written a testprogramm
    and looked at the perl source to find out why. Now I know. It seems
    that DFA won't get optimised for the alternation.


    As I have no time and knowledge and skill for optimising the perlregex
    compiler from scratch, what can I do. Programming such foreach loops
    gives me headaches - it such 'awk'ward.


    Here's the testprogram for those of you that don't think it's true:

    #!/bin/perl
    use strict;
    use Digest::MD5 qw(md5 md5_hex md5_base64);
    use Time::HiRes qw(time );
    #use re 'debug' ;


    foreach my $regexcount (1,5,10)
    {
    foreach my $regexlength (2,5,10,20)
    {
    my @items = map{ createRandomTextWithLength($regexlength); }
    (1..$regexcount);
    my $regexstr = join('|',@items);
    my $regex = qr /(?:$regexstr)/;

    foreach my $stringlength (100,1000,10000,100000)
    {
    print localtime()." Stringlength: $stringlength Number of
    Regexes:$regexcount Length of each Regex:$regexlength\n";

    my $teststring = createRandomTextWithLength($stringlength);
    my $timer;
    {
    my $test=$teststring;
    $timer =time;
    $test =~ s/$regex/foobar/g;
    printf("ElapsedTime:%5.4f %20s
    %20s\n",time-$timer,md5_hex($test),$regex);
    }

    {
    my $test=$teststring;
    $timer =time;
    foreach my $oneregex (@items)
    {
    $test =~ s/$oneregex/foobar/g;
    }
    printf("ElapsedTime:%5.4f %20s
    %20s\n",time-$timer,md5_hex($test),' for loop over '.join(',',@items));
    }
    print "\n";
    }
    }
    }

    sub createRandomTextWithLength($)
    {
    my($count) = (@_);
    my $string;
    for (1.. $count)
    {
    $string.=chr(ord('a')+rand(20));
    }
    return $string;
    }
     
    jolly, Feb 18, 2005
    #2
    1. Advertisements

  3. Gunnar Hjalmarsson

    jolly Guest

    Yep, it's an optimisation issue. I always thought that using the
    /(foo|bar)/ would be the quickest way. So lot's of code has been
    already written with that in mind.
    It has become a problem lately as the strings I do regexes on tend to
    get larger ( e.q. xml-files ) and the performance penalty is HUGE.

    I thought' that maybe someone has a solution for the problem. I haven't
    taken a look how parrot works with regexes. I'm thinking of writing a
    Module with an optimised parser for such regexes but that would be my
    last resort.

    Jolly
     
    jolly, Feb 18, 2005
    #3
  4. [ Please provide some context when replying to a message. ]

    XML files are seldom very large.

    Anyway, are "foo" and "bar" plain strings? If they are, you may want to
    try using the index() function instead of the regex engine for better
    efficiency.
     
    Gunnar Hjalmarsson, Feb 18, 2005
    #4
  5. Yes, this is even mentioned in the FAQ, albeit in an oblique way.

    The solution to FAQ "How do I efficiently match many regular expressions
    at once?" does not mention joining them with '|'. It would IMNSHO be
    better if it were to explicitly mention that joining was not efficient.
     
    Brian McCauley, Feb 18, 2005
    #5
  6. Gunnar Hjalmarsson

    JollyJinx Guest

    A pity that Foo and bar are NOT plain strings the real thing looks more
    like:

    $self->{MATCHCACHE}= '([^[:alnum:]\xc0-\xff]('.join('|', map{
    RockBottom::buildMatchFromString($_) }(sort
    {length($b)<=>length($a)}(@matcharray)) ).')[^[:alnum:]\xc0-\xff])';

    buildMatchFromString builds multiple versions of a string and it builds
    regexes not just plain strings .

    The array contains thousands of matches and replacement is into a hash
    of strings ( actually a function ). But anyways the XML files are not
    small ( > 100 kBytes ).


    index isn't an option here ;-(
     
    JollyJinx, Feb 18, 2005
    #6
  7. Gunnar Hjalmarsson

    xhoster Guest

    Maybe, but I don't think you've demonstrated that.
    Well, they also aren't doing the same thing. So that makes any comparison
    rather meaningless.
    What part in the perl source tipped you off that they aren't optimized?

    You can write one subroutine or module, and then use it over and over.
    That way you only have to program the foreach loop once.
    Try adding an assertion to your code to check that the md5_hex of each
    $test are actually equal.

    Xho
     
    xhoster, Feb 18, 2005
    #7
  8. Really? I've transitioned some code from looping to a single regex
    alternation and got over an order of magnitude performance increase. It
    may simply depend on the complexity and size of the items. I was
    dealing with over a thousand items, so the penalty for running perl
    opcodes on each one was significant. The regex was probably about a
    megabyte when stuffed together, but was much faster.
    As a general case, I wouldn't agree with such an assertion.
     
    Darren Dunham, Feb 18, 2005
    #8
  9. The book "Mastering Regular Expressions" has a section titled "Perl
    Efficiency Issues" that talks about just this issue. You might want to
    look there.
     
    Glenn Jackman, Feb 19, 2005
    #9
  10. This is not my experience, but as you sat it probably depends on the
    complexity. Here's a quick benchmark I threw together to demonstrate
    that one regex is slower than many.

    use strict;
    use warnings;
    use Benchmark;

    my @many = map { "Match$_" } 1 .. 1000;
    my $one = join '|', @many;

    $_ = qr/$_/ for $one, @many;

    my $data = ('random stuff' x 100) . 'Match999';

    timethese 1000 => {
    one => sub { $data =~ $one },
    many => sub { grep { $data =~ $_} @many },
    };

    __END__
    Benchmark: timing 1000 iterations of many, one...
    many: 2 wallclock secs ( 2.76 usr + 0.00 sys = 2.76 CPU) @
    361.79/s (n=1000)
    one: 33 wallclock secs (32.81 usr + 0.00 sys = 32.81 CPU) @
    30.48/s (n=1000)
    I think it _is_ true in the general case - although it is possible what
    there exist special cases where it is not.

    Perhaps you can throw one together to illustrate a case where the single
    regex is faster and then make the case for why that case should not be
    considered exceptional.
     
    Brian McCauley, Feb 19, 2005
    #10
  11. Also sprach Brian McCauley:
    So far, this is indeed the case (mostly). Right now though some attempts
    are made by the perl-porters to rectify this. A few days ago a patch was
    proposed that turned regexes containing alternatives (ideally with some
    common prefix) into some sort of trie-structure that will speed up
    matching and that should even beat the multiple-regex approach.

    Tassilo
     
    Tassilo v. Parseval, Feb 19, 2005
    #11
  12. Gunnar Hjalmarsson

    brian d foy Guest

    I missed this discussion before, but the blead version of the FAQ
    does have an alternation example, but punts to Mastering Regular
    Expressions for the full details on why alternation can be really
    slow. If anyone has a better answer, they can post a patch. :)

    http://faq.perl.org/perlfaq6.html#How_do_I_efficiently
     
    brian d foy, Feb 20, 2005
    #12
  13. I went back and found the code, and I guess it was a bit of a special
    case, but I didn't think it was at the time. It turns out that my
    matches were anchored on the line, and that appears to make a big
    difference.

    Change
    to

    $one = qr/^($one)/;
    $_ = qr/^$_/ for $one, @many;

    Now this benchmark isn't quite what my task was doing, but it does show
    the big improvement if the match is anchored. I'll have to keep in mind
    that the anchor is helping me out in this case. That wasn't something I
    noticed when I was working on it initially.
     
    Darren Dunham, Feb 24, 2005
    #13
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.