There's no LIMIT to my SPLITting headache.

Discussion in 'Perl Misc' started by use63net@yahoo.com, Jun 11, 2005.

  1. Guest

    split /PATTERN/, EXPR, LIMIT

    The use of a LIMIT is quite helpful if one is searching for fields
    closer to the begining of the EXPR string , but what about if your
    target is nearer the end of the string? In this case, for very long
    strings, using split, even with a LIMIT, will cause noticeable delays
    is time-intensive applications. Does anyone have a workaround or
    alternative that will be faster?
     
    , Jun 11, 2005
    #1
    1. Advertising

  2. Brian Wakem Guest

    wrote:

    > split /PATTERN/, EXPR, LIMIT
    >
    > The use of a LIMIT is quite helpful if one is searching for fields
    > closer to the begining of the EXPR string , but what about if your
    > target is nearer the end of the string? In this case, for very long
    > strings, using split, even with a LIMIT, will cause noticeable delays
    > is time-intensive applications. Does anyone have a workaround or
    > alternative that will be faster?



    If you *know* the target is near the end you could 'reverse' the string
    first.


    --
    Brian Wakem
     
    Brian Wakem, Jun 11, 2005
    #2
    1. Advertising

  3. wrote:

    > split /PATTERN/, EXPR, LIMIT
    >
    > The use of a LIMIT is quite helpful if one is searching for fields
    > closer to the begining of the EXPR string , but what about if your
    > target is nearer the end of the string? In this case, for very long
    > strings, using split, even with a LIMIT, will cause noticeable delays
    > is time-intensive applications. Does anyone have a workaround or
    > alternative that will be faster?


    If you need only a single piece (say 42) and there are more than 42
    pieces...

    my ( $piece42 ) = /^(?:(.*?)PATTERN){42}/;
     
    Brian McCauley, Jun 11, 2005
    #3
  4. Guest

    wrote:
    > split /PATTERN/, EXPR, LIMIT
    >
    > The use of a LIMIT is quite helpful if one is searching for fields
    > closer to the begining of the EXPR string , but what about if your
    > target is nearer the end of the string? In this case, for very long
    > strings, using split, even with a LIMIT, will cause noticeable delays
    > is time-intensive applications. Does anyone have a workaround or
    > alternative that will be faster?


    Personally, my policy is not to preceed the real data with very long
    strings of unnecessary crap. At least not in highly time sensitive
    applications.

    If you can't fix whatever it is that is generating this poorly thought out
    data, then maybe you could reverse the string and split that. Or use
    substring to just grab the end of the string. Or maybe use a regex to
    capture just what you want. Or use a database.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jun 11, 2005
    #4
  5. Mark Seger Guest


    >>The use of a LIMIT is quite helpful if one is searching for fields
    >>closer to the begining of the EXPR string , but what about if your
    >>target is nearer the end of the string? In this case, for very long
    >>strings, using split, even with a LIMIT, will cause noticeable delays
    >>is time-intensive applications. Does anyone have a workaround or
    >>alternative that will be faster?


    This may be a little off target, but can you provide some additional
    details? Just how long is the string you're trying to split? How many
    pieces are you trying to split it into?

    re: reversing the string - part of me says it may take just as long to
    reverse it as the time you'd save, but the real answer is to do a few
    timing tests if it's really all that important.

    If you know it will always be at least n-chars from the start of the
    string the suggestion about doing a substr() first also has merit, but
    again you can't beat timing tests.

    The most enlightening experience I've had with perl was discovering that:

    statement if $a=~/aaa|bbb|ccc/

    is a LOT slower than

    statement if $a=~/aaa/ || $a=~/bbb/ || $a=~/ccc/

    by a factor of about 7, at least when I checked it out on both a 3.5GHz
    Xeon and a 1.5GHz Intanium 2. The moral to the story is nothing beats a
    few simple tests.

    of course I still write most of my code using the first form for
    readability and only then if performance isn't critical, which it rarely
    is on most apps [my opinion]. When it DOES count and you use the 2nd
    form, just be sure to comment your code so nobody decided to optimize it
    for you. :cool:

    -mark
     
    Mark Seger, Jun 11, 2005
    #5
  6. Guest

    Mark Seger <> wrote:
    > >>The use of a LIMIT is quite helpful if one is searching for fields
    > >>closer to the begining of the EXPR string , but what about if your
    > >>target is nearer the end of the string? In this case, for very long
    > >>strings, using split, even with a LIMIT, will cause noticeable delays
    > >>is time-intensive applications. Does anyone have a workaround or
    > >>alternative that will be faster?

    >
    > This may be a little off target, but can you provide some additional
    > details? Just how long is the string you're trying to split? How many
    > pieces are you trying to split it into?
    >
    > re: reversing the string - part of me says it may take just as long to
    > reverse it as the time you'd save, but the real answer is to do a few
    > timing tests if it's really all that important.


    It is very important. On my machine, for getting the last two fields of a
    string, reversing becomes faster at only 12 fields (187 bytes total line
    length) for a very simple regex (/,/). For a more complex regex /\s*,\s*),
    it was just 6 fields.

    use strict;
    use Benchmark qw:)all);
    foreach (0,1,2,3,4,5,10,20,40,100,1000,10000) {
    my $x= join ",", (map rand(), 1..$_),'Foo','Bar';

    print "$_\t" , length $x, "\t";

    cmpthese( -3, {
    full => sub { my @x=(split /\s*,\s*/, $x)[-2,-1]; assert(@x);},
    rev => sub {
    my @x=(split /\s*,\s*/, reverse($x),3)[0,1];
    @x = map scalar reverse($_), reverse @x;
    assert(@x);
    }
    });

    };

    sub assert {
    die $_[0] unless $_[0] eq 'Foo';
    die $_[1] unless $_[1] eq 'Bar';
    };


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jun 11, 2005
    #6
  7. Dave Guest

    Mark Seger wrote:

    > The most enlightening experience I've had with perl was discovering that:
    >
    > statement if $a=~/aaa|bbb|ccc/
    >
    > is a LOT slower than
    >
    > statement if $a=~/aaa/ || $a=~/bbb/ || $a=~/ccc/
    >
    > by a factor of about 7, at least when I checked it out on both a 3.5GHz
    > Xeon and a 1.5GHz Intanium 2. The moral to the story is nothing beats a
    > few simple tests.


    Not by a long shot on my machine, running Perl 5.8.7.

    =code

    use Benchmark qw:)hireswallclock cmpthese);

    my $pat = 'A Perl Paaattern';
    my $match = 0;

    cmpthese(0, {
    'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
    'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
    });

    =cut

    Time::HiRes is compiled on my machine.

    Granted, that's a *very* simple pattern, but the shorter code came out
    ahead 5/5 times by a wide margin.

    ___________________________
    Rate Code1 Code2
    ___________________________
    Code1 2775928/s -- -62%
    Code2 7368521/s 165% --
    ___________________________
    Code1 2888905/s -- -63%
    Code2 7777375/s 169% --
    ___________________________
    Code2 2647141/s -- -64%
    Code1 7289831/s 175% --
    ___________________________
    Code1 2742926/s -- -62%
    Code2 7313196/s 167% --
    ___________________________
    Code1 2803761/s -- -61%
    Code2 7211258/s 157% --
    ---------------------------

    Or if I set $pat to a string of the output from 'perldoc Time::HiRes'

    Code1 2657898/s -- -70%
    Code2 8771367/s 230% --


    Dave
     
    Dave, Jun 12, 2005
    #7
  8. Sisyphus Guest

    "Dave" <> wrote in message
    news:...

    >
    > use Benchmark qw:)hireswallclock cmpthese);
    >
    > my $pat = 'A Perl Paaattern';
    > my $match = 0;
    >
    > cmpthese(0, {
    > 'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
    > 'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
    > });
    >


    use Benchmark qw:)hireswallclock cmpthese);

    $pat = 'A Perl Paaattern';
    $match = 0;

    cmpthese(0, {
    'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
    'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
    });

    __END__

    All I've done is delete both instances of "my", and that produces something
    completely different:

    Rate Code2 Code1
    Code2 193959/s -- -80%
    Code1 980959/s 406% --

    For your code, I get a result very similar to the one you got:

    Rate Code1 Code2
    Code1 1153893/s -- -63%
    Code2 3114356/s 170% --

    Seems that declaring with 'my' has little effect (slight slowing down) on
    the result reported for Code1, but a significant effect (marked speeding up)
    on the result reported for Code2:

    use warnings;
    use Benchmark;

    $pat = 'A Perl Paaattern';
    $match = 0;

    my $p = 'A Perl Paaattern';
    my $m = 0;

    timethese(200000, {
    'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
    'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
    'Code1A' => '$m = 1 if $p =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
    'Code2A' => '$m = 1 if $p =~/aaa|bbb|ccc/',
    });

    __END__

    Name "main::match" used only once: possible typo at try.pl line 5.
    Name "main::pat" used only once: possible typo at try.pl line 4.
    Benchmark: timing 200000 iterations of Code1, Code1A, Code2, Code2A...
    Code1: 0 wallclock secs ( 0.20 usr + 0.00 sys = 0.20 CPU) @
    1000000.00/s (n=200000)
    (warning: too few iterations for a reliable count)
    Code1A: 1 wallclock secs ( 0.29 usr + 0.00 sys = 0.29 CPU) @
    687285.22/s (n=200000)
    (warning: too few iterations for a reliable count)
    Code2: 1 wallclock secs ( 1.03 usr + 0.00 sys = 1.03 CPU) @
    193798.45/s (n=200000)
    Code2A: 0 wallclock secs ( 0.07 usr + 0.00 sys = 0.07 CPU) @
    2857142.86/s (n=200000)
    (warning: too few iterations for a reliable count)

    Something dodgy going on, methinks :)

    Cheers,
    Rob
     
    Sisyphus, Jun 12, 2005
    #8
  9. Guest


    > use Benchmark qw:)hireswallclock cmpthese);
    >
    > my $pat = 'A Perl Paaattern';
    > my $match = 0;
    >
    > cmpthese(0, {
    > 'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
    > 'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
    > });
    >
    > =cut



    This is getting away from the original question, but this code causes
    "Use of ninitialized value in pattern match ..." errors on my PC under
    Perl 5.8.4.
     
    , Jun 12, 2005
    #9
  10. Brian Wakem Guest

    Sisyphus wrote:

    >
    > "Dave" <> wrote in message
    > news:...
    >
    >>
    >> use Benchmark qw:)hireswallclock cmpthese);
    >>
    >> my $pat = 'A Perl Paaattern';
    >> my $match = 0;
    >>
    >> cmpthese(0, {
    >> 'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ ||
    >> $pat=~/ccc/', 'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
    >> });
    >>

    >
    > use Benchmark qw:)hireswallclock cmpthese);
    >
    > $pat = 'A Perl Paaattern';
    > $match = 0;
    >
    > cmpthese(0, {
    > 'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
    > 'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
    > });
    >
    > __END__
    >
    > All I've done is delete both instances of "my", and that produces
    > something completely different:
    >
    > Rate Code2 Code1
    > Code2 193959/s -- -80%
    > Code1 980959/s 406% --
    >
    > For your code, I get a result very similar to the one you got:
    >
    > Rate Code1 Code2
    > Code1 1153893/s -- -63%
    > Code2 3114356/s 170% --
    >
    > Seems that declaring with 'my' has little effect (slight slowing down) on
    > the result reported for Code1, but a significant effect (marked speeding
    > up) on the result reported for Code2:
    >
    > use warnings;
    > use Benchmark;
    >
    > $pat = 'A Perl Paaattern';
    > $match = 0;
    >
    > my $p = 'A Perl Paaattern';
    > my $m = 0;
    >
    > timethese(200000, {
    > 'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
    > 'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
    > 'Code1A' => '$m = 1 if $p =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
    > 'Code2A' => '$m = 1 if $p =~/aaa|bbb|ccc/',
    > });



    I get negative time elapsed for Code2A.

    Code2A: 0 wallclock secs (-0.01 usr + 0.00 sys = -0.01 CPU) @
    -20000000.00/s (n=200000)


    And it takes less time with more iterations.

    Code2A: 0 wallclock secs (-0.24 usr + 0.00 sys = -0.24 CPU) @
    -41666666.67/s (n=10000000)


    What is going on here?


    --
    Brian Wakem
     
    Brian Wakem, Jun 12, 2005
    #10
  11. Sisyphus Guest

    "Brian Wakem" <> wrote in message

    >
    > I get negative time elapsed for Code2A.
    >
    > Code2A: 0 wallclock secs (-0.01 usr + 0.00 sys = -0.01 CPU) @
    > -20000000.00/s (n=200000)
    >
    >
    > And it takes less time with more iterations.
    >
    > Code2A: 0 wallclock secs (-0.24 usr + 0.00 sys = -0.24 CPU) @
    > -41666666.67/s (n=10000000)
    >


    Wow ... that's one helluva argument in support of always declaring your
    variables with 'my' :)))

    Seriously, I don't know what's happening there. I do know that I'm never
    comfortable with 'use strict;' and/or 'my'/'our' when it comes to scripts
    that also 'use Benchmark;'. Whenever I want to time things, I make sure that
    the variables involved in the code being timed are global variables ....
    dunno whether that's a justifiable *basis* for feeling comfortable .... but
    it makes me *feel* more comfortable, nonetheless :)

    Actually, I know for a fact that lexical scoping can make Benchmark results
    useless and misleading - and one way to avoid that is, of course, to use
    only global variables.

    Cheers,
    Rob
     
    Sisyphus, Jun 12, 2005
    #11
  12. Brian Wakem Guest

    Sisyphus wrote:

    >
    > "Brian Wakem" <> wrote in message
    >
    >>
    >> I get negative time elapsed for Code2A.
    >>
    >> Code2A: 0 wallclock secs (-0.01 usr + 0.00 sys = -0.01 CPU) @
    >> -20000000.00/s (n=200000)
    >>
    >>
    >> And it takes less time with more iterations.
    >>
    >> Code2A: 0 wallclock secs (-0.24 usr + 0.00 sys = -0.24 CPU) @
    >> -41666666.67/s (n=10000000)
    >>

    >
    > Wow ... that's one helluva argument in support of always declaring your
    > variables with 'my' :)))
    >
    > Seriously, I don't know what's happening there. I do know that I'm never
    > comfortable with 'use strict;' and/or 'my'/'our' when it comes to scripts
    > that also 'use Benchmark;'. Whenever I want to time things, I make sure
    > that the variables involved in the code being timed are global variables
    > .... dunno whether that's a justifiable *basis* for feeling comfortable
    > .... but it makes me *feel* more comfortable, nonetheless :)
    >
    > Actually, I know for a fact that lexical scoping can make Benchmark
    > results useless and misleading - and one way to avoid that is, of course,
    > to use only global variables.
    >
    > Cheers,
    > Rob



    If I drop the my from my $m = 0 I get some proper results.

    Code2A: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU) @
    10000000.00/s (n=200000)
    (warning: too few iterations for a reliable count)


    So is this a bug or a feature?


    --
    Brian Wakem
     
    Brian Wakem, Jun 12, 2005
    #12
  13. Also sprach Brian Wakem:

    > Sisyphus wrote:


    >> Actually, I know for a fact that lexical scoping can make Benchmark
    >> results useless and misleading - and one way to avoid that is, of course,
    >> to use only global variables.
    >>
    >> Cheers,
    >> Rob

    >
    >
    > If I drop the my from my $m = 0 I get some proper results.
    >
    > Code2A: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU) @
    > 10000000.00/s (n=200000)
    > (warning: too few iterations for a reliable count)
    >
    >
    > So is this a bug or a feature?


    It's expected behaviour:

    my $var = "value";

    cmpthese(0, {
    code1 => 'do_something_with($var)',
    ...
    });

    The code to be benchmarked is passed as string and run via STRING-eval. In
    the scope of this eval, the lexical '$var' does not exist. Lexicals
    cannot be accessed across different lexical scopes.

    That's why Benchmark can also deal with code-references, more correctly
    called a closure here. There you get the expected behaviour:

    my $var = "value";

    cmpthese(0, {
    code1 => sub { do_something_with($var) },
    ...
    }

    I never use Benchmark with code strings to be evaled. The fact that they
    don't adhere to lexical scoping is one of my reasons why.

    Tassilo
    --
    use bigint;
    $n=71423350343770280161397026330337371139054411854220053437565440;
    $m=-8,;;$_=$n&(0xff)<<$m,,$_>>=$m,,print+chr,,while(($m+=8)<=200);
     
    Tassilo v. Parseval, Jun 12, 2005
    #13
  14. Mark Seger Guest

    Dave wrote:

    > Mark Seger wrote:
    >
    >
    >>The most enlightening experience I've had with perl was discovering that:
    >>
    >>statement if $a=~/aaa|bbb|ccc/
    >>
    >>is a LOT slower than
    >>
    >>statement if $a=~/aaa/ || $a=~/bbb/ || $a=~/ccc/
    >>
    >>by a factor of about 7, at least when I checked it out on both a 3.5GHz
    >>Xeon and a 1.5GHz Intanium 2. The moral to the story is nothing beats a
    >> few simple tests.

    >
    >
    > Not by a long shot on my machine, running Perl 5.8.7.
    >
    > =code
    >
    > use Benchmark qw:)hireswallclock cmpthese);
    >
    > my $pat = 'A Perl Paaattern';
    > my $match = 0;
    >
    > cmpthese(0, {
    > 'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
    > 'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
    > });


    I've never used benchmark, but it sounds neat. But how could it get
    such a different number from my test? Here's what I did:

    test1.pl
    #!/usr/bin/perl -w

    $a="abcdefghijklkmbnopqrstuvwxyz";
    for ($i=0; $i<1000000; $i++)
    {
    my $z=1 if $a=~/aaa|bbb|ccc/;
    }

    test2.pl
    #!/usr/bin/perl -w

    $a="abcdefghijklkmbnopqrstuvwxyz";
    for ($i=0; $i<1000000; $i++)
    {
    my $z=1 if $a=~/aaa/ || $a=~/bbb/ || $a=~/ccc/;
    }

    [root@cag-dl380-01 test]# time ./test1.pl
    real 0m7.492s
    user 0m7.420s
    sys 0m0.000s

    [root@cag-dl380-01 test]# time ./test2.pl
    real 0m0.956s
    user 0m0.950s
    sys 0m0.000s

    > Rate Code1 Code2
    > ___________________________
    > Code1 2775928/s -- -62%
    > Code2 7368521/s 165% --
    > ___________________________
    > Code1 2888905/s -- -63%
    > Code2 7777375/s 169% --
    > ___________________________
    > Code2 2647141/s -- -64%
    > Code1 7289831/s 175% --
    > ___________________________
    > Code1 2742926/s -- -62%
    > Code2 7313196/s 167% --
    > ___________________________
    > Code1 2803761/s -- -61%
    > Code2 7211258/s 157% --
    > ---------------------------


    I also tried converting things to use Benchmark and got similar numbers
    as above but am not sure how to interpret this as there was nothing in
    the manpage I read, but I'll continute to look. Is this saying that the
    first code fragment executed about 3M times/sec and the second 7M? I'm
    not sure what the percentages represent, but if they're relative to
    something, it feels like 'code1' is almost 3 times faster than 'code2'
    which seems to correlate with the numbers/second. If so then there IS a
    difference though not as much as I'm seeing.

    I also tried running my code with Benchmark but am getting some warnings
    I'm not sure about. If I try "my $a" I get a zillion uninit values that
    make no sense, not to mention the subroutine redefinition message.
    Maybe I need a different version of Benchmark? Anyhows, here's my code:

    #!/usr/bin/perl -w

    use Benchmark qw:)hireswallclock cmpthese);

    $a="abcdefghijklkmbnopqrstuvwxyz";
    cmpthese(0,
    {
    'Code1' => '$z=1 if $a=~/aaa|bbb|ccc/',
    'Code2' => '$z=1 if $a=~/aaa/ || $a=~/bbb/ || $a=~/ccc/'
    });

    and the results:

    [root@cag-dl380-01 test]# ./bench1.pl
    Subroutine Benchmark::mytime redefined at
    /usr/lib/perl5/5.8.0/Benchmark.pm line 450.
    Name "main::a" used only once: possible typo at ./bench1.pl line 5.
    Rate Code1 Code2
    Code1 144273/s -- -90%
    Code2 1494766/s 936% --

    Is the implication that I ran 10 times more iterations using the second
    form?

    I also tried using the 'timethese' function in Benchmark, simply
    replacing the references to 'cmpthese' so I won't bore you with the
    code. Here's what that did and as you can see I'm still getting
    warnings and errors so I'm not sure how trustworthy the results are

    [root@cag-dl380-01 test]# ./bench2.pl
    Subroutine Benchmark::mytime redefined at
    /usr/lib/perl5/5.8.0/Benchmark.pm line 450.
    Name "main::a" used only once: possible typo at ./bench2.pl line 5.
    Benchmark: running Code1, Code2 for at least 3 CPU seconds...
    Code1: 3.28681 wallclock secs ( 3.26 usr + 0.00 sys = 3.26 CPU)
    @ 146238.34/s (n=476737)
    Code2: 3.20847 wallclock secs ( 3.21 usr + 0.00 sys = 3.21 CPU)
    @ 1401934.58/s (n=4500210)

    -mark
     
    Mark Seger, Jun 12, 2005
    #14
  15. Mark Seger Guest


    > Not by a long shot on my machine, running Perl 5.8.7.


    snip

    > ___________________________
    > Rate Code1 Code2
    > ___________________________
    > Code1 2775928/s -- -62%
    > Code2 7368521/s 165% --


    ok, I finally read more on Benchmark, and I should have probably done so
    before my last posting. Anynow, I was indeed correct in my last note
    and that means your numbers differ by a factor of 3, which I wouldn't
    characterize as 'not by a long shot' when compared to 7. though I am
    curious why as I'd think a faster or slow cpu would perform at
    relatively the same ratio.

    One other thing - what o/s are you running your tests on? All my above
    have been on a 3.5GHz Xeon. I just tried it on my small machine, a
    1.4GHz running XP and got very similar numbers.

    -mark


    -mark
     
    Mark Seger, Jun 12, 2005
    #15
  16. Mark Seger Guest

    > Not by a long shot on my machine, running Perl 5.8.7.
    >
    > =code
    >
    > use Benchmark qw:)hireswallclock cmpthese);
    >
    > my $pat = 'A Perl Paaattern';
    > my $match = 0;
    >
    > cmpthese(0, {
    > 'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
    > 'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
    > });


    If it hasn't been obvious, this is really bothering me and on closer
    inspection of your code I see the string you're doing the compare
    against not only contains one of the patters being tested for, it's the
    first patter, so of course the difference will less. The second
    comparisons will not even take place in the first case and not knowing
    if perl does anything fancy with | in a regx, but I'm guessing not there
    either.

    What I should have done was posted my test string - I'm just using the
    alpabet of "abc..z" which is both longer as well as doesn't contain any
    of the test patterns.

    Finally I tried it again for a string of 208 chars long (the alphabet
    repeated 8 times and got a difference of almost 25 and this time I'll
    post the code:

    #!/usr/bin/perl

    use Benchmark qw:)hireswallclock cmpthese);

    $a="abcdefghijklmnopqrstuvwxyz";
    for ($i=0; $i<3; $i++) { $a.=$a; }

    cmpthese(0,
    {
    'Code1' => '$z=1 if $a=~/aaa|bbb|ccc/',
    'Code2' => '$z=1 if $a=~/aaa/ || $a=~/bbb/ || $a=~/ccc/'
    });

    [root@cag-dl380-01 test]# ./bench1.pl
    Rate Code1 Code2
    Code1 23519/s -- -96%
    Code2 532528/s 2164% --

    So back to my original assertion, one for of pattern matching can make a
    big difference for lots of comparisons when timing does matter.
    Curiously enough if I extend the number of comparisons to 6 different
    tests, the difference in times drops! Clearly the separate compares run
    about twice as slow as I would have expected but the regx is only about
    25% slower.

    [root@cag-dl380-01 test]# ./bench1.pl
    Rate Code1 Code2
    Code1 17329/s -- -94%
    Code2 273835/s 1480% --


    -mark
     
    Mark Seger, Jun 12, 2005
    #16
  17. Guest

    Back to the original issue, I think that this little program shows that
    using "reverse" can substantially increase performance.

    --------------------

    my @x;
    $#x = 5000;
    my $string;
    my $start;
    my $field;

    $string= join ",", (map rand(10), 1..$#x);

    $start = times;

    foreach (1 .. 100)
    {
    @x=split /,/, $string;
    $field = $x[4000];
    $field = $x[4500];
    };

    print "regular split time = ", times() - $start , "\n";


    $start = times;

    foreach (1 .. 100)
    {
    @x=split /,/, reverse($string), 1002;
    $field = reverse $x[500];
    $field = reverse $x[1000];
    };

    print "reverse split time = ", times() - $start , "\n\n";

    ----------------------
    On my PC, I get something like:

    regular split time = 2.25
    reverse split time = 0.6
     
    , Jun 12, 2005
    #17
  18. Dave Guest

    Sisyphus wrote:


    > Name "main::match" used only once: possible typo at try.pl line 5.
    > Name "main::pat" used only once: possible typo at try.pl line 4.
    > Benchmark: timing 200000 iterations of Code1, Code1A, Code2, Code2A...


    Oops! That's what I get for not turning on warnings.

    I should have considered that "$pat" (and "$match", for that matter)
    might not be visible inside the scope that the scope that the code is
    being executed, which it's not.

    To lexically scope "$pat", that code snippet would have to be written as

    ....

    cmpthese($count, {
    Code1 => q[my $pat = 'A Perl Paaattern'; my $match = 1 if
    $pat =~ /aaa/ || $pat=~/bbb/ || $pat=~/ccc/],
    Code2 => q[my $pat = 'A Perl Paaattern'; my $match = 1 if
    $pat =~ /aaa|bbb|ccc/],
    });

    Of course, this was all done purposely by me to emphasis the value of
    enabling the 'warnings' pragma. ;-)

    Dave
     
    Dave, Jun 13, 2005
    #18
  19. Dave Guest

    Tassilo v. Parseval wrote:

    > Also sprach Brian Wakem:



    >>So is this a bug or a feature?

    >
    >
    > It's expected behaviour:


    <snip>

    > The code to be benchmarked is passed as string and run via STRING-eval. In
    > the scope of this eval, the lexical '$var' does not exist. Lexicals
    > cannot be accessed across different lexical scopes.


    Right. I posted a followup to the thread pointing out the flaw in my
    sample benchmark code. Quite an embarrassing oversight as this is a
    relatively common "gotcha" in Perl. It's time like these that one
    wishes they had inserted 'X-No-Archive' into their original message
    header. :)

    Dave
     
    Dave, Jun 13, 2005
    #19
  20. Dave Guest

    Mark Seger wrote:

    > If it hasn't been obvious, this is really bothering me and on closer
    > inspection of your code I see the string you're doing the compare
    > against not only contains one of the patters being tested for, it's the
    > first patter, so of course the difference will less. The second


    Actually, the problem was that my example was attempting to match
    against an uninitialized value.

    Please don't make the mistake of thinking that using globals has an
    inherent speed advantage in pattern matches. I really should've
    inspected my example more closely prior to posting it. See Tassilo's
    post for an explanation.

    ----- code snippet -----

    use warnings;
    use strict;

    use Benchmark qw:)hireswallclock cmpthese);

    my $pat = <<'END_OF_PATTERN';
    A lexical pattern that doesn't go out of scope.
    See *closure* in perlref.
    END_OF_PATTERN

    my $count = -5;

    cmpthese($count, {
    Code1 => \&Code1,
    Code2 => \&Code2,
    });

    sub Code1 {
    $pat =~ /aaa/ || $pat=~/bbb/ || $pat=~/ccc/;
    }
    sub Code2 {
    $pat =~ /aaa|bbb|ccc/;
    }

    ----- end of code -----

    Dave
     
    Dave, Jun 13, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Ericson
    Replies:
    0
    Views:
    428
    John Ericson
    Jul 19, 2003
  2. Mark
    Replies:
    0
    Views:
    442
  3. Mensanator

    a splitting headache

    Mensanator, Oct 16, 2009, in forum: Python
    Replies:
    29
    Views:
    597
    Mensanator
    Oct 26, 2009
  4. John Posner

    Re: a splitting headache

    John Posner, Oct 22, 2009, in forum: Python
    Replies:
    3
    Views:
    306
    Mensanator
    Oct 22, 2009
  5. John Posner

    Re: a splitting headache

    John Posner, Oct 22, 2009, in forum: Python
    Replies:
    2
    Views:
    315
    Gabriel Genellina
    Oct 23, 2009
Loading...

Share This Page