There's no LIMIT to my SPLITting headache.

U

use63net

split /PATTERN/, EXPR, LIMIT

The use of a LIMIT is quite helpful if one is searching for fields
closer to the begining of the EXPR string , but what about if your
target is nearer the end of the string? In this case, for very long
strings, using split, even with a LIMIT, will cause noticeable delays
is time-intensive applications. Does anyone have a workaround or
alternative that will be faster?
 
B

Brian Wakem

split /PATTERN/, EXPR, LIMIT

The use of a LIMIT is quite helpful if one is searching for fields
closer to the begining of the EXPR string , but what about if your
target is nearer the end of the string? In this case, for very long
strings, using split, even with a LIMIT, will cause noticeable delays
is time-intensive applications. Does anyone have a workaround or
alternative that will be faster?


If you *know* the target is near the end you could 'reverse' the string
first.
 
B

Brian McCauley

split /PATTERN/, EXPR, LIMIT

The use of a LIMIT is quite helpful if one is searching for fields
closer to the begining of the EXPR string , but what about if your
target is nearer the end of the string? In this case, for very long
strings, using split, even with a LIMIT, will cause noticeable delays
is time-intensive applications. Does anyone have a workaround or
alternative that will be faster?

If you need only a single piece (say 42) and there are more than 42
pieces...

my ( $piece42 ) = /^(?:(.*?)PATTERN){42}/;
 
X

xhoster

split /PATTERN/, EXPR, LIMIT

The use of a LIMIT is quite helpful if one is searching for fields
closer to the begining of the EXPR string , but what about if your
target is nearer the end of the string? In this case, for very long
strings, using split, even with a LIMIT, will cause noticeable delays
is time-intensive applications. Does anyone have a workaround or
alternative that will be faster?

Personally, my policy is not to preceed the real data with very long
strings of unnecessary crap. At least not in highly time sensitive
applications.

If you can't fix whatever it is that is generating this poorly thought out
data, then maybe you could reverse the string and split that. Or use
substring to just grab the end of the string. Or maybe use a regex to
capture just what you want. Or use a database.

Xho
 
M

Mark Seger

This may be a little off target, but can you provide some additional
details? Just how long is the string you're trying to split? How many
pieces are you trying to split it into?

re: reversing the string - part of me says it may take just as long to
reverse it as the time you'd save, but the real answer is to do a few
timing tests if it's really all that important.

If you know it will always be at least n-chars from the start of the
string the suggestion about doing a substr() first also has merit, but
again you can't beat timing tests.

The most enlightening experience I've had with perl was discovering that:

statement if $a=~/aaa|bbb|ccc/

is a LOT slower than

statement if $a=~/aaa/ || $a=~/bbb/ || $a=~/ccc/

by a factor of about 7, at least when I checked it out on both a 3.5GHz
Xeon and a 1.5GHz Intanium 2. The moral to the story is nothing beats a
few simple tests.

of course I still write most of my code using the first form for
readability and only then if performance isn't critical, which it rarely
is on most apps [my opinion]. When it DOES count and you use the 2nd
form, just be sure to comment your code so nobody decided to optimize it
for you. :cool:

-mark
 
X

xhoster

Mark Seger said:
This may be a little off target, but can you provide some additional
details? Just how long is the string you're trying to split? How many
pieces are you trying to split it into?

re: reversing the string - part of me says it may take just as long to
reverse it as the time you'd save, but the real answer is to do a few
timing tests if it's really all that important.

It is very important. On my machine, for getting the last two fields of a
string, reversing becomes faster at only 12 fields (187 bytes total line
length) for a very simple regex (/,/). For a more complex regex /\s*,\s*),
it was just 6 fields.

use strict;
use Benchmark qw:)all);
foreach (0,1,2,3,4,5,10,20,40,100,1000,10000) {
my $x= join ",", (map rand(), 1..$_),'Foo','Bar';

print "$_\t" , length $x, "\t";

cmpthese( -3, {
full => sub { my @x=(split /\s*,\s*/, $x)[-2,-1]; assert(@x);},
rev => sub {
my @x=(split /\s*,\s*/, reverse($x),3)[0,1];
@x = map scalar reverse($_), reverse @x;
assert(@x);
}
});

};

sub assert {
die $_[0] unless $_[0] eq 'Foo';
die $_[1] unless $_[1] eq 'Bar';
};


Xho
 
D

Dave

Mark said:
The most enlightening experience I've had with perl was discovering that:

statement if $a=~/aaa|bbb|ccc/

is a LOT slower than

statement if $a=~/aaa/ || $a=~/bbb/ || $a=~/ccc/

by a factor of about 7, at least when I checked it out on both a 3.5GHz
Xeon and a 1.5GHz Intanium 2. The moral to the story is nothing beats a
few simple tests.

Not by a long shot on my machine, running Perl 5.8.7.

=code

use Benchmark qw:)hireswallclock cmpthese);

my $pat = 'A Perl Paaattern';
my $match = 0;

cmpthese(0, {
'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
});

=cut

Time::HiRes is compiled on my machine.

Granted, that's a *very* simple pattern, but the shorter code came out
ahead 5/5 times by a wide margin.

___________________________
Rate Code1 Code2
___________________________
Code1 2775928/s -- -62%
Code2 7368521/s 165% --
___________________________
Code1 2888905/s -- -63%
Code2 7777375/s 169% --
___________________________
Code2 2647141/s -- -64%
Code1 7289831/s 175% --
___________________________
Code1 2742926/s -- -62%
Code2 7313196/s 167% --
___________________________
Code1 2803761/s -- -61%
Code2 7211258/s 157% --
---------------------------

Or if I set $pat to a string of the output from 'perldoc Time::HiRes'

Code1 2657898/s -- -70%
Code2 8771367/s 230% --


Dave
 
S

Sisyphus

use Benchmark qw:)hireswallclock cmpthese);

my $pat = 'A Perl Paaattern';
my $match = 0;

cmpthese(0, {
'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
});

use Benchmark qw:)hireswallclock cmpthese);

$pat = 'A Perl Paaattern';
$match = 0;

cmpthese(0, {
'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
});

__END__

All I've done is delete both instances of "my", and that produces something
completely different:

Rate Code2 Code1
Code2 193959/s -- -80%
Code1 980959/s 406% --

For your code, I get a result very similar to the one you got:

Rate Code1 Code2
Code1 1153893/s -- -63%
Code2 3114356/s 170% --

Seems that declaring with 'my' has little effect (slight slowing down) on
the result reported for Code1, but a significant effect (marked speeding up)
on the result reported for Code2:

use warnings;
use Benchmark;

$pat = 'A Perl Paaattern';
$match = 0;

my $p = 'A Perl Paaattern';
my $m = 0;

timethese(200000, {
'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
'Code1A' => '$m = 1 if $p =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
'Code2A' => '$m = 1 if $p =~/aaa|bbb|ccc/',
});

__END__

Name "main::match" used only once: possible typo at try.pl line 5.
Name "main::pat" used only once: possible typo at try.pl line 4.
Benchmark: timing 200000 iterations of Code1, Code1A, Code2, Code2A...
Code1: 0 wallclock secs ( 0.20 usr + 0.00 sys = 0.20 CPU) @
1000000.00/s (n=200000)
(warning: too few iterations for a reliable count)
Code1A: 1 wallclock secs ( 0.29 usr + 0.00 sys = 0.29 CPU) @
687285.22/s (n=200000)
(warning: too few iterations for a reliable count)
Code2: 1 wallclock secs ( 1.03 usr + 0.00 sys = 1.03 CPU) @
193798.45/s (n=200000)
Code2A: 0 wallclock secs ( 0.07 usr + 0.00 sys = 0.07 CPU) @
2857142.86/s (n=200000)
(warning: too few iterations for a reliable count)

Something dodgy going on, methinks :)

Cheers,
Rob
 
U

use63net

use Benchmark qw:)hireswallclock cmpthese);

my $pat = 'A Perl Paaattern';
my $match = 0;

cmpthese(0, {
'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
});

=cut


This is getting away from the original question, but this code causes
"Use of ninitialized value in pattern match ..." errors on my PC under
Perl 5.8.4.
 
B

Brian Wakem

Sisyphus said:
use Benchmark qw:)hireswallclock cmpthese);

$pat = 'A Perl Paaattern';
$match = 0;

cmpthese(0, {
'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
});

__END__

All I've done is delete both instances of "my", and that produces
something completely different:

Rate Code2 Code1
Code2 193959/s -- -80%
Code1 980959/s 406% --

For your code, I get a result very similar to the one you got:

Rate Code1 Code2
Code1 1153893/s -- -63%
Code2 3114356/s 170% --

Seems that declaring with 'my' has little effect (slight slowing down) on
the result reported for Code1, but a significant effect (marked speeding
up) on the result reported for Code2:

use warnings;
use Benchmark;

$pat = 'A Perl Paaattern';
$match = 0;

my $p = 'A Perl Paaattern';
my $m = 0;

timethese(200000, {
'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
'Code1A' => '$m = 1 if $p =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
'Code2A' => '$m = 1 if $p =~/aaa|bbb|ccc/',
});


I get negative time elapsed for Code2A.

Code2A: 0 wallclock secs (-0.01 usr + 0.00 sys = -0.01 CPU) @
-20000000.00/s (n=200000)


And it takes less time with more iterations.

Code2A: 0 wallclock secs (-0.24 usr + 0.00 sys = -0.24 CPU) @
-41666666.67/s (n=10000000)


What is going on here?
 
S

Sisyphus

Brian Wakem said:
I get negative time elapsed for Code2A.

Code2A: 0 wallclock secs (-0.01 usr + 0.00 sys = -0.01 CPU) @
-20000000.00/s (n=200000)


And it takes less time with more iterations.

Code2A: 0 wallclock secs (-0.24 usr + 0.00 sys = -0.24 CPU) @
-41666666.67/s (n=10000000)

Wow ... that's one helluva argument in support of always declaring your
variables with 'my' :)))

Seriously, I don't know what's happening there. I do know that I'm never
comfortable with 'use strict;' and/or 'my'/'our' when it comes to scripts
that also 'use Benchmark;'. Whenever I want to time things, I make sure that
the variables involved in the code being timed are global variables ....
dunno whether that's a justifiable *basis* for feeling comfortable .... but
it makes me *feel* more comfortable, nonetheless :)

Actually, I know for a fact that lexical scoping can make Benchmark results
useless and misleading - and one way to avoid that is, of course, to use
only global variables.

Cheers,
Rob
 
B

Brian Wakem

Sisyphus said:
Wow ... that's one helluva argument in support of always declaring your
variables with 'my' :)))

Seriously, I don't know what's happening there. I do know that I'm never
comfortable with 'use strict;' and/or 'my'/'our' when it comes to scripts
that also 'use Benchmark;'. Whenever I want to time things, I make sure
that the variables involved in the code being timed are global variables
.... dunno whether that's a justifiable *basis* for feeling comfortable
.... but it makes me *feel* more comfortable, nonetheless :)

Actually, I know for a fact that lexical scoping can make Benchmark
results useless and misleading - and one way to avoid that is, of course,
to use only global variables.

Cheers,
Rob


If I drop the my from my $m = 0 I get some proper results.

Code2A: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU) @
10000000.00/s (n=200000)
(warning: too few iterations for a reliable count)


So is this a bug or a feature?
 
T

Tassilo v. Parseval

Also sprach Brian Wakem:
Sisyphus wrote:


If I drop the my from my $m = 0 I get some proper results.

Code2A: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU) @
10000000.00/s (n=200000)
(warning: too few iterations for a reliable count)


So is this a bug or a feature?

It's expected behaviour:

my $var = "value";

cmpthese(0, {
code1 => 'do_something_with($var)',
...
});

The code to be benchmarked is passed as string and run via STRING-eval. In
the scope of this eval, the lexical '$var' does not exist. Lexicals
cannot be accessed across different lexical scopes.

That's why Benchmark can also deal with code-references, more correctly
called a closure here. There you get the expected behaviour:

my $var = "value";

cmpthese(0, {
code1 => sub { do_something_with($var) },
...
}

I never use Benchmark with code strings to be evaled. The fact that they
don't adhere to lexical scoping is one of my reasons why.

Tassilo
 
M

Mark Seger

Dave said:
Mark Seger wrote:




Not by a long shot on my machine, running Perl 5.8.7.

=code

use Benchmark qw:)hireswallclock cmpthese);

my $pat = 'A Perl Paaattern';
my $match = 0;

cmpthese(0, {
'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
});

I've never used benchmark, but it sounds neat. But how could it get
such a different number from my test? Here's what I did:

test1.pl
#!/usr/bin/perl -w

$a="abcdefghijklkmbnopqrstuvwxyz";
for ($i=0; $i<1000000; $i++)
{
my $z=1 if $a=~/aaa|bbb|ccc/;
}

test2.pl
#!/usr/bin/perl -w

$a="abcdefghijklkmbnopqrstuvwxyz";
for ($i=0; $i<1000000; $i++)
{
my $z=1 if $a=~/aaa/ || $a=~/bbb/ || $a=~/ccc/;
}

[root@cag-dl380-01 test]# time ./test1.pl
real 0m7.492s
user 0m7.420s
sys 0m0.000s

[root@cag-dl380-01 test]# time ./test2.pl
real 0m0.956s
user 0m0.950s
sys 0m0.000s
Rate Code1 Code2
___________________________
Code1 2775928/s -- -62%
Code2 7368521/s 165% --
___________________________
Code1 2888905/s -- -63%
Code2 7777375/s 169% --
___________________________
Code2 2647141/s -- -64%
Code1 7289831/s 175% --
___________________________
Code1 2742926/s -- -62%
Code2 7313196/s 167% --
___________________________
Code1 2803761/s -- -61%
Code2 7211258/s 157% --
---------------------------

I also tried converting things to use Benchmark and got similar numbers
as above but am not sure how to interpret this as there was nothing in
the manpage I read, but I'll continute to look. Is this saying that the
first code fragment executed about 3M times/sec and the second 7M? I'm
not sure what the percentages represent, but if they're relative to
something, it feels like 'code1' is almost 3 times faster than 'code2'
which seems to correlate with the numbers/second. If so then there IS a
difference though not as much as I'm seeing.

I also tried running my code with Benchmark but am getting some warnings
I'm not sure about. If I try "my $a" I get a zillion uninit values that
make no sense, not to mention the subroutine redefinition message.
Maybe I need a different version of Benchmark? Anyhows, here's my code:

#!/usr/bin/perl -w

use Benchmark qw:)hireswallclock cmpthese);

$a="abcdefghijklkmbnopqrstuvwxyz";
cmpthese(0,
{
'Code1' => '$z=1 if $a=~/aaa|bbb|ccc/',
'Code2' => '$z=1 if $a=~/aaa/ || $a=~/bbb/ || $a=~/ccc/'
});

and the results:

[root@cag-dl380-01 test]# ./bench1.pl
Subroutine Benchmark::mytime redefined at
/usr/lib/perl5/5.8.0/Benchmark.pm line 450.
Name "main::a" used only once: possible typo at ./bench1.pl line 5.
Rate Code1 Code2
Code1 144273/s -- -90%
Code2 1494766/s 936% --

Is the implication that I ran 10 times more iterations using the second
form?

I also tried using the 'timethese' function in Benchmark, simply
replacing the references to 'cmpthese' so I won't bore you with the
code. Here's what that did and as you can see I'm still getting
warnings and errors so I'm not sure how trustworthy the results are

[root@cag-dl380-01 test]# ./bench2.pl
Subroutine Benchmark::mytime redefined at
/usr/lib/perl5/5.8.0/Benchmark.pm line 450.
Name "main::a" used only once: possible typo at ./bench2.pl line 5.
Benchmark: running Code1, Code2 for at least 3 CPU seconds...
Code1: 3.28681 wallclock secs ( 3.26 usr + 0.00 sys = 3.26 CPU)
@ 146238.34/s (n=476737)
Code2: 3.20847 wallclock secs ( 3.21 usr + 0.00 sys = 3.21 CPU)
@ 1401934.58/s (n=4500210)

-mark
 
M

Mark Seger

Not by a long shot on my machine, running Perl 5.8.7.
snip

___________________________
Rate Code1 Code2
___________________________
Code1 2775928/s -- -62%
Code2 7368521/s 165% --

ok, I finally read more on Benchmark, and I should have probably done so
before my last posting. Anynow, I was indeed correct in my last note
and that means your numbers differ by a factor of 3, which I wouldn't
characterize as 'not by a long shot' when compared to 7. though I am
curious why as I'd think a faster or slow cpu would perform at
relatively the same ratio.

One other thing - what o/s are you running your tests on? All my above
have been on a 3.5GHz Xeon. I just tried it on my small machine, a
1.4GHz running XP and got very similar numbers.

-mark


-mark
 
M

Mark Seger

Not by a long shot on my machine, running Perl 5.8.7.
=code

use Benchmark qw:)hireswallclock cmpthese);

my $pat = 'A Perl Paaattern';
my $match = 0;

cmpthese(0, {
'Code1' => '$match = 1 if $pat =~/aaa/ || $pat=~/bbb/ || $pat=~/ccc/',
'Code2' => '$match = 1 if $pat =~/aaa|bbb|ccc/',
});

If it hasn't been obvious, this is really bothering me and on closer
inspection of your code I see the string you're doing the compare
against not only contains one of the patters being tested for, it's the
first patter, so of course the difference will less. The second
comparisons will not even take place in the first case and not knowing
if perl does anything fancy with | in a regx, but I'm guessing not there
either.

What I should have done was posted my test string - I'm just using the
alpabet of "abc..z" which is both longer as well as doesn't contain any
of the test patterns.

Finally I tried it again for a string of 208 chars long (the alphabet
repeated 8 times and got a difference of almost 25 and this time I'll
post the code:

#!/usr/bin/perl

use Benchmark qw:)hireswallclock cmpthese);

$a="abcdefghijklmnopqrstuvwxyz";
for ($i=0; $i<3; $i++) { $a.=$a; }

cmpthese(0,
{
'Code1' => '$z=1 if $a=~/aaa|bbb|ccc/',
'Code2' => '$z=1 if $a=~/aaa/ || $a=~/bbb/ || $a=~/ccc/'
});

[root@cag-dl380-01 test]# ./bench1.pl
Rate Code1 Code2
Code1 23519/s -- -96%
Code2 532528/s 2164% --

So back to my original assertion, one for of pattern matching can make a
big difference for lots of comparisons when timing does matter.
Curiously enough if I extend the number of comparisons to 6 different
tests, the difference in times drops! Clearly the separate compares run
about twice as slow as I would have expected but the regx is only about
25% slower.

[root@cag-dl380-01 test]# ./bench1.pl
Rate Code1 Code2
Code1 17329/s -- -94%
Code2 273835/s 1480% --


-mark
 
U

use63net

Back to the original issue, I think that this little program shows that
using "reverse" can substantially increase performance.

--------------------

my @x;
$#x = 5000;
my $string;
my $start;
my $field;

$string= join ",", (map rand(10), 1..$#x);

$start = times;

foreach (1 .. 100)
{
@x=split /,/, $string;
$field = $x[4000];
$field = $x[4500];
};

print "regular split time = ", times() - $start , "\n";


$start = times;

foreach (1 .. 100)
{
@x=split /,/, reverse($string), 1002;
$field = reverse $x[500];
$field = reverse $x[1000];
};

print "reverse split time = ", times() - $start , "\n\n";
 
D

Dave

Sisyphus wrote:

Name "main::match" used only once: possible typo at try.pl line 5.
Name "main::pat" used only once: possible typo at try.pl line 4.
Benchmark: timing 200000 iterations of Code1, Code1A, Code2, Code2A...

Oops! That's what I get for not turning on warnings.

I should have considered that "$pat" (and "$match", for that matter)
might not be visible inside the scope that the scope that the code is
being executed, which it's not.

To lexically scope "$pat", that code snippet would have to be written as

....

cmpthese($count, {
Code1 => q[my $pat = 'A Perl Paaattern'; my $match = 1 if
$pat =~ /aaa/ || $pat=~/bbb/ || $pat=~/ccc/],
Code2 => q[my $pat = 'A Perl Paaattern'; my $match = 1 if
$pat =~ /aaa|bbb|ccc/],
});

Of course, this was all done purposely by me to emphasis the value of
enabling the 'warnings' pragma. ;-)

Dave
 
D

Dave

Tassilo said:
Also sprach Brian Wakem:



It's expected behaviour:

The code to be benchmarked is passed as string and run via STRING-eval. In
the scope of this eval, the lexical '$var' does not exist. Lexicals
cannot be accessed across different lexical scopes.

Right. I posted a followup to the thread pointing out the flaw in my
sample benchmark code. Quite an embarrassing oversight as this is a
relatively common "gotcha" in Perl. It's time like these that one
wishes they had inserted 'X-No-Archive' into their original message
header. :)

Dave
 
D

Dave

Mark said:
If it hasn't been obvious, this is really bothering me and on closer
inspection of your code I see the string you're doing the compare
against not only contains one of the patters being tested for, it's the
first patter, so of course the difference will less. The second

Actually, the problem was that my example was attempting to match
against an uninitialized value.

Please don't make the mistake of thinking that using globals has an
inherent speed advantage in pattern matches. I really should've
inspected my example more closely prior to posting it. See Tassilo's
post for an explanation.

----- code snippet -----

use warnings;
use strict;

use Benchmark qw:)hireswallclock cmpthese);

my $pat = <<'END_OF_PATTERN';
A lexical pattern that doesn't go out of scope.
See *closure* in perlref.
END_OF_PATTERN

my $count = -5;

cmpthese($count, {
Code1 => \&Code1,
Code2 => \&Code2,
});

sub Code1 {
$pat =~ /aaa/ || $pat=~/bbb/ || $pat=~/ccc/;
}
sub Code2 {
$pat =~ /aaa|bbb|ccc/;
}

----- end of code -----

Dave
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,835
Latest member
KetoRushACVBuy

Latest Threads

Top