J
J. Romano
... try to actually isolate the issue in a proper
benchmark. spend some deep time in thought in how
to do it. post your benchmark for review. then talk
about $' with some confidence.
Okay, I'll take you up on that offer.
(But be warned! Expect a "tome".
First, allow me to define some terminology:
* A "candidate line" is a pattern match that can be easily
written to use $`, $&, or $'. They don't have to use
them, as they can be written with explicit grabs. But
nevertheless, they are pattern matches that could benefit
from the use of the $MATCH variables.
* A "taint line" is a line of code that uses a $MATCH
variable (such as $`, $&, or $'). The reason that it is
called a "taint line" is because it "taints" all regular
expressions with a performance penalty, whether they use
the match variables or not. Note that in this post,
the usage of the word "taint" has nothing to do with
Perl's "taint mode", a mode that is turned on with the
"-T" switch.
Also, allow me to clear up a common mistake. The following
two lines are NOT functionally identical:
$a = $1 if $string =~ m/=(.*)$/;
$a = $' if $string =~ m/=/;
This is because the '.' matches all characters EXCEPT a
newline. To make the two lines identical, either put an "s"
modifier on the first line or add a chomp($a) call to the
second. In other words, these two lines are functionally
equivalent:
$a = $1 if $string =~ m/=(.*)$/s;
$a = $' if $string =~ m/=/;
as are these:
$a = $1 if $string =~ m/=(.*)$/;
$a = $' if $string =~ m/=/; chomp($a);
Now to the benchmarking.
The first benchmark program I came up with was this one:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my $count = 1e7;
my $prefix = q!$a = $' if "a" =~ m/a/;!;
my $line = '$a = 1 if "abc=xyz" =~ m/=/;';
my $badProgram = "$prefix $line";
timethese($count, {good => $line});
timethese($count, {bad => $badProgram, prefix => $prefix});
__END__
The main program being tested is stored in the $line
variable:
'$a = 1 if "abc=xyz" =~ m/=/;
All it does is check to see if the string "abc=xyz" has an
equal sign (which of course it does). If it finds one, it
sets $a to 1.
The first call to timethese() runs this line ten million
times. Note that this line is not tainted, since the $'
variable is never used until it is eval'ed in the second
call to timethese().
The second call to timethese() runs that line, together with
the $prefix code, which is just a "taint line". I could
just run that "bad" program and compare its time to the "good"
program, but then the "bad" program would have twice as many
regular expressions to run, unfairly making it look like it
is much worse than it really is.
That is why I decided to run the $prefix line as well.
Therefore, its time can be subtracted from the "bad" code's
time to see how code identical to the "good" code (but
tainted) suffers from the performance penalty.
By running the code, I get the following results (note that
I formatted the lines (but did not change the results) in
the interest of making the results fit nicely in limited
space):
good: ( 3.50 usr + 0.00 sys = 3.50 CPU)
bad: (31.89 usr + 0.00 sys = 31.89 CPU)
prefix: (18.25 usr + 0.00 sys = 18.25 CPU)
Subtracting the times, I get:
31.89 - 18.25 - 3.50 = 10.14 seconds
It appears that the performance penalty for pattern
matches is around 10.14 seconds per 10 million pattern
matches. That's about one microsecond for every pattern
match.
Looking back at my program, I realized that someone might
accuse the program of being invalid, claiming that whatever
overhead timethese() uses to run code is subtracted out of
the "bad" code (when we subtract out $prefix's time) but
never subtracted from the "good" code, making the
performance penalty look smaller than it really is.
I'm not sure if there is an overhead, but just in case, I
decided to make a new variation of the above code to account
for this:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my $count = 1e7;
my $goodPrefix = q!$a = "" if "a" =~ m/a/;!;
my $badPrefix = q!$a = $' if "a" =~ m/a/;!;
my $line = '$a = 1 if "abc=xyz" =~ m/=/;';
my $goodProgram = "$goodPrefix $line";
my $badProgram = "$badPrefix $line";
timethese($count, {good => $goodProgram, goodPrefix => $goodPrefix});
timethese($count, {bad => $badProgram, badPrefix => $badPrefix});
__END__
This code now has a $goodPrefix (which is untainted) and a
$badPrefix (which is the taint line). The "good" and the
"bad" code both will have their prefixes' time subtracted
out, putting them on more equal footing. These were my
results:
good: ( 7.72 usr + 0.02 sys = 7.73 CPU)
goodPrefix: ( 3.83 usr + 0.00 sys = 3.83 CPU)
bad: (30.56 usr + 0.01 sys = 30.58 CPU)
badPrefix: (16.09 usr + 0.02 sys = 16.11 CPU)
The time difference between the tainted and untainted
pattern matches is:
(30.56 - 16.09) - (7.72 - 3.83) = 10.58
That's a penalty of 10.58 seconds for ten million pattern
matches (or about one microsecond for every pattern match).
I thought about it a bit more, and I realized that the
string "abc=xyz" is a rather small string, quite possibly
much smaller than the average string used in a pattern
match. It stands to reason that if the string-to-be-matched
is much larger than seven characters, the penalty incurred
from copying the string to $`, $&, and $' would be greater.
So I composed the following program:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my $count = 1e6;
my $goodPrefix = q!$a = "" if "a" =~ m/a/;!;
my $badPrefix = q!$a = $' if "a" =~ m/a/;!;
my $line = '$a = 1 if "abc=xyz" =~ m/=/;';
# Replace "abc=xyz" with longer string:
my $string = join('', 'A'..'Z','a'..'z','0'..'9');
$string = "$string=$string";
$string x= 1; # change multiplier to make longer string
$line =~ s/abc=xyz/$string/;
my $goodProgram = "$goodPrefix $line";
my $badProgram = "$badPrefix $line";
timethese($count, {good => $goodProgram, goodPrefix => $goodPrefix});
timethese($count, {bad => $badProgram, badPrefix => $badPrefix});
__END__
The key line is this one:
$string x= 1; # change multiplier to make longer string
With the multiplier set to one, the main code line compares
the string:
ABC...XYZabc...xyz012...789=ABC...XYZabc...xyz012...789
which is 125 characters long. That's 56% larger than the
standard screen width of 80 characters.
I discovered something interesting: as I doubled the
multiplier from 1 to 2, then to 4, then 8, and so on, I
discovered that the penalty did indeed increase, but not
linearly. I would have expected the penalty to at least
double each time, but instead it increased very slowly at
first before it started to double. I expect that its
tardiness in doubling was due to some overhead factor.
Anyway, for each multiplier, here is the measured
penalty (in microseconds) for a pattern match with that
long string:
1: 1.08
2: 1.14
4: 1.19
8: 1.67
16: 1.84
32: 2.73
64: 3.78
128: 5.72
256: 9.89
512: 19.13
(Note that in the 512 case, 64 thousand characters were used
in every pattern match, with a penalty of about 20
microseconds (or 0.02 milliseconds) for every pattern
match.)
Although the doubling behavior of the penalty is latent, it
does appear that it eventually "doubles-out." Of course, as
the multiplier increased to ever-greater numbers, the Perl
interpreter may run out of its allotted RAM, resorting to
other resources such as virtual memory & disk space to
finish its task. Once this happens, the processes used to
supply "substitute RAM" will likely create a large
performance drop (significantly greater than just doubling).
However, as we see from my results, a pattern match on a
string of 64,000 characters still only has a penalty of
about 20 microseconds.
Once I did this I got to thinking that all of my programs
thus far use no candidate lines whatsoever. Obviously, the
$MATCH variables are used in programs where some candidate
lines are present (otherwise, they're never used (or shouldn't
be, at any rate)). So assuming that none of the regular
expressions are in candidate lines wouldn't necessarily be
fair to the code that uses the $MATCH variables.
So I decided to write this program that tests the
performance of a program where ALL its regular expressions
are in candidate lines:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my $count = 1e7;
my $bad = q!$string = $' if "abc=xyz" =~ m/=/; chomp($string)!;
my $good = q!$string = $1 if "abc=xyz" =~ m/=(.*)$/!;
timethese($count, {good => $good});
timethese($count, {bad => $bad});
__END__
The results:
good: (19.92 usr + 0.00 sys = 19.92 CPU)
bad: (17.38 usr + 0.00 sys = 17.38 CPU)
It looks like for once that the "bad" code fares better than
the "good" code (by an average of a quarter of a microsecond
per regular expression)!
I also tried replacing the "abc=xyz" string to the much
larger string of:
ABC...XYZabc...xyz012...789=ABC...XYZabc...xyz012...789
repeated 256 times (for a total of 32,000 characters).
My results were:
good: (84.55 usr + 0.00 sys = 84.55 CPU)
bad: (39.06 usr + 0.00 sys = 39.06 CPU)
Surprisingly, the "bad" code ran more than twice as fast as
the "good" code! Apparently, the larger the string that has
to get captured, the better the "bad" code performs (that
is, when compared to the non-tainted code).
Therefore, I have to make the case that, in the event that
someone creates a Perl program that has candidate lines
for most of its regular expressions, it may actually be
faster to use the $MATCH variables. (Of course, it may or
may not be faster to use the $MATCH variables, but if no
super-long strings are used, the difference will be tiny).
At any rate, unless super-long strings are used in
pattern-matches, the difference (for better or for worse) of
using the $MATCH variables on each regular expression is on
the order of microseconds per match.
Okay, so I went through all this trouble to measure the
speed of tainted lines vs. non-tainted lines. However, all
the programs I've tested so far have no input or output,
which is pretty rare for useful Perl scripts. Therefore,
unless you plan to write Perl programs that consist of only
pattern-matching and have absolutely no I/O involved, these
tests really aren't that useful in determining just how bad
a performance penalty from using the $MATCH variables can
be.
So I decided that my next logical step would be to benchmark
a program that did a task that Perl would conceivably be
used for: Searching every file in the current directory
and printing out the number of lines that contain a certain
string ("Perl", in this case):
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my $count = 1e2;
my $prefix = q!$a = $' if "a" =~ m/a/;!;
my $program =<<'END_OF_PROGRAM';
@ARGV = grep -f, <*>;
$b = 0;
m/\bPerl\b/ and $b++ while <>;
END_OF_PROGRAM
timethese($count, {good => $program});
print "Count: $b\n";
timethese($count, {bad => "$prefix $program", prefix => $prefix});
print "Count: $b\n";
__END__
The results:
good: (11.66 usr + 12.08 sys = 23.73 CPU)
Count: 342
bad: (11.88 usr + 11.72 sys = 23.59 CPU)
prefix: ( 0.00 usr + 0.00 sys = 0.00 CPU)
Count: 342
We see that the "good" code uses less usr time (at about 2.2
milliseconds per run), but for some reason it managed to use
more sys time (about 3.6 milliseconds per run). I'm not
sure why this is so, but I'm inclined to think that because
the $MATCH penalty is so small compared to disk access, the
penalty doesn't really show up when run with code that uses
disk access. In other words, the variation in time for
code that uses disk I/O is so great that it more than makes
up for just about any penalty introduced by the $MATCH
variables.
Okay, so at this point I realize that almost all performance
penalties incurred by the $MATCH variables are quite small,
and that when used in programs that use I/O processes, they
seem to become negligible. But I wanted to perform one more
benchmark test: How would the $' variable affect REAL Perl
programs, the heavy-duty scripts that we've written and are
proud of?
I have one such Perl program I wrote at work. We would get
a special type of binary data files and we would need to
peer into them to see what data they held. This is quite a
complicated process, as it would require a full-fledged
decoder to handle the special data types and
representations. In order to peek into a new file that was
new to us, we would have to take our normal decoder (written
in C) and use the debugger to change the flow of execution
so that the contents of the file would be written out to the
screen for human eyes to read.
This technique was cumbersome and slow, so I wrote a Perl
program to parse through a binary file of that format and
spit out its contents in human-readable form. It worked
great. (And to think that the very first time I decoded a
file in such a format I was using a paper, pencil, and hex
dump output.)
Because that Perl script I wrote belongs to the company I
work for, I won't post it here, but I will give some
statistics about it: In order to parse any binary file it
must first parse through three ASCII configuration files
(with 173, 558, and 2060 lines with each line going through
at least one regular expression). For about every 1K of
input, the program spits out about 4K of output. There are
quite a few regular expressions sprinkled throughout the
program; several of which are operating on the binary data
(one of which operates on an entire binary file). For the
record, this program does NOT use the $`, $&, or $'
variables, nor the English module (or any other module, for
that matter).
I made two copies of this program: one named "good.pl" and
one named "bad.pl". I inserted the following line at the
top of both files:
@ARGV = <*.bin>;
and inserted the following taint line at the top of
"bad.pl":
$a = $' if "a" =~ m/a/;
Then I ran the following script:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
my $count = 1e3;
timethese($count, {good => "do 'good.pl'"});
timethese($count, {good => "do 'good.pl'"});
timethese($count, {bad => "do 'bad.pl'"});
__END__
Let me explain why I time "good.pl" twice:
Because I have several functions defined in the script,
timethese() keeps complaining that the functions are kept
being redefined over and over. Obviously, it won't
complain about this the very first time the program is run,
making the first timethese() have an advantage over any
subsequent calls that test the same code. Therefore, by
calling timethese() three times, I ensure that the second
and third call to timethese() are on equal footing (as far
as redefining functions goes).
I ran this program so that the code it called would have to
decode a somewhat large load of 55 files (around 5K each).
Here are the results:
Benchmark: timing 1000 iterations of good...
good: (45.09 usr + 1.78 sys = 46.88 CPU)
Benchmark: timing 1000 iterations of good...
good: (45.61 usr + 1.47 sys = 47.08 CPU)
Benchmark: timing 1000 iterations of bad...
bad: (45.64 usr + 1.55 sys = 47.19 CPU)
It appears that, when comparing the CPU time, 1000 runs of
the program takes 0.11 seconds longer when I use the $MATCH
variables. That means that I'd save only around 0.11
milliseconds (per single run) by avoiding them.
Just for fun, I decided to concatenate all those files (and a few
more) into one huge binary file (my program was made to handle
this), and then to "cat" that file several times into an even
bigger binary file. Then the "good.pl" and "bad.pl" files were
changed to read that file as input. For the record, that file
was 5,938,445 bytes. I wrote my program so that it would do a
(successful) pattern match over the entire binary file contents
one time for each sub-file inside it. That means that for every
time this program is run with this >5MB file as input (which is,
for this particular data format, unrealistically large), almost
six megabytes of data are copied into $`, $&, and $' three
hundred times.
And here are the results of running this program with a
abnormally large data set:
Benchmark: timing 1000 iterations of good...
good: (129.78 usr + 2.56 sys = 132.34 CPU)
Benchmark: timing 1000 iterations of good...
good: (129.39 usr + 2.39 sys = 131.78 CPU)
Benchmark: timing 1000 iterations of bad...
bad: (130.36 usr + 2.94 sys = 133.30 CPU)
It looks like the performance penalty of using the $MATCH
variables is 133.30 - 131.78 = 1.52 seconds (for 1000
iterations). For each run, less than 2 milliseconds is the
penalty for needlessly using the $MATCH variables. Keep in
mind, for each run almost 6 megabytes of data were being
copied into $`, $&, and $' 300 times. And the penalty was
not even two milliseconds.
And here is where others can help me out. I'm curious to
see how your own favorite Perl programs fare. If you'd like
to post your own results, take a good, robust Perl script
(that doesn't use the $MATCH variables), copy it to
"good.pl" and "bad.pl", add an @ARGV line to both files
(if necessary), and then add a "taint" line to the "bad.pl"
script. Then run the above Perl code and compare the last
two benchmark readings (you may want to pipe the output to
"grep wallclock" to find the benchmark output if you normally
have a lot of output).
At this point, I have to say that these Perl programs I
wrote seem to show that the performance penalty of using $`,
$&, and $' are usually small and often negligible.
CONCLUSION
I happen to subscribe to Paul Graham's ideas on
optimization. On page 213 of his book "ANSI Common Lisp"
(yes, Lisp), he writes:
Three points can be made about optimization,
regardless of the implementation: it should be
focused on bottlenecks, it should not begin too
early, and it should begin with algorithms.
Probably the most important thing to understand
about optimization is that programs tend to have a
few bottlenecks that account for a great part of
the execution time. ... Optimizing these parts of
the program will make it run noticeably faster;
optimizing the rest of the program will be a
waste of time in comparison.
...
A corollary of the bottleneck rule is that one
should not put too much effort into optimizing
early in a program's life. Knuth puts the point
even more strongly: "Premature optimization is the
root of all evil (or at least most of it) in
programming."
In other words, optimizing code that is not a bottleneck may save
some time in the end, but the time it saves becomes negligible
when run together with a bottleneck, or at least with code that
has worse run-time behavior (measured in Big-O notation). The
time spent in optimizing non-bottleneck code is usually much
greater than the few milliseconds it saves.
And since the penalty to using $`, $&, and $' seems to behave
at O(N) or possibly O(N*log(N)), the penalty will become
insignificant when used in programs that employ algorithms of
worse Big-O behavior (or that even use disk access).
So why does the penalty disclaimer seem to follow $'
wherever it is taught? Personally, I think it's because the
penalty (however small) does exist, and programmers think
they have a responsibility to inform other programmers
(since one piece of code can affect another piece of
seemingly unrelated code).
But I still wondered about that, so I looked it up in the book
where I find learned about it: "Learning Perl" by Randal L.
Schwartz & Tom Phoenix. On page 122, it says,
Now, we said earlier that these three are "free."
Well, freedom has its price. In this case, the
price is that once you use any one of these
automatic match variables anywhere in your entire
program, other regular expressions will run a
little more slowly. Now this isn't a giant
slowdown, but it's enough of a worry that many
Perl programmers will simply never use these
automatic match variables.[*] Instead, they'll
use a workaround.
And there's more: A footnote is included:
[*] Most of these folks haven't actually
benchmarked their programs to see whether their
workarounds actually save time, though; it's as
though these variables were poisonous or
something. But we can't blame them for not
benchmarking -- many programs that could benefit
from these three variables take up only a few
minutes of CPU time in a week, so benchmarking and
optimizing would be a waste of time. But in that
case, why fear a possible extra millisecond? By
the way, Perl developers are working on this
problem, but there will probably be no solution
before Perl 6.
Apparently Randal L. Schwartz and Tom Phoenix don't seem to think
it would be such a big deal if some programs used $'. I agree
with them, of course, but I do acknowledge that there are times
that it's better to use $' than others.
So I made a list of times it is acceptable to use $`, $&, and $':
1. when most of the pattern matches in your program are
candidate lines
2. when none of the text strings being pattern-matched is longer
than a thousand characters
3. when there are few pattern matches when compared with the
rest of the program
4. when used along-side algorithms that have worse run-time
behaviors
But it's also important to point out some times when it is not
acceptable to use those variables:
1. when a pattern-matched string has a length greater than
millions of characters
2. when you are writing a Perl module for others to use
Point 1 is important for programmers who have to handle large
amount of binary data, or even long text strings, like DNA
sequences. I tried a search engine, but couldn't find just how
much memory you would need to hold a complete human DNA sequence,
but I think I remember reading that it was something on the order
of fifteen gigabytes. Reading something of that size takes up
enough time as it is; any needless copying will be definitely
noticeable and unwelcome.
Point 2 is important because your module might be used by people
who use super-long strings (such as DNA sequences). So if you
feel the need to violate this point, at least have the decency to
document their usage, so that others are aware of it (and be
prepared to watch the usage of your module plummet).
And here is another list, enumerating reasons why it's okay to
use those variables in a script for your own use:
1. If most of your pattern matches are candidate lines, there's
a good chance that using $' (instead of m/=(.*)$/) will
actually speed up your program.
2. Even if it doesn't speed up your program, the time wasted
will probably be negligible and won't ever be noticed.
3. Even in the rare case where the time wasted isn't
negligible, it should be a trivial matter to eliminate
the penalty by editing the code and removing the match
variables.
(I am reminded of the story I have heard about an enthusiastic
programmer who was determined to make his program run faster by
making it more "streamlined." He replaced various parts of code
with presumably faster code, only to find out that the resulting
speed-up was negligible (or worse -- his new program ran slower).
The moral of this story is that it's not always obvious in what
parts of code most of the processor time is being used. If a
section of code that uses very little processor time is
optimized, it might run faster, but the time it saves is
practically non-existent compared to the rest of the code.)
Bear in mind that using a tainted pattern match of:
if (m/match/)
{
...
}
is somewhat equivalent to the untainted:
if (m/^(.*?)(match)(.*)$/)
{
my $prematch = $1;
my $match = $2;
my $postmatch = $3;
...
}
in that the copying into $prematch, $match, and $postmatch are
done whether you want them or not. That amounts to the entire
string being copied over (possibly needlessly) for every
successful pattern match. If the string is small, there's a good
chance that this penalty won't even be measurable. However, in
the cases where the programmer has to make a conscious effort not
to needlessly copy a large string (possibly because he/she is
handling a large data stream), then the $MATCH variable should
probably best be avoided -- but ONLY if these large strings are
being matched with regular expressions (as opposed to, for
instance, the index() function).
Of course, if all your regular expressions with large strings
require you to capture the pre-match and post-match with
parentheses, it would probably be faster to go ahead and use the
$MATCH variables.
So it's your call as the programmer to make an educated decision
whether or not to use the $MATCH variables. If regular
expressions are used only on small strings, the penalty most
likely will make no real difference. But if large strings are
used, give it some thought.
Ironically, I never personally used $`, $&, and $' in my scripts
before I researched this (exactly because of the always-mentioned
penalty). But now that I have made all these benchmarked tests
and examined the results, I just might start using them in my own
small programs.
(Constructive criticism is welcome. Since usage of these match
variables is a touchy topic for some Perl programmers, please
don't flame me if I made a statement that doesn't agree with your
beliefs. After all, I am only human, so if I said something that
you feel is incorrect or invalid, feel free to point it out and
comment on it, without resorting to needlessly harsh comments
about my programming skills. Thank you.)
Happy Perling!
-- Jean-Luc Romano