How bad is $'? (Was: "Get substring of line")

J

J. Romano

... try to actually isolate the issue in a proper
benchmark. spend some deep time in thought in how
to do it. post your benchmark for review. then talk
about $' with some confidence.


Okay, I'll take you up on that offer.

(But be warned! Expect a "tome". :)


First, allow me to define some terminology:

* A "candidate line" is a pattern match that can be easily
written to use $`, $&, or $'. They don't have to use
them, as they can be written with explicit grabs. But
nevertheless, they are pattern matches that could benefit
from the use of the $MATCH variables.

* A "taint line" is a line of code that uses a $MATCH
variable (such as $`, $&, or $'). The reason that it is
called a "taint line" is because it "taints" all regular
expressions with a performance penalty, whether they use
the match variables or not. Note that in this post,
the usage of the word "taint" has nothing to do with
Perl's "taint mode", a mode that is turned on with the
"-T" switch.


Also, allow me to clear up a common mistake. The following
two lines are NOT functionally identical:

$a = $1 if $string =~ m/=(.*)$/;
$a = $' if $string =~ m/=/;

This is because the '.' matches all characters EXCEPT a
newline. To make the two lines identical, either put an "s"
modifier on the first line or add a chomp($a) call to the
second. In other words, these two lines are functionally
equivalent:

$a = $1 if $string =~ m/=(.*)$/s;
$a = $' if $string =~ m/=/;

as are these:

$a = $1 if $string =~ m/=(.*)$/;
$a = $' if $string =~ m/=/; chomp($a);


Now to the benchmarking.

The first benchmark program I came up with was this one:


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;

my $count = 1e7;
my $prefix = q!$a = $' if "a" =~ m/a/;!;
my $line = '$a = 1 if "abc=xyz" =~ m/=/;';
my $badProgram = "$prefix $line";

timethese($count, {good => $line});
timethese($count, {bad => $badProgram, prefix => $prefix});
__END__


The main program being tested is stored in the $line
variable:

'$a = 1 if "abc=xyz" =~ m/=/;

All it does is check to see if the string "abc=xyz" has an
equal sign (which of course it does). If it finds one, it
sets $a to 1.

The first call to timethese() runs this line ten million
times. Note that this line is not tainted, since the $'
variable is never used until it is eval'ed in the second
call to timethese().

The second call to timethese() runs that line, together with
the $prefix code, which is just a "taint line". I could
just run that "bad" program and compare its time to the "good"
program, but then the "bad" program would have twice as many
regular expressions to run, unfairly making it look like it
is much worse than it really is.

That is why I decided to run the $prefix line as well.
Therefore, its time can be subtracted from the "bad" code's
time to see how code identical to the "good" code (but
tainted) suffers from the performance penalty.

By running the code, I get the following results (note that
I formatted the lines (but did not change the results) in
the interest of making the results fit nicely in limited
space):

good: ( 3.50 usr + 0.00 sys = 3.50 CPU)
bad: (31.89 usr + 0.00 sys = 31.89 CPU)
prefix: (18.25 usr + 0.00 sys = 18.25 CPU)


Subtracting the times, I get:

31.89 - 18.25 - 3.50 = 10.14 seconds

It appears that the performance penalty for pattern
matches is around 10.14 seconds per 10 million pattern
matches. That's about one microsecond for every pattern
match.


Looking back at my program, I realized that someone might
accuse the program of being invalid, claiming that whatever
overhead timethese() uses to run code is subtracted out of
the "bad" code (when we subtract out $prefix's time) but
never subtracted from the "good" code, making the
performance penalty look smaller than it really is.

I'm not sure if there is an overhead, but just in case, I
decided to make a new variation of the above code to account
for this:


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;

my $count = 1e7;
my $goodPrefix = q!$a = "" if "a" =~ m/a/;!;
my $badPrefix = q!$a = $' if "a" =~ m/a/;!;
my $line = '$a = 1 if "abc=xyz" =~ m/=/;';
my $goodProgram = "$goodPrefix $line";
my $badProgram = "$badPrefix $line";

timethese($count, {good => $goodProgram, goodPrefix => $goodPrefix});
timethese($count, {bad => $badProgram, badPrefix => $badPrefix});
__END__


This code now has a $goodPrefix (which is untainted) and a
$badPrefix (which is the taint line). The "good" and the
"bad" code both will have their prefixes' time subtracted
out, putting them on more equal footing. These were my
results:

good: ( 7.72 usr + 0.02 sys = 7.73 CPU)
goodPrefix: ( 3.83 usr + 0.00 sys = 3.83 CPU)
bad: (30.56 usr + 0.01 sys = 30.58 CPU)
badPrefix: (16.09 usr + 0.02 sys = 16.11 CPU)


The time difference between the tainted and untainted
pattern matches is:

(30.56 - 16.09) - (7.72 - 3.83) = 10.58

That's a penalty of 10.58 seconds for ten million pattern
matches (or about one microsecond for every pattern match).

I thought about it a bit more, and I realized that the
string "abc=xyz" is a rather small string, quite possibly
much smaller than the average string used in a pattern
match. It stands to reason that if the string-to-be-matched
is much larger than seven characters, the penalty incurred
from copying the string to $`, $&, and $' would be greater.

So I composed the following program:


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;

my $count = 1e6;
my $goodPrefix = q!$a = "" if "a" =~ m/a/;!;
my $badPrefix = q!$a = $' if "a" =~ m/a/;!;
my $line = '$a = 1 if "abc=xyz" =~ m/=/;';

# Replace "abc=xyz" with longer string:
my $string = join('', 'A'..'Z','a'..'z','0'..'9');
$string = "$string=$string";
$string x= 1; # change multiplier to make longer string
$line =~ s/abc=xyz/$string/;

my $goodProgram = "$goodPrefix $line";
my $badProgram = "$badPrefix $line";

timethese($count, {good => $goodProgram, goodPrefix => $goodPrefix});
timethese($count, {bad => $badProgram, badPrefix => $badPrefix});
__END__


The key line is this one:

$string x= 1; # change multiplier to make longer string

With the multiplier set to one, the main code line compares
the string:

ABC...XYZabc...xyz012...789=ABC...XYZabc...xyz012...789

which is 125 characters long. That's 56% larger than the
standard screen width of 80 characters.

I discovered something interesting: as I doubled the
multiplier from 1 to 2, then to 4, then 8, and so on, I
discovered that the penalty did indeed increase, but not
linearly. I would have expected the penalty to at least
double each time, but instead it increased very slowly at
first before it started to double. I expect that its
tardiness in doubling was due to some overhead factor.

Anyway, for each multiplier, here is the measured
penalty (in microseconds) for a pattern match with that
long string:

1: 1.08
2: 1.14
4: 1.19
8: 1.67
16: 1.84
32: 2.73
64: 3.78
128: 5.72
256: 9.89
512: 19.13

(Note that in the 512 case, 64 thousand characters were used
in every pattern match, with a penalty of about 20
microseconds (or 0.02 milliseconds) for every pattern
match.)

Although the doubling behavior of the penalty is latent, it
does appear that it eventually "doubles-out." Of course, as
the multiplier increased to ever-greater numbers, the Perl
interpreter may run out of its allotted RAM, resorting to
other resources such as virtual memory & disk space to
finish its task. Once this happens, the processes used to
supply "substitute RAM" will likely create a large
performance drop (significantly greater than just doubling).
However, as we see from my results, a pattern match on a
string of 64,000 characters still only has a penalty of
about 20 microseconds.


Once I did this I got to thinking that all of my programs
thus far use no candidate lines whatsoever. Obviously, the
$MATCH variables are used in programs where some candidate
lines are present (otherwise, they're never used (or shouldn't
be, at any rate)). So assuming that none of the regular
expressions are in candidate lines wouldn't necessarily be
fair to the code that uses the $MATCH variables.

So I decided to write this program that tests the
performance of a program where ALL its regular expressions
are in candidate lines:


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;

my $count = 1e7;
my $bad = q!$string = $' if "abc=xyz" =~ m/=/; chomp($string)!;
my $good = q!$string = $1 if "abc=xyz" =~ m/=(.*)$/!;

timethese($count, {good => $good});
timethese($count, {bad => $bad});
__END__


The results:

good: (19.92 usr + 0.00 sys = 19.92 CPU)
bad: (17.38 usr + 0.00 sys = 17.38 CPU)

It looks like for once that the "bad" code fares better than
the "good" code (by an average of a quarter of a microsecond
per regular expression)!

I also tried replacing the "abc=xyz" string to the much
larger string of:

ABC...XYZabc...xyz012...789=ABC...XYZabc...xyz012...789

repeated 256 times (for a total of 32,000 characters).
My results were:

good: (84.55 usr + 0.00 sys = 84.55 CPU)
bad: (39.06 usr + 0.00 sys = 39.06 CPU)

Surprisingly, the "bad" code ran more than twice as fast as
the "good" code! Apparently, the larger the string that has
to get captured, the better the "bad" code performs (that
is, when compared to the non-tainted code).

Therefore, I have to make the case that, in the event that
someone creates a Perl program that has candidate lines
for most of its regular expressions, it may actually be
faster to use the $MATCH variables. (Of course, it may or
may not be faster to use the $MATCH variables, but if no
super-long strings are used, the difference will be tiny).

At any rate, unless super-long strings are used in
pattern-matches, the difference (for better or for worse) of
using the $MATCH variables on each regular expression is on
the order of microseconds per match.


Okay, so I went through all this trouble to measure the
speed of tainted lines vs. non-tainted lines. However, all
the programs I've tested so far have no input or output,
which is pretty rare for useful Perl scripts. Therefore,
unless you plan to write Perl programs that consist of only
pattern-matching and have absolutely no I/O involved, these
tests really aren't that useful in determining just how bad
a performance penalty from using the $MATCH variables can
be.

So I decided that my next logical step would be to benchmark
a program that did a task that Perl would conceivably be
used for: Searching every file in the current directory
and printing out the number of lines that contain a certain
string ("Perl", in this case):


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;

my $count = 1e2;
my $prefix = q!$a = $' if "a" =~ m/a/;!;
my $program =<<'END_OF_PROGRAM';
@ARGV = grep -f, <*>;
$b = 0;
m/\bPerl\b/ and $b++ while <>;
END_OF_PROGRAM

timethese($count, {good => $program});
print "Count: $b\n";
timethese($count, {bad => "$prefix $program", prefix => $prefix});
print "Count: $b\n";
__END__


The results:

good: (11.66 usr + 12.08 sys = 23.73 CPU)
Count: 342
bad: (11.88 usr + 11.72 sys = 23.59 CPU)
prefix: ( 0.00 usr + 0.00 sys = 0.00 CPU)
Count: 342

We see that the "good" code uses less usr time (at about 2.2
milliseconds per run), but for some reason it managed to use
more sys time (about 3.6 milliseconds per run). I'm not
sure why this is so, but I'm inclined to think that because
the $MATCH penalty is so small compared to disk access, the
penalty doesn't really show up when run with code that uses
disk access. In other words, the variation in time for
code that uses disk I/O is so great that it more than makes
up for just about any penalty introduced by the $MATCH
variables.


Okay, so at this point I realize that almost all performance
penalties incurred by the $MATCH variables are quite small,
and that when used in programs that use I/O processes, they
seem to become negligible. But I wanted to perform one more
benchmark test: How would the $' variable affect REAL Perl
programs, the heavy-duty scripts that we've written and are
proud of?

I have one such Perl program I wrote at work. We would get
a special type of binary data files and we would need to
peer into them to see what data they held. This is quite a
complicated process, as it would require a full-fledged
decoder to handle the special data types and
representations. In order to peek into a new file that was
new to us, we would have to take our normal decoder (written
in C) and use the debugger to change the flow of execution
so that the contents of the file would be written out to the
screen for human eyes to read.

This technique was cumbersome and slow, so I wrote a Perl
program to parse through a binary file of that format and
spit out its contents in human-readable form. It worked
great. (And to think that the very first time I decoded a
file in such a format I was using a paper, pencil, and hex
dump output.)

Because that Perl script I wrote belongs to the company I
work for, I won't post it here, but I will give some
statistics about it: In order to parse any binary file it
must first parse through three ASCII configuration files
(with 173, 558, and 2060 lines with each line going through
at least one regular expression). For about every 1K of
input, the program spits out about 4K of output. There are
quite a few regular expressions sprinkled throughout the
program; several of which are operating on the binary data
(one of which operates on an entire binary file). For the
record, this program does NOT use the $`, $&, or $'
variables, nor the English module (or any other module, for
that matter).

I made two copies of this program: one named "good.pl" and
one named "bad.pl". I inserted the following line at the
top of both files:

@ARGV = <*.bin>;

and inserted the following taint line at the top of
"bad.pl":

$a = $' if "a" =~ m/a/;

Then I ran the following script:


#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;

my $count = 1e3;
timethese($count, {good => "do 'good.pl'"});
timethese($count, {good => "do 'good.pl'"});
timethese($count, {bad => "do 'bad.pl'"});
__END__


Let me explain why I time "good.pl" twice:
Because I have several functions defined in the script,
timethese() keeps complaining that the functions are kept
being redefined over and over. Obviously, it won't
complain about this the very first time the program is run,
making the first timethese() have an advantage over any
subsequent calls that test the same code. Therefore, by
calling timethese() three times, I ensure that the second
and third call to timethese() are on equal footing (as far
as redefining functions goes).

I ran this program so that the code it called would have to
decode a somewhat large load of 55 files (around 5K each).

Here are the results:

Benchmark: timing 1000 iterations of good...
good: (45.09 usr + 1.78 sys = 46.88 CPU)
Benchmark: timing 1000 iterations of good...
good: (45.61 usr + 1.47 sys = 47.08 CPU)
Benchmark: timing 1000 iterations of bad...
bad: (45.64 usr + 1.55 sys = 47.19 CPU)


It appears that, when comparing the CPU time, 1000 runs of
the program takes 0.11 seconds longer when I use the $MATCH
variables. That means that I'd save only around 0.11
milliseconds (per single run) by avoiding them.

Just for fun, I decided to concatenate all those files (and a few
more) into one huge binary file (my program was made to handle
this), and then to "cat" that file several times into an even
bigger binary file. Then the "good.pl" and "bad.pl" files were
changed to read that file as input. For the record, that file
was 5,938,445 bytes. I wrote my program so that it would do a
(successful) pattern match over the entire binary file contents
one time for each sub-file inside it. That means that for every
time this program is run with this >5MB file as input (which is,
for this particular data format, unrealistically large), almost
six megabytes of data are copied into $`, $&, and $' three
hundred times.

And here are the results of running this program with a
abnormally large data set:

Benchmark: timing 1000 iterations of good...
good: (129.78 usr + 2.56 sys = 132.34 CPU)
Benchmark: timing 1000 iterations of good...
good: (129.39 usr + 2.39 sys = 131.78 CPU)
Benchmark: timing 1000 iterations of bad...
bad: (130.36 usr + 2.94 sys = 133.30 CPU)


It looks like the performance penalty of using the $MATCH
variables is 133.30 - 131.78 = 1.52 seconds (for 1000
iterations). For each run, less than 2 milliseconds is the
penalty for needlessly using the $MATCH variables. Keep in
mind, for each run almost 6 megabytes of data were being
copied into $`, $&, and $' 300 times. And the penalty was
not even two milliseconds.

And here is where others can help me out. I'm curious to
see how your own favorite Perl programs fare. If you'd like
to post your own results, take a good, robust Perl script
(that doesn't use the $MATCH variables), copy it to
"good.pl" and "bad.pl", add an @ARGV line to both files
(if necessary), and then add a "taint" line to the "bad.pl"
script. Then run the above Perl code and compare the last
two benchmark readings (you may want to pipe the output to
"grep wallclock" to find the benchmark output if you normally
have a lot of output).


At this point, I have to say that these Perl programs I
wrote seem to show that the performance penalty of using $`,
$&, and $' are usually small and often negligible.



CONCLUSION


I happen to subscribe to Paul Graham's ideas on
optimization. On page 213 of his book "ANSI Common Lisp"
(yes, Lisp), he writes:


Three points can be made about optimization,
regardless of the implementation: it should be
focused on bottlenecks, it should not begin too
early, and it should begin with algorithms.

Probably the most important thing to understand
about optimization is that programs tend to have a
few bottlenecks that account for a great part of
the execution time. ... Optimizing these parts of
the program will make it run noticeably faster;
optimizing the rest of the program will be a
waste of time in comparison.

...

A corollary of the bottleneck rule is that one
should not put too much effort into optimizing
early in a program's life. Knuth puts the point
even more strongly: "Premature optimization is the
root of all evil (or at least most of it) in
programming."


In other words, optimizing code that is not a bottleneck may save
some time in the end, but the time it saves becomes negligible
when run together with a bottleneck, or at least with code that
has worse run-time behavior (measured in Big-O notation). The
time spent in optimizing non-bottleneck code is usually much
greater than the few milliseconds it saves.

And since the penalty to using $`, $&, and $' seems to behave
at O(N) or possibly O(N*log(N)), the penalty will become
insignificant when used in programs that employ algorithms of
worse Big-O behavior (or that even use disk access).

So why does the penalty disclaimer seem to follow $'
wherever it is taught? Personally, I think it's because the
penalty (however small) does exist, and programmers think
they have a responsibility to inform other programmers
(since one piece of code can affect another piece of
seemingly unrelated code).

But I still wondered about that, so I looked it up in the book
where I find learned about it: "Learning Perl" by Randal L.
Schwartz & Tom Phoenix. On page 122, it says,


Now, we said earlier that these three are "free."
Well, freedom has its price. In this case, the
price is that once you use any one of these
automatic match variables anywhere in your entire
program, other regular expressions will run a
little more slowly. Now this isn't a giant
slowdown, but it's enough of a worry that many
Perl programmers will simply never use these
automatic match variables.[*] Instead, they'll
use a workaround.


And there's more: A footnote is included:


[*] Most of these folks haven't actually
benchmarked their programs to see whether their
workarounds actually save time, though; it's as
though these variables were poisonous or
something. But we can't blame them for not
benchmarking -- many programs that could benefit
from these three variables take up only a few
minutes of CPU time in a week, so benchmarking and
optimizing would be a waste of time. But in that
case, why fear a possible extra millisecond? By
the way, Perl developers are working on this
problem, but there will probably be no solution
before Perl 6.


Apparently Randal L. Schwartz and Tom Phoenix don't seem to think
it would be such a big deal if some programs used $'. I agree
with them, of course, but I do acknowledge that there are times
that it's better to use $' than others.

So I made a list of times it is acceptable to use $`, $&, and $':

1. when most of the pattern matches in your program are
candidate lines
2. when none of the text strings being pattern-matched is longer
than a thousand characters
3. when there are few pattern matches when compared with the
rest of the program
4. when used along-side algorithms that have worse run-time
behaviors

But it's also important to point out some times when it is not
acceptable to use those variables:

1. when a pattern-matched string has a length greater than
millions of characters
2. when you are writing a Perl module for others to use

Point 1 is important for programmers who have to handle large
amount of binary data, or even long text strings, like DNA
sequences. I tried a search engine, but couldn't find just how
much memory you would need to hold a complete human DNA sequence,
but I think I remember reading that it was something on the order
of fifteen gigabytes. Reading something of that size takes up
enough time as it is; any needless copying will be definitely
noticeable and unwelcome.

Point 2 is important because your module might be used by people
who use super-long strings (such as DNA sequences). So if you
feel the need to violate this point, at least have the decency to
document their usage, so that others are aware of it (and be
prepared to watch the usage of your module plummet).

And here is another list, enumerating reasons why it's okay to
use those variables in a script for your own use:

1. If most of your pattern matches are candidate lines, there's
a good chance that using $' (instead of m/=(.*)$/) will
actually speed up your program.
2. Even if it doesn't speed up your program, the time wasted
will probably be negligible and won't ever be noticed.
3. Even in the rare case where the time wasted isn't
negligible, it should be a trivial matter to eliminate
the penalty by editing the code and removing the match
variables.

(I am reminded of the story I have heard about an enthusiastic
programmer who was determined to make his program run faster by
making it more "streamlined." He replaced various parts of code
with presumably faster code, only to find out that the resulting
speed-up was negligible (or worse -- his new program ran slower).
The moral of this story is that it's not always obvious in what
parts of code most of the processor time is being used. If a
section of code that uses very little processor time is
optimized, it might run faster, but the time it saves is
practically non-existent compared to the rest of the code.)

Bear in mind that using a tainted pattern match of:

if (m/match/)
{
...
}

is somewhat equivalent to the untainted:

if (m/^(.*?)(match)(.*)$/)
{
my $prematch = $1;
my $match = $2;
my $postmatch = $3;

...
}

in that the copying into $prematch, $match, and $postmatch are
done whether you want them or not. That amounts to the entire
string being copied over (possibly needlessly) for every
successful pattern match. If the string is small, there's a good
chance that this penalty won't even be measurable. However, in
the cases where the programmer has to make a conscious effort not
to needlessly copy a large string (possibly because he/she is
handling a large data stream), then the $MATCH variable should
probably best be avoided -- but ONLY if these large strings are
being matched with regular expressions (as opposed to, for
instance, the index() function).

Of course, if all your regular expressions with large strings
require you to capture the pre-match and post-match with
parentheses, it would probably be faster to go ahead and use the
$MATCH variables.

So it's your call as the programmer to make an educated decision
whether or not to use the $MATCH variables. If regular
expressions are used only on small strings, the penalty most
likely will make no real difference. But if large strings are
used, give it some thought.


Ironically, I never personally used $`, $&, and $' in my scripts
before I researched this (exactly because of the always-mentioned
penalty). But now that I have made all these benchmarked tests
and examined the results, I just might start using them in my own
small programs.


(Constructive criticism is welcome. Since usage of these match
variables is a touchy topic for some Perl programmers, please
don't flame me if I made a statement that doesn't agree with your
beliefs. After all, I am only human, so if I said something that
you feel is incorrect or invalid, feel free to point it out and
comment on it, without resorting to needlessly harsh comments
about my programming skills. Thank you.)

Happy Perling!

-- Jean-Luc Romano
 
J

jl_post

(Okay, so I know that Michele will probably not read this post, but
I've decided to answer his reply anyway, for the benefit of anybody who
might be interested in this thread.)


Michele Dondi replied:
Indeed: as a personal cmt to you, I have the impression you
tend to be overly verbose. I mean verboseness is not so bad
per se, but the actual impression, gathered from other posts
of yours, is that the actual content in them that may have
been of some interest could have been expressed much more
concisely, thus making it easier for it to reach a wider
audience.

Could be, could be... But that's the way I tend to be. One reason for
that is that I like to explain things.

Another reason for that is that the people in this newsgroup tend to
point out things that I missed mentioning, whether I'm aware of them or
not. (That's not necessarily a bad thing.) For instance, if I mention
$' in my code without assigning it to a variable, someone may point out
that the Perl compiler may optimize $' out, making some think that my
benchmark is invalid (whether it really is or not). They may have a
valid point, which is why I often include extra explanations to cover
concerns that I anticipate that others may have.
In this particular case I couldn't go beyond the first
few paragraphs and I'm not willing to.

Well, if time is an issue, you might consider reading the "CONCLUSION"
section of this thread's original post. It discusses when it's best to
avoid using the MATCH variables, as well as when it's acceptable to use
them.

Why are you (running under strict and) using as a
generic variable the predefined global variable $a?

I used strict because I make it a habit of using it in my own code (and
I see no reason not to use it here). I assigned to $a because, well,
it's there to use and I don't have to declare it. I mean, I suppose I
could declare a variable, but I didn't see any reason to put a
declaration in my benchmark when it doesn't have to be there. I don't
think it really makes a difference either way.
I mean, this post of yours has some flavour of a tutorial: so why
exposing to potential newbies something we usually warn them against?

Good question. I'll give three answers:

First, I didn't mean this post to be a tutorial. I meant it to be easy
to follow and understand. If that makes it easy for potential newbies
to understand, then that's just a side-effect that I didn't plan.

Second, I don't think there's much harm in teaching potential newbies
about $', $&, and $`, as long as the obligatory warning is given (as it
almost always is).

And third, if we're not supposed to expose potential newbies to these
match variables, then I have to point out that Randal Schwartz and Tom
Phoenix already broke this rule in their book "Learning Perl," as they
use the match variables rather liberally in their examples. (I even
included a quote from their book in the CONCLUSION section of my
original post, in case you want to read some of their thoughts about
the match variables.)

What is the point of benchmarking two completely
different snippets, one of which just does something
more than the other?

They're not two completely different snippets; they both have the code
in the $line variable in common. The "bad" program has an extra line
of code, which is used to "taint" the rest of the program (due to the
$' variable). Of course, it's not fair to compare the times of "good"
with "bad," as the "bad" code runs an extra line of code (which happens
to be the taint line). Therefore, I also benchmark that line of code
so that it can be subtracted from the "bad" code time. This difference
is what is compared to the "good" code time.
Also, indeed lines of code benchmarked as strings are
(string-)eval()ed which means that they are parsed and
executed as stand-alone perl programs, but (as of the
docs => see!) _in the lexical context_ of the current
program, which means that the interpreter they're executed
with is not a total stranger to the current one.

I'm not quite sure if I understand you perfectly here. It sounds like
you're saying that even if I use $' once in my code, it "spoils" all
the regular expressions (even the ones that have no need for $'). If
this is what you mean, I have to point out that I'm eval()ing the "bad"
code (the code with $') AFTER the "good" code is eval()ed. In other
words, the "good" code finishes executing before the Perl interpreter
ever gets a chance to eval() the "bad" code and discover that the $'
variable is being used at all. As a result, the $' taint in the "bad"
code shouldn't get a chance to spoil the "good" code.

PS: FWIW I _partly_ agree with you that the general habit
of frowning upon the use of $`, $&, $' a priori is to some
extent exaggerated.

Thanks for the comment. Of course, there are times when it's bad to
use them, and other times when it's really no big deal. By making
benchmarks on several real programs I've found that the penalty of
using the match variables is almost always negligible. I wouldn't mind
feedback from other people who have benchmarked real, useful programs
(with and without the match variables) to see if they discover a
significant penalty. To be honest, I had to go out of my way to create
unrealistic data in order to create a non-negligible penalty. (That
doesn't mean that the penalty will never be measurable in real code;
that just means that, under some (but not all) conditions, there are
times when the penalty exists in some measurable manner.)

And thanks for your feedback, Michele. It was appreciated.
Good luck in your studies!

-- Jean-Luc
 
B

bik.mido

(Okay, so I know that Michele will probably not read this post, but
I've decided to answer his reply anyway, for the benefit of anybody who
might be interested in this thread.)

Well, every now and again I give a peek into google-groups...
my $prefix = q!$a = $' if "a" =~ m/a/;!;
my $line = '$a = 1 if "abc=xyz" =~ m/=/;';

Why are you (running under strict and) using as a
generic variable the predefined global variable $a? [snip]
I mean, this post of yours has some flavour of a tutorial: so why
exposing to potential newbies something we usually warn them
against?

Good question. I'll give three answers:

First, I didn't mean this post to be a tutorial. I meant it to be
easy

I didn't claim it is. I said it "has some flavour of", which reflects
the impression I get out of it.
Second, I don't think there's much harm in teaching potential newbies
about $', $&, and $`, as long as the obligatory warning is given (as it
almost always is).

I was not talking about this, but about using $a and $b as general
purpose variables.
And third, if we're not supposed to expose potential newbies to these
match variables, then I have to point out that Randal Schwartz and
Tom

Ditto as above!
I'm not quite sure if I understand you perfectly here. It sounds like
you're saying that even if I use $' once in my code, it "spoils" all
the regular expressions (even the ones that have no need for $'). If

This is what's written in the docs.
this is what you mean, I have to point out that I'm eval()ing the "bad"
code (the code with $') AFTER the "good" code is eval()ed. In other

Well, but then I _think_ (although I'm not really _sure_) that even if
you're running the code through a string eval() the interpreters run to
execute the code _are_ still affected by the side effect of using $',
or, as you say, 'taintedness'.

Why do I think so? Because, still as of the docs and as already hinted
above eval()ed strings are parsed and run in _the lexical context_ of
the current program. But let's check for ourselves; consider this
script:

| #!/usr/bin/perl
|
| use strict;
| use warnings;
| use Benchmark qw/:all :hireswallclock/;
|
| # 'aaaa' =~ /a/ and $';
|
| timethis -30, q{ 'aaaa' =~ /a/ }, 'all';
|
| __END__

Running it I get

| all: 36.9275 wallclock secs (35.77 usr + 0.14 sys = 35.91 CPU)
| @ 1343370.29/s (n=48240427)

whereas if I uncomment the commented line I get

| all: 32.1861 wallclock secs (30.69 usr + 0.18 sys = 30.87 CPU)
| @ 417959.18/s (n=12902400)

(Output slightly edited for clarity in both cases, but then please try
this yourself.)

I think that the figures are significative.
words, the "good" code finishes executing before the Perl interpreter
ever gets a chance to eval() the "bad" code and discover that the $'
variable is being used at all. As a result, the $' taint in the "bad"
code shouldn't get a chance to spoil the "good" code.
Are you sure? What about the test above then?!?


HTH,
Michele
 
J

jl_post

Well, every now and again I give a peek into google-groups...

You're welcome every time, then!
Well, but then I _think_ (although I'm not really _sure_)
that even if you're running the code through a string eval()
the interpreters run to execute the code _are_ still affected
by the side effect of using $', or, as you say, 'taintedness'.

Why do I think so? Because, still as of the docs and as already
hinted above eval()ed strings are parsed and run in _the lexical
context_ of the current program. But let's check for ourselves;
consider this script:

I agree with you completely when you say that code running through a
string eval() is still tainted, but I think you're missing one key
point: The code is not tainted if the string containing the $' has not
been eval()ed yet. Take your example code:
| #!/usr/bin/perl
|
| use strict;
| use warnings;
| use Benchmark qw/:all :hireswallclock/;
|
| # 'aaaa' =~ /a/ and $';
|
| timethis -30, q{ 'aaaa' =~ /a/ }, 'all';
|
| __END__

Commenting out the line with $' will cause the Benchmark code to be
tainted, exactly as you say. But, if the $' line is eval()ed AFTER the
Benchmark code, then it cannot taint the Benchmark code, as the Perl
interpreter does not know yet that the $' variable exists.

To illustrate, consider the following code (which is just your code
with a few modifications):


#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw/:all :hireswallclock/;

timethis -30, q{ 'aaaa' =~ /a/ }, 'before';
timethis 1, q{ 'aaaa' =~ /a/ and $' }, 'ignore';
timethis -30, q{ 'aaaa' =~ /a/ }, 'after';

__END__


This modified code differs from your original code in that the "good"
code is benchmarked twice: Once before the $' is eval()ed and once
after (according to "perldoc Benchmark", a string passed into
timethis() is eval()ed). The output I get (slightly edited for
clarity) is:

| before: 32.0156 wallclock secs @ 2853800.97/s (n=90562520)
| after: 30.4844 wallclock secs @ 589883.69/s (n=17724825)

(I left out the output of the second benchmark, since its only purpose
is to "taint" the code.)

As you can see, if the "tainted" code tainted ALL the code, then both
"before" and "after" benchmarks would have roughly the same output.
But we see from the output that the "before" benchmark performed much
faster than the "after" benchmark.

Therefore, since the only thing that happened between the two
benchmarks was an eval/benchmark of the "tainter" code, it stands to
reason that any eval()ed code with $' only taints code that is eval()ed
after, and not before.

This supports the benchmark code findings that you wrote about. In
your benchmark, the $' variable was essentially eval()ed right before
run time, while the benchmarked code was eval()ed during run time (and
was therefore affected by the $' variable).

So if you wanted to modify your benchmark program, you could change the
taint line from:

| # 'aaaa' =~ /a/ and $';

to:

| # eval( q{ 'aaaa' =~ /a/ and $' } );

And place it AFTER the call to timethis(). Then you'll see that it
makes no difference whether it is commented or not. In case you're
interested, the output I got was:

COMMENTED: ... @ 3028027.55/s (n=93959695)
UNCOMMENTED: ... @ 3085748.57/s (n=95321859)

The code that eval()ed the $' actually ran slightly faster than the
code that had the $' line commented. In this special case, it doesn't
mean that $' made any code any faster, but just that it made no
difference in the end, and that the speed difference between the two
runs is negligible.

(In all my benchmarks that I included in this thread's original post, I
made sure that the code without $' was always eval()ed or benchmarked
before the code that used it.)
Anyway, thanks again for your comments, Michele.

-- Jean-Luc
 
B

bik.mido

Well, but then I _think_ (although I'm not really _sure_)
that even if you're running the code through a string eval()
the interpreters run to execute the code _are_ still affected
by the side effect of using $', or, as you say, 'taintedness'.
[snip]
I agree with you completely when you say that code running through a
string eval() is still tainted, but I think you're missing one key
point: The code is not tainted if the string containing the $' has not
been eval()ed yet. Take your example code:
[snip]

I'm not trying your modified benchmarks myself, but I'll trust your
word for it. It seems I kind of misunderstood what you meant...
Michele
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,922
Messages
2,570,047
Members
46,475
Latest member
RacheleGri

Latest Threads

Top