How bad is $'? (Was: "Get substring of line")

Discussion in 'Perl Misc' started by J. Romano, Jan 18, 2005.

  1. J. Romano

    J. Romano Guest

    On Wednesday, January 5, 2005, Uri Guttman said:
    >
    > ... try to actually isolate the issue in a proper
    > benchmark. spend some deep time in thought in how
    > to do it. post your benchmark for review. then talk
    > about $' with some confidence.



    Okay, I'll take you up on that offer.

    (But be warned! Expect a "tome". :)


    First, allow me to define some terminology:

    * A "candidate line" is a pattern match that can be easily
    written to use $`, $&, or $'. They don't have to use
    them, as they can be written with explicit grabs. But
    nevertheless, they are pattern matches that could benefit
    from the use of the $MATCH variables.

    * A "taint line" is a line of code that uses a $MATCH
    variable (such as $`, $&, or $'). The reason that it is
    called a "taint line" is because it "taints" all regular
    expressions with a performance penalty, whether they use
    the match variables or not. Note that in this post,
    the usage of the word "taint" has nothing to do with
    Perl's "taint mode", a mode that is turned on with the
    "-T" switch.


    Also, allow me to clear up a common mistake. The following
    two lines are NOT functionally identical:

    $a = $1 if $string =~ m/=(.*)$/;
    $a = $' if $string =~ m/=/;

    This is because the '.' matches all characters EXCEPT a
    newline. To make the two lines identical, either put an "s"
    modifier on the first line or add a chomp($a) call to the
    second. In other words, these two lines are functionally
    equivalent:

    $a = $1 if $string =~ m/=(.*)$/s;
    $a = $' if $string =~ m/=/;

    as are these:

    $a = $1 if $string =~ m/=(.*)$/;
    $a = $' if $string =~ m/=/; chomp($a);


    Now to the benchmarking.

    The first benchmark program I came up with was this one:


    #!/usr/bin/perl
    use strict;
    use warnings;
    use Benchmark;

    my $count = 1e7;
    my $prefix = q!$a = $' if "a" =~ m/a/;!;
    my $line = '$a = 1 if "abc=xyz" =~ m/=/;';
    my $badProgram = "$prefix $line";

    timethese($count, {good => $line});
    timethese($count, {bad => $badProgram, prefix => $prefix});
    __END__


    The main program being tested is stored in the $line
    variable:

    '$a = 1 if "abc=xyz" =~ m/=/;

    All it does is check to see if the string "abc=xyz" has an
    equal sign (which of course it does). If it finds one, it
    sets $a to 1.

    The first call to timethese() runs this line ten million
    times. Note that this line is not tainted, since the $'
    variable is never used until it is eval'ed in the second
    call to timethese().

    The second call to timethese() runs that line, together with
    the $prefix code, which is just a "taint line". I could
    just run that "bad" program and compare its time to the "good"
    program, but then the "bad" program would have twice as many
    regular expressions to run, unfairly making it look like it
    is much worse than it really is.

    That is why I decided to run the $prefix line as well.
    Therefore, its time can be subtracted from the "bad" code's
    time to see how code identical to the "good" code (but
    tainted) suffers from the performance penalty.

    By running the code, I get the following results (note that
    I formatted the lines (but did not change the results) in
    the interest of making the results fit nicely in limited
    space):

    good: ( 3.50 usr + 0.00 sys = 3.50 CPU)
    bad: (31.89 usr + 0.00 sys = 31.89 CPU)
    prefix: (18.25 usr + 0.00 sys = 18.25 CPU)


    Subtracting the times, I get:

    31.89 - 18.25 - 3.50 = 10.14 seconds

    It appears that the performance penalty for pattern
    matches is around 10.14 seconds per 10 million pattern
    matches. That's about one microsecond for every pattern
    match.


    Looking back at my program, I realized that someone might
    accuse the program of being invalid, claiming that whatever
    overhead timethese() uses to run code is subtracted out of
    the "bad" code (when we subtract out $prefix's time) but
    never subtracted from the "good" code, making the
    performance penalty look smaller than it really is.

    I'm not sure if there is an overhead, but just in case, I
    decided to make a new variation of the above code to account
    for this:


    #!/usr/bin/perl
    use strict;
    use warnings;
    use Benchmark;

    my $count = 1e7;
    my $goodPrefix = q!$a = "" if "a" =~ m/a/;!;
    my $badPrefix = q!$a = $' if "a" =~ m/a/;!;
    my $line = '$a = 1 if "abc=xyz" =~ m/=/;';
    my $goodProgram = "$goodPrefix $line";
    my $badProgram = "$badPrefix $line";

    timethese($count, {good => $goodProgram, goodPrefix => $goodPrefix});
    timethese($count, {bad => $badProgram, badPrefix => $badPrefix});
    __END__


    This code now has a $goodPrefix (which is untainted) and a
    $badPrefix (which is the taint line). The "good" and the
    "bad" code both will have their prefixes' time subtracted
    out, putting them on more equal footing. These were my
    results:

    good: ( 7.72 usr + 0.02 sys = 7.73 CPU)
    goodPrefix: ( 3.83 usr + 0.00 sys = 3.83 CPU)
    bad: (30.56 usr + 0.01 sys = 30.58 CPU)
    badPrefix: (16.09 usr + 0.02 sys = 16.11 CPU)


    The time difference between the tainted and untainted
    pattern matches is:

    (30.56 - 16.09) - (7.72 - 3.83) = 10.58

    That's a penalty of 10.58 seconds for ten million pattern
    matches (or about one microsecond for every pattern match).

    I thought about it a bit more, and I realized that the
    string "abc=xyz" is a rather small string, quite possibly
    much smaller than the average string used in a pattern
    match. It stands to reason that if the string-to-be-matched
    is much larger than seven characters, the penalty incurred
    from copying the string to $`, $&, and $' would be greater.

    So I composed the following program:


    #!/usr/bin/perl
    use strict;
    use warnings;
    use Benchmark;

    my $count = 1e6;
    my $goodPrefix = q!$a = "" if "a" =~ m/a/;!;
    my $badPrefix = q!$a = $' if "a" =~ m/a/;!;
    my $line = '$a = 1 if "abc=xyz" =~ m/=/;';

    # Replace "abc=xyz" with longer string:
    my $string = join('', 'A'..'Z','a'..'z','0'..'9');
    $string = "$string=$string";
    $string x= 1; # change multiplier to make longer string
    $line =~ s/abc=xyz/$string/;

    my $goodProgram = "$goodPrefix $line";
    my $badProgram = "$badPrefix $line";

    timethese($count, {good => $goodProgram, goodPrefix => $goodPrefix});
    timethese($count, {bad => $badProgram, badPrefix => $badPrefix});
    __END__


    The key line is this one:

    $string x= 1; # change multiplier to make longer string

    With the multiplier set to one, the main code line compares
    the string:

    ABC...XYZabc...xyz012...789=ABC...XYZabc...xyz012...789

    which is 125 characters long. That's 56% larger than the
    standard screen width of 80 characters.

    I discovered something interesting: as I doubled the
    multiplier from 1 to 2, then to 4, then 8, and so on, I
    discovered that the penalty did indeed increase, but not
    linearly. I would have expected the penalty to at least
    double each time, but instead it increased very slowly at
    first before it started to double. I expect that its
    tardiness in doubling was due to some overhead factor.

    Anyway, for each multiplier, here is the measured
    penalty (in microseconds) for a pattern match with that
    long string:

    1: 1.08
    2: 1.14
    4: 1.19
    8: 1.67
    16: 1.84
    32: 2.73
    64: 3.78
    128: 5.72
    256: 9.89
    512: 19.13

    (Note that in the 512 case, 64 thousand characters were used
    in every pattern match, with a penalty of about 20
    microseconds (or 0.02 milliseconds) for every pattern
    match.)

    Although the doubling behavior of the penalty is latent, it
    does appear that it eventually "doubles-out." Of course, as
    the multiplier increased to ever-greater numbers, the Perl
    interpreter may run out of its allotted RAM, resorting to
    other resources such as virtual memory & disk space to
    finish its task. Once this happens, the processes used to
    supply "substitute RAM" will likely create a large
    performance drop (significantly greater than just doubling).
    However, as we see from my results, a pattern match on a
    string of 64,000 characters still only has a penalty of
    about 20 microseconds.


    Once I did this I got to thinking that all of my programs
    thus far use no candidate lines whatsoever. Obviously, the
    $MATCH variables are used in programs where some candidate
    lines are present (otherwise, they're never used (or shouldn't
    be, at any rate)). So assuming that none of the regular
    expressions are in candidate lines wouldn't necessarily be
    fair to the code that uses the $MATCH variables.

    So I decided to write this program that tests the
    performance of a program where ALL its regular expressions
    are in candidate lines:


    #!/usr/bin/perl
    use strict;
    use warnings;
    use Benchmark;

    my $count = 1e7;
    my $bad = q!$string = $' if "abc=xyz" =~ m/=/; chomp($string)!;
    my $good = q!$string = $1 if "abc=xyz" =~ m/=(.*)$/!;

    timethese($count, {good => $good});
    timethese($count, {bad => $bad});
    __END__


    The results:

    good: (19.92 usr + 0.00 sys = 19.92 CPU)
    bad: (17.38 usr + 0.00 sys = 17.38 CPU)

    It looks like for once that the "bad" code fares better than
    the "good" code (by an average of a quarter of a microsecond
    per regular expression)!

    I also tried replacing the "abc=xyz" string to the much
    larger string of:

    ABC...XYZabc...xyz012...789=ABC...XYZabc...xyz012...789

    repeated 256 times (for a total of 32,000 characters).
    My results were:

    good: (84.55 usr + 0.00 sys = 84.55 CPU)
    bad: (39.06 usr + 0.00 sys = 39.06 CPU)

    Surprisingly, the "bad" code ran more than twice as fast as
    the "good" code! Apparently, the larger the string that has
    to get captured, the better the "bad" code performs (that
    is, when compared to the non-tainted code).

    Therefore, I have to make the case that, in the event that
    someone creates a Perl program that has candidate lines
    for most of its regular expressions, it may actually be
    faster to use the $MATCH variables. (Of course, it may or
    may not be faster to use the $MATCH variables, but if no
    super-long strings are used, the difference will be tiny).

    At any rate, unless super-long strings are used in
    pattern-matches, the difference (for better or for worse) of
    using the $MATCH variables on each regular expression is on
    the order of microseconds per match.


    Okay, so I went through all this trouble to measure the
    speed of tainted lines vs. non-tainted lines. However, all
    the programs I've tested so far have no input or output,
    which is pretty rare for useful Perl scripts. Therefore,
    unless you plan to write Perl programs that consist of only
    pattern-matching and have absolutely no I/O involved, these
    tests really aren't that useful in determining just how bad
    a performance penalty from using the $MATCH variables can
    be.

    So I decided that my next logical step would be to benchmark
    a program that did a task that Perl would conceivably be
    used for: Searching every file in the current directory
    and printing out the number of lines that contain a certain
    string ("Perl", in this case):


    #!/usr/bin/perl
    use strict;
    use warnings;
    use Benchmark;

    my $count = 1e2;
    my $prefix = q!$a = $' if "a" =~ m/a/;!;
    my $program =<<'END_OF_PROGRAM';
    @ARGV = grep -f, <*>;
    $b = 0;
    m/\bPerl\b/ and $b++ while <>;
    END_OF_PROGRAM

    timethese($count, {good => $program});
    print "Count: $b\n";
    timethese($count, {bad => "$prefix $program", prefix => $prefix});
    print "Count: $b\n";
    __END__


    The results:

    good: (11.66 usr + 12.08 sys = 23.73 CPU)
    Count: 342
    bad: (11.88 usr + 11.72 sys = 23.59 CPU)
    prefix: ( 0.00 usr + 0.00 sys = 0.00 CPU)
    Count: 342

    We see that the "good" code uses less usr time (at about 2.2
    milliseconds per run), but for some reason it managed to use
    more sys time (about 3.6 milliseconds per run). I'm not
    sure why this is so, but I'm inclined to think that because
    the $MATCH penalty is so small compared to disk access, the
    penalty doesn't really show up when run with code that uses
    disk access. In other words, the variation in time for
    code that uses disk I/O is so great that it more than makes
    up for just about any penalty introduced by the $MATCH
    variables.


    Okay, so at this point I realize that almost all performance
    penalties incurred by the $MATCH variables are quite small,
    and that when used in programs that use I/O processes, they
    seem to become negligible. But I wanted to perform one more
    benchmark test: How would the $' variable affect REAL Perl
    programs, the heavy-duty scripts that we've written and are
    proud of?

    I have one such Perl program I wrote at work. We would get
    a special type of binary data files and we would need to
    peer into them to see what data they held. This is quite a
    complicated process, as it would require a full-fledged
    decoder to handle the special data types and
    representations. In order to peek into a new file that was
    new to us, we would have to take our normal decoder (written
    in C) and use the debugger to change the flow of execution
    so that the contents of the file would be written out to the
    screen for human eyes to read.

    This technique was cumbersome and slow, so I wrote a Perl
    program to parse through a binary file of that format and
    spit out its contents in human-readable form. It worked
    great. (And to think that the very first time I decoded a
    file in such a format I was using a paper, pencil, and hex
    dump output.)

    Because that Perl script I wrote belongs to the company I
    work for, I won't post it here, but I will give some
    statistics about it: In order to parse any binary file it
    must first parse through three ASCII configuration files
    (with 173, 558, and 2060 lines with each line going through
    at least one regular expression). For about every 1K of
    input, the program spits out about 4K of output. There are
    quite a few regular expressions sprinkled throughout the
    program; several of which are operating on the binary data
    (one of which operates on an entire binary file). For the
    record, this program does NOT use the $`, $&, or $'
    variables, nor the English module (or any other module, for
    that matter).

    I made two copies of this program: one named "good.pl" and
    one named "bad.pl". I inserted the following line at the
    top of both files:

    @ARGV = <*.bin>;

    and inserted the following taint line at the top of
    "bad.pl":

    $a = $' if "a" =~ m/a/;

    Then I ran the following script:


    #!/usr/bin/perl
    use strict;
    use warnings;
    use Benchmark;

    my $count = 1e3;
    timethese($count, {good => "do 'good.pl'"});
    timethese($count, {good => "do 'good.pl'"});
    timethese($count, {bad => "do 'bad.pl'"});
    __END__


    Let me explain why I time "good.pl" twice:
    Because I have several functions defined in the script,
    timethese() keeps complaining that the functions are kept
    being redefined over and over. Obviously, it won't
    complain about this the very first time the program is run,
    making the first timethese() have an advantage over any
    subsequent calls that test the same code. Therefore, by
    calling timethese() three times, I ensure that the second
    and third call to timethese() are on equal footing (as far
    as redefining functions goes).

    I ran this program so that the code it called would have to
    decode a somewhat large load of 55 files (around 5K each).

    Here are the results:

    Benchmark: timing 1000 iterations of good...
    good: (45.09 usr + 1.78 sys = 46.88 CPU)
    Benchmark: timing 1000 iterations of good...
    good: (45.61 usr + 1.47 sys = 47.08 CPU)
    Benchmark: timing 1000 iterations of bad...
    bad: (45.64 usr + 1.55 sys = 47.19 CPU)


    It appears that, when comparing the CPU time, 1000 runs of
    the program takes 0.11 seconds longer when I use the $MATCH
    variables. That means that I'd save only around 0.11
    milliseconds (per single run) by avoiding them.

    Just for fun, I decided to concatenate all those files (and a few
    more) into one huge binary file (my program was made to handle
    this), and then to "cat" that file several times into an even
    bigger binary file. Then the "good.pl" and "bad.pl" files were
    changed to read that file as input. For the record, that file
    was 5,938,445 bytes. I wrote my program so that it would do a
    (successful) pattern match over the entire binary file contents
    one time for each sub-file inside it. That means that for every
    time this program is run with this >5MB file as input (which is,
    for this particular data format, unrealistically large), almost
    six megabytes of data are copied into $`, $&, and $' three
    hundred times.

    And here are the results of running this program with a
    abnormally large data set:

    Benchmark: timing 1000 iterations of good...
    good: (129.78 usr + 2.56 sys = 132.34 CPU)
    Benchmark: timing 1000 iterations of good...
    good: (129.39 usr + 2.39 sys = 131.78 CPU)
    Benchmark: timing 1000 iterations of bad...
    bad: (130.36 usr + 2.94 sys = 133.30 CPU)


    It looks like the performance penalty of using the $MATCH
    variables is 133.30 - 131.78 = 1.52 seconds (for 1000
    iterations). For each run, less than 2 milliseconds is the
    penalty for needlessly using the $MATCH variables. Keep in
    mind, for each run almost 6 megabytes of data were being
    copied into $`, $&, and $' 300 times. And the penalty was
    not even two milliseconds.

    And here is where others can help me out. I'm curious to
    see how your own favorite Perl programs fare. If you'd like
    to post your own results, take a good, robust Perl script
    (that doesn't use the $MATCH variables), copy it to
    "good.pl" and "bad.pl", add an @ARGV line to both files
    (if necessary), and then add a "taint" line to the "bad.pl"
    script. Then run the above Perl code and compare the last
    two benchmark readings (you may want to pipe the output to
    "grep wallclock" to find the benchmark output if you normally
    have a lot of output).


    At this point, I have to say that these Perl programs I
    wrote seem to show that the performance penalty of using $`,
    $&, and $' are usually small and often negligible.



    CONCLUSION


    I happen to subscribe to Paul Graham's ideas on
    optimization. On page 213 of his book "ANSI Common Lisp"
    (yes, Lisp), he writes:


    Three points can be made about optimization,
    regardless of the implementation: it should be
    focused on bottlenecks, it should not begin too
    early, and it should begin with algorithms.

    Probably the most important thing to understand
    about optimization is that programs tend to have a
    few bottlenecks that account for a great part of
    the execution time. ... Optimizing these parts of
    the program will make it run noticeably faster;
    optimizing the rest of the program will be a
    waste of time in comparison.

    ...

    A corollary of the bottleneck rule is that one
    should not put too much effort into optimizing
    early in a program's life. Knuth puts the point
    even more strongly: "Premature optimization is the
    root of all evil (or at least most of it) in
    programming."


    In other words, optimizing code that is not a bottleneck may save
    some time in the end, but the time it saves becomes negligible
    when run together with a bottleneck, or at least with code that
    has worse run-time behavior (measured in Big-O notation). The
    time spent in optimizing non-bottleneck code is usually much
    greater than the few milliseconds it saves.

    And since the penalty to using $`, $&, and $' seems to behave
    at O(N) or possibly O(N*log(N)), the penalty will become
    insignificant when used in programs that employ algorithms of
    worse Big-O behavior (or that even use disk access).

    So why does the penalty disclaimer seem to follow $'
    wherever it is taught? Personally, I think it's because the
    penalty (however small) does exist, and programmers think
    they have a responsibility to inform other programmers
    (since one piece of code can affect another piece of
    seemingly unrelated code).

    But I still wondered about that, so I looked it up in the book
    where I find learned about it: "Learning Perl" by Randal L.
    Schwartz & Tom Phoenix. On page 122, it says,


    Now, we said earlier that these three are "free."
    Well, freedom has its price. In this case, the
    price is that once you use any one of these
    automatic match variables anywhere in your entire
    program, other regular expressions will run a
    little more slowly. Now this isn't a giant
    slowdown, but it's enough of a worry that many
    Perl programmers will simply never use these
    automatic match variables.[*] Instead, they'll
    use a workaround.


    And there's more: A footnote is included:


    [*] Most of these folks haven't actually
    benchmarked their programs to see whether their
    workarounds actually save time, though; it's as
    though these variables were poisonous or
    something. But we can't blame them for not
    benchmarking -- many programs that could benefit
    from these three variables take up only a few
    minutes of CPU time in a week, so benchmarking and
    optimizing would be a waste of time. But in that
    case, why fear a possible extra millisecond? By
    the way, Perl developers are working on this
    problem, but there will probably be no solution
    before Perl 6.


    Apparently Randal L. Schwartz and Tom Phoenix don't seem to think
    it would be such a big deal if some programs used $'. I agree
    with them, of course, but I do acknowledge that there are times
    that it's better to use $' than others.

    So I made a list of times it is acceptable to use $`, $&, and $':

    1. when most of the pattern matches in your program are
    candidate lines
    2. when none of the text strings being pattern-matched is longer
    than a thousand characters
    3. when there are few pattern matches when compared with the
    rest of the program
    4. when used along-side algorithms that have worse run-time
    behaviors

    But it's also important to point out some times when it is not
    acceptable to use those variables:

    1. when a pattern-matched string has a length greater than
    millions of characters
    2. when you are writing a Perl module for others to use

    Point 1 is important for programmers who have to handle large
    amount of binary data, or even long text strings, like DNA
    sequences. I tried a search engine, but couldn't find just how
    much memory you would need to hold a complete human DNA sequence,
    but I think I remember reading that it was something on the order
    of fifteen gigabytes. Reading something of that size takes up
    enough time as it is; any needless copying will be definitely
    noticeable and unwelcome.

    Point 2 is important because your module might be used by people
    who use super-long strings (such as DNA sequences). So if you
    feel the need to violate this point, at least have the decency to
    document their usage, so that others are aware of it (and be
    prepared to watch the usage of your module plummet).

    And here is another list, enumerating reasons why it's okay to
    use those variables in a script for your own use:

    1. If most of your pattern matches are candidate lines, there's
    a good chance that using $' (instead of m/=(.*)$/) will
    actually speed up your program.
    2. Even if it doesn't speed up your program, the time wasted
    will probably be negligible and won't ever be noticed.
    3. Even in the rare case where the time wasted isn't
    negligible, it should be a trivial matter to eliminate
    the penalty by editing the code and removing the match
    variables.

    (I am reminded of the story I have heard about an enthusiastic
    programmer who was determined to make his program run faster by
    making it more "streamlined." He replaced various parts of code
    with presumably faster code, only to find out that the resulting
    speed-up was negligible (or worse -- his new program ran slower).
    The moral of this story is that it's not always obvious in what
    parts of code most of the processor time is being used. If a
    section of code that uses very little processor time is
    optimized, it might run faster, but the time it saves is
    practically non-existent compared to the rest of the code.)

    Bear in mind that using a tainted pattern match of:

    if (m/match/)
    {
    ...
    }

    is somewhat equivalent to the untainted:

    if (m/^(.*?)(match)(.*)$/)
    {
    my $prematch = $1;
    my $match = $2;
    my $postmatch = $3;

    ...
    }

    in that the copying into $prematch, $match, and $postmatch are
    done whether you want them or not. That amounts to the entire
    string being copied over (possibly needlessly) for every
    successful pattern match. If the string is small, there's a good
    chance that this penalty won't even be measurable. However, in
    the cases where the programmer has to make a conscious effort not
    to needlessly copy a large string (possibly because he/she is
    handling a large data stream), then the $MATCH variable should
    probably best be avoided -- but ONLY if these large strings are
    being matched with regular expressions (as opposed to, for
    instance, the index() function).

    Of course, if all your regular expressions with large strings
    require you to capture the pre-match and post-match with
    parentheses, it would probably be faster to go ahead and use the
    $MATCH variables.

    So it's your call as the programmer to make an educated decision
    whether or not to use the $MATCH variables. If regular
    expressions are used only on small strings, the penalty most
    likely will make no real difference. But if large strings are
    used, give it some thought.


    Ironically, I never personally used $`, $&, and $' in my scripts
    before I researched this (exactly because of the always-mentioned
    penalty). But now that I have made all these benchmarked tests
    and examined the results, I just might start using them in my own
    small programs.


    (Constructive criticism is welcome. Since usage of these match
    variables is a touchy topic for some Perl programmers, please
    don't flame me if I made a statement that doesn't agree with your
    beliefs. After all, I am only human, so if I said something that
    you feel is incorrect or invalid, feel free to point it out and
    comment on it, without resorting to needlessly harsh comments
    about my programming skills. Thank you.)

    Happy Perling!

    -- Jean-Luc Romano
    J. Romano, Jan 18, 2005
    #1
    1. Advertising

  2. J. Romano

    Guest

    (Okay, so I know that Michele will probably not read this post, but
    I've decided to answer his reply anyway, for the benefit of anybody who
    might be interested in this thread.)


    > On 18 Jan 2005 07:29:11 -0800, (J. Romano) wrote:
    >
    > > Okay, I'll take you up on that offer.
    > >
    > > (But be warned! Expect a "tome". :)


    Michele Dondi replied:
    >
    > Indeed: as a personal cmt to you, I have the impression you
    > tend to be overly verbose. I mean verboseness is not so bad
    > per se, but the actual impression, gathered from other posts
    > of yours, is that the actual content in them that may have
    > been of some interest could have been expressed much more
    > concisely, thus making it easier for it to reach a wider
    > audience.


    Could be, could be... But that's the way I tend to be. One reason for
    that is that I like to explain things.

    Another reason for that is that the people in this newsgroup tend to
    point out things that I missed mentioning, whether I'm aware of them or
    not. (That's not necessarily a bad thing.) For instance, if I mention
    $' in my code without assigning it to a variable, someone may point out
    that the Perl compiler may optimize $' out, making some think that my
    benchmark is invalid (whether it really is or not). They may have a
    valid point, which is why I often include extra explanations to cover
    concerns that I anticipate that others may have.

    > In this particular case I couldn't go beyond the first
    > few paragraphs and I'm not willing to.


    Well, if time is an issue, you might consider reading the "CONCLUSION"
    section of this thread's original post. It discusses when it's best to
    avoid using the MATCH variables, as well as when it's acceptable to use
    them.


    > >The first benchmark program I came up with was this one:
    > >
    > >#!/usr/bin/perl
    > >use strict;
    > >use warnings;
    > >use Benchmark;
    > >
    > >my $count = 1e7;
    > >my $prefix = q!$a = $' if "a" =~ m/a/;!;
    > >my $line = '$a = 1 if "abc=xyz" =~ m/=/;';

    >
    > Why are you (running under strict and) using as a
    > generic variable the predefined global variable $a?


    I used strict because I make it a habit of using it in my own code (and
    I see no reason not to use it here). I assigned to $a because, well,
    it's there to use and I don't have to declare it. I mean, I suppose I
    could declare a variable, but I didn't see any reason to put a
    declaration in my benchmark when it doesn't have to be there. I don't
    think it really makes a difference either way.

    > I mean, this post of yours has some flavour of a tutorial: so why
    > exposing to potential newbies something we usually warn them against?


    Good question. I'll give three answers:

    First, I didn't mean this post to be a tutorial. I meant it to be easy
    to follow and understand. If that makes it easy for potential newbies
    to understand, then that's just a side-effect that I didn't plan.

    Second, I don't think there's much harm in teaching potential newbies
    about $', $&, and $`, as long as the obligatory warning is given (as it
    almost always is).

    And third, if we're not supposed to expose potential newbies to these
    match variables, then I have to point out that Randal Schwartz and Tom
    Phoenix already broke this rule in their book "Learning Perl," as they
    use the match variables rather liberally in their examples. (I even
    included a quote from their book in the CONCLUSION section of my
    original post, in case you want to read some of their thoughts about
    the match variables.)


    > >my $badProgram = "$prefix $line";
    > >
    > >timethese($count, {good => $line});
    > >timethese($count, {bad => $badProgram, prefix => $prefix});
    > >__END__

    >
    > What is the point of benchmarking two completely
    > different snippets, one of which just does something
    > more than the other?


    They're not two completely different snippets; they both have the code
    in the $line variable in common. The "bad" program has an extra line
    of code, which is used to "taint" the rest of the program (due to the
    $' variable). Of course, it's not fair to compare the times of "good"
    with "bad," as the "bad" code runs an extra line of code (which happens
    to be the taint line). Therefore, I also benchmark that line of code
    so that it can be subtracted from the "bad" code time. This difference
    is what is compared to the "good" code time.

    > Also, indeed lines of code benchmarked as strings are
    > (string-)eval()ed which means that they are parsed and
    > executed as stand-alone perl programs, but (as of the
    > docs => see!) _in the lexical context_ of the current
    > program, which means that the interpreter they're executed
    > with is not a total stranger to the current one.


    I'm not quite sure if I understand you perfectly here. It sounds like
    you're saying that even if I use $' once in my code, it "spoils" all
    the regular expressions (even the ones that have no need for $'). If
    this is what you mean, I have to point out that I'm eval()ing the "bad"
    code (the code with $') AFTER the "good" code is eval()ed. In other
    words, the "good" code finishes executing before the Perl interpreter
    ever gets a chance to eval() the "bad" code and discover that the $'
    variable is being used at all. As a result, the $' taint in the "bad"
    code shouldn't get a chance to spoil the "good" code.


    > PS: FWIW I _partly_ agree with you that the general habit
    > of frowning upon the use of $`, $&, $' a priori is to some
    > extent exaggerated.


    Thanks for the comment. Of course, there are times when it's bad to
    use them, and other times when it's really no big deal. By making
    benchmarks on several real programs I've found that the penalty of
    using the match variables is almost always negligible. I wouldn't mind
    feedback from other people who have benchmarked real, useful programs
    (with and without the match variables) to see if they discover a
    significant penalty. To be honest, I had to go out of my way to create
    unrealistic data in order to create a non-negligible penalty. (That
    doesn't mean that the penalty will never be measurable in real code;
    that just means that, under some (but not all) conditions, there are
    times when the penalty exists in some measurable manner.)

    And thanks for your feedback, Michele. It was appreciated.
    Good luck in your studies!

    -- Jean-Luc
    , Feb 1, 2005
    #2
    1. Advertising

  3. J. Romano

    Guest

    wrote:
    > (Okay, so I know that Michele will probably not read this post, but
    > I've decided to answer his reply anyway, for the benefit of anybody

    who
    > might be interested in this thread.)


    Well, every now and again I give a peek into google-groups...

    > > >my $prefix = q!$a = $' if "a" =~ m/a/;!;
    > > >my $line = '$a = 1 if "abc=xyz" =~ m/=/;';

    > >
    > > Why are you (running under strict and) using as a
    > > generic variable the predefined global variable $a?

    [snip]
    > > I mean, this post of yours has some flavour of a tutorial: so why
    > > exposing to potential newbies something we usually warn them

    against?
    >
    > Good question. I'll give three answers:
    >
    > First, I didn't mean this post to be a tutorial. I meant it to be

    easy

    I didn't claim it is. I said it "has some flavour of", which reflects
    the impression I get out of it.

    > Second, I don't think there's much harm in teaching potential newbies
    > about $', $&, and $`, as long as the obligatory warning is given (as

    it
    > almost always is).


    I was not talking about this, but about using $a and $b as general
    purpose variables.

    > And third, if we're not supposed to expose potential newbies to these
    > match variables, then I have to point out that Randal Schwartz and

    Tom

    Ditto as above!

    > > Also, indeed lines of code benchmarked as strings are
    > > (string-)eval()ed which means that they are parsed and
    > > executed as stand-alone perl programs, but (as of the
    > > docs => see!) _in the lexical context_ of the current
    > > program, which means that the interpreter they're executed
    > > with is not a total stranger to the current one.

    >
    > I'm not quite sure if I understand you perfectly here. It sounds

    like
    > you're saying that even if I use $' once in my code, it "spoils" all
    > the regular expressions (even the ones that have no need for $'). If


    This is what's written in the docs.

    > this is what you mean, I have to point out that I'm eval()ing the

    "bad"
    > code (the code with $') AFTER the "good" code is eval()ed. In other


    Well, but then I _think_ (although I'm not really _sure_) that even if
    you're running the code through a string eval() the interpreters run to
    execute the code _are_ still affected by the side effect of using $',
    or, as you say, 'taintedness'.

    Why do I think so? Because, still as of the docs and as already hinted
    above eval()ed strings are parsed and run in _the lexical context_ of
    the current program. But let's check for ourselves; consider this
    script:

    | #!/usr/bin/perl
    |
    | use strict;
    | use warnings;
    | use Benchmark qw/:all :hireswallclock/;
    |
    | # 'aaaa' =~ /a/ and $';
    |
    | timethis -30, q{ 'aaaa' =~ /a/ }, 'all';
    |
    | __END__

    Running it I get

    | all: 36.9275 wallclock secs (35.77 usr + 0.14 sys = 35.91 CPU)
    | @ 1343370.29/s (n=48240427)

    whereas if I uncomment the commented line I get

    | all: 32.1861 wallclock secs (30.69 usr + 0.18 sys = 30.87 CPU)
    | @ 417959.18/s (n=12902400)

    (Output slightly edited for clarity in both cases, but then please try
    this yourself.)

    I think that the figures are significative.

    > words, the "good" code finishes executing before the Perl interpreter
    > ever gets a chance to eval() the "bad" code and discover that the $'
    > variable is being used at all. As a result, the $' taint in the

    "bad"
    > code shouldn't get a chance to spoil the "good" code.

    Are you sure? What about the test above then?!?


    HTH,
    Michele
    , Feb 2, 2005
    #3
  4. J. Romano

    Guest

    wrote:
    >
    > Well, every now and again I give a peek into google-groups...


    You're welcome every time, then!

    > Well, but then I _think_ (although I'm not really _sure_)
    > that even if you're running the code through a string eval()
    > the interpreters run to execute the code _are_ still affected
    > by the side effect of using $', or, as you say, 'taintedness'.
    >
    > Why do I think so? Because, still as of the docs and as already
    > hinted above eval()ed strings are parsed and run in _the lexical
    > context_ of the current program. But let's check for ourselves;
    > consider this script:


    I agree with you completely when you say that code running through a
    string eval() is still tainted, but I think you're missing one key
    point: The code is not tainted if the string containing the $' has not
    been eval()ed yet. Take your example code:

    > | #!/usr/bin/perl
    > |
    > | use strict;
    > | use warnings;
    > | use Benchmark qw/:all :hireswallclock/;
    > |
    > | # 'aaaa' =~ /a/ and $';
    > |
    > | timethis -30, q{ 'aaaa' =~ /a/ }, 'all';
    > |
    > | __END__


    Commenting out the line with $' will cause the Benchmark code to be
    tainted, exactly as you say. But, if the $' line is eval()ed AFTER the
    Benchmark code, then it cannot taint the Benchmark code, as the Perl
    interpreter does not know yet that the $' variable exists.

    To illustrate, consider the following code (which is just your code
    with a few modifications):


    #!/usr/bin/perl

    use strict;
    use warnings;
    use Benchmark qw/:all :hireswallclock/;

    timethis -30, q{ 'aaaa' =~ /a/ }, 'before';
    timethis 1, q{ 'aaaa' =~ /a/ and $' }, 'ignore';
    timethis -30, q{ 'aaaa' =~ /a/ }, 'after';

    __END__


    This modified code differs from your original code in that the "good"
    code is benchmarked twice: Once before the $' is eval()ed and once
    after (according to "perldoc Benchmark", a string passed into
    timethis() is eval()ed). The output I get (slightly edited for
    clarity) is:

    | before: 32.0156 wallclock secs @ 2853800.97/s (n=90562520)
    | after: 30.4844 wallclock secs @ 589883.69/s (n=17724825)

    (I left out the output of the second benchmark, since its only purpose
    is to "taint" the code.)

    As you can see, if the "tainted" code tainted ALL the code, then both
    "before" and "after" benchmarks would have roughly the same output.
    But we see from the output that the "before" benchmark performed much
    faster than the "after" benchmark.

    Therefore, since the only thing that happened between the two
    benchmarks was an eval/benchmark of the "tainter" code, it stands to
    reason that any eval()ed code with $' only taints code that is eval()ed
    after, and not before.

    This supports the benchmark code findings that you wrote about. In
    your benchmark, the $' variable was essentially eval()ed right before
    run time, while the benchmarked code was eval()ed during run time (and
    was therefore affected by the $' variable).

    So if you wanted to modify your benchmark program, you could change the
    taint line from:

    | # 'aaaa' =~ /a/ and $';

    to:

    | # eval( q{ 'aaaa' =~ /a/ and $' } );

    And place it AFTER the call to timethis(). Then you'll see that it
    makes no difference whether it is commented or not. In case you're
    interested, the output I got was:

    COMMENTED: ... @ 3028027.55/s (n=93959695)
    UNCOMMENTED: ... @ 3085748.57/s (n=95321859)

    The code that eval()ed the $' actually ran slightly faster than the
    code that had the $' line commented. In this special case, it doesn't
    mean that $' made any code any faster, but just that it made no
    difference in the end, and that the speed difference between the two
    runs is negligible.

    (In all my benchmarks that I included in this thread's original post, I
    made sure that the code without $' was always eval()ed or benchmarked
    before the code that used it.)
    Anyway, thanks again for your comments, Michele.

    -- Jean-Luc
    , Feb 2, 2005
    #4
  5. J. Romano

    Guest

    wrote:
    > > Well, but then I _think_ (although I'm not really _sure_)
    > > that even if you're running the code through a string eval()
    > > the interpreters run to execute the code _are_ still affected
    > > by the side effect of using $', or, as you say, 'taintedness'.

    [snip]
    > I agree with you completely when you say that code running through a
    > string eval() is still tainted, but I think you're missing one key
    > point: The code is not tainted if the string containing the $' has

    not
    > been eval()ed yet. Take your example code:

    [snip]

    I'm not trying your modified benchmarks myself, but I'll trust your
    word for it. It seems I kind of misunderstood what you meant...
    Michele
    , Feb 3, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    862
  2. Eric Anderson

    Bad Transform or Bad Engine?

    Eric Anderson, Oct 4, 2005, in forum: XML
    Replies:
    1
    Views:
    375
    Peter Flynn
    Oct 5, 2005
  3. =?Utf-8?B?V2lsbGlhbSBTdWxsaXZhbg==?=

    vs2005 publish website doing bad things, bad things

    =?Utf-8?B?V2lsbGlhbSBTdWxsaXZhbg==?=, Oct 25, 2006, in forum: ASP .Net
    Replies:
    1
    Views:
    597
    =?Utf-8?B?UGV0ZXIgQnJvbWJlcmcgW0MjIE1WUF0=?=
    Oct 25, 2006
  4. rantingrick
    Replies:
    44
    Views:
    1,202
    Peter Pearson
    Jul 13, 2010
  5. Petterson Mikael

    Newbie question: "Get substring of line"

    Petterson Mikael, Jan 4, 2005, in forum: Perl Misc
    Replies:
    25
    Views:
    310
    Michele Dondi
    Jan 6, 2005
Loading...

Share This Page