Issue: unexpected value in $2 (Perl 5.10.1)

Discussion in 'Perl Misc' started by John Bokma, Mar 22, 2013.

  1. John Bokma

    John Bokma Guest

    The following piece of code assigns an unexpected value to $2 in a
    program I am working on. When I run tests, it works as expected, but
    when I run the actual program (which parses currently 2 large XML files)
    $2 gets assigned a random value when the second XML file is parsed [1]:

    $iso_gmt =~ /^(\d{4})-(\d\d)-(\d\d)T(\d\d):(\d\d):(\d\d)Z$/
    or croak "Not a valid ISO GMT date time: '$iso_gmt'";
    # Note: without the assingment below $2 might be set to a different
    # value for an unknown reason (bug in XML::parser C code?)
    # perl, v5.10.1
    #my $month = $2;
    my $seconds_since_epoch;
    my $error;
    $seconds_since_epoch = timegm( $6, $5, $4, $3, $2 - 1, $1 - 1900 );

    $iso_gmt = '2012-10-30T21:05:01Z';

    Examples of errors:

    Month '20129103' out of range 0..11
    Month '11826511' out of range 0..11
    Month '20079951' out of range 0..11
    Month '15160655' out of range 0..11
    Month '21972303' out of range 0..11
    Month '10208591' out of range 0..11

    0133254f - 20129103
    00b4754f - 11826511
    0132654f - 20079951
    00e7554f - 15160655
    014f454f - 21972303
    009bc54f - 10208591

    ^^^ looks like there is a pattern.

    Uncommenting the assignment of $2 to $month removes the effect.
    Reversing the parse order of the 2 XML files also removes the effect.

    Anyone a suggestion (or 2) to pin point what goes wrong here (or is this
    a know bug?). It looks like memory accidentally gets overwritten.


    [1] using XML::parser
     
    John Bokma, Mar 22, 2013
    #1
    1. Advertisements

  2. John Bokma

    John Bokma Guest

    My apologies for the unclear write up. The above piece of code is called
    by code that uses XML::parser to parse two XML files. Since the issue
    seems to depend on the order in which the XML files are parsed I don't
    think it's possible (or: easy) to create a minimal test case (otherwise
    I certainly would've posted that one). Note: tests that I run via prove
    to test the above code don't show the issue.
    Yes, same here, usually ;-).
    timegm throws the "Month ... out of range ...". I used a one liner to
    convert those values to hex (given above).

    [..]
    Thanks, will look into those two options. Right now I am not able to
    reproduce it. Most likely due to code changes.
     
    John Bokma, Mar 23, 2013
    #2
    1. Advertisements

  3. John Bokma

    Klaus Guest

    I agree, but I would also add that it is perfectly safe to use $1, $2,
    $3, ... in an assignment to local variables immediatly after $iso_gmt
    =~ /.../ or croak "..."

    For example, the following lines are perfectly safe:

    $iso_gmt =~ /.../ or croak "...";
    my ($year, $month, $day, $hour, $min, $sec)
    = ($1, $2, $3, $4, $5, $6);

    However, the following line is not safe:

    I am very suspicious about the fact that $1, $2, $3, ... are passed by
    reference to a subroutine timegm( $6, $5, $4, $3, $2 - 1, $1 - 1900 ).

    Variables such as $1, $2, $3, ... have very complicated scoping rules.

    Passing such variables by reference to a subroutine makes it very hard
    to guarantee the integrity of $1, $2, $3, ...
     
    Klaus, Mar 23, 2013
    #3
  4. John Bokma

    John Bokma Guest

    Can you give me an example of calling a sub that shows this unsafe
    behavior? And/or the scoping issues you mention?

    I've always considered it perfectly safe to do something like:

    $foo =~ /(.)/;
    bar( $1 );

    if not, I like to know cases this can go wrong. Thanks.
     
    John Bokma, Mar 23, 2013
    #4
  5. John Bokma

    Klaus Guest

    Here is an example where

    Case 1 uses the safe bar($var)
    Case 2 uses the unsafe bar($1)

    ==============================================
    use strict;
    use warnings;

    my $foo = 'abc';

    $foo =~ /^(..).$/ or die "Error 1";
    my $var = $1;
    print "Case 1: ", bar($var), "\n";

    $foo =~ /^(..).$/ or die "Error 2";
    print "Case 2: ", bar($1), "\n";

    sub bar {
    my $z = '';

    $_[0] =~ /^(.).$/ or return "Error 3";
    $z .= $1;

    $_[0] =~ /^.(.)$/ or return "Error 4";
    $z .= $1;

    return "ok ('$z')";
    }
    ==============================================

    The output is:
    ==============================================
    Case 1: ok ('ab')
    Case 2: Error 4
    ==============================================
     
    Klaus, Mar 23, 2013
    #5
  6. John Bokma

    Klaus Guest

    see perldoc perlre:

    ++ These special variables, like the %+ hash and the numbered match
    ++ variables ($1 , $2 , $3 , etc.) are dynamically scoped until the
    ++ end of the enclosing block or until the next successful match,
    ++ whichever comes first.

    The scoping issue is that the execution of timegm($6, ...) falls in
    the same dynamically scoped block as the previously executed $iso_gmt
    =~ /.../ or croak "...";

    timegm($6, ...) can itself have regular expressions, in which case
    that regular expression overwrites whatever is contained in $6.

    The issue is that $6 is aliased to the first parameter of
    timegm($6, ...).

    These cases are not necessarily wrong, but they are very difficult to
    get right.
     
    Klaus, Mar 23, 2013
    #6
  7. [...]
    'use warnings' wouldn't complain about .= with a 'no value' variable
    on the left hand side and even if this wasn't the case, the useless
    concatenation could simply be replaced with a proper assignment in
    order to avoid the "cargo cult 'initialization'" (my $z = '').
     
    Rainer Weikusat, Mar 23, 2013
    #7
  8. While that is theoretically possible, it doesn't seem to be the case
    with timegm(). I just skimmed the source and don't see any use of
    regexps there and frankly I can't think of any legitimate use timegm()
    could have for regexps.

    So I doubt that this is the issue. (Also I think John would have been
    able to extract a minimal test case if it was that easy)

    hp
     
    Peter J. Holzer, Mar 23, 2013
    #8
  9. #!/usr/bin/perl

    use strict;
    use warnings;
    use feature qw{ say };

    sub holy_cosmos {
    my $input = shift @_;
    'holy_hallelujah' =~ m{holy_(.*)};
    say $1 // 'missing'
    }

    'holy_rainbow' =~ m{(.*)_(.*)};
    say $1 // 'not found';
    holy_cosmos $1;
    say $1 // 'not found';

    __END__

    {2109:6} [0:0]% perl ~/foo.90HmzN.pl
    holy
    hallelujah
    holy
    {2374:7} [0:0]%

    I doubt there's any. Probably, that was the case in pre-historic Perls.
    Or this is some XS that directly manipulates variables in other scope.

    *CUT*
     
    Eric Pozharski, Mar 24, 2013
    #9
  10. Eric> sub holy_cosmos {
    Eric> my $input = shift @_;
    Eric> 'holy_hallelujah' =~ m{holy_(.*)};
    Eric> say $1 // 'missing'
    Eric> }

    This is bad. Very bad. If the match fails, it DOES NOT RESET the match
    variables. So you have broken code from the getgo.
     
    Randal L. Schwartz, Mar 25, 2013
    #10
  11. Slowly: they said
    (<>)
    that this:

    #!/usr/bin/perl

    use strict;
    use warnings;
    use feature qw{ say };

    sub foo {
    my ($x) = @_;
    "y" =~ /(.)/;
    say "$x:$_[0]";
    }

    "x" =~ /(.)/;
    foo $1;
    say $1;
    __END__

    should be this:

    x:y
    y

    or this:

    y:y
    y

    instead it is this:

    {1894:4} [0:0]% perl ~/foo.H2vhcu.pl
    x:y
    x

    p.s. I've already seen what Devel::peek::Dump said me, has nothing to
    do with subs, @_, or aliasing.
     
    Eric Pozharski, Mar 25, 2013
    #11
  12. How do you reset match variables?

    p.s. I already feel much better and would like to pump it even further.
     
    Eric Pozharski, Mar 25, 2013
    #12
  13. John Bokma

    Klaus Guest

    Usually you don't reset match variables at all.

    The point is to test the match, if it fails, then you don't look at
    match variables.

    For example:

    if ('holy_hallelujah' =~ m{holy_(.*)}) {
    say $1;
    }
    else {
    # do not look at match variables...
    say 'missing';
    }
     
    Klaus, Mar 25, 2013
    #13
  14. The next successful match will 'reset' them. In this case, this means
    if the 'holy_hallelujah' match could fail (which it can't but a
    'non-constant' one could), the line below would test the value the
    last successful match put into $1.

    As Klaus already wrote: The easiest way to stay clear of $1
    .... pitfalls is 'never use them except in code which is conditionally
    executed when "the match" succeeded'.
     
    Rainer Weikusat, Mar 25, 2013
    #14
  15. Giganews filters googlegroups? Amazing. Count me in.

    *SKIP*
    #!/usr/bin/perl

    use strict;
    use warnings;
    use feature qw{ say };

    'holy_rainbow' =~ m{(.*)_(.*)};
    say $1 // 'not found';

    {
    'holy_hallelujah' =~ m{holy_(.*)};
    say $1 // 'missing';
    }

    say $1 // 'not found';

    __END__

    {11145:5} [0:0]% perl ~/foo.4PonDs.pl
    holy
    hallelujah
    holy

    See? And no @_ involved.

    *CUT*
     
    Eric Pozharski, Mar 26, 2013
    #15
  16. [...]
    What is this now supposed to demonstrate? As documented, the $1
    .... are always implicitly local to the enclosing block. This subthread
    was about passing them as arguments to a subroutine call which will
    cause them to be aliased to @_ slots and a successful pattern match
    executed by the subroutine will then change the arguments.

    Slightly less contrived example
    -----------------------------
    sub move_suffix
    {
    $_[0] =~ /_(.*)$/ and return "$_[1]_$1";
    return $_[1];
    }

    print(move_suffix('aaa_bbb', 'ccc'), "\n");

    'hhh_jjj' =~ /(.*)_/ and print(move_suffix('aaa_bbb', $1), "\n");
    ------------------------------

    This will print

    ccc_bbb
    bbb_bbb

    instead of

    ccc_bbb
    hhh_bbb

    Since assigning a scalar to another scalar implies copying the value, I
    usually don't use scalar assignments in simple (frequently executed)
    subroutines which don't need 'modifiable state variables' initialized
    to the passed argument values.
     
    Rainer Weikusat, Mar 26, 2013
    #16
  17. The statement was (paraphrased) that passing $1 ... as arguments to a
    subroutine call wouldn't be safe. This actually demonstrated by the
    example you posted because the value of $_[0] does change because of
    the pattern match: If something other than $1 had been passed, the
    output would be two times the value of that, not the value of the
    passed arguments once followed by y.
     
    Rainer Weikusat, Mar 26, 2013
    #17
  18. While the answer to your question was already demonstrated in this
    thread (a year ago!), I still want to add my 2¢: I ***always*** write
    such calls as
    bar "$1";

    I think that the slowdown is negligible, while the benefits are
    enormous (you may guess that I spent many hours tracing this kind of bugs).

    Yours,
    Ilya
     
    Ilya Zakharevich, Jun 1, 2014
    #18
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.