Issue: unexpected value in $2 (Perl 5.10.1)

J

John Bokma

The following piece of code assigns an unexpected value to $2 in a
program I am working on. When I run tests, it works as expected, but
when I run the actual program (which parses currently 2 large XML files)
$2 gets assigned a random value when the second XML file is parsed [1]:

$iso_gmt =~ /^(\d{4})-(\d\d)-(\d\d)T(\d\d):(\d\d):(\d\d)Z$/
or croak "Not a valid ISO GMT date time: '$iso_gmt'";
# Note: without the assingment below $2 might be set to a different
# value for an unknown reason (bug in XML::parser C code?)
# perl, v5.10.1
#my $month = $2;
my $seconds_since_epoch;
my $error;
$seconds_since_epoch = timegm( $6, $5, $4, $3, $2 - 1, $1 - 1900 );

$iso_gmt = '2012-10-30T21:05:01Z';

Examples of errors:

Month '20129103' out of range 0..11
Month '11826511' out of range 0..11
Month '20079951' out of range 0..11
Month '15160655' out of range 0..11
Month '21972303' out of range 0..11
Month '10208591' out of range 0..11

0133254f - 20129103
00b4754f - 11826511
0132654f - 20079951
00e7554f - 15160655
014f454f - 21972303
009bc54f - 10208591

^^^ looks like there is a pattern.

Uncommenting the assignment of $2 to $month removes the effect.
Reversing the parse order of the 2 XML files also removes the effect.

Anyone a suggestion (or 2) to pin point what goes wrong here (or is this
a know bug?). It looks like memory accidentally gets overwritten.


[1] using XML::parser
 
J

John Bokma

Ben Morrow said:
Quoth John Bokma said:
The following piece of code assigns an unexpected value to $2 in a
program I am working on. When I run tests, it works as expected, but
when I run the actual program (which parses currently 2 large XML files)
$2 gets assigned a random value when the second XML file is parsed [1]:

$iso_gmt =~ /^(\d{4})-(\d\d)-(\d\d)T(\d\d):(\d\d):(\d\d)Z$/
or croak "Not a valid ISO GMT date time: '$iso_gmt'";
# Note: without the assingment below $2 might be set to a different
# value for an unknown reason (bug in XML::parser C code?)
# perl, v5.10.1
#my $month = $2;
my $seconds_since_epoch;
my $error;
$seconds_since_epoch = timegm( $6, $5, $4, $3, $2 - 1, $1 - 1900 );

This code doesn't call XML::parser. Where are you calling it?

My apologies for the unclear write up. The above piece of code is called
by code that uses XML::parser to parse two XML files. Since the issue
seems to depend on the order in which the XML files are parsed I don't
think it's possible (or: easy) to create a minimal test case (otherwise
I certainly would've posted that one). Note: tests that I run via prove
to test the above code don't show the issue.
Also, I would usually want to assign the return value of m// rather than
relying on globals:

Yes, same here, usually ;-).
Where does this output come from?

timegm throws the "Month ... out of range ...". I used a one liner to
convert those values to hex (given above).

[..]
The output of Devel::peek::Dump($2) at various points would be helpful.
If you can it would also be worth trying with a perl built with
-DDEBUGGING, to see if you get an assertion failure.

Thanks, will look into those two options. Right now I am not able to
reproduce it. Most likely due to code changes.
 
K

Klaus

Also, I would usually want to assign the return value of m// rather than
relying on globals:
    my ($year, $month, $day, $hour, $min, $sec) =
        $iso_gmt =~ /.../;

I agree, but I would also add that it is perfectly safe to use $1, $2,
$3, ... in an assignment to local variables immediatly after $iso_gmt
=~ /.../ or croak "..."

For example, the following lines are perfectly safe:

$iso_gmt =~ /.../ or croak "...";
my ($year, $month, $day, $hour, $min, $sec)
= ($1, $2, $3, $4, $5, $6);

However, the following line is not safe:

I am very suspicious about the fact that $1, $2, $3, ... are passed by
reference to a subroutine timegm( $6, $5, $4, $3, $2 - 1, $1 - 1900 ).

Variables such as $1, $2, $3, ... have very complicated scoping rules.

Passing such variables by reference to a subroutine makes it very hard
to guarantee the integrity of $1, $2, $3, ...
 
J

John Bokma

Klaus said:
However, the following line is not safe:

Can you give me an example of calling a sub that shows this unsafe
behavior? And/or the scoping issues you mention?

I've always considered it perfectly safe to do something like:

$foo =~ /(.)/;
bar( $1 );

if not, I like to know cases this can go wrong. Thanks.
 
K

Klaus

Can you give me an example of calling a sub that shows this unsafe
behavior? And/or the scoping issues you mention?

I've always considered it perfectly safe to do something like:

$foo =~ /(.)/;
bar( $1 );

if not, I like to know cases this can go wrong. Thanks.

Here is an example where

Case 1 uses the safe bar($var)
Case 2 uses the unsafe bar($1)

==============================================
use strict;
use warnings;

my $foo = 'abc';

$foo =~ /^(..).$/ or die "Error 1";
my $var = $1;
print "Case 1: ", bar($var), "\n";

$foo =~ /^(..).$/ or die "Error 2";
print "Case 2: ", bar($1), "\n";

sub bar {
my $z = '';

$_[0] =~ /^(.).$/ or return "Error 3";
$z .= $1;

$_[0] =~ /^.(.)$/ or return "Error 4";
$z .= $1;

return "ok ('$z')";
}
==============================================

The output is:
==============================================
Case 1: ok ('ab')
Case 2: Error 4
==============================================
 
K

Klaus

And/or the scoping issues you mention?

see perldoc perlre:

++ These special variables, like the %+ hash and the numbered match
++ variables ($1 , $2 , $3 , etc.) are dynamically scoped until the
++ end of the enclosing block or until the next successful match,
++ whichever comes first.

The scoping issue is that the execution of timegm($6, ...) falls in
the same dynamically scoped block as the previously executed $iso_gmt
=~ /.../ or croak "...";

timegm($6, ...) can itself have regular expressions, in which case
that regular expression overwrites whatever is contained in $6.

The issue is that $6 is aliased to the first parameter of
timegm($6, ...).

These cases are not necessarily wrong, but they are very difficult to
get right.
 
R

Rainer Weikusat

[...]
==============================================
use strict;
use warnings;

my $foo = 'abc';

$foo =~ /^(..).$/ or die "Error 1";
my $var = $1;
print "Case 1: ", bar($var), "\n";

$foo =~ /^(..).$/ or die "Error 2";
print "Case 2: ", bar($1), "\n";

sub bar {
my $z = '';

$_[0] =~ /^(.).$/ or return "Error 3";
$z .= $1;

'use warnings' wouldn't complain about .= with a 'no value' variable
on the left hand side and even if this wasn't the case, the useless
concatenation could simply be replaced with a proper assignment in
order to avoid the "cargo cult 'initialization'" (my $z = '').
 
P

Peter J. Holzer

see perldoc perlre:

++ These special variables, like the %+ hash and the numbered match
++ variables ($1 , $2 , $3 , etc.) are dynamically scoped until the
++ end of the enclosing block or until the next successful match,
++ whichever comes first.

The scoping issue is that the execution of timegm($6, ...) falls in
the same dynamically scoped block as the previously executed $iso_gmt
=~ /.../ or croak "...";

timegm($6, ...) can itself have regular expressions, in which case
that regular expression overwrites whatever is contained in $6.

While that is theoretically possible, it doesn't seem to be the case
with timegm(). I just skimmed the source and don't see any use of
regexps there and frankly I can't think of any legitimate use timegm()
could have for regexps.

The issue is that $6 is aliased to the first parameter of
timegm($6, ...).

So I doubt that this is the issue. (Also I think John would have been
able to extract a minimal test case if it was that easy)

hp
 
E

Eric Pozharski

with said:
Can you give me an example of calling a sub that shows this unsafe
behavior? And/or the scoping issues you mention?

#!/usr/bin/perl

use strict;
use warnings;
use feature qw{ say };

sub holy_cosmos {
my $input = shift @_;
'holy_hallelujah' =~ m{holy_(.*)};
say $1 // 'missing'
}

'holy_rainbow' =~ m{(.*)_(.*)};
say $1 // 'not found';
holy_cosmos $1;
say $1 // 'not found';

__END__

{2109:6} [0:0]% perl ~/foo.90HmzN.pl
holy
hallelujah
holy
{2374:7} [0:0]%

I doubt there's any. Probably, that was the case in pre-historic Perls.
Or this is some XS that directly manipulates variables in other scope.

*CUT*
 
R

Randal L. Schwartz

Eric> sub holy_cosmos {
Eric> my $input = shift @_;
Eric> 'holy_hallelujah' =~ m{holy_(.*)};
Eric> say $1 // 'missing'
Eric> }

This is bad. Very bad. If the match fails, it DOES NOT RESET the match
variables. So you have broken code from the getgo.
 
E

Eric Pozharski

Try this:

sub foo {
my ($x) = @_;
"y" =~ /(.)/;
say "$x:$_[0]";
}

"x" =~ /(.)/;
foo $1;

Note that this can only happen if you access @_ directly (or by
reference) after doing a pattern match, which is not common.

Slowly: they said
(<[email protected]>)
that this:

#!/usr/bin/perl

use strict;
use warnings;
use feature qw{ say };

sub foo {
my ($x) = @_;
"y" =~ /(.)/;
say "$x:$_[0]";
}

"x" =~ /(.)/;
foo $1;
say $1;
__END__

should be this:

x:y
y

or this:

y:y
y

instead it is this:

{1894:4} [0:0]% perl ~/foo.H2vhcu.pl
x:y
x

p.s. I've already seen what Devel::peek::Dump said me, has nothing to
do with subs, @_, or aliasing.
 
E

Eric Pozharski

with said:
Eric> sub holy_cosmos {
Eric> my $input = shift @_;
Eric> 'holy_hallelujah' =~ m{holy_(.*)};
Eric> say $1 // 'missing'
Eric> }

This is bad. Very bad. If the match fails, it DOES NOT RESET the match
variables. So you have broken code from the getgo.

How do you reset match variables?

p.s. I already feel much better and would like to pump it even further.
 
K

Klaus

How do you reset match variables?

Usually you don't reset match variables at all.

The point is to test the match, if it fails, then you don't look at
match variables.

For example:

if ('holy_hallelujah' =~ m{holy_(.*)}) {
say $1;
}
else {
# do not look at match variables...
say 'missing';
}
 
R

Rainer Weikusat

Eric Pozharski said:
How do you reset match variables?

The next successful match will 'reset' them. In this case, this means
if the 'holy_hallelujah' match could fail (which it can't but a
'non-constant' one could), the line below would test the value the
last successful match put into $1.

As Klaus already wrote: The easiest way to stay clear of $1
.... pitfalls is 'never use them except in code which is conditionally
executed when "the match" succeeded'.
 
E

Eric Pozharski

with said:
Who they? (Google Groups appears to be so broken it can't look up a
msgid any more...)

Giganews filters googlegroups? Amazing. Count me in.

*SKIP*
Then you're using it wrong, because this has everything to do with @_
and aliasing.

#!/usr/bin/perl

use strict;
use warnings;
use feature qw{ say };

'holy_rainbow' =~ m{(.*)_(.*)};
say $1 // 'not found';

{
'holy_hallelujah' =~ m{holy_(.*)};
say $1 // 'missing';
}

say $1 // 'not found';

__END__

{11145:5} [0:0]% perl ~/foo.4PonDs.pl
holy
hallelujah
holy

See? And no @_ involved.

*CUT*
 
R

Rainer Weikusat

[...]
Then you're using it wrong, because this has everything to do with @_
and aliasing.

#!/usr/bin/perl

use strict;
use warnings;
use feature qw{ say };

'holy_rainbow' =~ m{(.*)_(.*)};
say $1 // 'not found';

{
'holy_hallelujah' =~ m{holy_(.*)};
say $1 // 'missing';
}

say $1 // 'not found';

__END__

{11145:5} [0:0]% perl ~/foo.4PonDs.pl
holy
hallelujah
holy

See? And no @_ involved.

What is this now supposed to demonstrate? As documented, the $1
.... are always implicitly local to the enclosing block. This subthread
was about passing them as arguments to a subroutine call which will
cause them to be aliased to @_ slots and a successful pattern match
executed by the subroutine will then change the arguments.

Slightly less contrived example
-----------------------------
sub move_suffix
{
$_[0] =~ /_(.*)$/ and return "$_[1]_$1";
return $_[1];
}

print(move_suffix('aaa_bbb', 'ccc'), "\n");

'hhh_jjj' =~ /(.*)_/ and print(move_suffix('aaa_bbb', $1), "\n");
------------------------------

This will print

ccc_bbb
bbb_bbb

instead of

ccc_bbb
hhh_bbb

Since assigning a scalar to another scalar implies copying the value, I
usually don't use scalar assignments in simple (frequently executed)
subroutines which don't need 'modifiable state variables' initialized
to the passed argument values.
 
R

Rainer Weikusat

Eric Pozharski said:
Try this:

sub foo {
my ($x) = @_;
"y" =~ /(.)/;
say "$x:$_[0]";
}

"x" =~ /(.)/;
foo $1;

Note that this can only happen if you access @_ directly (or by
reference) after doing a pattern match, which is not common.

Slowly: they said
(<[email protected]>)
that this:

#!/usr/bin/perl

use strict;
use warnings;
use feature qw{ say };

sub foo {
my ($x) = @_;
"y" =~ /(.)/;
say "$x:$_[0]";
}

"x" =~ /(.)/;
foo $1;
say $1;
__END__

should be this:

x:y
y

or this:

y:y
y

The statement was (paraphrased) that passing $1 ... as arguments to a
subroutine call wouldn't be safe. This actually demonstrated by the
example you posted because the value of $_[0] does change because of
the pattern match: If something other than $1 had been passed, the
output would be two times the value of that, not the value of the
passed arguments once followed by y.
 
I

Ilya Zakharevich

Can you give me an example of calling a sub that shows this unsafe
behavior? And/or the scoping issues you mention?

I've always considered it perfectly safe to do something like:

$foo =~ /(.)/;
bar( $1 );

if not, I like to know cases this can go wrong. Thanks.

While the answer to your question was already demonstrated in this
thread (a year ago!), I still want to add my 2¢: I ***always*** write
such calls as
bar "$1";

I think that the slowdown is negligible, while the benefits are
enormous (you may guess that I spent many hours tracing this kind of bugs).

Yours,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,439
Members
44,829
Latest member
PIXThurman

Latest Threads

Top