Lexical variables in (?{...}) regexp constructs

Ala · Jun 15, 2009

Hi all,

In perlre, the documentation for (?{...}) says:

"Due to an unfortunate implementation issue, the Perl code contained
in these blocks is treated as a compile time closure that can have
seemingly bizarre consequences when used with lexically scoped
variables inside of subroutines or loops. There are various
workarounds for this, including simply using global variables instead.
If you are using this construct and strange results occur then check
for the use of lexically scoped variables."

I'm indeed seeing weird things, mainly variables getting undefined for
no reason.
Any ideas on what the "various workarounds" that TFM is speaking of
are? I'd rather stay as far away from global vars as I can.

Thanks,
--Ala

sln · Jun 16, 2009

Hi all,

In perlre, the documentation for (?{...}) says:

"Due to an unfortunate implementation issue, the Perl code contained
in these blocks is treated as a compile time closure that can have
seemingly bizarre consequences when used with lexically scoped
variables inside of subroutines or loops. There are various
workarounds for this, including simply using global variables instead.
If you are using this construct and strange results occur then check
for the use of lexically scoped variables."

I'm indeed seeing weird things, mainly variables getting undefined for
no reason.
Any ideas on what the "various workarounds" that TFM is speaking of
are? I'd rather stay as far away from global vars as I can.

Thanks,
--Ala

Probably the key phrase is "the Perl code contained in these blocks";
Imbedding code inside of a regular expression really has limited use
unless used in conjunction with a conditional or to immediatly store
the value of the last capture group.

It gets better, you can't nest another regular expression in the block
(since the engine is not reentrant).

Seemingly better results happen when you call a named subroutine from the
code block. Here, lexicals seem to work and m// seems to work, but not s///
(the latter causes a crash on my machine, so be carefull not to call a
Perl function that uses the regex engine).

It seems the lack of explanation and numerous caveats are meant as a warning
to stay clear.

Below are a few examples.
The first one trys a lexical within the code block (not too good).
The second calls a subroutine (that does lexicals) from withing the code block.
The third is an example of somebody's IP parser I cleaned up that
shows extended conditional and code embedding (using 5.10).

Anyway, its hit or miss with extended/experimental stuff.

-sln

## ex. 1
================
use strict;
use warnings;

my $string = "yes no yes no";

while ( $string =~ /yes(?{my $test = printmsg(); print "test = '$test'\n";})/g) {}

sub printmsg
{
print "found yes\n";
'';
}
__END__
Output:
found yes
test = ''
Use of uninitialized value $string in pattern match (m//) at dd.pl line 6.

## ex. 2
================
use strict;
use warnings;

my $string = "yes no yes no";
my $test = "this is test";

while ( $string =~ /yes(?{$test = printmsg($test);})/g) {}
print "test = '$test'\n";

sub printmsg
{
my $param = shift;
my $count = 2;

while ($count--) {
print "($count)found yes, was passed '$param'\n";
}
return '';
## Cannot do regex if being called from embedded code
}
__END__
Output:
(1)found yes, was passed 'this is test'
(0)found yes, was passed 'this is test'
(1)found yes, was passed ''
(0)found yes, was passed ''
test = ''

## ex. 3
================
## IpMatch_5_10.pl
## (To test new Perl 5.10 conditionals)
##

require 5.10.0; # 5.10 only, new extended regex
use strict;
use warnings;

my $Octlimit = 255;

my $OctetPat = qr/
\b (\d{1,3}) \b # capture a 3 digit number on boundries

(?(?{ # start conditional code block

# print "$^N\n"; # uncomment to print what matched last
$^N > $Octlimit # condition: is number > octet limit ?

}) # end code block

(*FAIL) # yes, condition is true, force pattern to fail for this number
)
/x;

my $dottedQuadPat = qr/ # Capture quad parts to named variables in the %+ hash
\s*
(?<O1>$OctetPat)
\.
(?<O2>$OctetPat)
\.
(?<O3>$OctetPat)
\.
(?<O4>$OctetPat)
\s*
/x;

my $DressedIPv4Pat = qr/ # Capture dressed quad parts to named variables in the %+ hash
\s* \[
$dottedQuadPat
\] \s*
/x;

while (my $ip = <DATA>)
{
chomp $ip;
next if !length($ip);
print "IP:\n'$ip'\n";

## Match all valid ip octets
my @match = $ip =~/$OctetPat/g;
if (@match)
{
print " ++ matched single octets\n";
for my $val (@match) {
print " $val\n";
}
} else {
print " -- no single octet match\n";
}

## Match dotted quad ip
if ($ip =~ /^$dottedQuadPat$/)
{
print " ++ matched quad #.#.#.#\n";
foreach my $key (sort keys %+) {
print " $key = $+{$key}\n";
}
} else {
print " -- no strict quad match\n";
}

## Match dressed dotted quad ip
if ($ip =~ /^$DressedIPv4Pat$/)
{
print " ++ matched dressed quad [#.#.#.#]\n";
foreach my $key (sort keys %+) {
print " $key = $+{$key}\n";
}
} else {
print " -- no strict dressed quad match\n";
}
}
__DATA__

1.12.123.254.255.256.4872
1.12.123.254
[123.254.255.255]

Ala · Jun 16, 2009

Probably the key phrase is "the Perl code contained in these blocks";
Imbedding code inside of a regular expression really has limited use
unless used in conjunction with a conditional or to immediatly store
the value of the last capture group.

Thanks for the reply. I realize my post wasn't very informative, and
that I'm probably playing with fire

Storing the last captured match is exactly what I'm using this
construct for. The basic idea is that I created a module to help parse
some non-trivial file format. The constructor of this module allows
the user to specify which parts of each record to capture and into
which variable to put it. It returns a compiled regexp that the user
can use when parsing. Something like this contrived example:

use R;
my ($name, $x, $y);
my $rgx = R->new(-capture => {
name => \$name,
locx => \$x,
locy => \$y,
});

#... later on .. use $rgx ..
while (<$fh>) {
if (/$rgx/) {
print "$name is at ($x, $y).\n";
}
}

I used this module as part of a larger module, let's call it M, that
parses the whole file and defines some other methods to manipulate the
data.
For the most part, this works perfectly well. Weird things start to
happen, seemingly in a random fashion, when I instantiate module M
multiple times in a loop.

I guess I shouldn't be relying on experimental features, but since the
docs mentioned a work-around, I was curious.

Thanks,
--Ala

sln · Jun 16, 2009

Thanks for the reply. I realize my post wasn't very informative, and
that I'm probably playing with fire
Storing the last captured match is exactly what I'm using this
construct for. The basic idea is that I created a module to help parse
some non-trivial file format. The constructor of this module allows
the user to specify which parts of each record to capture and into
which variable to put it. It returns a compiled regexp that the user
can use when parsing. Something like this contrived example:

use R;
my ($name, $x, $y);
my $rgx = R->new(-capture => {
name => \$name,
locx => \$x,
locy => \$y,
});

#... later on .. use $rgx ..
while (<$fh>) {
if (/$rgx/) {
print "$name is at ($x, $y).\n";
}
}

I used this module as part of a larger module, let's call it M, that
parses the whole file and defines some other methods to manipulate the
data.
For the most part, this works perfectly well. Weird things start to
happen, seemingly in a random fashion, when I instantiate module M
multiple times in a loop.

I guess I shouldn't be relying on experimental features, but since the
docs mentioned a work-around, I was curious.

Thanks,
--Ala

As far as I know, lexicals scoped within the block that initiates the
regex, should be ok, be it a reference or not (haven't tried it but assume its ok).
pseudo - example:
SCOPE:
{
my ($name, $x, $y);
my ($ref_name, $ref_x, $ref_y) = (\$name, \$x, \$y);

my $rgx = qr/([a-z,A-Z]+)(?{$$name = $^N})(\d\d)(?{$$x = $^N}),(\d\d)(?{$$y = $^N})/;
while (<$fh>) {
if (/$rgx/) {
print "$name is at ($x, $y).\n";
}
}
};

If what you say is true, this is the case:

Package M;
use R;
my ($name, $x, $y);
my $rgx = R->new();
.... more code
if (/$rgx/) { }
1;

As a side, the R->new() is being used as just a class function call.
It does not bless() anything it appears since it is returning a string scalar.

Then somewhere else, you create multiple instances of M
(or just call M methods) in in a loop.

The my ($name, $x, $y) appear to be file scoped variables.
In the context I wrote, there is only one instance of ($name,$x,$y) no
matter how many instances of M you create (class scoped?).

If you had a method in M that creates many $rgx's, it would have to store
those qr// in an object based (M) blessed referent (hash or array) for them
and thereby thier references ($name,$x,$y) to persist.

Usually this involves fleshing out either R or M with references to unique
lexicals.

So that this is the case (usually):

while (<$fh>) {
if (/$obj->$rgx/) {
print "$obj->$name is at ($obj->$x, $obj->$y).\n";
}
}

More than likely you would need accessors.

Hope this helps.
-sln

sln · Jun 16, 2009

[snip]
More than likely you would need accessors.

Here's an example of making a custom regex class. This is far removed
from what you want though.

In your case, you would create the regex in the constructor,
add unique references (to lexicals) in a qr// statement, then
assign it to the object hash ie. $self->{rgx} = qr//, then return
the instance. The references must be unique if thats what your goal is.

-sln

-----------------------------------

###
package RxP;
our @ISA = qw();

sub new
{
my $self;
my $class = shift;
if (defined($_[0]) && ref($_[0]) eq 'RxP') {
%{$self} = %{$_[0]};
return bless ($self, $class);
}
$self = {
'regex' => '',
'code' => sub{''},
'type' => 's',
'dflt_sub' => \&search
};
while (my ($name, $val) = splice (@_, 0, 2)) {
next if (!defined $val);
if ('regex' eq lc $name) {
$self->{'regex'} = $val;
}
elsif ('code' eq lc $name && ref($val) eq 'CODE') {
$self->{'code'} = $val;
}
elsif ('type' eq lc $name && $val =~ /(sg|gs|rg|gr|s|r)/i) {
set_type ($self, lc $1);
}
}
return bless ($self, $class);
}
sub get_type
{
return $_[0]->{'type'};
}
sub set_type
{
return 0 unless (defined $_[1]);
if ($_[1] =~ /(sg|gs|rg|gr|s|r)/i) {
$_[0]->{'dflt_sub'} = {
's' => \&search,
'sg' => \&search_g,
'gs' => \&search_g,
'r' => \&replace,
'rg' => \&replace_g,
'gr' => \&replace_g
}->{lc $1};
$_[0]->{'type'} = lc $1;
return 1;
}
return 0;
}
sub clone
{
# clone self, return new
return RxP->new( $_[0]);
}
sub copy
{
# copy other to self, return self
return $_[0] unless (defined $_[1] && ref($_[1]) eq 'RxP');
%{$_[0]} = %{$_[1]}; # no need for deep recursion
return $_[0];
}
sub apply
{
return 0 unless (defined $_[1]);
return &{$_[0]->{'dflt_sub'}};
}
sub search
{
return 0 unless (defined $_[1]);
return $_[1] =~ /$_[0]->{'regex'}/;
}
sub search_g
{
return 0 unless (defined $_[1]);
return $_[1] =~ /$_[0]->{'regex'}/g;
}
sub replace
{
return 0 unless (defined $_[1]);
return $_[1] =~ s/$_[0]->{'regex'}/&{$_[0]->{'code'}}/e;
}
sub replace_g
{
return 0 unless (defined $_[1]);
return $_[1] =~ s/$_[0]->{'regex'}/&{$_[0]->{'code'}}/ge;
}

sln · Jun 17, 2009

Thanks for the reply. I realize my post wasn't very informative, and
that I'm probably playing with fire
Storing the last captured match is exactly what I'm using this
construct for. The basic idea is that I created a module to help parse
some non-trivial file format. The constructor of this module allows
the user to specify which parts of each record to capture and into
which variable to put it. It returns a compiled regexp that the user
can use when parsing. Something like this contrived example:

use R;
my ($name, $x, $y);
my $rgx = R->new(-capture => {
name => \$name,
locx => \$x,
locy => \$y,
});

#... later on .. use $rgx ..
while (<$fh>) {
if (/$rgx/) {
print "$name is at ($x, $y).\n";
}
}

I used this module as part of a larger module, let's call it M, that
parses the whole file and defines some other methods to manipulate the
data.
For the most part, this works perfectly well. Weird things start to
happen, seemingly in a random fashion, when I instantiate module M
multiple times in a loop.

I guess I shouldn't be relying on experimental features, but since the
docs mentioned a work-around, I was curious.

Thanks,
--Ala

Click to expand...

As far as I know, lexicals scoped within the block that initiates the
regex, should be ok, be it a reference or not (haven't tried it but assume its ok).
pseudo - example:
SCOPE:
{
my ($name, $x, $y);
my ($ref_name, $ref_x, $ref_y) = (\$name, \$x, \$y);

my $rgx = qr/([a-z,A-Z]+)(?{$$name = $^N})(\d\d)(?{$$x = $^N}),(\d\d)(?{$$y = $^N})/;

$$ref_name $$ref_x $$ref_y

Sorry, I just mindlessly do this, depending too much on the compiler/interpreter to catch stuff.
But, thats what crafts are... fix/debug/refactor, rinse, repeat

-sln

Sencha Touch--Support 2 browsers in just 228K!	64	Jul 16, 2010
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Lexical variables in (?{...}) regexp constructs

Ala

sln

Ala

sln

sln

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads