Strange behavior by regex with variable

D

DJ Stunks

Hello all,

In order to not have to perpetually be escaping a vertical bar field
separator in my regular expressions I tried to use a variable instead
(as in _PBP_).

However, I'm getting some strange behavior. Specifically, perl
believes the match has succeeded, but the capture is not performed.

Here's a short but complete script which demonstrates the issue.
Please let me know if you have any ideas.

-jp

PS: witnessed on perl v5.8.7 Binary build 813 [148120] for
MSWin32-x86-multi-thread

C:\tmp>cat tmp2.pl
#!/usr/bin/perl

use strict;
use warnings;

my $BAR = q{|};
my $string = q{|this.string_contains-some-2342:stuff|};

if (my ($stuff) = $string =~ m{^\| ([^|]+) \|$}x) {
print "'$stuff' matched\n";
}

if (my ($stuff) = $string =~ m{^$BAR ([^|]+) $BAR$}xo) {
print "'$stuff' matched using half \$BARs\n";
}

if (my ($stuff) = $string =~ m{^$BAR ([^$BAR]+) $BAR$}xo) {
print "'$stuff' matched using all \$BARs\n";
}

__END__

C:\tmp>tmp2.pl
'this.string_contains-some-2342:stuff' matched
Use of uninitialized value in concatenation (.) or string at
C:\tmp\tmp2.pl line 14.
'' matched using half $BARs
Use of uninitialized value in concatenation (.) or string at
C:\tmp\tmp2.pl line 18.
'' matched using all $BARs
 
P

Paul Lalli

DJ said:
In order to not have to perpetually be escaping a vertical bar field
separator in my regular expressions I tried to use a variable instead
(as in _PBP_).

I can only assume you misread the relevant part of PBP. Simply putting
a regular expression character in a variable does not prevent it from
being interpreted as a regular expression character.
However, I'm getting some strange behavior. Specifically, perl
believes the match has succeeded, but the capture is not performed.

Here's a short but complete script which demonstrates the issue.
Please let me know if you have any ideas.

-jp

PS: witnessed on perl v5.8.7 Binary build 813 [148120] for
MSWin32-x86-multi-thread

C:\tmp>cat tmp2.pl
#!/usr/bin/perl

use strict;
use warnings;

my $BAR = q{|};
my $string = q{|this.string_contains-some-2342:stuff|};

if (my ($stuff) = $string =~ m{^\| ([^|]+) \|$}x) {

Here, you escaped the | characters
print "'$stuff' matched\n";
}

if (my ($stuff) = $string =~ m{^$BAR ([^|]+) $BAR$}xo) {

Here, the regexp is sent through double-quotish interpolation first,
meaning $BAR gets replaced withe |. Then the result is sent through
the regexp parser, making | be interpreted as the regexp alternation
character.
print "'$stuff' matched using half \$BARs\n";
}

if (my ($stuff) = $string =~ m{^$BAR ([^$BAR]+) $BAR$}xo) {

Same thing here... just more of them. :)
print "'$stuff' matched using all \$BARs\n";
}

__END__

If your variables are going to contain regular expression characters,
you need to quote-meta them:

if (/\Q$BAR\E ... /) { ... }

I don't have my copy of PBP on me, so I can't help to explain what part
of it led you to this faulty belief...

Paul Lalli
 
D

DJ Stunks

Paul said:
I can only assume you misread the relevant part of PBP. Simply putting
a regular expression character in a variable does not prevent it from
being interpreted as a regular expression character.

of course.

Thanks. I must have misunderstood Pastor Conway :p

-jp
 
I

it_says_BALLS_on_your forehead

Paul said:
DJ said:
In order to not have to perpetually be escaping a vertical bar field
separator in my regular expressions I tried to use a variable instead
(as in _PBP_).

I can only assume you misread the relevant part of PBP. Simply putting
a regular expression character in a variable does not prevent it from
being interpreted as a regular expression character.
However, I'm getting some strange behavior. Specifically, perl
believes the match has succeeded, but the capture is not performed.

Here's a short but complete script which demonstrates the issue.
Please let me know if you have any ideas.

-jp

PS: witnessed on perl v5.8.7 Binary build 813 [148120] for
MSWin32-x86-multi-thread

C:\tmp>cat tmp2.pl
#!/usr/bin/perl

use strict;
use warnings;

my $BAR = q{|};
my $string = q{|this.string_contains-some-2342:stuff|};

if (my ($stuff) = $string =~ m{^\| ([^|]+) \|$}x) {

Here, you escaped the | characters
print "'$stuff' matched\n";
}

if (my ($stuff) = $string =~ m{^$BAR ([^|]+) $BAR$}xo) {

Here, the regexp is sent through double-quotish interpolation first,
meaning $BAR gets replaced withe |. Then the result is sent through
the regexp parser, making | be interpreted as the regexp alternation
character.
print "'$stuff' matched using half \$BARs\n";
}

if (my ($stuff) = $string =~ m{^$BAR ([^$BAR]+) $BAR$}xo) {

Same thing here... just more of them. :)
print "'$stuff' matched using all \$BARs\n";
}

__END__

If your variables are going to contain regular expression characters,
you need to quote-meta them:

if (/\Q$BAR\E ... /) { ... }

I don't have my copy of PBP on me, so I can't help to explain what part
of it led you to this faulty belief...


yeah, earlier this week, someone was getting double parens because they
didn't realize that after variable interpolation, the regex engine
still treats parens as capturing. i tried the following code:

use strict;
use warnings;

my $BAR = qr{\|};
my $string = q{|this.string_contains-some-2342:stuff|};


my ($stuff) = $string =~ m/^\| ([^|]+) \|$/x;
print "'$stuff' matched\n" if $stuff;


my ($stuff2) = $string =~ m/^$BAR ([^|]+) $BAR$/xo;
print "'$stuff2' matched using half \$BARs\n" if $stuff2;


my ($stuff3) = $string =~ m/^$BAR ([^$BAR]+) $BAR$/xo;
print "'$stuff3' matched using all \$BARs\n" if $stuff3;

__END__

and got:
[nagano 23] ~/1-perl > try_varpipe.pl
'this.string_contains-some-2342:stuff' matched
'this.string_contains-some-2342:stuff' matched using half $BARs
[nagano 24] ~/1-perl >

it looks like character classes behave in such a way as to prevent the
last regex to match.
 
D

DJ Stunks

it_says_BALLS_on_your forehead said:
i tried the following code:

use strict;
use warnings;

my $BAR = qr{\|};
my $string = q{|this.string_contains-some-2342:stuff|};


my ($stuff) = $string =~ m/^\| ([^|]+) \|$/x;
print "'$stuff' matched\n" if $stuff;


my ($stuff2) = $string =~ m/^$BAR ([^|]+) $BAR$/xo;
print "'$stuff2' matched using half \$BARs\n" if $stuff2;


my ($stuff3) = $string =~ m/^$BAR ([^$BAR]+) $BAR$/xo;
print "'$stuff3' matched using all \$BARs\n" if $stuff3;

__END__

and got:
[nagano 23] ~/1-perl > try_varpipe.pl
'this.string_contains-some-2342:stuff' matched
'this.string_contains-some-2342:stuff' matched using half $BARs
[nagano 24] ~/1-perl >

it looks like character classes behave in such a way as to prevent the
last regex to match.

I modified my test script as suggested by Paul and now it works fine,
including the negated character class.

I think it's the qr// that caused your third test to fail, BALLS.

C:\tmp>cat tmp2.pl
#!/usr/bin/perl

use strict;
use warnings;

my $BAR = quotemeta q{|};
my $string = q{|this.string_contains-some-2342:stuff|};

if (my ($stuff) = $string =~ m{^\| ([^|]+) \|$}x) {
print "'$stuff' matched\n";
}

if (my ($stuff) = $string =~ m{^$BAR ([^|]+) $BAR$}xo) {
print "'$stuff' matched using half \$BARs\n";
}

if (my ($stuff) = $string =~ m{^$BAR ([^$BAR]+) $BAR$}xo) {
print "'$stuff' matched using all \$BARs\n";
}

__END__

C:\tmp>tmp2.pl
'this.string_contains-some-2342:stuff' matched
'this.string_contains-some-2342:stuff' matched using half $BARs
'this.string_contains-some-2342:stuff' matched using all $BARs

-jp
 
I

it_says_BALLS_on_your forehead

DJ said:
it_says_BALLS_on_your forehead said:
i tried the following code:

use strict;
use warnings;

my $BAR = qr{\|};
my $string = q{|this.string_contains-some-2342:stuff|};


my ($stuff) = $string =~ m/^\| ([^|]+) \|$/x;
print "'$stuff' matched\n" if $stuff;


my ($stuff2) = $string =~ m/^$BAR ([^|]+) $BAR$/xo;
print "'$stuff2' matched using half \$BARs\n" if $stuff2;


my ($stuff3) = $string =~ m/^$BAR ([^$BAR]+) $BAR$/xo;
print "'$stuff3' matched using all \$BARs\n" if $stuff3;

__END__

and got:
[nagano 23] ~/1-perl > try_varpipe.pl
'this.string_contains-some-2342:stuff' matched
'this.string_contains-some-2342:stuff' matched using half $BARs
[nagano 24] ~/1-perl >

it looks like character classes behave in such a way as to prevent the
last regex to match.

I modified my test script as suggested by Paul and now it works fine,
including the negated character class.

I think it's the qr// that caused your third test to fail, BALLS.

ahh, i think you're right. i replaced that line with:
my $BAR = q{\|};

and all three worked :).
 
D

Dr.Ruud

it_says_BALLS_on_your forehead schreef:
DJ Stunks:

ahh, i think you're right. i replaced that line with:
my $BAR = q{\|};

and all three worked :).

Alternative:

qr/[|]/

Test:

echo '-|-' \
| perl -ne 'chomp; $r=qr/[|]/; \
print "bar\n" if /\A(.$r.)\z/ and $1 eq $_'

I assume that the regex-engine optimizes away single-character-sets.
 
R

robic0

Hello all,

In order to not have to perpetually be escaping a vertical bar field
separator in my regular expressions I tried to use a variable instead
(as in _PBP_).
Jeez, you may have to go back to /|/. But you higher order folks shouldn't
be pissed off in your qualified envoronments and narrow escopements then,
should you?
 
J

John W. Krahn

Dr.Ruud said:
it_says_BALLS_on_your forehead schreef:
DJ Stunks:
ahh, i think you're right. i replaced that line with:
my $BAR = q{\|};

and all three worked :).

Alternative:

qr/[|]/

Test:

echo '-|-' \
| perl -ne 'chomp; $r=qr/[|]/; \
print "bar\n" if /\A(.$r.)\z/ and $1 eq $_'

I assume that the regex-engine optimizes away single-character-sets.

Well, let's see what Benchmark says:

$ perl -MBenchmark=cmpthese -e'
cmpthese -10, {
lit => q{ q/abcdefghijklmn/ =~ /klmn/ },
cc => q{ q/abcdefghijklmn/ =~ /[k][l][m][n]/ } }
'
Rate cc lit
cc 2192887/s -- -41%
lit 3741724/s 71% --




John
 
D

Dr.Ruud

John W. Krahn schreef:
Dr.Ruud:
I assume that the regex-engine optimizes away single-character-sets.

Well, let's see what Benchmark says:

$ perl -MBenchmark=cmpthese -e'
cmpthese -10, {
lit => q{ q/abcdefghijklmn/ =~ /klmn/ },
cc => q{ q/abcdefghijklmn/ =~ /[k][l][m][n]/ } }
'
Rate cc lit
cc 2192887/s -- -41%
lit 3741724/s 71% --

That doesn't show whether the regex-engine optimizes away
single-character-sets or not.

I tried to get closer to the metal and rewrote your test to this:

#!/usr/bin/perl
use strict;
use warnings;

use Benchmark qw/cmpthese/;

sub say
{ local $\ = "\n";
print '';
if (@_) {
print "<$_>" for @_;
print '';
} }

my $s = 'abcdefghijklmn_jklmn_jklmnopquvwxyz';
my $qr_lit = qr/klmn/;
my $qr_cs = qr/[k][l][m][n]/;
my $do_lit = sub { scalar( () = $_[0] =~ /$qr_lit/g ) };
my $do_cs = sub { scalar( () = $_[0] =~ /$qr_cs/g ) };

say $s, $qr_lit, $qr_cs, &$do_lit($s), &$do_cs($s);

if (1) {
cmpthese -5,
{
lit => q{ $s =~ $qr_lit; }
, cs => q{ $s =~ $qr_cs ; }
};
}

say $s, $qr_lit, $qr_cs, &$do_lit($s), &$do_cs($s);

__END__

and then they both win sometimes.
:)

The iterations-per-second are higher with //o:
lit => q{ $s =~ /$qr_lit/o; }


(perl v5.8.6, i386-freebsd-64int)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top