Strange behavior of 'Alternative capture group numbering'


R

Raymundo

Hello,

At first, I'm sorry that I'm not good at English.

I'm reading "perlretut" (Perl Regular Expression Tutorial) of version
5.14 now:
http://perldoc.perl.org/perlretut.html

While I was reading "Alternative capture group numbering" section,
I wrote a simple test program to practice it myself.

I'm using Strawberry Perl 5.12.3 on Windows XP.

Here is my code:
-----
#!perl
use strict;
use warnings;

while (1) {
my $input = <STDIN>;
chomp $input;
if ( $input =~ /(?|(a)(b)|(c))(d)/ ) {
print "1[$1] 2[$2] 3[$3]\n";
}
}
-----

Here is the result:
-----
abd
1[a] 2 3[d]
cd
Use of uninitialized value $2 in concatenation (.) or string at d:\Temp
\test.pl line 13, <STDIN> line 2.
1[c] 2[] 3[d]
----

Okay. This is what I expected and what the document said. 'd' is
assigned to $3 because the maximum number in the alternative numbering
group is 2.

Then I modified the pattern, only changing the order of two group in
the alternative numbering group:
-----
if ( $input =~ /(?|(c)|(a)(b))(d)/ ) {
-----
This is the result:
-----
abd
Use of uninitialized value $3 in concatenation (.) or string at d:\Temp
\test.pl line 13, <STDIN> line 1.
1[a] 2[d] 3[]
cd
Use of uninitialized value $3 in concatenation (.) or string at d:\Temp
\test.pl line 13, <STDIN> line 2.
1[c] 2[d] 3[]
----

I have no idea why the result differs from the first one.
Why 'd' is in $2, not $3? Where did 'b' of 'abd' go after matching?

Is this a bug? Or is there something that I misunderstand?

Any help would be appreciated.
Thank you.
 
Ad

Advertisements

S

sln

Hello,

At first, I'm sorry that I'm not good at English.

I'm reading "perlretut" (Perl Regular Expression Tutorial) of version
5.14 now:
http://perldoc.perl.org/perlretut.html

While I was reading "Alternative capture group numbering" section,
I wrote a simple test program to practice it myself.

I'm using Strawberry Perl 5.12.3 on Windows XP.

Here is my code:
-----
#!perl
use strict;
use warnings;

while (1) {
my $input = <STDIN>;
chomp $input;
if ( $input =~ /(?|(a)(b)|(c))(d)/ ) {
print "1[$1] 2[$2] 3[$3]\n";
}
}
-----

Here is the result:
-----
abd
1[a] 2 3[d]
cd
Use of uninitialized value $2 in concatenation (.) or string at d:\Temp
\test.pl line 13, <STDIN> line 2.
1[c] 2[] 3[d]
----

Okay. This is what I expected and what the document said. 'd' is
assigned to $3 because the maximum number in the alternative numbering
group is 2.

Then I modified the pattern, only changing the order of two group in
the alternative numbering group:
-----
if ( $input =~ /(?|(c)|(a)(b))(d)/ ) {
-----
This is the result:
-----
abd
Use of uninitialized value $3 in concatenation (.) or string at d:\Temp
\test.pl line 13, <STDIN> line 1.
1[a] 2[d] 3[]
cd
Use of uninitialized value $3 in concatenation (.) or string at d:\Temp
\test.pl line 13, <STDIN> line 2.
1[c] 2[d] 3[]
----

I have no idea why the result differs from the first one.
Why 'd' is in $2, not $3? Where did 'b' of 'abd' go after matching?

Is this a bug? Or is there something that I misunderstand?


Its probably not a bug if you had to program branch reset code,
because the whole thing is buggy and tends to crash at the drop of
a hat.

Using the regex debug mechanism some observations can be noted.
The last branch-reset alternation is labled BRANCH (FAIL).
Apparently, the number of capture buffers in this branch is
NOT counted when calculating the largest number of buffers.
Therefore, the # capture buffer after the branch-reset is the
largest of the branches BEFORE the last branch.

Example:

(?|
(x) ()
|
(c)
|
(a) (b) (r)
)
(d)

Produces this code:

1: BRANCH (13)
2: OPEN1 (4)
4: EXACT <x> (6)
6: CLOSE1 (8)
8: OPEN2 (11)
10: NOTHING (11)
11: CLOSE2 (40)
13: BRANCH (20)
14: OPEN1 (16)
16: EXACT <c> (18)
18: CLOSE1 (40)
20: BRANCH (FAIL)
21: OPEN1 (23)
23: EXACT <a> (25)
25: CLOSE1 (27)
27: OPEN2 (29)
29: EXACT <b> (31)
31: CLOSE2 (33)
33: OPEN3 (35)
35: EXACT <r> (37)
37: CLOSE3 (40)
39: TAIL (40)
40: OPEN3 (42)
42: EXACT <d> (44)
44: CLOSE3 (46)
46: END (0)

You can see that (d) is capture buffer 3, but it should be 4.

So the simple solution is that the largest number of capture buffers
should not be in the last branch.

There are a couple of ways around this.

1 - Pad a different branch with a NOTHING capture group.
(?|
(c) ()
| (a)(b)
)
(d)

or,

2 - Move the largest number of captures into another branch.
(?|
(a)(b)
| (c)
)
(d)

This is just an observation that seems to hold true.
In my mind, branch-reset in Perl or any PCRE engine is just
one big bug, and should be avoided.

-sln
 
Ad

Advertisements

R

Raymundo

Quoth (e-mail address removed):


It looks to me like a bug in perl, and it appears to have been fixed in
5.14.

If you have any other instances of (?|) causing problems (that persist
in 5.14), and certainly if you have any examples of crashes, you should
report them with perlbug.

Ben


Thank you, sln and Ben.

I've posted the same question on my twitter, and received replies
saying
that 5.14 shows correct results. One of my follows sent me this link:
http://perl5.git.perl.org/perl.git/commit/fd4be6f07df0e6a021290ef721c5d73550e0248c


Happy New Year~ :)

G.Y.Park from South Korea
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top