Strange behavior of 'Alternative capture group numbering'

Discussion in 'Perl Misc' started by Raymundo, Jan 1, 2012.

  1. Raymundo

    Raymundo Guest

    Hello,

    At first, I'm sorry that I'm not good at English.

    I'm reading "perlretut" (Perl Regular Expression Tutorial) of version
    5.14 now:
    http://perldoc.perl.org/perlretut.html

    While I was reading "Alternative capture group numbering" section,
    I wrote a simple test program to practice it myself.

    I'm using Strawberry Perl 5.12.3 on Windows XP.

    Here is my code:
    -----
    #!perl
    use strict;
    use warnings;

    while (1) {
    my $input = <STDIN>;
    chomp $input;
    if ( $input =~ /(?|(a)(b)|(c))(d)/ ) {
    print "1[$1] 2[$2] 3[$3]\n";
    }
    }
    -----

    Here is the result:
    -----
    abd
    1[a] 2 3[d]
    cd
    Use of uninitialized value $2 in concatenation (.) or string at d:\Temp
    \test.pl line 13, <STDIN> line 2.
    1[c] 2[] 3[d]
    ----

    Okay. This is what I expected and what the document said. 'd' is
    assigned to $3 because the maximum number in the alternative numbering
    group is 2.

    Then I modified the pattern, only changing the order of two group in
    the alternative numbering group:
    -----
    if ( $input =~ /(?|(c)|(a)(b))(d)/ ) {
    -----
    This is the result:
    -----
    abd
    Use of uninitialized value $3 in concatenation (.) or string at d:\Temp
    \test.pl line 13, <STDIN> line 1.
    1[a] 2[d] 3[]
    cd
    Use of uninitialized value $3 in concatenation (.) or string at d:\Temp
    \test.pl line 13, <STDIN> line 2.
    1[c] 2[d] 3[]
    ----

    I have no idea why the result differs from the first one.
    Why 'd' is in $2, not $3? Where did 'b' of 'abd' go after matching?

    Is this a bug? Or is there something that I misunderstand?

    Any help would be appreciated.
    Thank you.
    Raymundo, Jan 1, 2012
    #1
    1. Advertising

  2. Raymundo

    Guest

    On Sun, 1 Jan 2012 08:59:44 -0800 (PST), Raymundo <> wrote:

    >Hello,
    >
    >At first, I'm sorry that I'm not good at English.
    >
    >I'm reading "perlretut" (Perl Regular Expression Tutorial) of version
    >5.14 now:
    >http://perldoc.perl.org/perlretut.html
    >
    >While I was reading "Alternative capture group numbering" section,
    >I wrote a simple test program to practice it myself.
    >
    >I'm using Strawberry Perl 5.12.3 on Windows XP.
    >
    >Here is my code:
    >-----
    >#!perl
    >use strict;
    >use warnings;
    >
    >while (1) {
    > my $input = <STDIN>;
    > chomp $input;
    > if ( $input =~ /(?|(a)(b)|(c))(d)/ ) {
    > print "1[$1] 2[$2] 3[$3]\n";
    > }
    >}
    >-----
    >
    >Here is the result:
    >-----
    >abd
    >1[a] 2 3[d]
    >cd
    >Use of uninitialized value $2 in concatenation (.) or string at d:\Temp
    >\test.pl line 13, <STDIN> line 2.
    >1[c] 2[] 3[d]
    >----
    >
    >Okay. This is what I expected and what the document said. 'd' is
    >assigned to $3 because the maximum number in the alternative numbering
    >group is 2.
    >
    >Then I modified the pattern, only changing the order of two group in
    >the alternative numbering group:
    >-----
    > if ( $input =~ /(?|(c)|(a)(b))(d)/ ) {
    >-----
    >This is the result:
    >-----
    >abd
    >Use of uninitialized value $3 in concatenation (.) or string at d:\Temp
    >\test.pl line 13, <STDIN> line 1.
    >1[a] 2[d] 3[]
    >cd
    >Use of uninitialized value $3 in concatenation (.) or string at d:\Temp
    >\test.pl line 13, <STDIN> line 2.
    >1[c] 2[d] 3[]
    >----
    >
    >I have no idea why the result differs from the first one.
    >Why 'd' is in $2, not $3? Where did 'b' of 'abd' go after matching?
    >
    >Is this a bug? Or is there something that I misunderstand?
    >


    Its probably not a bug if you had to program branch reset code,
    because the whole thing is buggy and tends to crash at the drop of
    a hat.

    Using the regex debug mechanism some observations can be noted.
    The last branch-reset alternation is labled BRANCH (FAIL).
    Apparently, the number of capture buffers in this branch is
    NOT counted when calculating the largest number of buffers.
    Therefore, the # capture buffer after the branch-reset is the
    largest of the branches BEFORE the last branch.

    Example:

    (?|
    (x) ()
    |
    (c)
    |
    (a) (b) (r)
    )
    (d)

    Produces this code:

    1: BRANCH (13)
    2: OPEN1 (4)
    4: EXACT <x> (6)
    6: CLOSE1 (8)
    8: OPEN2 (11)
    10: NOTHING (11)
    11: CLOSE2 (40)
    13: BRANCH (20)
    14: OPEN1 (16)
    16: EXACT <c> (18)
    18: CLOSE1 (40)
    20: BRANCH (FAIL)
    21: OPEN1 (23)
    23: EXACT <a> (25)
    25: CLOSE1 (27)
    27: OPEN2 (29)
    29: EXACT <b> (31)
    31: CLOSE2 (33)
    33: OPEN3 (35)
    35: EXACT <r> (37)
    37: CLOSE3 (40)
    39: TAIL (40)
    40: OPEN3 (42)
    42: EXACT <d> (44)
    44: CLOSE3 (46)
    46: END (0)

    You can see that (d) is capture buffer 3, but it should be 4.

    So the simple solution is that the largest number of capture buffers
    should not be in the last branch.

    There are a couple of ways around this.

    1 - Pad a different branch with a NOTHING capture group.
    (?|
    (c) ()
    | (a)(b)
    )
    (d)

    or,

    2 - Move the largest number of captures into another branch.
    (?|
    (a)(b)
    | (c)
    )
    (d)

    This is just an observation that seems to hold true.
    In my mind, branch-reset in Perl or any PCRE engine is just
    one big bug, and should be avoided.

    -sln
    , Jan 1, 2012
    #2
    1. Advertising

  3. Raymundo

    Raymundo Guest

    On 1ì›”2ì¼, 오전7ì‹œ16분, Ben Morrow <> wrote:
    > Quoth :
    >
    >
    > It looks to me like a bug in perl, and it appears to have been fixed in
    > 5.14.
    >
    > If you have any other instances of (?|) causing problems (that persist
    > in 5.14), and certainly if you have any examples of crashes, you should
    > report them with perlbug.
    >
    > Ben



    Thank you, sln and Ben.

    I've posted the same question on my twitter, and received replies
    saying
    that 5.14 shows correct results. One of my follows sent me this link:
    http://perl5.git.perl.org/perl.git/commit/fd4be6f07df0e6a021290ef721c5d73550e0248c


    Happy New Year~ :)

    G.Y.Park from South Korea
    Raymundo, Jan 1, 2012
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. buran

    numbering rows in datagrid

    buran, Jul 7, 2003, in forum: ASP .Net
    Replies:
    2
    Views:
    462
    Saravana
    Jul 7, 2003
  2. buran

    numbering rows in datagrid

    buran, Oct 15, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    514
    Cowboy \(Gregory A. Beamer\)
    Oct 15, 2003
  3. =?Utf-8?B?Q2hyaXM=?=

    form field auto numbering with unique value

    =?Utf-8?B?Q2hyaXM=?=, Dec 31, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    432
    John Saunders
    Jan 1, 2005
  4. Andy Fish
    Replies:
    2
    Views:
    448
    Andy Fish
    Apr 5, 2005
  5. Max
    Replies:
    7
    Views:
    9,078
Loading...

Share This Page