Backreferences: alias vs copy

Discussion in 'Perl Misc' started by Michael Carman, Aug 10, 2008.

  1. In a separate thread someone recently asked what happens if they modify
    the variable in a 'while ($var =~ /pattern/g)' loop. In crafting a
    sample program I noticed something that surprised me a little:

    my $s = 'abc';

    while ($s =~ /(\w)/g) {
    print "$1 - ";
    $s = 'xyz' if $1 eq 'b';
    print "$1\n";
    }
    __END__
    a - a
    b - y
    x - x
    y - y
    z - z

    In the second result, you can see that the value of $1 changes after
    reassigning $s. Its value becomes the text from the new string at the
    position corresponding to the match against the old one. This makes it
    pretty clear that $1 is actually an alias instead of a copy but I can't
    find this documented anywhere.

    That made me wonder what would happen if the new string was shorter than
    the match position in the old one. Consider

    my $s = 'abc';

    while ($s =~ /(\w)/g) {
    print "$1 - ";
    $s = 'x' if $1 eq 'c';
    print "$1\n";
    }
    __END__
    a - a
    b - b
    c - c # <--
    x - x

    as well as:

    my $s = 'abc';

    while ($s =~ /(\w)/g) {
    print "$1 - ";
    $s = 'xy' if $1 eq 'c';
    print "$1\n";
    }
    __END__

    a - a
    b - b
    c - # <--
    x - x
    y - y

    If that doesn't scream "NUL terminated C string!" I don't know what does.

    Is this documented anywhere, preferably with a caveat about using $1 and
    kin after you've changed the match string?

    -mjc
     
    Michael Carman, Aug 10, 2008
    #1
    1. Advertising

  2. On Aug 10, 8:04 am, Michael Carman <> wrote:
    > In a separate thread someone recently asked what happens if they modify
    > the variable in a 'while ($var =~ /pattern/g)' loop. In crafting a
    > sample program I noticed something that surprised me a little:
    >
    > my $s = 'abc';
    >
    > while ($s =~ /(\w)/g) {
    > print "$1 - ";
    > $s = 'xyz' if $1 eq 'b';
    > print "$1\n";
    > }
    > __END__
    > a - a
    > b - y
    > x - x
    > y - y
    > z - z
    >
    > In the second result, you can see that the value of $1 changes after
    > reassigning $s. Its value becomes the text from the new string at the
    > position corresponding to the match against the old one. This makes it
    > pretty clear that $1 is actually an alias instead of a copy but I can't
    > find this documented anywhere.
    >


    I can't find anything completely
    explicit but the performance penalty would be prohibitive. There'd be
    a double whammy if
    the backref. was captured but not used afterwards.

    > That made me wonder what would happen if the new string was shorter than
    > the match position in the old one. Consider
    >
    > my $s = 'abc';
    >
    > while ($s =~ /(\w)/g) {
    > print "$1 - ";
    > $s = 'x' if $1 eq 'c';
    > print "$1\n";
    > }
    > __END__
    > a - a
    > b - b
    > c - c # <--
    > x - x
    >
    > as well as:
    >
    > my $s = 'abc';
    >
    > while ($s =~ /(\w)/g) {
    > print "$1 - ";
    > $s = 'xy' if $1 eq 'c';
    > print "$1\n";
    > }
    > __END__
    >
    > a - a
    > b - b
    > c - # <--
    > x - x
    > y - y
    >
    > If that doesn't scream "NUL terminated C string!" I don't know what does.
    >
    > Is this documented anywhere, preferably with a caveat about using $1 and
    > kin after you've changed the match string?
    >


    The only hint I saw was perlre's
    warning that once $& is seen, the copy price tag extends to $1, $2,
    etc as well:


    WARNING: Once Perl sees that you
    need one of $&, $`, or $'
    anywhere in the program, it has
    to provide them for every
    pattern match. This may
    substantially slow your program.
    Perl uses the same mechanism to
    produce $1, $2, etc, so you
    also pay a price for each pattern
    that contains capturing parens...


    That seems like a clear inference could be made that no copy occurs in
    the absence of $&.

    --
    Charles DeRykus
     
    comp.lang.c++, Aug 14, 2008
    #2
    1. Advertising

  3. Michael Carman

    Guest

    "comp.lang.c++" <> wrote:
    > >
    > > Is this documented anywhere, preferably with a caveat about using $1
    > > and kin after you've changed the match string?
    > >

    >
    > The only hint I saw was perlre's
    > warning that once $& is seen, the copy price tag extends to $1, $2,
    > etc as well:
    >
    > WARNING: Once Perl sees that you
    > need one of $&, $`, or $'
    > anywhere in the program, it has
    > to provide them for every
    > pattern match. This may
    > substantially slow your program.
    > Perl uses the same mechanism to
    > produce $1, $2, etc, so you
    > also pay a price for each pattern
    > that contains capturing parens...


    I think you are misinterpreting that. It goes on to say:

    > But if you never use $&, $` or $', then patterns without capturing
    > parentheses will not be penalized.


    This seems to imply that patterns *with* capturing parentheses will be
    penalized, even in the absence of $&, $` or $'.


    > That seems like a clear inference could be made that no copy occurs in
    > the absence of $&.


    Maybe that is what is actually happening, but it seems far from clear based
    on the documents.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Aug 14, 2008
    #3
  4. On Aug 14, 11:42 am, wrote:
    > "comp.lang.c++" <> wrote:
    >
    > > > Is this documented anywhere, preferably with a caveat about using $1
    > > > and kin after you've changed the match string?

    >
    > > The only hint I saw was perlre's
    > > warning that once $& is seen, the copy price tag extends to $1, $2,
    > > etc as well:

    >
    > > WARNING: Once Perl sees that you
    > > need one of $&, $`, or $'
    > > anywhere in the program, it has
    > > to provide them for every
    > > pattern match. This may
    > > substantially slow your program.
    > > Perl uses the same mechanism to
    > > produce $1, $2, etc, so you
    > > also pay a price for each pattern
    > > that contains capturing parens...

    >
    > I think you are misinterpreting that. It goes on to say:
    >
    > > But if you never use $&, $` or $', then patterns without capturing
    > > parentheses will not be penalized.

    >
    > This seems to imply that patterns *with* capturing parentheses will be
    > penalized, even in the absence of $&, $` or $'.
    >


    No, I think capturing parens
    actually copy if $& is in the
    picture. Compare below with
    orig. output:

    my $s = 'abc';
    while ($s =~ /(\w)/g) {
    print "$&: $1 - ";
    print "$1 - ";
    $s = 'xyz' if $1 eq 'b';
    print "$1\n";
    }
    __END__
    a: a - a
    b: b - b
    x: x - x
    y: y - y
    z: z - z

    --
    Charles DeRykus
     
    comp.lang.c++, Aug 14, 2008
    #4
  5. Michael Carman

    Guest

    "comp.lang.c++" <> wrote:
    > On Aug 14, 11:42 am, wrote:
    > > "comp.lang.c++" <> wrote:
    > >
    > > > > Is this documented anywhere, preferably with a caveat about using
    > > > > $1 and kin after you've changed the match string?

    > >
    > > > The only hint I saw was perlre's
    > > > warning that once $& is seen, the copy price tag extends to $1, $2,
    > > > etc as well:

    > >
    > > > WARNING: Once Perl sees that you
    > > > need one of $&, $`, or $'
    > > > anywhere in the program, it has
    > > > to provide them for every
    > > > pattern match. This may
    > > > substantially slow your program.
    > > > Perl uses the same mechanism to
    > > > produce $1, $2, etc, so you
    > > > also pay a price for each pattern
    > > > that contains capturing parens...

    > >
    > > I think you are misinterpreting that. It goes on to say:
    > >
    > > > But if you never use $&, $` or $', then patterns without capturing
    > > > parentheses will not be penalized.

    > >
    > > This seems to imply that patterns *with* capturing parentheses will be
    > > penalized, even in the absence of $&, $` or $'.
    > >

    >
    > No, I think capturing parens
    > actually copy if $& is in the
    > picture. Compare below with
    > orig. output:
    >
    > my $s = 'abc';
    > while ($s =~ /(\w)/g) {
    > print "$&: $1 - ";
    > print "$1 - ";
    > $s = 'xyz' if $1 eq 'b';
    > print "$1\n";
    > }


    Based on my experimentation:

    In the absence of /g, capturing parenthesis always copy.

    In the presence of $&, capturing parenthesis always copy.

    They alias only if they are used with a /g and only if $& (etc) has not
    been seen.

    Odd.

    If you use a string eval to inspect $&, $' or $` (so that Perl doesn't
    see them coming), then those variables are set by alias vs. copy under the
    same conditions the capturing parenthesis are. And if the regex doesn't
    have any capturing parenthesis, then $& etc are set by alias. That was a
    surprise; I figured they wouldn't get set at all when Perl doesn't see them
    coming and there were no capturing parentheses.


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Aug 14, 2008
    #5
  6. comp.lang.c++ wrote:
    > On Aug 10, 8:04 am, Michael Carman <> wrote:
    >> This makes it pretty clear that $1 is actually an alias instead of
    >> a copy but I can't find this documented anywhere.

    >
    > I can't find anything completely explicit but the performance penalty
    > would be prohibitive.


    Yes, the behavior isn't surprising at all if you think about the
    implementation a little.

    > WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere
    > in the program, it has to provide them for every pattern match. This
    > may substantially slow your program. Perl uses the same mechanism to
    > produce $1, $2, etc, so you also pay a price for each pattern that
    > contains capturing parens...
    >
    > That seems like a clear inference could be made that no copy occurs
    > in the absence of $&.


    All that says is that if you use those variables perl has to track the
    prematch, match, and postmatch for every regular expression. This is
    because they're set after a successful match, and when you use them perl
    has no way of knowing which regex will have be the last successful one.

    Capturing parens only introduce the overhead for the regexes in which
    they are used because it's clear that they only apply there.

    After poking around a bit more, I noticed that perlvar has this to say
    in the entry for @- (@LAST_MATCH_START):

    $1 is the same as "substr($var, $-[1], $+[1] - $-[1])"

    I had always read that as "is equivalent to" but it would appear that a
    literal interpretation is warranted. They really are the exact same.

    -mjc
     
    Michael Carman, Aug 15, 2008
    #6
  7. wrote:
    > In the absence of /g, capturing parenthesis always copy.
    >
    > In the presence of $&, capturing parenthesis always copy.
    >
    > They alias only if they are used with a /g and only if $& (etc) has
    > not been seen.


    I see the same behavior, though I wonder if in the presence of $& it's
    actually $& that's the copy and then $1 and friends alias to it instead
    of to the original string. There's probably no way of knowing without
    mucking through the guts.

    > If you use a string eval to inspect $&, $' or $` (so that Perl
    > doesn't see them coming), then those variables are set by alias vs.
    > copy under the same conditions the capturing parenthesis are.


    Actually, it's weirder than that:

    perl -e "$_ = 'abc123'; /\d/; $_ = 'xyz789'; print qq{[$&]}"
    [1]

    perl -e "$_ = 'abc123'; /1/; $_ = 'xyz789'; print qq{[$&]}"
    [1]

    perl -e "$_ = 'abc123'; /\d/; $_ = 'xyz789'; eval 'print qq{[$&]}'"
    [7]

    perl -e "$_ = 'abc123'; /1/; $_ = 'xyz789'; eval 'print qq{[$&]}'"
    []

    perl -e "$_ = 'abc123'; /[0-9]/; $_ = 'xyz789'; eval 'print qq{[$&]}'"
    [7]

    perl -e "$_ = 'abc123'; /\w1/; $_ = 'xyz789'; eval 'print qq{[$&]}'"
    [z7]

    > And if the regex doesn't have any capturing parenthesis, then $& etc
    > are set by alias. That was a surprise; I figured they wouldn't get
    > set at all when Perl doesn't see them coming and there were no
    > capturing parentheses.


    Agreed. I was particularly surprised by that as well, although it
    depends on the pattern. If you match literal text $& isn't set; you'll
    get an uninitialized value warning if you add -w.

    If you match against things like /\d/, /[0-9]/, or /(?:1|2)/ then $&
    does get set. Patterns such as /[1]/ and /(?:1)/ don't set it,
    presumably because they can be simplified to a literal /1/.

    It appears that the aliasing (at least for a stealth $&) is a side
    effect of the regex engine potentially needing to backtrack. I suspect
    that for the literal matches perl is calling index() to look for a
    substring instead of invoking the regex engine.

    It's possible that the behavior of $1 is the result of a similar
    implementation detail/optimization. I'm hesitant to call it a bug,
    though it might be.

    -mjc
     
    Michael Carman, Aug 15, 2008
    #7
  8. Michael Carman wrote:
    >
    > After poking around a bit more, I noticed that perlvar has this to say
    > in the entry for @- (@LAST_MATCH_START):
    >
    > $1 is the same as "substr($var, $-[1], $+[1] - $-[1])"
    >
    > I had always read that as "is equivalent to" but it would appear that a
    > literal interpretation is warranted. They really are the exact same.


    They are not *exactly* the same. You can assign to substr($var, $-[1],
    $+[1] - $-[1]) but you cannot assign to $1.



    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
     
    John W. Krahn, Aug 15, 2008
    #8
  9. Michael Carman

    Willem Guest

    John W. Krahn wrote:
    ) Michael Carman wrote:
    )>
    )> After poking around a bit more, I noticed that perlvar has this to say
    )> in the entry for @- (@LAST_MATCH_START):
    )>
    )> $1 is the same as "substr($var, $-[1], $+[1] - $-[1])"
    )>
    )> I had always read that as "is equivalent to" but it would appear that a
    )> literal interpretation is warranted. They really are the exact same.
    )
    ) They are not *exactly* the same. You can assign to substr($var, $-[1],
    ) $+[1] - $-[1]) but you cannot assign to $1.

    Which is a pity, IMHO. Assigning to $1 would be a good faeture.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Aug 15, 2008
    #9
  10. Michael Carman

    Dr.Ruud Guest

    schreef:
    > "comp.lang.c++" <> wrote:


    >> But if you never use $&, $` or $', then patterns without capturing
    >> parentheses will not be penalized.

    >
    > This seems to imply that patterns *with* capturing parentheses will be
    > penalized, even in the absence of $&, $` or $'.


    No.

    Without the special patterns, this penalisation just doesn't occur.
    This penalisation is only there when the special patterns are there.
    A single occurence of the patterns makes Perl do something extra (like
    capturing) for every regex, but if a regex is already capturing anyway,
    the penalisation is less personal.
    (etc., like the Parror sketch)

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Aug 15, 2008
    #10
  11. Michael Carman

    Willem Guest

    bugbear wrote:
    ) Willem wrote:
    )> John W. Krahn wrote:
    )> ) They are not *exactly* the same. You can assign to substr($var, $-[1],
    )> ) $+[1] - $-[1]) but you cannot assign to $1.
    )>
    )> Which is a pity, IMHO. Assigning to $1 would be a good feature.
    )
    ) ...of which an equivalent is shown above ! :)

    Agreed, it wouldn't be much more than syntactic sugar, but a lot
    of the language is just that: syntactic sugar.
    That should alo make it reasonably easy to implement, I would venture.
    (Unless $1 were sometimes a copy and not always an alias...)


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Aug 15, 2008
    #11
  12. John W. Krahn wrote:
    > They are not *exactly* the same. You can assign to
    > substr($var, $-[1], $+[1] - $-[1]) but you cannot assign to $1.


    Well, yes, there is that. :)

    Willem wrote:
    > a lot of the language is just that: syntactic sugar. That should alo
    > make it reasonably easy to implement, I would venture. (Unless $1
    > were sometimes a copy and not always an alias...)


    Judging by the experiments in the other branch of this thread, that
    appears to be the case.

    -mjc
     
    Michael Carman, Aug 15, 2008
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mark Fletcher
    Replies:
    1
    Views:
    499
    Mark Fletcher
    May 19, 2004
  2. Chris Nolte
    Replies:
    9
    Views:
    4,294
    Jeff Schwab
    May 25, 2004
  3. dhek bhun kho

    java.util.regex: Backreferences?

    dhek bhun kho, Jul 9, 2003, in forum: Java
    Replies:
    2
    Views:
    799
    dhek bhun kho
    Jul 9, 2003
  4. Amy Dillavou

    backreferences

    Amy Dillavou, Sep 28, 2005, in forum: Python
    Replies:
    4
    Views:
    447
    Peter
    Sep 28, 2005
  5. grocery_stocker
    Replies:
    9
    Views:
    788
    grocery_stocker
    May 24, 2008
Loading...

Share This Page