Backreferences: alias vs copy

M

Michael Carman

In a separate thread someone recently asked what happens if they modify
the variable in a 'while ($var =~ /pattern/g)' loop. In crafting a
sample program I noticed something that surprised me a little:

my $s = 'abc';

while ($s =~ /(\w)/g) {
print "$1 - ";
$s = 'xyz' if $1 eq 'b';
print "$1\n";
}
__END__
a - a
b - y
x - x
y - y
z - z

In the second result, you can see that the value of $1 changes after
reassigning $s. Its value becomes the text from the new string at the
position corresponding to the match against the old one. This makes it
pretty clear that $1 is actually an alias instead of a copy but I can't
find this documented anywhere.

That made me wonder what would happen if the new string was shorter than
the match position in the old one. Consider

my $s = 'abc';

while ($s =~ /(\w)/g) {
print "$1 - ";
$s = 'x' if $1 eq 'c';
print "$1\n";
}
__END__
a - a
b - b
c - c # <--
x - x

as well as:

my $s = 'abc';

while ($s =~ /(\w)/g) {
print "$1 - ";
$s = 'xy' if $1 eq 'c';
print "$1\n";
}
__END__

a - a
b - b
c - # <--
x - x
y - y

If that doesn't scream "NUL terminated C string!" I don't know what does.

Is this documented anywhere, preferably with a caveat about using $1 and
kin after you've changed the match string?

-mjc
 
C

comp.lang.c++

In a separate thread someone recently asked what happens if they modify
the variable in a 'while ($var =~ /pattern/g)' loop. In crafting a
sample program I noticed something that surprised me a little:

my $s = 'abc';

while ($s =~ /(\w)/g) {
print "$1 - ";
$s = 'xyz' if $1 eq 'b';
print "$1\n";
}
__END__
a - a
b - y
x - x
y - y
z - z

In the second result, you can see that the value of $1 changes after
reassigning $s. Its value becomes the text from the new string at the
position corresponding to the match against the old one. This makes it
pretty clear that $1 is actually an alias instead of a copy but I can't
find this documented anywhere.

I can't find anything completely
explicit but the performance penalty would be prohibitive. There'd be
a double whammy if
the backref. was captured but not used afterwards.
That made me wonder what would happen if the new string was shorter than
the match position in the old one. Consider

my $s = 'abc';

while ($s =~ /(\w)/g) {
print "$1 - ";
$s = 'x' if $1 eq 'c';
print "$1\n";
}
__END__
a - a
b - b
c - c # <--
x - x

as well as:

my $s = 'abc';

while ($s =~ /(\w)/g) {
print "$1 - ";
$s = 'xy' if $1 eq 'c';
print "$1\n";
}
__END__

a - a
b - b
c - # <--
x - x
y - y

If that doesn't scream "NUL terminated C string!" I don't know what does.

Is this documented anywhere, preferably with a caveat about using $1 and
kin after you've changed the match string?

The only hint I saw was perlre's
warning that once $& is seen, the copy price tag extends to $1, $2,
etc as well:


WARNING: Once Perl sees that you
need one of $&, $`, or $'
anywhere in the program, it has
to provide them for every
pattern match. This may
substantially slow your program.
Perl uses the same mechanism to
produce $1, $2, etc, so you
also pay a price for each pattern
that contains capturing parens...


That seems like a clear inference could be made that no copy occurs in
the absence of $&.
 
X

xhoster

comp.lang.c++ said:
The only hint I saw was perlre's
warning that once $& is seen, the copy price tag extends to $1, $2,
etc as well:

WARNING: Once Perl sees that you
need one of $&, $`, or $'
anywhere in the program, it has
to provide them for every
pattern match. This may
substantially slow your program.
Perl uses the same mechanism to
produce $1, $2, etc, so you
also pay a price for each pattern
that contains capturing parens...

I think you are misinterpreting that. It goes on to say:
But if you never use $&, $` or $', then patterns without capturing
parentheses will not be penalized.

This seems to imply that patterns *with* capturing parentheses will be
penalized, even in the absence of $&, $` or $'.

That seems like a clear inference could be made that no copy occurs in
the absence of $&.

Maybe that is what is actually happening, but it seems far from clear based
on the documents.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
C

comp.lang.c++

I think you are misinterpreting that. It goes on to say:


This seems to imply that patterns *with* capturing parentheses will be
penalized, even in the absence of $&, $` or $'.

No, I think capturing parens
actually copy if $& is in the
picture. Compare below with
orig. output:

my $s = 'abc';
while ($s =~ /(\w)/g) {
print "$&: $1 - ";
print "$1 - ";
$s = 'xyz' if $1 eq 'b';
print "$1\n";
}
__END__
a: a - a
b: b - b
x: x - x
y: y - y
z: z - z
 
X

xhoster

comp.lang.c++ said:
No, I think capturing parens
actually copy if $& is in the
picture. Compare below with
orig. output:

my $s = 'abc';
while ($s =~ /(\w)/g) {
print "$&: $1 - ";
print "$1 - ";
$s = 'xyz' if $1 eq 'b';
print "$1\n";
}

Based on my experimentation:

In the absence of /g, capturing parenthesis always copy.

In the presence of $&, capturing parenthesis always copy.

They alias only if they are used with a /g and only if $& (etc) has not
been seen.

Odd.

If you use a string eval to inspect $&, $' or $` (so that Perl doesn't
see them coming), then those variables are set by alias vs. copy under the
same conditions the capturing parenthesis are. And if the regex doesn't
have any capturing parenthesis, then $& etc are set by alias. That was a
surprise; I figured they wouldn't get set at all when Perl doesn't see them
coming and there were no capturing parentheses.


Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
M

Michael Carman

comp.lang.c++ said:
I can't find anything completely explicit but the performance penalty
would be prohibitive.

Yes, the behavior isn't surprising at all if you think about the
implementation a little.
WARNING: Once Perl sees that you need one of $&, $`, or $' anywhere
in the program, it has to provide them for every pattern match. This
may substantially slow your program. Perl uses the same mechanism to
produce $1, $2, etc, so you also pay a price for each pattern that
contains capturing parens...

That seems like a clear inference could be made that no copy occurs
in the absence of $&.

All that says is that if you use those variables perl has to track the
prematch, match, and postmatch for every regular expression. This is
because they're set after a successful match, and when you use them perl
has no way of knowing which regex will have be the last successful one.

Capturing parens only introduce the overhead for the regexes in which
they are used because it's clear that they only apply there.

After poking around a bit more, I noticed that perlvar has this to say
in the entry for @- (@LAST_MATCH_START):

$1 is the same as "substr($var, $-[1], $+[1] - $-[1])"

I had always read that as "is equivalent to" but it would appear that a
literal interpretation is warranted. They really are the exact same.

-mjc
 
M

Michael Carman

In the absence of /g, capturing parenthesis always copy.

In the presence of $&, capturing parenthesis always copy.

They alias only if they are used with a /g and only if $& (etc) has
not been seen.

I see the same behavior, though I wonder if in the presence of $& it's
actually $& that's the copy and then $1 and friends alias to it instead
of to the original string. There's probably no way of knowing without
mucking through the guts.
If you use a string eval to inspect $&, $' or $` (so that Perl
doesn't see them coming), then those variables are set by alias vs.
copy under the same conditions the capturing parenthesis are.

Actually, it's weirder than that:

perl -e "$_ = 'abc123'; /\d/; $_ = 'xyz789'; print qq{[$&]}"
[1]

perl -e "$_ = 'abc123'; /1/; $_ = 'xyz789'; print qq{[$&]}"
[1]

perl -e "$_ = 'abc123'; /\d/; $_ = 'xyz789'; eval 'print qq{[$&]}'"
[7]

perl -e "$_ = 'abc123'; /1/; $_ = 'xyz789'; eval 'print qq{[$&]}'"
[]

perl -e "$_ = 'abc123'; /[0-9]/; $_ = 'xyz789'; eval 'print qq{[$&]}'"
[7]

perl -e "$_ = 'abc123'; /\w1/; $_ = 'xyz789'; eval 'print qq{[$&]}'"
[z7]
And if the regex doesn't have any capturing parenthesis, then $& etc
are set by alias. That was a surprise; I figured they wouldn't get
set at all when Perl doesn't see them coming and there were no
capturing parentheses.

Agreed. I was particularly surprised by that as well, although it
depends on the pattern. If you match literal text $& isn't set; you'll
get an uninitialized value warning if you add -w.

If you match against things like /\d/, /[0-9]/, or /(?:1|2)/ then $&
does get set. Patterns such as /[1]/ and /(?:1)/ don't set it,
presumably because they can be simplified to a literal /1/.

It appears that the aliasing (at least for a stealth $&) is a side
effect of the regex engine potentially needing to backtrack. I suspect
that for the literal matches perl is calling index() to look for a
substring instead of invoking the regex engine.

It's possible that the behavior of $1 is the result of a similar
implementation detail/optimization. I'm hesitant to call it a bug,
though it might be.

-mjc
 
J

John W. Krahn

Michael said:
After poking around a bit more, I noticed that perlvar has this to say
in the entry for @- (@LAST_MATCH_START):

$1 is the same as "substr($var, $-[1], $+[1] - $-[1])"

I had always read that as "is equivalent to" but it would appear that a
literal interpretation is warranted. They really are the exact same.

They are not *exactly* the same. You can assign to substr($var, $-[1],
$+[1] - $-[1]) but you cannot assign to $1.



John
 
W

Willem

John W. Krahn wrote:
) Michael Carman wrote:
)>
)> After poking around a bit more, I noticed that perlvar has this to say
)> in the entry for @- (@LAST_MATCH_START):
)>
)> $1 is the same as "substr($var, $-[1], $+[1] - $-[1])"
)>
)> I had always read that as "is equivalent to" but it would appear that a
)> literal interpretation is warranted. They really are the exact same.
)
) They are not *exactly* the same. You can assign to substr($var, $-[1],
) $+[1] - $-[1]) but you cannot assign to $1.

Which is a pity, IMHO. Assigning to $1 would be a good faeture.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
D

Dr.Ruud

(e-mail address removed) schreef:
This seems to imply that patterns *with* capturing parentheses will be
penalized, even in the absence of $&, $` or $'.

No.

Without the special patterns, this penalisation just doesn't occur.
This penalisation is only there when the special patterns are there.
A single occurence of the patterns makes Perl do something extra (like
capturing) for every regex, but if a regex is already capturing anyway,
the penalisation is less personal.
(etc., like the Parror sketch)
 
W

Willem

bugbear wrote:
) Willem wrote:
)> John W. Krahn wrote:
)> ) They are not *exactly* the same. You can assign to substr($var, $-[1],
)> ) $+[1] - $-[1]) but you cannot assign to $1.
)>
)> Which is a pity, IMHO. Assigning to $1 would be a good feature.
)
) ...of which an equivalent is shown above ! :)

Agreed, it wouldn't be much more than syntactic sugar, but a lot
of the language is just that: syntactic sugar.
That should alo make it reasonably easy to implement, I would venture.
(Unless $1 were sometimes a copy and not always an alias...)


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
M

Michael Carman

John said:
They are not *exactly* the same. You can assign to
substr($var, $-[1], $+[1] - $-[1]) but you cannot assign to $1.

Well, yes, there is that. :)
a lot of the language is just that: syntactic sugar. That should alo
make it reasonably easy to implement, I would venture. (Unless $1
were sometimes a copy and not always an alias...)

Judging by the experiments in the other branch of this thread, that
appears to be the case.

-mjc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top