Interpolation of qr-regexes containing backreferences

H

Haakon Riiser

I just noticed that backreferences in qr-regexes behave differently
from what I expected when they are interpolated into a new regex.
I expected that the meaning of the backreference shouldn't change
when interpolated into a new regex. I.e., one should be able to
do things like:

$re1 = qr{(.)\1};
$re2 = qr{($re1$re1)};

which I would expect to be equivalent to

$re2 = qr{((.)\2(.)\3)};

Perl 5.8.3 instead does this:

$re2 = qr{((.)\1(.)\1)};

I searched for the problem on Google, and found that it has been
known for at least three years. Since it's still here, does that
mean that there's another solution that does not require me to
drop the interpolation and write the entire regex as one big chunk?

Thanks in advance for any replies.
 
B

Ben Morrow

Haakon Riiser said:
I just noticed that backreferences in qr-regexes behave differently
from what I expected when they are interpolated into a new regex.
I expected that the meaning of the backreference shouldn't change
when interpolated into a new regex. I.e., one should be able to
do things like:

$re1 = qr{(.)\1};
$re2 = qr{($re1$re1)};

which I would expect to be equivalent to

$re2 = qr{((.)\2(.)\3)};

Perl 5.8.3 instead does this:

$re2 = qr{((.)\1(.)\1)};

You could try (untested):

my $re1 = qr[(.)(??{$^N})];
my $re2 = qr[($re1$re1)];

Ben
 
H

Haakon Riiser

[Ben Morrow]
Haakon Riiser said:
[...] I.e., one should be able to do things like:

$re1 = qr{(.)\1};
$re2 = qr{($re1$re1)};

which I would expect to be equivalent to

$re2 = qr{((.)\2(.)\3)};

You could try (untested):

my $re1 = qr[(.)(??{$^N})];
my $re2 = qr[($re1$re1)];

Thanks, this works great! I've usually tried to avoid "highly
experimental" regex features such as (??{ ... }), but it's been
marked highly experimental for a few years now, so how dangerous
could it be?

I should probably reread that section of the regex manual since
I didn't pay too much attention to it the first time, it being
experimental and all. :)
 
H

Haakon Riiser

[Ben Morrow]
You could try (untested):

my $re1 = qr[(.)(??{$^N})];
my $re2 = qr[($re1$re1)];

One question regarding the behavior of (??{ ... }):
Take the following code: (Notice that there are two versions of
the $quoted_literal regex. The first one uses (??{ ... }) and $^N
and the other one uses the delimiter directly.)

use warnings;

$quoted_literal = qr/
(")
(??{ "[^$^N]*$^N" })
/x;

$quoted_literal = qr/
"
[^"]*
"
/x;

$data = 'this is "hello" world';
@list = $data =~ /($quoted_literal|[^"]*)/g;
for ($i = 0; $i < @list; $i++) {
printf "[$i] '\%s'\n", defined $list[$i] ? $list[$i] : "UNDEFINED";
}

If I run this program as it is (using the simple direct version of
$quoted_literal) the output is

[0] 'this is '
[1] '"hello"'
[2] ' world'
[3] ''

If the simple version of $quoted_literal is removed, i.e. making the
script use the (??{ ... }) / $^N version, the result is completely
different:

[0] 'this is '
[1] 'UNDEFINED'
[2] '"hello"'
[3] '"'
[4] ' world'
[5] 'UNDEFINED'
[6] ''
[7] 'UNDEFINED'

As I understood it, the two versions of $quoted_literal should
match exactly the same text, so I can't figure out why the results
aren't the same. Any help in understanding why this happens,
and preferably fixing it, is greatly appreciated.
 
B

Ben Morrow

Haakon Riiser said:
[Ben Morrow]
You could try (untested):

my $re1 = qr[(.)(??{$^N})];
my $re2 = qr[($re1$re1)];

One question regarding the behavior of (??{ ... }):
Take the following code: (Notice that there are two versions of
the $quoted_literal regex. The first one uses (??{ ... }) and $^N
and the other one uses the delimiter directly.)

use warnings;

$quoted_literal = qr/
(")
(??{ "[^$^N]*$^N" })
/x;

$quoted_literal = qr/
"
[^"]*
"
/x;

$data = 'this is "hello" world';
@list = $data =~ /($quoted_literal|[^"]*)/g;
for ($i = 0; $i < @list; $i++) {
printf "[$i] '\%s'\n", defined $list[$i] ? $list[$i] : "UNDEFINED";
}

If I run this program as it is (using the simple direct version of
$quoted_literal) the output is

[0] 'this is '
[1] '"hello"'
[2] ' world'
[3] ''

If the simple version of $quoted_literal is removed, i.e. making the
script use the (??{ ... }) / $^N version, the result is completely
different:

[0] 'this is '
[1] 'UNDEFINED'
[2] '"hello"'
[3] '"'
[4] ' world'
[5] 'UNDEFINED'
[6] ''
[7] 'UNDEFINED'

As I understood it, the two versions of $quoted_literal should
match exactly the same text, so I can't figure out why the results
aren't the same. Any help in understanding why this happens,
and preferably fixing it, is greatly appreciated.

The regex with (??{}) in it has an extra set of parentheses. If you
take the second output again, and number the rows:
[0] 'this is ' $1
[1] 'UNDEFINED' $2
[2] '"hello"' $1
[3] '"' $2
[4] ' world' $1
[5] 'UNDEFINED' $2
[6] '' $1
[7] 'UNDEFINED' $2

it should be clear. BTW, you would almost certainly be better off
using Text::Balanced for this sort of thing.

Ben
 
H

Haakon Riiser

[Ben Morrow]
The regex with (??{}) in it has an extra set of parentheses. If
you take the second output again, and number the rows:
[0] 'this is ' $1
[1] 'UNDEFINED' $2
[2] '"hello"' $1
[3] '"' $2
[4] ' world' $1
[5] 'UNDEFINED' $2
[6] '' $1
[7] 'UNDEFINED' $2

it should be clear.

Argh, I can't believe I didn't spot that one. Time to take a
break I guess. :)
BTW, you would almost certainly be better off using
Text::Balanced for this sort of thing.

That would require me to totally rewrite my tokenizer. I was
working on a small parser (using the wonderful Parse::Yapp),
and did the entire tokenizing with a single regex-match.

@tokens = $raw_data =~ m{
$comment | ( $quoted_literal | $special | $op | $unquoted_literal )
}gx;

The language is quite simple, so it is possible to do every regex
without using internal capturing. The only construct that would
be simplified with backreferences was $quoted_literal, which
supports three types of strings: double quoted, single quoted,
and user-defined delimiter.

" ... "
' ... '
^c ... c

where c can be any character, and the delimiters can be escaped
by putting two of them next to each other:

'foo ''bar'' baz' == foo 'bar' baz

Since the third string type supports any character as a delimiter,
it would be nice if I could use backreferences. Now that that's
out of the question, I chose instead to generate a bunch of regexes
(one for each ASCII character) using sprintf. Not as elegant,
but it works, and it's probably faster than the equivalent solution
with backreferences would have been.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Haakon Riiser
$re1 = qr{(.)\1};
$re2 = qr{($re1$re1)};

which I would expect to be equivalent to

$re2 = qr{((.)\2(.)\3)};

What makes you expect this? qr() is an analogue of qq() etc...
Perl 5.8.3 instead does this:

$re2 = qr{((.)\1(.)\1)};

As designed...

Hope this helps,
Ilya
 
H

Haakon Riiser

[Ilya Zakharevich]
Who cares? What is important is how it *is* designed.

Who cares how it is designed? You asked me what made *me* expect
that qr regexes can be interpolated with predictable behavior.
The answer was, of course, that this would make sense to me,
while the current design makes no sense since you can accomplish
the same thing by interpolating a string representation of the
regexe, while the more useful case of localized regex scope w/o
capturing side effects is impossible to achieve.
"Exactly as written"??? And what you think would it be, q// or qq//?
(One canot replace qr() by qq(), any more than replace qq() by q().)

I shouldn't have to explain what I mean by "exactly as written".
In the case with q, that means character-by-character. With qq,
it means that the result of the string processing (translation of
of character escapes such as \n and \t, and variable interpolation)
is interpolated directly.
That's (??{}). Why do you want to merge two different cases into one?

As I said in the previous post,

Interpolation of qr// should rewrite the regex, if necessary,
so that it matches the same text as it would match when used on
its own. This is much more useful, since you can then build
up a large regex from several small qr chunks, without having
to worry that modifications to one of the building blocks will
suddenly break regexes interpolated after it.

I think that the string type interpolation of qr that you think
is so well designed is an ugly kludge that makes big regexes
hard to maintain. I can't see *any* reason as to why you can't
simply create the regex as a regular string and interpolate that,
if you so desperately need separate regex building blocks that
can refer to each other. qr regexes could then be used when you
need the regexes to be completely shielded from each other (which
in my experience is *much* more common than wanting spaghetti
code regexes), and we wouldn't have to resort to (??{}) to get
something as common as backreferences.

I sure hope Ben Morrow was right when he said that qr interpolation
works the way I like it in Perl 6.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Haakon Riiser
Who cares how it is designed? You asked me what made *me* expect
that qr regexes can be interpolated with predictable behavior.

Do not put words in my mouth, please.
The answer was, of course, that this would make sense to me,

So what documentation way does not matter, right?
while the current design makes no sense since you can accomplish
the same thing by interpolating a string representation of the
regexe, while the more useful case of localized regex scope w/o
capturing side effects is impossible to achieve.

I see that you not only do not read the docs, but also do not read the
answers to your questions on this newsgroup.

[Omiting meaningless suggestions already refuted in the preceeding
discussion.]

Hope this helps,
Ilya
 
G

gnari

Ilya Zakharevich said:
[A complimentary Cc of this posting was sent to
Haakon Riiser
What makes you expect this? qr() is an analogue of qq() etc...
That's not how I would design it.
Who cares? What is important is how it *is* designed.
Who cares how it is designed? You asked me what made *me* expect
that qr regexes can be interpolated with predictable behavior.

Do not put words in my mouth, please.
The answer was, of course, that this would make sense to me,

So what documentation way does not matter, right?
while the current design makes no sense since you can accomplish
the same thing by interpolating a string representation of the
regexe, while the more useful case of localized regex scope w/o
capturing side effects is impossible to achieve.

I see that you not only do not read the docs, but also do not read the
answers to your questions on this newsgroup.[/QUOTE]

hey. no need to let this degenerate into a flame war.

looked to me like the OP was familiar with the way it works,
but was expressing his view that he would have expected it to
be implemented differently than it is. some of the follow-ups
have been interesting, actually, and the the original question was not
without merit.

gnari.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
gnari
hey. no need to let this degenerate into a flame war.

looked to me like the OP was familiar with the way it works,

Except the knowledge of the bug that qr(\2) does not work, I did not
observe any familiarity. He claims that the result of qr(whatever) is
the same as qq(whatever); he claims that some things cannot be done,
etc.
but was expressing his view that he would have expected it to
be implemented differently than it is.

I noticed this. But *why* do you think this view deserves to be
shared? Different people have different expectations. But the only
place this matters (after the initial design stage is behind) is: if
the docs do not clear the ambiguities, the docs must be corrected.

But it does not look that this is the topic of this discussion...

Yours,
Ilya
 
H

Haakon Riiser

[Ilya Zakharevich]

What are we really discussing here? In my last posts, I have
merely been stating how I would have designed qr interpolation,
and I have tried to describe the reasons for it. I now know that
it wasn't intented to work that way in Perl 5, and of course I
accept that. There's really nothing to argue over, unless you're
offended that I don't agree with the current implementation.
 
H

Haakon Riiser

[Ilya Zakharevich]
He claims that the result of qr(whatever) is the same as
qq(whatever);

Yes, I believed it was. If you could give me a simple example of
the potential differences in $re2 in the following two examples,
I would appreciate it. (Really, I'm not being sarcastic. :)

# Example 1: Interpolating a qr-regex
$re1 = qr(whatever);
$re2 = qr($re1);

# Example 2: Interpolating a regex stored as a qq-string
$re1 = qq(whatever);
$re2 = qr($re1);
he claims that some things cannot be done, etc.

Yes, I claimed that there was no way to use capturing in an
interpolated regex without causing some side effects. E.g.,
if you say

$re2 = qr($re1);

then

$data =~ $re2;

will capture into $1, $2, ... if you use capturing parentheses in
$re1. I claimed that it was impossible to use capturing locally in
$re1 without causing this side effect. If you can prove me wrong,
I'd be grateful if you can show me how to do it. It would actually
be of great help to me in the project I'm currently working on.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Haakon Riiser
Yes, I claimed that there was no way to use capturing in an
interpolated regex without causing some side effects. E.g.,
if you say

$re2 = qr($re1);

then

$data =~ $re2;

will capture into $1, $2, ... if you use capturing parentheses in
$re1. I claimed that it was impossible to use capturing locally in
$re1 without causing this side effect. If you can prove me wrong,

If you specify your problem, I'm sure a lot of people will be glad to
help you. I, personally, cannot grok what it is exactly you want to
achieve.

hoep this helps,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,050
Latest member
AngelS122

Latest Threads

Top