Can someone 'splain why this regex won't work both ways?

spydox · Apr 14, 2008

I'm trying to find a repeated number in a string, like 122345 finds
22.

This works:

/(\d)\1/

This doesn't:

/\1(\d)/

I guess LLR parsing is to blame, but shouldn't the second example
first try to FIND a $1 then check to see if there is a \1, and repeat
that process moving L to R?

I though Perl sort of went to and fro trying to do matching. To me,
there IS a /\1(\d)/ in the string since $1 is 2, and there is a \1 = 2
preceeding it.

I was a little surprized this didn't work although I can sort of see
why in a way too. In some ways it seems to me that regexes should be
*disconnected* from parsing - just answer the question does this
match?

Ben Morrow · Apr 14, 2008

Quoth (e-mail address removed):

I'm trying to find a repeated number in a string, like 122345 finds
22.

This works:

/(\d)\1/

This doesn't:

/\1(\d)/

I guess LLR parsing is to blame, but shouldn't the second example
first try to FIND a $1 then check to see if there is a \1, and repeat
that process moving L to R?

I though Perl sort of went to and fro trying to do matching. To me,
there IS a /\1(\d)/ in the string since $1 is 2, and there is a \1 = 2
preceeding it.

There are two separate operations here which you are confusing. First
perl parses the regex itself, and compiles it into an internal form.
Then it matches that regex against the string you provide. The second
will backtrack, under some circumstances; the first won't.

Ben

A. Sinan Unur · Apr 14, 2008

(e-mail address removed) wrote in
(e-mail address removed)
:

I'm trying to find a repeated number in a string, like 122345
finds 22.

This works:

/(\d)\1/

This doesn't:

/\1(\d)/

I guess LLR parsing is to blame,
....

I was a little surprized this didn't work although I can sort of
see why in a way too. In some ways it seems to me that regexes
should be *disconnected* from parsing - just answer the question
does this match?

I don't look at this as a parsing issue. Rather, it is a "the
universe must make sense" kind of issue: The first match does not
exist before the first match. That makes sense to me. It may not
make sense to you.

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

spydox · Apr 14, 2008

Quoth (e-mail address removed):

There are two separate operations here which you are confusing. First
perl parses the regex itself, and compiles it into an internal form.
Then it matches that regex against the string you provide. The second
will backtrack, under some circumstances; the first won't.

Ben

Understood, and I appreciate the insight. It makes sense.
Yet, when all else apparently *fails*, in my experience, and I've
heard MJD and others say this, Perl will "do its best" to match. To
me, unless it *also* tried backtracking, it gave up too soon..

spydox · Apr 14, 2008

..
..
..

..
..

I don't look at this as a parsing issue. Rather, it is a "the
universe must make sense" kind of issue: The first match does not
exist before the first match. That makes sense to me. It may not
make sense to you.

To me, like conventional pattern-recognition, of say two tanks next to
each other, the system should accept it whether the match is described
either way:

find a tank with another identical tank to it's left

*or*

find a tank with another identical tank to it's right

The system should have no *context-sensitivity* where only one of the
two matches. Sure, internally an algorithm may be scanning L to R or R
to L or whatever, but the user should not even be concerned with that,
at least in this case. I still think it gave up too soon- it should
have tried R to L (backtracking) when L to R failed.

Just IMHO, thank-you for your thoughts. This area seems just a bit
gray to me I'd be very interested in Damain or Mark's thoughts.

Ben Morrow · Apr 14, 2008

Quoth (e-mail address removed):

Understood, and I appreciate the insight. It makes sense.
Yet, when all else apparently *fails*, in my experience, and I've
heard MJD and others say this, Perl will "do its best" to match. To
me, unless it *also* tried backtracking, it gave up too soon..

No, you're still not understanding. Perl will only backtrack *while
trying to match*. Compiling the regex comes long before that.

Ben

Willem · Apr 14, 2008

(e-mail address removed) wrote:
) Understood, and I appreciate the insight. It makes sense.
) Yet, when all else apparently *fails*, in my experience, and I've
) heard MJD and others say this, Perl will "do its best" to match. To
) me, unless it *also* tried backtracking, it gave up too soon..

That's not what backtracking means.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

J.D. Baldwin · Apr 14, 2008

In the previous article said:
To me, like conventional pattern-recognition, of say two tanks next to
each other, the system should accept it whether the match is described
either way:

find a tank with another identical tank to it's left

*or*

find a tank with another identical tank to it's right

A better phrasing:

find a tank, then find another one to its right

*or*

find another one to its left, then find a tank

One of these phrasings makes sense; the other does not. Or, rather,
the other doesn't and one of the phrasings makes sense.

If you want a more formal justification, here's what the Camel Book
says about these. Note the two instances of the word "later":

1.7.4. Backreferences

[...] A pair of parentheses around a part of a regular
expression causes whatever was matched by that part to be
remembered for later use. It doesn't change what the part
matches, so /\d+/ and /(\d+)/ will still match as many digits
as possible, but in the latter case they will be remembered in
a special variable to be backreferenced later.

A. Sinan Unur · Apr 14, 2008

(e-mail address removed) wrote in
@f63g2000hsf.googlegroups.co
m:

[ please do not snip attributions ]

To me, like conventional pattern-recognition, of say two tanks
next to each other, the system should accept it whether the match
is described either way:

find a tank with another identical tank to it's left

*or*

find a tank with another identical tank to it's right

The system should have no *context-sensitivity* where only one of
the two matches. Sure, internally an algorithm may be scanning L
to R or R to L or whatever, but the user should not even be
concerned with that, at least in this case. I still think it gave
up too soon- it should have tried R to L (backtracking) when L to
R failed.

What you seem to want is a "match two identical characters"
operator. For this particular case, you can achieve that by using:

=for example

my @strings = qw( 1222345 1233345 );

s/00|11|22|33|44|55|66|77|88|99// for @strings;

print "$_\n" for @strings;

=cut

When you use a character class, every element of that class is
considered equivalent to every other one. So, for example, when you
write

/\d{2}/

that does find two characters that are in the same equivalence
class.

The tank analogy works perftectly here because there are no two
identical tanks in the world. Instead, there are equivalence classes
of tanks. Tanks that are the same model, tanks in the same unit etc.

If what you want is to say,

find a tank, then find another tank that is the same
model as the one you just found

well, that is equivalent to /(\d)\1/

J. D. Baldwin gives perfect examples of why /\1(\d)/ does not make
sense: Finding another tank in the same equivalence class as the one
you first found comes after first finding a tank.

Just IMHO, thank-you for your thoughts. This area seems just a bit
gray to me I'd be very interested in Damain or Mark's thoughts.

s/Damain/Damian/

My feeble mind looks at the following:

#!/usr/bin/perl

use strict;
use warnings;

use 5.010;

for ( my @a = qw( 1222345 1233345 ) ) {
s/(?<tank>\d)\K\k<tank>// and print "$_\n";
}

for ( my @a = qw( 1222345 1233345 ) ) {
s/(?<tank>\d)\K\k<tank>+// and print "$_\n";
}

for ( my @a = qw( 1222345 1233345 ) ) {
s/(?<tank>\d)\k<tank>// and print "$_\n";
}

__END__

thinks that the third one is the most natural (that is, find a tank,
then find another tank in the same equivalence class) to the other
ones.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

xhoster · Apr 14, 2008

Ben Morrow said:
Quoth (e-mail address removed):

No, you're still not understanding. Perl will only backtrack *while
trying to match*. Compiling the regex comes long before that.

I think that that is what he is talking about, the when trying to match
part. His use of "parsing" in the original question was ill-advised, but I
think you latched onto the the bad phrasing and rather than the intended
question, and now won't let him correct his poor phrasing. First perl
parses and compiles the regular expression, then it uses that compiled
expression to match (or loosely speaking "parse") the target string.

Perl parses and compiles /\1(.)/ without error or warning (which surprised
me). But then what does it do with it?

Conceptually, it could temporarily treat the \1 as ".*", and then when/if
the capture is matched go back and verify that it is the same as the thing
previously matched by the tentative .* cum \1. I don't know that I would
call this backtracking (as the OP seems to be doing), but I can't think of
anything obviously better to call it. Or it could reorder things give
an identical compiled regex as /(.)\1/. I don't know if these two things
would give the same answer as each other in all cases (if so, the latter
would surely be faster).

I think that that is what the OP thought it should do. Obviously, Perl
doesn't do either of those thing. I can't figure out what it does do. I
thought it would treat \1 preceding any capture as the empty string, but
apparently it doesn't do that, either. It seems to act as something
unmatchable.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Peter J. Holzer · Apr 14, 2008

To me, like conventional pattern-recognition, of say two tanks next to
each other, the system should accept it whether the match is described
either way:

find a tank with another identical tank to it's left

*or*

find a tank with another identical tank to it's right

The system should have no *context-sensitivity* where only one of the
two matches. Sure, internally an algorithm may be scanning L to R or R
to L or whatever, but the user should not even be concerned with that,
at least in this case. I still think it gave up too soon- it should
have tried R to L (backtracking) when L to R failed.

Backtracking doesn't mean scanning right to left. Backtracking means to
go back to the last point where you had a choice and try the other
alternative(s).

So, for example if you have a pattern /foo(bar|baz)/, after matching
"foo", you have a choice between trying to match "bar" or "baz". The
regex engine will try to match "bar" first, and if that fails, it will
backtrack to the point before it tried that and then try to match "baz"
instead.

But in a pattern like /\1(a)/ there is no choice: It needs to start by
matching the string in the first capture group, but that hasn't been
defined yet, so it must fail. (Well, it could try all possible strings,
but that would be extremely inefficient).

hp

xhoster · Apr 14, 2008

A better phrasing:

find a tank, then find another one to its right

*or*

find another one to its left, then find a tank

One of these phrasings makes sense; the other does not. Or, rather,
the other doesn't and one of the phrasings makes sense.

If you want a more formal justification, here's what the Camel Book
says about these. Note the two instances of the word "later":

I think that that is his point, an objection to the notion that
left and right typographically equate to "earlier" and "later"
chronologically.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Ilya Zakharevich · Apr 14, 2008

[A complimentary Cc of this posting was sent to

I'm trying to find a repeated number in a string, like 122345 finds
22.

This works:

/(\d)\1/

This doesn't:

/\1(\d)/

This depends on what you mean by "works". It works in the sense that
it does not match (as it should not). I do not find it documented in
perlre, but \3 will fail to match if group 3 did not match "yet".

Hope this helps,
Ilya

P.S. perl -Mre=debugcolor -wle "q(aa) =~ /\1(a)/"

Ben Morrow · Apr 15, 2008

Quoth (e-mail address removed):

I think that that is what he is talking about, the when trying to match
part. His use of "parsing" in the original question was ill-advised, but I
think you latched onto the the bad phrasing and rather than the intended
question, and now won't let him correct his poor phrasing. First perl
parses and compiles the regular expression, then it uses that compiled
expression to match (or loosely speaking "parse") the target string.

You're right, I was misunderstanding the OP's misunderstanding.

Perl parses and compiles /\1(.)/ without error or warning (which surprised
me). But then what does it do with it?

I was assuming (without having tried it) that the regex was failing at
the compile stage. It seems I was wrong...

Conceptually, it could temporarily treat the \1 as ".*", and then when/if
the capture is matched go back and verify that it is the same as the thing
previously matched by the tentative .* cum \1.

Something like this can be done with

/(.*)(a)(??{ $1 eq $2 ? "(?

" : "(?!)" })/

using a code assertion to insert either a 'succeed' or a 'fail and
backtrack' item into the regex at runtime. Not that I'd recommend this,
of course...

I don't know that I would
call this backtracking (as the OP seems to be doing), but I can't think of
anything obviously better to call it. Or it could reorder things give
an identical compiled regex as /(.)\1/. I don't know if these two things
would give the same answer as each other in all cases (if so, the latter
would surely be faster).

I think that that is what the OP thought it should do. Obviously, Perl
doesn't do either of those thing. I can't figure out what it does do. I
thought it would treat \1 preceding any capture as the empty string, but
apparently it doesn't do that, either. It seems to act as something
unmatchable.

All the $N start as undef, which is unmatchable as you found (-Mre=debug
is useful for finding out what is going on). Whenever perl backtracks to
retry part of a match, it clears any $N set by the part of the match it
is backtracking over, so /\1(.)/ couldn't match even if it did
backtrack, as $1 would be undef again by the time it got to retry the
\1. (Perl doesn't in fact backtrack, as it knows nothing has changed so
this would be an infinite loop.)

It is, however, possible to get \1 to match when it appears earlier in
the expression than the first brackets (which is why it's not a syntax
error); you just have to make sure it gets set first. For instance,

"abac" =~ /^(?:\1c|(a)b)+$/

matches. The first time through the +, $1 is undef so the \1c part fails;
but the (a)b part succeeds so $1 gets set. Then it goes round the + loop
again, and this time $1 is 'a' so the first branch can succeed.

Ben

comp.llang.perl.moderated · Apr 16, 2008

...

It is, however, possible to get \1 to match when it appears earlier in
the expression than the first brackets (which is why it's not a syntax
error); you just have to make sure it gets set first. For instance,

"abac" =~ /^(?:\1c|(a)b)+$/

matches. The first time through the +, $1 is undef so the \1c part fails;
but the (a)b part succeeds so $1 gets set. Then it goes round the + loop
again, and this time $1 is 'a' so the first branch can succeed.

Slightly different example sans alternation could work too if $N is
optional:

"abab" =~ /(?

\2)?(b))+/

I'd have the same enthusiasm for a root canal though...

Ilya Zakharevich · Apr 17, 2008

[A complimentary Cc of this posting was sent to
comp.llang.perl.moderated

Slightly different example sans alternation could work too if $N is
optional:

"abab" =~ /(?\2)?(b))+/

I'd have the same enthusiasm for a root canal though...

Extra parens are considered harmfull:

perl -wle "q(abab) =~ /(?: \1? (b) )+/x or die"

(not much beauty gained or lost, of course...).

Yours,
Ilya

Can someone tell me what's wrong with this question on StackOverflow?	0	Aug 19, 2023
Decoding no of ways and printing each decode message	2	Jun 1, 2021
can someone help me with this baffling regex ?	4	Feb 8, 2011
Why is this WordPress comments form not submitting?	1	Jan 12, 2020
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Why won't my (j)query work?	86	Dec 29, 2009
Doing both regex match and assignment within a If loop?	7	Mar 29, 2013
why does this work?	6	Aug 8, 2012

Can someone 'splain why this regex won't work both ways?

spydox

Ben Morrow

A. Sinan Unur

spydox

spydox

Ben Morrow

Willem

J.D. Baldwin

A. Sinan Unur

xhoster

Peter J. Holzer

xhoster

Ilya Zakharevich

Ben Morrow

comp.llang.perl.moderated

Ilya Zakharevich

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads