Regular Expressions: Greedy Matching

J

Jose Luis

Perl regular expressions normally match the longest string possible, but this example prints "a|a". Why does it match?


<<snip>>

"aax" =~ m/(^a*)([^x])/;
print "$1|$2\n" ;

<<snip>>


Thanks in advance,
Jose Luis.
 
P

Peter Makholm

Jose Luis said:
Perl regular expressions normally match the longest string possible,
but this example prints "a|a". Why does it match?

Of all the possible matches, Perl will in general find the lef-most
longest match.
"aax" =~ m/(^a*)([^x])/; print "$1|$2\n" ;

There are two possible matches (a bit sloppy syntax):

$1 = "^a", $2 = "a"

and

$1 = "^", $2 = "a"

The first of these posibilities is the left-most longest match.

Did you expect $1 to be something like "^aa"? Perl tried this but
decided that it was not possible to comply with the requirement that $2
should consist of one non-"x" character. And therefore it backtracked
and tried something shorter for $1.

//Makholm
 
U

Uri Guttman

JL> Perl regular expressions normally match the longest string possible, but this example prints "a|a". Why does it match?

JL> "aax" =~ m/(^a*)([^x])/;
JL> print "$1|$2\n" ;

greed doesn't overrule actual matching. the regex first will match all
of the leading 'a' chars but then the not x will fail. so it backtracks
and shortens the previous match (because it can) and tries again from
there. then the [^x] matches the 2nd 'a' and the match succeeds. what
you are thinking about is disallowing backtracking. perl6 can do that
and there is a way to do that in perl5 with the (?< ) construct.
it is documented in perlre.

another new feature which is simpler to use is adding a + after a
quantifier. this is called a possesive quantifier and it won't give back
text when backtracking. this is also in perlre. these are both in 5.10.

so your code would be changed to (untested):

"aax" =~ m/(^a*+)([^x])/;

and that should not match as the first grab will get all the 'a' chars
and not give any back during backtracking and the second part will never
match.

uri
 
I

Ilya Zakharevich

Perl regular expressions normally match the longest string possible

What made you think so? Which match is prefered has NOTHING to do
with length. All documented...

Ilya
 
P

Peter Makholm

Ilya Zakharevich said:
What made you think so? Which match is prefered has NOTHING to do
with length. All documented...

The Camel Book does say: "Perl's matching is normally leftmost longest;
with minimal matching it becomes leftmost shortest. But the leftmost
part never varies ad is the dominant criterion"

(Third edition, chapter 5. (page 178))

Well, nowadays there are fun tricks with possesive quantifiers,
independent subexpressions and verbs to control backtracking. But to say
that it has NOTHING to do with length and that this is documented
requires you to leave the Camel Book out of the corpus of Perl5
documentation.

Not that this makes Jose Luis's formulation correct. But in general
length has some relevance.

//Makholm
 
J

Jim Gibson

<1f7de2a9-fd67-4db1-b6d3-1da850c956c2@glegroupsg2000goo.googlegroups.com>
Perl regular expressions normally match the longest string possible, but this
example prints "a|a". Why does it match?


<<snip>>

"aax" =~ m/(^a*)([^x])/;
print "$1|$2\n" ;

Perl will _start_ with the longest possible match for 'a*' in your
regular expression (because a non-qualified '*' is "greedy"), but if it
fails to match because of what follows, then it will shorten the
sub-match by one character and try again. Since the next character
after the first-matched 'aa' is 'x', and you have specified that the
next character be anything but 'x' ([^x]), the match with 'aa' fails.
Perl tries again with 'a', and since the next 'a' matches [^x], the
match succeeds and "a|a" is printed.
 
C

C.DeRykus

Perl regular expressions normally match the longest string possible, but this example prints "a|a". Why does it match?

<<snip>>

"aax" =~ m/(^a*)([^x])/;
print "$1|$2\n" ;

Maybe you're after one of the below:


m/(^a*)([^x]*)/ # * is 0 or more

m/(^a*)([^x]?)/ # ? is 0 or one

This'll enable a max match of "a" for $1 since
* or ? quantifiers will cause $2 to match even
with zero [^x] characters.
 
I

Ilya Zakharevich

The Camel Book does say: "Perl's matching is normally leftmost longest;
with minimal matching it becomes leftmost shortest. But the leftmost
part never varies ad is the dominant criterion"

(Third edition, chapter 5. (page 178))

Well, nowadays there are fun tricks with possesive quantifiers,
independent subexpressions and verbs to control backtracking. But to say
that it has NOTHING to do with length and that this is documented
requires you to leave the Camel Book out of the corpus of Perl5
documentation.

??? Perl documentation comes with Perl. Does Camel book come with
Perl? Make your own conclusion...
Not that this makes Jose Luis's formulation correct. But in general
length has some relevance.

"Some relevance" and "nothing to do" are practically synonims...

Hope this helps,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top