Regex /(X*)/

U

ulrich_martin

Hello,

I would like to ask you for help. Could anybody explain me, why regex
"/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
this regex is a greedy one and there is not anchor "^", so I would
expect, that $1 would contain XXX. I know (I have read it), that it is
possible to use + instead of *, but I would like to know, why the "*"
quantifier doesn't catch it.

I have found this example in perlretut:
Finally,
"aXXXb" =~ /(X*)/; # matches with $1 = ''
because it can match zero copies of 'X' at the beginning of
the string. If you definitely want to match at least one
'X', use "X+", not "X*".

M.
 
D

Damian Lukowski

I have found this example in perlretut:
Finally,
"aXXXb" =~ /(X*)/; # matches with $1 = ''
because it can match zero copies of 'X' at the beginning of
the string. If you definitely want to match at least one
'X', use "X+", not "X*".

Well, that is the explanation. Perl tries to match as soon as possible.
"Sooner" is more important than "longer".
 
P

Paul Lalli

Hello,

I would like to ask you for help. Could anybody explain me, why regex
"/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
this regex is a greedy one and there is not anchor "^", so I would
expect, that $1 would contain XXX. I know (I have read it), that it is
possible to use + instead of *, but I would like to know, why the "*"
quantifier doesn't catch it.

I have found this example in perlretut:
Finally,
"aXXXb" =~ /(X*)/; # matches with $1 = ''
because it can match zero copies of 'X' at the beginning of
the string.  If you definitely want to match at least one
'X', use "X+", not "X*".

Because greediness takes second place to position. Perl attempts to
find the FIRST match that it can. Once it's started successfully
matching, only then does the greediness of quantifiers come into play.

Take a look at all the places /(X*)/ could match aXXXb....

while ("aXXXb" =~ /(X*)/g) {
print "$`<<$&>>$'\n";
}

<<>>aXXXb
a<<XXX>>b
aXXX<<>>b
aXXXb<<>>


The first time through, it matches right at the beginning of the
string.
The second time through, it matches the XXX
The third time through, it matches between the X and the b
The final time through, it maches after the b, at the end of the
string.


Paul Lalli
 
J

jl_post

I would like to ask you for help. Could anybody explain me, why regex
"/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
this regex is a greedy one and there is not anchor "^", so I would
expect, that $1 would contain XXX. I know (I have read it), that it is
possible to use + instead of *, but I would like to know, why the "*"
quantifier doesn't catch it.

I have found this example in perlretut:
Finally,
"aXXXb" =~ /(X*)/; # matches with $1 = ''
because it can match zero copies of 'X' at the beginning of
the string. If you definitely want to match at least one
'X', use "X+", not "X*".


Basically, a lot of people mistakenly think that the greedy '*'
quantifier makes m/(X*)/ match the LONGEST string of Xs. But in
reality, m/(X*)/ matches AS SOON AS POSSIBLE, and '*' just makes (X*)
gobble as much as it can once a match is found.

I believe it was the "Learning Perl" (the "llama" book) that said
that if a regular expression can match an empty string, then it will
always return true no matter what string it is given. And the regular
expression m/(X*)/ does return true when used with an empty string, as
'' has zero-or-more instances of 'X' inside it. Therefore, even this
match succeeds:

"ab" =~ /(X*)/; # $1 gets set to ''

It succeeds because it found zero-or-more Xs at the very beginning of
the string. Likewise, the match:

"aXXXb" =~ /(X*)/; # $1 gets set to ''

also succeeds by finding zero-or-more Xs at the very beginning of the
string. It stops searching after that because it found a match, and
has no need to continue any further.

If you really wanted a regular expression that would match at least
one X, then you should use the '+' quantifier instead of '*', like
this:

"aXXXb" =~ /(X+)/; # $1 gets set to 'XXX'

but since it still matches as soon as possible, it wouldn't match a
longer string of Xs, as shown here:

"aXXXbXXXXXc" =~ /(X+)/; # $1 still gets set to 'XXX'

If you wanted to match the longest string of Xs, you'd have to loop
through all the strings of Xs and record the longest one. You can do
this with the /g modifier like this:

my $longestString = '';
while ( "aXXXbXXXXXcXd" =~ m/(X+)/g )
{
$longestString = $1 if length($1) > length($longestString);
}
print "$longestString\n"; # prints 'XXXXX'

So remember, it is a mistake to think that the '*' and '+'
quantifiers match the longest instance of a string; they just match as
much as they can (or "gobble" up as much as they can) once a match has
been found -- even if the match was found at the very beginning of the
string.

This means that if a regular expression can match an empty string,
then the '*' quantifier will probably match an empty string unless
what it's quantifying happens to be at the beginning of the string.

I hope this explanation helps.

-- Jean-Luc
 
X

xhoster

Hello,

I would like to ask you for help. Could anybody explain me, why regex
"/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
this regex is a greedy one and there is not anchor "^", so I would
expect, that $1 would contain XXX.

"Greedy" is a term of art in computer science. It does not have exactly
the same meaning as it does in religion or ethics or Marxism. Alas,
even the term of art isn't all that unambiguous in this context, either, as
there is no objective way of knowing what "locally optimal" means in the
regex context. Fortunately the documentation doesn't rely on you knowing
exactly what it means by greedy, it goes on to explain what the behavior
actually is. So don't get hung on loaded words. I think the docs should
remove that reference and just stick to describing the behavior explicitly.
I know (I have read it), that it is
possible to use + instead of *, but I would like to know, why the "*"
quantifier doesn't catch it.

Because it doesn't look ahead to see what better thing in the future might
happen, it makes local decisions. That is what greedy means in the term
of art, but in this case it is applying not to the "*" but to the scanning.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
U

ulrich_martin

_
(e-mail address removed) ([email protected]) wrote on VCCLXXIX
September MCMXCIII in <URL:$$ Hello,
$$
$$ I would like to ask you for help. Could anybody explain me, why regex
$$ "/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
$$ this regex is a greedy one and there is not anchor "^", so I would
$$ expect, that $1 would contain XXX. I know (I have read it), that it is
$$ possible to use + instead of *, but I would like to know, why the "*"
$$ quantifier doesn't catch it.

Because if a regexp can match in more than one way in the subject string,
it will match at the left most position.

/(X*)/ matches 0 or more X's. "aXXXb" starts with zero X's. So it matches
at the beginning of the string. With 0 X's.

$$ I have found this example in perlretut:
$$ Finally,
$$ "aXXXb" =~ /(X*)/; # matches with $1 = ''
$$ because it can match zero copies of 'X' at the beginning of
$$ the string. If you definitely want to match at least one
$$ 'X', use "X+", not "X*".

Right.

Abigail

Thank you very much for perfects explanations to all of you.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top