I would like to ask you for help. Could anybody explain me, why regex
"/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
this regex is a greedy one and there is not anchor "^", so I would
expect, that $1 would contain XXX. I know (I have read it), that it is
possible to use + instead of *, but I would like to know, why the "*"
quantifier doesn't catch it.
I have found this example in perlretut:
Finally,
"aXXXb" =~ /(X*)/; # matches with $1 = ''
because it can match zero copies of 'X' at the beginning of
the string. If you definitely want to match at least one
'X', use "X+", not "X*".
Basically, a lot of people mistakenly think that the greedy '*'
quantifier makes m/(X*)/ match the LONGEST string of Xs. But in
reality, m/(X*)/ matches AS SOON AS POSSIBLE, and '*' just makes (X*)
gobble as much as it can once a match is found.
I believe it was the "Learning Perl" (the "llama" book) that said
that if a regular expression can match an empty string, then it will
always return true no matter what string it is given. And the regular
expression m/(X*)/ does return true when used with an empty string, as
'' has zero-or-more instances of 'X' inside it. Therefore, even this
match succeeds:
"ab" =~ /(X*)/; # $1 gets set to ''
It succeeds because it found zero-or-more Xs at the very beginning of
the string. Likewise, the match:
"aXXXb" =~ /(X*)/; # $1 gets set to ''
also succeeds by finding zero-or-more Xs at the very beginning of the
string. It stops searching after that because it found a match, and
has no need to continue any further.
If you really wanted a regular expression that would match at least
one X, then you should use the '+' quantifier instead of '*', like
this:
"aXXXb" =~ /(X+)/; # $1 gets set to 'XXX'
but since it still matches as soon as possible, it wouldn't match a
longer string of Xs, as shown here:
"aXXXbXXXXXc" =~ /(X+)/; # $1 still gets set to 'XXX'
If you wanted to match the longest string of Xs, you'd have to loop
through all the strings of Xs and record the longest one. You can do
this with the /g modifier like this:
my $longestString = '';
while ( "aXXXbXXXXXcXd" =~ m/(X+)/g )
{
$longestString = $1 if length($1) > length($longestString);
}
print "$longestString\n"; # prints 'XXXXX'
So remember, it is a mistake to think that the '*' and '+'
quantifiers match the longest instance of a string; they just match as
much as they can (or "gobble" up as much as they can) once a match has
been found -- even if the match was found at the very beginning of the
string.
This means that if a regular expression can match an empty string,
then the '*' quantifier will probably match an empty string unless
what it's quantifying happens to be at the beginning of the string.
I hope this explanation helps.
-- Jean-Luc