Regex /(X*)/

Discussion in 'Perl Misc' started by ulrich_martin@seznam.cz, Feb 13, 2008.

  1. Guest

    Hello,

    I would like to ask you for help. Could anybody explain me, why regex
    "/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
    this regex is a greedy one and there is not anchor "^", so I would
    expect, that $1 would contain XXX. I know (I have read it), that it is
    possible to use + instead of *, but I would like to know, why the "*"
    quantifier doesn't catch it.

    I have found this example in perlretut:
    Finally,
    "aXXXb" =~ /(X*)/; # matches with $1 = ''
    because it can match zero copies of 'X' at the beginning of
    the string. If you definitely want to match at least one
    'X', use "X+", not "X*".

    M.
     
    , Feb 13, 2008
    #1
    1. Advertising

  2. schrieb:
    > I have found this example in perlretut:
    > Finally,
    > "aXXXb" =~ /(X*)/; # matches with $1 = ''
    > because it can match zero copies of 'X' at the beginning of
    > the string. If you definitely want to match at least one
    > 'X', use "X+", not "X*".


    Well, that is the explanation. Perl tries to match as soon as possible.
    "Sooner" is more important than "longer".
     
    Damian Lukowski, Feb 13, 2008
    #2
    1. Advertising

  3. Paul Lalli Guest

    On Feb 13, 3:25 am, wrote:
    > Hello,
    >
    > I would like to ask you for help. Could anybody explain me, why regex
    > "/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
    > this regex is a greedy one and there is not anchor "^", so I would
    > expect, that $1 would contain XXX. I know (I have read it), that it is
    > possible to use + instead of *, but I would like to know, why the "*"
    > quantifier doesn't catch it.
    >
    > I have found this example in perlretut:
    > Finally,
    > "aXXXb" =~ /(X*)/; # matches with $1 = ''
    > because it can match zero copies of 'X' at the beginning of
    > the string.  If you definitely want to match at least one
    > 'X', use "X+", not "X*".


    Because greediness takes second place to position. Perl attempts to
    find the FIRST match that it can. Once it's started successfully
    matching, only then does the greediness of quantifiers come into play.

    Take a look at all the places /(X*)/ could match aXXXb....

    while ("aXXXb" =~ /(X*)/g) {
    print "$`<<$&>>$'\n";
    }

    <<>>aXXXb
    a<<XXX>>b
    aXXX<<>>b
    aXXXb<<>>


    The first time through, it matches right at the beginning of the
    string.
    The second time through, it matches the XXX
    The third time through, it matches between the X and the b
    The final time through, it maches after the b, at the end of the
    string.


    Paul Lalli
     
    Paul Lalli, Feb 13, 2008
    #3
  4. Guest

    On Feb 13, 1:25 am, wrote:
    >
    > I would like to ask you for help. Could anybody explain me, why regex
    > "/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
    > this regex is a greedy one and there is not anchor "^", so I would
    > expect, that $1 would contain XXX. I know (I have read it), that it is
    > possible to use + instead of *, but I would like to know, why the "*"
    > quantifier doesn't catch it.
    >
    > I have found this example in perlretut:
    > Finally,
    > "aXXXb" =~ /(X*)/; # matches with $1 = ''
    > because it can match zero copies of 'X' at the beginning of
    > the string. If you definitely want to match at least one
    > 'X', use "X+", not "X*".



    Basically, a lot of people mistakenly think that the greedy '*'
    quantifier makes m/(X*)/ match the LONGEST string of Xs. But in
    reality, m/(X*)/ matches AS SOON AS POSSIBLE, and '*' just makes (X*)
    gobble as much as it can once a match is found.

    I believe it was the "Learning Perl" (the "llama" book) that said
    that if a regular expression can match an empty string, then it will
    always return true no matter what string it is given. And the regular
    expression m/(X*)/ does return true when used with an empty string, as
    '' has zero-or-more instances of 'X' inside it. Therefore, even this
    match succeeds:

    "ab" =~ /(X*)/; # $1 gets set to ''

    It succeeds because it found zero-or-more Xs at the very beginning of
    the string. Likewise, the match:

    "aXXXb" =~ /(X*)/; # $1 gets set to ''

    also succeeds by finding zero-or-more Xs at the very beginning of the
    string. It stops searching after that because it found a match, and
    has no need to continue any further.

    If you really wanted a regular expression that would match at least
    one X, then you should use the '+' quantifier instead of '*', like
    this:

    "aXXXb" =~ /(X+)/; # $1 gets set to 'XXX'

    but since it still matches as soon as possible, it wouldn't match a
    longer string of Xs, as shown here:

    "aXXXbXXXXXc" =~ /(X+)/; # $1 still gets set to 'XXX'

    If you wanted to match the longest string of Xs, you'd have to loop
    through all the strings of Xs and record the longest one. You can do
    this with the /g modifier like this:

    my $longestString = '';
    while ( "aXXXbXXXXXcXd" =~ m/(X+)/g )
    {
    $longestString = $1 if length($1) > length($longestString);
    }
    print "$longestString\n"; # prints 'XXXXX'

    So remember, it is a mistake to think that the '*' and '+'
    quantifiers match the longest instance of a string; they just match as
    much as they can (or "gobble" up as much as they can) once a match has
    been found -- even if the match was found at the very beginning of the
    string.

    This means that if a regular expression can match an empty string,
    then the '*' quantifier will probably match an empty string unless
    what it's quantifying happens to be at the beginning of the string.

    I hope this explanation helps.

    -- Jean-Luc
     
    , Feb 13, 2008
    #4
  5. Guest

    wrote:
    > Hello,
    >
    > I would like to ask you for help. Could anybody explain me, why regex
    > "/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
    > this regex is a greedy one and there is not anchor "^", so I would
    > expect, that $1 would contain XXX.


    "Greedy" is a term of art in computer science. It does not have exactly
    the same meaning as it does in religion or ethics or Marxism. Alas,
    even the term of art isn't all that unambiguous in this context, either, as
    there is no objective way of knowing what "locally optimal" means in the
    regex context. Fortunately the documentation doesn't rely on you knowing
    exactly what it means by greedy, it goes on to explain what the behavior
    actually is. So don't get hung on loaded words. I think the docs should
    remove that reference and just stick to describing the behavior explicitly.

    > I know (I have read it), that it is
    > possible to use + instead of *, but I would like to know, why the "*"
    > quantifier doesn't catch it.


    Because it doesn't look ahead to see what better thing in the future might
    happen, it makes local decisions. That is what greedy means in the term
    of art, but in this case it is applying not to the "*" but to the scanning.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Feb 13, 2008
    #5
  6. Guest

    On Feb 13, 11:59 pm, Abigail <> wrote:
    > _
    > () wrote on VCCLXXIX
    > September MCMXCIII in <URL:news:>:
    > $$ Hello,
    > $$
    > $$ I would like to ask you for help. Could anybody explain me, why regex
    > $$ "/(X*)/" is not able to catch X in string "aXXXb". Quantifier "*" in
    > $$ this regex is a greedy one and there is not anchor "^", so I would
    > $$ expect, that $1 would contain XXX. I know (I have read it), that it is
    > $$ possible to use + instead of *, but I would like to know, why the "*"
    > $$ quantifier doesn't catch it.
    >
    > Because if a regexp can match in more than one way in the subject string,
    > it will match at the left most position.
    >
    > /(X*)/ matches 0 or more X's. "aXXXb" starts with zero X's. So it matches
    > at the beginning of the string. With 0 X's.
    >
    > $$ I have found this example in perlretut:
    > $$ Finally,
    > $$ "aXXXb" =~ /(X*)/; # matches with $1 = ''
    > $$ because it can match zero copies of 'X' at the beginning of
    > $$ the string. If you definitely want to match at least one
    > $$ 'X', use "X+", not "X*".
    >
    > Right.
    >
    > Abigail
    > --
    > use lib sub {($\) = split /\./ => pop; print $"};
    > eval "use Just" || eval "use another" || eval "use Perl" || eval "use Hacker";


    Thank you very much for perfects explanations to all of you.
     
    , Feb 14, 2008
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    727
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,653
    Ant...
    Nov 6, 2003
  3. Replies:
    2
    Views:
    614
  4. Xah Lee
    Replies:
    1
    Views:
    956
    Ilias Lazaridis
    Sep 22, 2006
  5. Replies:
    3
    Views:
    798
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page