another strange regexp case

Discussion in 'Ruby' started by Kristof Bastiaensen, Jun 29, 2004.

  1. Hi,

    here is another regexp behaviour which surprises me.
    There may be some logic behind it, but I fail to see it...

    irb(main):004:0> /(theone)?/.match(" theone").to_a
    => ["", nil]

    irb(main):003:0> /(theone)?/.match("theone").to_a
    => ["theone", "theone"]

    irb(main):005:0> / (theone)?/.match(" theone").to_a
    => [" theone", "theone"]

    In the first case, it doesn't match "theone", but in
    the second and third it does...

    Could anyone explain this?

    Kristof
     
    Kristof Bastiaensen, Jun 29, 2004
    #1
    1. Advertisements

  2. Kristof Bastiaensen

    ts Guest

    K> irb(main):004:0> /(theone)?/.match(" theone").to_a
    K> => ["", nil]

    When the regexp engine try to match `t' it fail, because the first
    character is ` ' and the regexp succeed because `theone' was optional

    K> irb(main):003:0> /(theone)?/.match("theone").to_a
    K> => ["theone", "theone"]

    it can match `theone' in its first try


    Guy Decoux
     
    ts, Jun 29, 2004
    #2
    1. Advertisements

  3. Kristof Bastiaensen

    Ara.T.Howard Guest

    ? means 'zero or one'

    we start a the beginning of ' theone' and instantly find a match: zero of
    them.
    same here.
    same here. ;-)


    remember regexp engines work (well, some of them) by staring at a position and
    consuming chars while the pattern matches, iff all the pattern was used we
    have a positive match, otherwise not. so in all these cases we start like so

    ' theone'
    ^
    ^
    ^
    ptr

    and drive with the regexp asking "does the regexp match starting here? if so
    how many chars did it consume" the consumed chars are returned in $1, $2,
    etc. in all the cases above this explains the matching.

    note that some regexp engines work in the reverse sense but the effect is
    largely the same...
    so it matched in all cases -- sometimes zero times, sometimes one time. this
    is what you asked the regexp to do. i try to follow these rules when
    composing regexps:

    - always use anchors ^ and $
    - never use anything that can match 'zero' things

    it's the 'zero' thing that suprised you. your first two regexps match even
    the empty string!

    obviously this is not always possible but i will maintain this:

    if you create a regexp without anchors and with portions that can match zero
    things and have not done so out of absolute need - your code has a bug.

    kind regards.

    -a
    --
    ===============================================================================
    | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
    | PHONE :: 303.497.6469
    | A flower falls, even though we love it;
    | and a weed grows, even though we do not love it.
    | --Dogen
    ===============================================================================
     
    Ara.T.Howard, Jun 29, 2004
    #3
  4. I think that this is about how greediness in Regexps works:

    A Regexp will try to match as much as possible starting at the current
    position, but even a "bad" match at the current position will be better
    than a "good" match at a later position in the String.

    Maybe it would be possible to do a version of .match that finds the
    "best" (== longest, if greedy) match in the whole string. I assume that
    it would be based on .scan in some kind of way.

    Regards,
    Florian Gross
     
    Florian Gross, Jun 29, 2004
    #4
  5. On Tue, 29 Jun 2004 11:05:43 -0600, Ara.T.Howard wrote:

    Thanks for the answer. I expected the pattern to expand greedily,
    but I forgot it will return the first match, which is the empty
    match. You are right, /(theone)?/ is a silly thing to write,
    finally I just needed another regexp for my problem.

    Thanks,
    Kristof
     
    Kristof Bastiaensen, Jun 29, 2004
    #5
  6. This is a case of the simple general rule "Watch out for regular
    expressions that match the empty string". All sorts of problems can arise
    when using them and usually you don't want to match an empty string
    anyway.

    Kind regards

    robert
     
    Robert Klemme, Jun 30, 2004
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.