another strange regexp case

Discussion in 'Ruby' started by Kristof Bastiaensen, Jun 29, 2004.

  1. Hi,

    here is another regexp behaviour which surprises me.
    There may be some logic behind it, but I fail to see it...

    irb(main):004:0> /(theone)?/.match(" theone").to_a
    => ["", nil]

    irb(main):003:0> /(theone)?/.match("theone").to_a
    => ["theone", "theone"]

    irb(main):005:0> / (theone)?/.match(" theone").to_a
    => [" theone", "theone"]

    In the first case, it doesn't match "theone", but in
    the second and third it does...

    Could anyone explain this?

    Kristof
    Kristof Bastiaensen, Jun 29, 2004
    #1
    1. Advertising

  2. Kristof Bastiaensen

    ts Guest

    >>>>> "K" == Kristof Bastiaensen <> writes:

    K> irb(main):004:0> /(theone)?/.match(" theone").to_a
    K> => ["", nil]

    When the regexp engine try to match `t' it fail, because the first
    character is ` ' and the regexp succeed because `theone' was optional

    K> irb(main):003:0> /(theone)?/.match("theone").to_a
    K> => ["theone", "theone"]

    it can match `theone' in its first try


    Guy Decoux
    ts, Jun 29, 2004
    #2
    1. Advertising

  3. Kristof Bastiaensen

    Ara.T.Howard Guest

    On Tue, 29 Jun 2004, Kristof Bastiaensen wrote:


    > irb(main):004:0> /(theone)?/.match(" theone").to_a
    > => ["", nil]


    ? means 'zero or one'

    we start a the beginning of ' theone' and instantly find a match: zero of
    them.

    > irb(main):003:0> /(theone)?/.match("theone").to_a
    > => ["theone", "theone"]


    same here.

    > irb(main):005:0> / (theone)?/.match(" theone").to_a
    > => [" theone", "theone"]


    same here. ;-)


    remember regexp engines work (well, some of them) by staring at a position and
    consuming chars while the pattern matches, iff all the pattern was used we
    have a positive match, otherwise not. so in all these cases we start like so

    ' theone'
    ^
    ^
    ^
    ptr

    and drive with the regexp asking "does the regexp match starting here? if so
    how many chars did it consume" the consumed chars are returned in $1, $2,
    etc. in all the cases above this explains the matching.

    note that some regexp engines work in the reverse sense but the effect is
    largely the same...

    > In the first case, it doesn't match "theone", but in the second and third it
    > does...


    so it matched in all cases -- sometimes zero times, sometimes one time. this
    is what you asked the regexp to do. i try to follow these rules when
    composing regexps:

    - always use anchors ^ and $
    - never use anything that can match 'zero' things

    it's the 'zero' thing that suprised you. your first two regexps match even
    the empty string!

    obviously this is not always possible but i will maintain this:

    if you create a regexp without anchors and with portions that can match zero
    things and have not done so out of absolute need - your code has a bug.

    kind regards.

    -a
    --
    ===============================================================================
    | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
    | PHONE :: 303.497.6469
    | A flower falls, even though we love it;
    | and a weed grows, even though we do not love it.
    | --Dogen
    ===============================================================================
    Ara.T.Howard, Jun 29, 2004
    #3
  4. Kristof Bastiaensen wrote:
    > Hi,


    Moin!

    > here is another regexp behaviour which surprises me.
    > There may be some logic behind it, but I fail to see it...
    > irb(main):004:0> /(theone)?/.match(" theone").to_a
    > => ["", nil]


    I think that this is about how greediness in Regexps works:

    A Regexp will try to match as much as possible starting at the current
    position, but even a "bad" match at the current position will be better
    than a "good" match at a later position in the String.

    Maybe it would be possible to do a version of .match that finds the
    "best" (== longest, if greedy) match in the whole string. I assume that
    it would be based on .scan in some kind of way.

    Regards,
    Florian Gross
    Florian Gross, Jun 29, 2004
    #4
  5. On Tue, 29 Jun 2004 11:05:43 -0600, Ara.T.Howard wrote:

    <snip>
    >> irb(main):004:0> /(theone)?/.match(" theone").to_a
    >> => ["", nil]

    >
    > ? means 'zero or one'
    >
    > we start a the beginning of ' theone' and instantly find a match: zero of
    > them.

    <snip>
    >
    > if you create a regexp without anchors and with portions that can match zero
    > things and have not done so out of absolute need - your code has a bug.


    Thanks for the answer. I expected the pattern to expand greedily,
    but I forgot it will return the first match, which is the empty
    match. You are right, /(theone)?/ is a silly thing to write,
    finally I just needed another regexp for my problem.

    Thanks,
    Kristof
    Kristof Bastiaensen, Jun 29, 2004
    #5
  6. "Kristof Bastiaensen" <> schrieb im Newsbeitrag
    news:p...
    > On Tue, 29 Jun 2004 11:05:43 -0600, Ara.T.Howard wrote:
    >
    > <snip>
    > >> irb(main):004:0> /(theone)?/.match(" theone").to_a
    > >> => ["", nil]

    > >
    > > ? means 'zero or one'
    > >
    > > we start a the beginning of ' theone' and instantly find a match: zero

    of
    > > them.

    > <snip>
    > >
    > > if you create a regexp without anchors and with portions that can

    match zero
    > > things and have not done so out of absolute need - your code has a

    bug.
    >
    > Thanks for the answer. I expected the pattern to expand greedily,
    > but I forgot it will return the first match, which is the empty
    > match. You are right, /(theone)?/ is a silly thing to write,
    > finally I just needed another regexp for my problem.


    This is a case of the simple general rule "Watch out for regular
    expressions that match the empty string". All sorts of problems can arise
    when using them and usually you don't want to match an empty string
    anyway.

    Kind regards

    robert
    Robert Klemme, Jun 30, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve Franks
    Replies:
    2
    Views:
    1,251
    Steve Franks
    Jun 10, 2004
  2. Tee
    Replies:
    3
    Views:
    7,803
    Herfried K. Wagner [MVP]
    Jun 23, 2004
  3. Janice

    lower case to upper case

    Janice, Dec 10, 2004, in forum: C Programming
    Replies:
    17
    Views:
    1,177
    Richard Bos
    Dec 14, 2004
  4. Replies:
    1
    Views:
    2,478
    Mark P
    Apr 6, 2007
  5. Joao Silva
    Replies:
    16
    Views:
    359
    7stud --
    Aug 21, 2009
Loading...

Share This Page