Re: regexp non-greedy matching bug?

Discussion in 'Python' started by Tim Peters, Dec 4, 2005.

  1. Tim Peters

    Tim Peters Guest

    [John Hazen]
    > I want to match one or two instances of a pattern in a string.
    >
    > According to the docs for the 're' module
    > ( http://python.org/doc/current/lib/re-syntax.html ) the '?' qualifier
    > is greedy by default, and adding a '?' after a qualifier makes it
    > non-greedy.


    >> The "*", "+", and "?" qualifiers are all greedy...
    >> Adding "?" after the qualifier makes it perform the match in
    >> non-greedy or minimal fashion...


    > In the following example, though my re is intended to allow for 1 or 2
    > instinces of 'foo', there are 2 in the string I'm matching. So, I would
    > expect group(1) and group(3) to both be populated. (When I remove the
    > conditional match on the 2nd foo, the grouping is as I expect.)
    >
    > $ python2.4
    > Python 2.4.1 (#2, Mar 31 2005, 00:05:10)
    > [GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin
    > Type "help", "copyright", "credits" or "license" for more information.
    > >>> import re
    > >>> foofoo = re.compile(r'^(foo)(.*?)(foo)?(.*?)$')
    > >>> foofoo.match(s).group(0)

    > 'foobarbazfoobar'
    > >>> foofoo.match(s).group(1)

    > 'foo'
    > >>> foofoo.match(s).group(2)

    > ''
    > >>> foofoo.match(s).group(3)
    > >>> foofoo.match(s).group(4)

    > 'barbazfoobar'


    Your problem isn't that

    (foo)?

    is not greedy (it is greedy), it's that your first

    (.*?)

    is not greedy. Remember that regexps also work left to right. When
    you coded your first

    (.*?)

    you're asking (because of the '?') the regexp engine to chew up the
    fewest possible number of characters at that point such that the
    _rest_ of the regexp _can_ match. By chewing up no characters at all,
    the rest of the regexp can in fact match, so that's what the engine
    did -- your second

    (foo)?

    is optional, telling the engine you don't require that `foo` to match.
    The engine took you at your word about that ;-)

    > >>> foofoo = re.compile(r'^(foo)(.*?)(foo)(.*?)$')


    In this case your second `foo` is not optional. The behavior wrt the first

    (.*?)

    really doesn't change: the regexp engine again chews up the fewest
    number of characters at that point such that the rest of the regexp
    can match. But because your second `foo` isn't optional in this case,
    the engine can't get away with matching 0 characters in this case. It
    still matches the fewest number it can match there, though (consistent
    with the rest of the pattern matching too).

    > >>> foofoo.match(s).group(0)

    > 'foobarbazfoobar'
    > >>> foofoo.match(s).group(1)

    > 'foo'
    > >>> foofoo.match(s).group(2)

    > 'barbaz'
    > >>> foofoo.match(s).group(3)

    > 'foo'
    > >>> foofoo.match(s).group(4)

    > 'bar'
    > >>>

    >
    > So, is this a bug, or just a problem with my understanding?


    The behavior is what I expected ;-)

    > If it's my brain that's broken, what's the proper way to do this with regexps?


    Sorry, I'm unclear on (exactly) what it is you're trying to
    accomplish. Maybe what you're looking for is

    ^P(.*P)?.*$

    ?

    > And, if the above is expected behavior, should I submit a doc bug? It's
    > clear that the "?" qualifier (applied to the second foo group) is _not_
    > greedy in this situation.


    See above: that's not only not clear, it's not true. Consider a
    related but much simpler example:

    >>> m = re.match(r'a(b)?(b)?c', 'abc')
    >>> m.groups()

    ('b', None)

    Both instances of

    (b)?

    are "greedy" there, and that the second one didn't match "b" does not
    mean that the second one is not greedy. It _couldn't_ match without
    violating that the _first_ is greedy, and the first "wins" because
    regexps work left to right. It may be harder to see that the same
    principle is at work in your example, but it is: your second (foo)?
    couldn't match without violating that your first (.*?) asks for a
    minimal match. My

    ^P(.*P)?.*$

    above asks the engine to match two instances of P if possible, but to
    settle for one if that's all it can find.
     
    Tim Peters, Dec 4, 2005
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. kaeli
    Replies:
    3
    Views:
    11,352
  2. Sam Pointon

    regexp non-greedy matching bug?

    Sam Pointon, Dec 4, 2005, in forum: Python
    Replies:
    8
    Views:
    372
    Fredrik Lundh
    Dec 5, 2005
  3. John Hazen

    Re: regexp non-greedy matching bug?

    John Hazen, Dec 4, 2005, in forum: Python
    Replies:
    0
    Views:
    406
    John Hazen
    Dec 4, 2005
  4. Dan Kelly

    Greedy and non greedy quantifiers

    Dan Kelly, Jan 17, 2008, in forum: Ruby
    Replies:
    4
    Views:
    149
    Robert Klemme
    Jan 19, 2008
  5. Matt Garrish

    greedy v. non-greedy matching

    Matt Garrish, Feb 16, 2004, in forum: Perl Misc
    Replies:
    4
    Views:
    167
    Matt Garrish
    Feb 16, 2004
Loading...

Share This Page