regexp non-greedy matching bug?

Discussion in 'Python' started by Sam Pointon, Dec 4, 2005.

  1. Sam Pointon

    Sam Pointon Guest

    My understanding of .*? and its ilk is that they will match as little
    as is possible for the rest of the pattern to match, like .* will match
    as much as possible. In the first instance, the first (.*?) did not
    have to match anything, because all of the rest of the pattern can be
    matched 0 or more times. I think that such a situation (non-greedy
    operator followed by operators that match 0 or more times) will never
    match. However, in the second instance, it has to match something later
    on in the string, so it will capture something.

    I believe that this is an operator precedence problem (greedy ? losing
    to .*?), but is to be expected. So, if this is the case, by all means
    it should be added in a note to the docs.

    However, I am not a regular expression expert, so my analysis of the
    situation may be well off the mark.
     
    Sam Pointon, Dec 4, 2005
    #1
    1. Advertising

  2. Sam Pointon

    John Hazen Guest

    I want to match one or two instances of a pattern in a string.

    According to the docs for the 're' module
    ( http://python.org/doc/current/lib/re-syntax.html ) the '?' qualifier
    is greedy by default, and adding a '?' after a qualifier makes it
    non-greedy.

    > The "*", "+", and "?" qualifiers are all greedy...
    > Adding "?" after the qualifier makes it perform the match in
    > non-greedy or minimal fashion...


    In the following example, though my re is intended to allow for 1 or 2
    instinces of 'foo', there are 2 in the string I'm matching. So, I would
    expect group(1) and group(3) to both be populated. (When I remove the
    conditional match on the 2nd foo, the grouping is as I expect.)

    $ python2.4
    Python 2.4.1 (#2, Mar 31 2005, 00:05:10)
    [GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import re
    >>> foofoo = re.compile(r'^(foo)(.*?)(foo)?(.*?)$')
    >>> foofoo.match(s).group(0)

    'foobarbazfoobar'
    >>> foofoo.match(s).group(1)

    'foo'
    >>> foofoo.match(s).group(2)

    ''
    >>> foofoo.match(s).group(3)
    >>> foofoo.match(s).group(4)

    'barbazfoobar'
    >>> foofoo = re.compile(r'^(foo)(.*?)(foo)(.*?)$')
    >>> foofoo.match(s).group(0)

    'foobarbazfoobar'
    >>> foofoo.match(s).group(1)

    'foo'
    >>> foofoo.match(s).group(2)

    'barbaz'
    >>> foofoo.match(s).group(3)

    'foo'
    >>> foofoo.match(s).group(4)

    'bar'
    >>>



    So, is this a bug, or just a problem with my understanding? If it's my
    brain that's broken, what's the proper way to do this with regexps?

    And, if the above is expected behavior, should I submit a doc bug? It's
    clear that the "?" qualifier (applied to the second foo group) is _not_
    greedy in this situation.

    -John
     
    John Hazen, Dec 4, 2005
    #2
    1. Advertising

  3. Sam Pointon

    Mike Meyer Guest

    John Hazen <> writes:
    > I want to match one or two instances of a pattern in a string.


    Then you should be using the split() method of the match object on the
    pattern in question.

    > According to the docs for the 're' module
    > ( http://python.org/doc/current/lib/re-syntax.html ) the '?' qualifier
    > is greedy by default, and adding a '?' after a qualifier makes it
    > non-greedy.
    >
    >> The "*", "+", and "?" qualifiers are all greedy...
    >> Adding "?" after the qualifier makes it perform the match in
    >> non-greedy or minimal fashion...


    The thing to understand is that regular expressions are *search*
    functions, that return the first parsing that matches. They search a
    space of possible matches to each term in the expression. If some term
    fails to match, the preceeding term goes on to its next match, and you
    try again. The "greedy" vs. "non-greedy" describes the order that the
    term in question tries matches. If it's greedy, it will try the
    longest possible match first. If it's non-greedy, it'll try the
    shortest possible match first.

    > $ python2.4
    > Python 2.4.1 (#2, Mar 31 2005, 00:05:10)
    > [GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> import re
    >>>> foofoo = re.compile(r'^(foo)(.*?)(foo)?(.*?)$')
    >>>> foofoo.match(s).group(0)

    > 'foobarbazfoobar'
    >>>> foofoo.match(s).group(1)

    > 'foo'
    >>>> foofoo.match(s).group(2)

    > ''
    >>>> foofoo.match(s).group(3)
    >>>> foofoo.match(s).group(4)

    > 'barbazfoobar'


    First, this pattern doesn't look for one or two instances of "foo" in
    a string. It looks for a string that starts with "foo" and maybe has a
    second "foo" in it as well.

    Here's what it does:

    ^ must match the beginning of the string (BTW, you can get the same
    behavior by leaving off the ^ and using search instead of match).

    (foo) must match the first foo.

    (.*?) is non-greedy, so it tries the shortest thing that matches, the
    empty string.

    (foo)? matches what's next in the string by matching 0 occurrences of
    (foo).

    (.*?) tries the empty string as well.

    $ fails to match the end of the string, so we backtrack, and the
    second (.*?) tries again, adding a character. This keeps on until the
    (.*?) matches the rest of the string and the $ succeeds.

    Here, the only pattern that gets retried is the last one. It keeps
    adding characters until it swallows the remainder of the string. This
    is exactly what's expected. You can't change this behavior with
    greedy/non-greedy, as that last term will always try searching the
    entire string before it fails and the preceeding "(foo)?" gets to try
    something else.

    >>>> foofoo = re.compile(r'^(foo)(.*?)(foo)(.*?)$')
    >>>> foofoo.match(s).group(0)

    > 'foobarbazfoobar'
    >>>> foofoo.match(s).group(1)

    > 'foo'
    >>>> foofoo.match(s).group(2)

    > 'barbaz'
    >>>> foofoo.match(s).group(3)

    > 'foo'
    >>>> foofoo.match(s).group(4)

    > 'bar'
    >>>>


    This time, the second (foo) keeps failing, forcing the first (.*?) to
    add characters until the (foo) succeeds.

    > So, is this a bug, or just a problem with my understanding? If it's my
    > brain that's broken, what's the proper way to do this with regexps?


    To do what you said you want to do, you want to use the split method:

    foo = re.compile('foo')
    if 2 <= len(foo.split(s)) <= 3:
    print "We had one or two 'foo's"

    As the founder of SPARE, I'd say the proper way to do this is with
    string methods:

    if 2 <= len(s.split('foo')) <= 3:
    print "We had one or two 'foo's"

    Where the two split's will produce exactly the same things.

    You might be able to match "starts with 'foo' and has at most one
    other 'foo'" with lookahead patternds, but I'm not going to go into
    that. If you really need the string to start with foo, I'd just use
    "s.startswith('foo')" to check it.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
     
    Mike Meyer, Dec 4, 2005
    #3
  4. Sam Pointon

    John Hazen Guest

    [Mike Meyer]
    > The thing to understand is that regular expressions are *search*
    > functions, that return the first parsing that matches. They search a
    > space of possible matches to each term in the expression. If some term
    > fails to match, the preceeding term goes on to its next match, and you
    > try again. The "greedy" vs. "non-greedy" describes the order that the
    > term in question tries matches. If it's greedy, it will try the
    > longest possible match first. If it's non-greedy, it'll try the
    > shortest possible match first.


    That's a good explanation. Thanks.

    [John]
    > > I want to match one or two instances of a pattern in a string.
    > >>>> foofoo = re.compile(r'^(foo)(.*?)(foo)?(.*?)$')


    [Mike]
    > First, this pattern doesn't look for one or two instances of "foo" in
    > a string. It looks for a string that starts with "foo" and maybe has a
    > second "foo" in it as well.


    Right. In simplifying the expression for public consumption, one of the
    terms I dropped was r'^.*?(foo)...'.

    > To do what you said you want to do, you want to use the split method:
    >
    > foo = re.compile('foo')
    > if 2 <= len(foo.split(s)) <= 3:
    > print "We had one or two 'foo's"


    Well, this would solve my dumbed down example, but each foo in the
    original expression was a stand-in for a more complex term. I was using
    match groups to extract the parts of the match that I wanted. Here's an
    example (using Tim's correction) that actually demonstrates what I'm
    doing:

    >>> s = 'zzzfoo123barxxxfoo456baryyy'
    >>> s2 = 'zzzfoo123barxxxfooyyy'
    >>> foobar2 = re.compile(r'^.*?foo(\d+)bar(.*foo(\d+)bar)?.*$')
    >>> print foobar2.match(s).group(1)

    123
    >>> print foobar2.match(s).group(3)

    456
    >>> print foobar2.match(s2).group(1)

    123
    >>> print foobar2.match(s2).group(3)

    None
    >>>



    Looking at re.split, it doesn't look like it returns the actual matching
    text, so I don't think that fits my need.

    > As the founder of SPARE...


    Hmm, not a very effective name. A google search didn't fing any obvious
    hits (even after adding the "python" qualifier, and removing "spare time"
    and "spare parts" hits). (I couldn't find it off your homepage,
    either.)

    Thanks for your help. If you have any suggestions about a non-re way to
    do the above, I'd be interested.

    -John
     
    John Hazen, Dec 4, 2005
    #4
  5. Mike Meyer wrote:

    > ^ must match the beginning of the string (BTW, you can get the same
    > behavior by leaving off the ^ and using search instead of match).


    that's backwards, isn't it? using ^ with match is usually pointless (since
    match only looks at the first position anyway), and using ^ with search
    is also usually pointless...

    (the only time you really need ^ is in certain multiline searches; depending
    on what flags you use, ^ can match not only at the beginning of a string,
    but also after each newline character in the target string)

    </F>
     
    Fredrik Lundh, Dec 4, 2005
    #5
  6. Sam Pointon

    Mike Meyer Guest

    John Hazen <> writes:
    >> To do what you said you want to do, you want to use the split method:
    >>
    >> foo = re.compile('foo')
    >> if 2 <= len(foo.split(s)) <= 3:
    >> print "We had one or two 'foo's"

    >
    > Well, this would solve my dumbed down example, but each foo in the
    > original expression was a stand-in for a more complex term.


    That actually doesn't matter. Just replace 'foo' with your more
    complex term.

    >>> foo2 = re.compile(r'foo(\d+)bar')


    > I was using
    > match groups to extract the parts of the match that I wanted. Here's an
    > example (using Tim's correction) that actually demonstrates what I'm
    > doing:
    >>>> s = 'zzzfoo123barxxxfoo456baryyy'
    >>>> s2 = 'zzzfoo123barxxxfooyyy'
    >>>> foobar2 = re.compile(r'^.*?foo(\d+)bar(.*foo(\d+)bar)?.*$')
    >>>> print foobar2.match(s).group(1)

    > 123
    >>>> print foobar2.match(s).group(3)

    > 456


    >>> foo2.split('zzzfoo123barxxxfoo456baryyy')

    ['zzz', '123', 'xxx', '456', 'yyy']
    >>>


    >>>> print foobar2.match(s2).group(1)

    > 123
    >>>> print foobar2.match(s2).group(3)

    > None
    >>>>


    >>> foo2.split('zzzfoo123barxxxfooyyy')

    ['zzz', '123', 'xxxfooyyy']

    > Looking at re.split, it doesn't look like it returns the actual matching
    > text, so I don't think that fits my need.


    split() returns the text matched by groups in the pattern used to do
    the split, and is documented as doing so. The solution you gave
    doesn't return "the actual matching text" for the instances, but just
    the text in the groups inside that text - which is exactly what
    split() returns.

    While on that topic, I'll note that the solution Tim gave you doesn't
    solve the problem as I originally understood it, either. You said you
    wanted to match one or two instances, which I read as only one or two
    instances, so that more than two instances would be treated as a
    failure. On rereading it, I can see where I was wrong.

    >> As the founder of SPARE...

    > Hmm, not a very effective name. A google search didn't fing any obvious
    > hits (even after adding the "python" qualifier, and removing "spare time"
    > and "spare parts" hits). (I couldn't find it off your homepage,
    > either.)


    That's the Society for the Prevention of Abuse of Regular
    Expressions. One of these days, my proof-reader will get back to me
    and I'll add a link about it to my home page.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
     
    Mike Meyer, Dec 4, 2005
    #6
  7. Sam Pointon

    Mike Meyer Guest

    "Fredrik Lundh" <> writes:
    > Mike Meyer wrote:
    >> ^ must match the beginning of the string (BTW, you can get the same
    >> behavior by leaving off the ^ and using search instead of match).

    > that's backwards, isn't it? using ^ with match is usually pointless (since
    > match only looks at the first position anyway), and using ^ with search
    > is also usually pointless...


    Yup, you're right. The '^' was redundant in the OP's post.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
     
    Mike Meyer, Dec 4, 2005
    #7
  8. Sam Pointon

    Aahz Guest

    In article <>,
    Fredrik Lundh <> wrote:
    >
    >that's backwards, isn't it? using ^ with match is usually pointless (since
    >match only looks at the first position anyway), and using ^ with search
    >is also usually pointless...


    While you're technically correct, I've been bitten too many times by
    forgetting whether to use match() or search(). I've fixed that problem
    by choosing to always use search() and combine with ^ as appropriate.
    --
    Aahz () <*> http://www.pythoncraft.com/

    "Don't listen to schmucks on USENET when making legal decisions. Hire
    yourself a competent schmuck." --USENET schmuck (aka Robert Kern)
     
    Aahz, Dec 4, 2005
    #8
  9. Aahz wrote:

    > While you're technically correct, I've been bitten too many times by
    > forgetting whether to use match() or search(). I've fixed that problem
    > by choosing to always use search() and combine with ^ as appropriate.


    that's a bit suboptimal, though, at least for cases where failed matches
    are rather common:

    C:\>timeit -s "import re; p = re.compile('b')" "p.match('a'*100)"
    100000 loops, best of 3: 6.14 usec per loop

    C:\>timeit -s "import re; p = re.compile('^b')" "p.match('a'*100)"
    100000 loops, best of 3: 6.25 usec per loop

    C:\>timeit -s "import re; p = re.compile('^b')" "p.search('a'*100)"
    100000 loops, best of 3: 15.4 usec per loop

    (afaik, search doesn't have any heuristics for figuring out if it can skip
    the search, so it'll check ^ against all available positions)

    on the other hand, benchmarking RE:s always results in confusing
    results:

    C:\>timeit -s "import re; p = re.compile('b')" "p.search('a'*100)"
    100000 loops, best of 3: 4.32 usec per loop

    (should this really be *faster* than match for this case ?)

    </F>
     
    Fredrik Lundh, Dec 5, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. kaeli
    Replies:
    3
    Views:
    11,409
  2. Tim Peters

    Re: regexp non-greedy matching bug?

    Tim Peters, Dec 4, 2005, in forum: Python
    Replies:
    0
    Views:
    406
    Tim Peters
    Dec 4, 2005
  3. John Hazen

    Re: regexp non-greedy matching bug?

    John Hazen, Dec 4, 2005, in forum: Python
    Replies:
    0
    Views:
    413
    John Hazen
    Dec 4, 2005
  4. Dan Kelly

    Greedy and non greedy quantifiers

    Dan Kelly, Jan 17, 2008, in forum: Ruby
    Replies:
    4
    Views:
    168
    Robert Klemme
    Jan 19, 2008
  5. Matt Garrish

    greedy v. non-greedy matching

    Matt Garrish, Feb 16, 2004, in forum: Perl Misc
    Replies:
    4
    Views:
    174
    Matt Garrish
    Feb 16, 2004
Loading...

Share This Page