Something confusing about non-greedy reg exp match

G

gburdell1

If I do this:

import re
a=re.search(r'hello.*?money', 'hello how are you hello funny money')

I would expect a.group(0) to be "hello funny money", since .*? is a
non-greedy match. But instead, I get the whole sentence, "hello how
are you hello funny money".

Is this expected behavior? How can I specify the correct regexp so
that I get "hello funny money" ?
 
R

r

If I do this:

import re
a=re.search(r'hello.*?money',  'hello how are you hello funny money')

I would expect a.group(0) to be "hello funny money", since .*? is a
non-greedy match. But instead, I get the whole sentence, "hello how
are you hello funny money".

Is this expected behavior? How can I specify the correct regexp so
that I get "hello funny money" ?

heres one way, but it depends greatly on the actual pattern you
seek...

re.search(r'hello \w+ money', 'hello how are you hello funny
money').group()

your regex matches the whole string because it means ("hello" followed
by any number of *anythings* up to "money") you see?


wisdom = get_enlightend(http://jjsenlightenments.blogspot.com/)
 
M

Mark Tolonen

If I do this:

import re
a=re.search(r'hello.*?money', 'hello how are you hello funny money')

I would expect a.group(0) to be "hello funny money", since .*? is a
non-greedy match. But instead, I get the whole sentence, "hello how
are you hello funny money".

Is this expected behavior? How can I specify the correct regexp so
that I get "hello funny money" ?

A non-greedy match matches the fewest characters before matching the text
*after* the non-greedy match. For example:
'hello how are you hello funny money and more money'

This is why it is difficult to use regular expressions to match nested
objects like parentheses or XML tags. In your case you'll need something
extra to not match the first hello.
'hello funny money'

-Mark
 
R

r

EDIT:
your regex matches the whole string because it means...

"hello" followed by any number of *anythings* up to the first
occurrence of "money")

you see?
 
G

George Burdell

A non-greedy match matches the fewest characters before matching the text
*after* the non-greedy match.  For example:


'hello how are you hello funny money'>>> a=re.search(r'hello.*money','hello how are you hello funny money and

'hello how are you hello funny money and more money'

This is why it is difficult to use regular expressions to match nested
objects like parentheses or XML tags.  In your case you'll need something
extra to not match the first hello.


'hello funny money'

-Mark

I see now. I also understand r's response. But what if there are many
"hello"'s before "money," and I don't know how many there are? In
other words, I want to find every occurrence of "money," and for each
occurrence, I want to scan back to the first occurrence of "hello."
How can this be done?
 
G

George Burdell

I see now. I also understand r's response. But what if there are many
"hello"'s before "money," and I don't know how many there are? In
other words, I want to find every occurrence of "money," and for each
occurrence, I want to scan back to the first occurrence of "hello."
How can this be done?

I should say "closet" occurrence of "hello," to be more clear.
 
G

George Burdell

A non-greedy match matches the fewest characters before matching the text
*after* the non-greedy match.  For example:


'hello how are you hello funny money'>>> a=re.search(r'hello.*money','hello how are you hello funny money and

'hello how are you hello funny money and more money'

This is why it is difficult to use regular expressions to match nested
objects like parentheses or XML tags.  In your case you'll need something
extra to not match the first hello.


'hello funny money'

-Mark

I see now. I also understand r's response. But what if there are many
"hello"'s before "money," and I don't know how many there are? In
other words, I want to find every occurrence of "money," and for each
occurrence, I want to scan in the reverse (left) direction to the
closest occurrence of "hello." How can this be done?
 
7

7stud

If I do this:

import re
a=re.search(r'hello.*?money',  'hello how are you hello funny money')

I would expect a.group(0) to be "hello funny money", since .*? is a
non-greedy match. But instead, I get the whole sentence, "hello how
are you hello funny money".

Is this expected behavior?

Yes. search() finds the *first* match. The non-greedy quantifier
does not transform search() into a function that finds all possible
matches and then picks the shortest one. Instead, the non-greedy
quantifier causes search() to return the shortest possible first match
(v. the default which is the "longest possible first match"). In your
case, there is only one possible first match, so the non-greedy
quantifier does nothing.
 
M

MRAB

George said:
I see now. I also understand r's response. But what if there are many
"hello"'s before "money," and I don't know how many there are? In
other words, I want to find every occurrence of "money," and for each
occurrence, I want to scan in the reverse (left) direction to the
closest occurrence of "hello." How can this be done?
'hello funny money'
 
P

Paul McGuire

By recognising the task: not expression matching, but lexing and
parsing. For which you might find the ‘pyparsing’ library of use
<URL:http://pyparsing.wikispaces.com/>.

Even pyparsing has to go through some gyrations to do this sort of
"match, then backup" parsing. Here is my solution:
[['hello funny money']]


SkipTo is analogous to the OP's .*?, but the failOn attribute adds the
logic "if this string is found before matching the target string, then
fail". So pyparsing scans through the string, matches the first
"hello", attempts to skip to the next occurrence of "money", but finds
another "hello" first, so this parse fails. Then the scan continues
until the next "hello" is found, and this time, SkipTo successfully
finds "money" without first hitting a "hello". I then had to wrap the
whole thing in a helper method originalTextFor, otherwise I get an
ugly grouping of separate strings.

So I still don't really have any kind of "backup after matching"
parsing, I just turned this into a qualified forward match. One could
do a similar thing with a parse action. If you could attach some kind
of validating function to a field within a regex, you could have done
the same thing there.

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,765
Messages
2,569,568
Members
45,042
Latest member
icassiem

Latest Threads

Top