Regexp : repeated group identification

C

candide

Consider the following code

# ----------------------------
import re

z=re.match('(Spam\d)+', 'Spam4Spam2Spam7Spam8')
print z.group(0)
print z.group(1)
# ----------------------------

outputting :

----------------------------
Spam4Spam2Spam7Spam8
Spam8
----------------------------

The '(Spam\d)+' regexp is tested against 'Spam4Spam2Spam7Spam8' and the
regexp matches the string.

Group numbered one within the regex '(Spam\d)+' refers to Spam\d

The fours substrings

Spam4 Spam2 Spam7 and Spam8

match the group numbered 1.

So I don't understand why z.group(1) gives the last substring (ie Spam8
as the output shows), why not an another one, Spam4 for example ?
 
V

Vlastimil Brom

2011/12/14 candide said:
Consider the following code

# ----------------------------
import re

z=re.match('(Spam\d)+', 'Spam4Spam2Spam7Spam8')
print z.group(0)
print z.group(1)
# ----------------------------

outputting :

----------------------------
Spam4Spam2Spam7Spam8
Spam8
----------------------------

The '(Spam\d)+' regexp is tested against 'Spam4Spam2Spam7Spam8' and the
regexp matches the string.

Group numbered one within the regex '(Spam\d)+' refers to Spam\d

The fours substrings

Spam4   Spam2   Spam7  and  Spam8

match the group numbered 1.

So I don't understand why z.group(1) gives the last substring (ie Spam8 as
the output shows), why not an another one, Spam4 for example ?

Hi,
you may find a tiny notice in the re docs on this:
http://docs.python.org/library/re.html#re.MatchObject.group

"If a group is contained in a part of the pattern that matched
multiple times, the last match is returned."

If you need to work with the content captured in the repeated group,
you may check the new regex implementation:
http://pypi.python.org/pypi/regex

Which has a special "captures" method of the match object for this
(beyond many other improvements):
import regex
m=regex.match('(Spam\d)+', 'Spam4Spam2Spam7Spam8')
m.captures(1) ['Spam4', 'Spam2', 'Spam7', 'Spam8']

hth,
vbr
 
C

candide

Le 14/12/2011 12:34, Vlastimil Brom a écrit :
"If a group is contained in a part of the pattern that matched
multiple times, the last match is returned."

I missed this point, your answer matches my question ;) thanks.

If you need to work with the content captured in the repeated group,
you may check the new regex implementation:
http://pypi.python.org/pypi/regex

Which has a special "captures" method of the match object for this
(beyond many other improvements):
import regex
m=regex.match('(Spam\d)+', 'Spam4Spam2Spam7Spam8')
m.captures(1) ['Spam4', 'Spam2', 'Spam7', 'Spam8']


Thanks for the reference and the example. I didn't know of this
reimplementation, hoping it offers the Aho-Corasick algorithm allowing
multiple keys search.
 
V

Vlastimil Brom

2011/12/14 candide said:
Thanks for the reference and the example. I didn't know of this
reimplementation, hoping it offers the Aho-Corasick algorithm allowing
multiple keys search.

Hi,
I am not sure about the underlying algorithm (it could as well be an
internal expansion of the alternatives like ...|...|...), but you can
use a list (set, actually) of alternatives to search for.
check the "named lists" feature,
\L<...>

hth,
vbr
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top