Capturing repeating group matches in regular expressions

J

James Collier

Is it possible to capture the results of repeating group matches in the
python regular expression module?

To illustrate, what I want is:
>>> re1 = re.compile("([a-z]W)([a-z]X)+([a-z]Y)");
>>> mo1 = re.match("aWbXcXdXeXfY");
>>> print mo1.groupsButNotAsWeKnowIt()
('aW','bX','cX','dX','eX','fY')

instead of
("aW", "eX", "fY")

.... which captures only the last match from the second group.

Of course, one option is to break out the substring containing the
repeating group and then use split() or findall() within the substring,
but, but, but ... I'd like to do it in one hit if possible.

I believe someone has raised a similar question before, but I can't find
a definitive answer. It may be so stunningly obvious that nobody ever
bothers to answer - if so, could some kind soul please humour me and at
least point me to what I'm not seeing in the Fine Manual.

Thanks,
James.
 
P

Paul McGuire

It's a bit wordy, but perhaps the ability to easily structure and retrieve
your returned tokens may sway you.

Download pyparsing at http://pyparsing.sourceforge.net

-- Paul


from pyparsing import Word,OneOrMore

# define parse grammar
lowers = "abcdefghijklmnopqrstuvwxyz"
endsWithW = Word(lowers,"W",exact=2)
endsWithX = Word(lowers,"X",exact=2)
endsWithY = Word(lowers,"Y",exact=2)

patt = endsWithW.setResultsName("W") + \
OneOrMore( endsWithX ).setResultsName("X") + \
endsWithY.setResultsName("Y")

# extract tokens from input string
tokens = patt.parseString("aWbXcXdXeXfY")

# tokens can be accessed as a list
print "tokens:",tokens

# tokens can be coerced to be a true list
print "tokens.asList():",tokens.asList()

# tokens can be a dictionary, if results names specified
print "tokens.keys():",tokens.keys()
print "tokens['W']:",tokens['W']
print "tokens['X']:",tokens['X']
print "tokens['Y']:",tokens['Y']

# if results names are valid attribute names, can even look like attribute
print "tokens.W:",tokens.W
print "tokens.X:",tokens.X
print "tokens.Y:",tokens.Y


Gives:

tokens: ['aW', 'bX', 'cX', 'dX', 'eX', 'fY']
tokens.asList(): ['aW', 'bX', 'cX', 'dX', 'eX', 'fY']
tokens.keys(): ['Y', 'X', 'W']
tokens['W']: aW
tokens['X']: ['bX', 'cX', 'dX', 'eX']
tokens['Y']: fY
tokens.W: aW
tokens.X: ['bX', 'cX', 'dX', 'eX']
tokens.Y: fY
 
J

James Collier

Michael said:
Not easily; there's a small dicussion on python-dev at the moment
about this. Erik Heneryd hacked up something that might be useful:

http://mail.python.org/pipermail/python-dev/attachments/20040810/a5e602ab/structmatch.py

And there's always the "use a real parser" option :)

Cheers,
mwh

Many thanks for the answer Michael - I take your point on the "real parser"
option, but I don't feel that the nut I'm cracking has that thick a shell.

It is some coincidence that this should be under current discussion on
python-dev. For what it's worth, I'd support Mike Coleman's PEP.

To give some more background, I'm tweaking someone else's code and therefore
I want to keep the change as concise as is reasonable. structmatch() is exactly
what I'm looking for - but for now I'll just split the task into two parts.

Thanks again -- James.
 
E

Erik Heneryd

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top