Groups in regular expressions don't repeat as expected


J

John Nagle

Here's something that surprised me about Python regular expressions.
krex = re.compile(r"^([a-z])+$")
s = "abcdef"
ms = krex.match(s)
ms.groups()
('f',)

The parentheses indicate a capturing group within the
regular expression, and the "+" indicates that the
group can appear one or more times. The regular
expression matches that way. But instead of returning
a captured group for each character, it returns only the
last one.

The documentation in fact says that, at

http://docs.python.org/library/re.html

"If a group is contained in a part of the pattern that matched multiple
times, the last match is returned."

That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.

John Nagle
 
Ad

Advertisements

N

Neil Cerutti

Here's something that surprised me about Python regular expressions.
krex = re.compile(r"^([a-z])+$")
s = "abcdef"
ms = krex.match(s)
ms.groups()
('f',)

The parentheses indicate a capturing group within the
regular expression, and the "+" indicates that the
group can appear one or more times. The regular
expression matches that way. But instead of returning
a captured group for each character, it returns only the
last one.

The documentation in fact says that, at

http://docs.python.org/library/re.html

"If a group is contained in a part of the pattern that matched multiple
times, the last match is returned."

That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.

..findall
 
M

MRAB

Here's something that surprised me about Python regular expressions.
krex = re.compile(r"^([a-z])+$")
s = "abcdef"
ms = krex.match(s)
ms.groups()
('f',)

The parentheses indicate a capturing group within the
regular expression, and the "+" indicates that the
group can appear one or more times. The regular
expression matches that way. But instead of returning
a captured group for each character, it returns only the
last one.

The documentation in fact says that, at

http://docs.python.org/library/re.html

"If a group is contained in a part of the pattern that matched multiple
times, the last match is returned."

That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.
You should take a look at the regex module on PyPI. :)
 
J

John Nagle

Here's something that surprised me about Python regular expressions.
krex = re.compile(r"^([a-z])+$")
s = "abcdef"
ms = krex.match(s)
ms.groups()
('f',)

The parentheses indicate a capturing group within the
regular expression, and the "+" indicates that the
group can appear one or more times. The regular
expression matches that way. But instead of returning
a captured group for each character, it returns only the
last one.

The documentation in fact says that, at

http://docs.python.org/library/re.html

"If a group is contained in a part of the pattern that matched multiple
times, the last match is returned."

That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.

.findall

Findall does something a bit different. It returns a list of
matches of the entire pattern, not repeats of groups within
the pattern.

Consider a regular expression for matching domain names:
kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
s = 'www.example.com'
ms = kre.match(s)
ms.groups() ('www', 'com')
msall = kre.findall(s)
msall
[('www', 'com')]

This is just a simple example. But it illustrates an unnecessary
limitation. The matcher can do the repeated matching; you just can't
get the results out.

John Nagle
 
N

Neil Cerutti

Findall does something a bit different. It returns a list of
matches of the entire pattern, not repeats of groups within
the pattern.

Consider a regular expression for matching domain names:
kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
s = 'www.example.com'
ms = kre.match(s)
ms.groups() ('www', 'com')
msall = kre.findall(s)
msall
[('www', 'com')]

This is just a simple example. But it illustrates an unnecessary
limitation. The matcher can do the repeated matching; you just can't
get the results out.

Thanks for the further explantion.

Assuming a fake API that returned multiple group matches as a
tuple:
? print(re.match(r"^([a-z])+$", "abcdef").groups())
(('a', 'b', 'c', 'd', 'e', 'f'),)

I was thinking of applying findall something like this, but you
have to make multiple calls:
m = re.match(r"^[a-z]+$", s)
if m:
.... print(re.findall(r"[a-z]", m.group()))
....
['a', 'b', 'c', 'd', 'e', 'f']

I can see that getting really annoying. Is there a better way to
make multiple group matches accessible without adding a third
element type as a group element?
 
V

Vlastimil Brom

2011/4/20 John Nagle said:
Here's something that surprised me about Python regular expressions.
krex = re.compile(r"^([a-z])+$")
s = "abcdef"
ms = krex.match(s)
ms.groups()
('f',)

...

"If a group is contained in a part of the pattern that matched multiple
times, the last match is returned."

That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.

                                       John Nagle

Hi,
do you mean something like:
import regex
ms = regex.match(r"^([a-z])+$", "abcdef")
ms.captures(1) ['a', 'b', 'c', 'd', 'e', 'f']

help(ms.captures)
Help on built-in function captures:

captures(...)
captures([group1, ...]) --> list of strings or tuple of list of strings..
Return the captures of one or more subgroups of the match. If there isa
single argument, the result is a list of strings; if there are multiple
arguments, the result is a tuple of lists with one item per argument; if
there are no arguments, the captures of the whole match is returned. Group
0 is the whole match.

cf.
http://pypi.python.org/pypi/regex

hth,
vbr
 
Ad

Advertisements

V

Vlastimil Brom

2011/4/20 MRAB said:
You should take a look at the regex module on PyPI. :)

Ah well...
sorry for possibly destroying the point and the aha! effect ...
vbr
 
Ad

Advertisements

J

John Nagle

Findall does something a bit different. It returns a list of
matches of the entire pattern, not repeats of groups within
the pattern.

Consider a regular expression for matching domain names:
kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
s = 'www.example.com'
ms = kre.match(s)
ms.groups() ('www', 'com')
msall = kre.findall(s)
msall
[('www', 'com')]

This is just a simple example. But it illustrates an unnecessary
limitation. The matcher can do the repeated matching; you just can't
get the results out.

Thanks for the further explantion.

Assuming a fake API that returned multiple group matches as a
tuple:
? print(re.match(r"^([a-z])+$", "abcdef").groups())
(('a', 'b', 'c', 'd', 'e', 'f'),)

I was thinking of applying findall something like this, but you
have to make multiple calls:
m = re.match(r"^[a-z]+$", s)
if m:
... print(re.findall(r"[a-z]", m.group()))
...
['a', 'b', 'c', 'd', 'e', 'f']

I can see that getting really annoying. Is there a better way to
make multiple group matches accessible without adding a third
element type as a group element?

The most elegant solution would be to have a regular expression
function that returned a tree of tuples or lists. Then you could
express an entire language syntax as a regular expression and
get out a parse tree.

Since the regular expression system is actually doing that work,
then discarding the results, it seems a reasonable extension.
I'm not suggesting extending regular expression matching itself,
just the way the results are stored.

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top