Groups in regular expressions don't repeat as expected

John Nagle · Apr 20, 2011

Here's something that surprised me about Python regular expressions.

krex = re.compile(r"^([a-z])+$")
s = "abcdef"
ms = krex.match(s)
ms.groups()

Click to expand...

Click to expand...

('f',)

The parentheses indicate a capturing group within the
regular expression, and the "+" indicates that the
group can appear one or more times. The regular
expression matches that way. But instead of returning
a captured group for each character, it returns only the
last one.

The documentation in fact says that, at

http://docs.python.org/library/re.html

"If a group is contained in a part of the pattern that matched multiple
times, the last match is returned."

That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.

John Nagle

Neil Cerutti · Apr 20, 2011

Here's something that surprised me about Python regular expressions.

krex = re.compile(r"^([a-z])+$")
s = "abcdef"
ms = krex.match(s)
ms.groups()

Click to expand...

Click to expand...

('f',)

The parentheses indicate a capturing group within the
regular expression, and the "+" indicates that the
group can appear one or more times. The regular
expression matches that way. But instead of returning
a captured group for each character, it returns only the
last one.

The documentation in fact says that, at

http://docs.python.org/library/re.html

"If a group is contained in a part of the pattern that matched multiple
times, the last match is returned."

That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.

..findall

MRAB · Apr 20, 2011

Here's something that surprised me about Python regular expressions.

krex = re.compile(r"^([a-z])+$")
s = "abcdef"
ms = krex.match(s)
ms.groups()

Click to expand...

Click to expand...

('f',)

The parentheses indicate a capturing group within the
regular expression, and the "+" indicates that the
group can appear one or more times. The regular
expression matches that way. But instead of returning
a captured group for each character, it returns only the
last one.

The documentation in fact says that, at

http://docs.python.org/library/re.html

"If a group is contained in a part of the pattern that matched multiple
times, the last match is returned."

That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.

You should take a look at the regex module on PyPI.

John Nagle · Apr 20, 2011

Here's something that surprised me about Python regular expressions.

krex = re.compile(r"^([a-z])+$")
s = "abcdef"
ms = krex.match(s)
ms.groups()

Click to expand...

('f',)

The parentheses indicate a capturing group within the
regular expression, and the "+" indicates that the
group can appear one or more times. The regular
expression matches that way. But instead of returning
a captured group for each character, it returns only the
last one.

The documentation in fact says that, at

http://docs.python.org/library/re.html

"If a group is contained in a part of the pattern that matched multiple
times, the last match is returned."

That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.

Click to expand...

.findall

Findall does something a bit different. It returns a list of
matches of the entire pattern, not repeats of groups within
the pattern.

Consider a regular expression for matching domain names:

kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
s = 'www.example.com'
ms = kre.match(s)
ms.groups() ('www', 'com')
msall = kre.findall(s)
msall

Click to expand...

Click to expand...

[('www', 'com')]

This is just a simple example. But it illustrates an unnecessary
limitation. The matcher can do the repeated matching; you just can't
get the results out.

John Nagle

Neil Cerutti · Apr 21, 2011

Findall does something a bit different. It returns a list of
matches of the entire pattern, not repeats of groups within
the pattern.

Consider a regular expression for matching domain names:

kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
s = 'www.example.com'
ms = kre.match(s)
ms.groups() ('www', 'com')
msall = kre.findall(s)
msall

Click to expand...

Click to expand...

[('www', 'com')]

This is just a simple example. But it illustrates an unnecessary
limitation. The matcher can do the repeated matching; you just can't
get the results out.

Thanks for the further explantion.

Assuming a fake API that returned multiple group matches as a
tuple:

? print(re.match(r"^([a-z])+$", "abcdef").groups())

Click to expand...

(('a', 'b', 'c', 'd', 'e', 'f'),)

I was thinking of applying findall something like this, but you
have to make multiple calls:

m = re.match(r"^[a-z]+$", s)
if m:

Click to expand...

Click to expand...

.... print(re.findall(r"[a-z]", m.group()))
....
['a', 'b', 'c', 'd', 'e', 'f']

I can see that getting really annoying. Is there a better way to
make multiple group matches accessible without adding a third
element type as a group element?

Vlastimil Brom · Apr 21, 2011

2011/4/20 John Nagle said:
Here's something that surprised me about Python regular expressions.

krex = re.compile(r"^([a-z])+$")
s = "abcdef"
ms = krex.match(s)
ms.groups()

Click to expand...

('f',)

...

Click to expand...

"If a group is contained in a part of the pattern that matched multiple
times, the last match is returned."

That's kind of lame, though. I'd expect that there would be some way
to retrieve all matches.

John Nagle

Hi,
do you mean something like:

import regex
ms = regex.match(r"^([a-z])+$", "abcdef")
ms.captures(1) ['a', 'b', 'c', 'd', 'e', 'f']

help(ms.captures)

Click to expand...

Click to expand...

Help on built-in function captures:

captures(...)
captures([group1, ...]) --> list of strings or tuple of list of strings..
Return the captures of one or more subgroups of the match. If there isa
single argument, the result is a list of strings; if there are multiple
arguments, the result is a tuple of lists with one item per argument; if
there are no arguments, the captures of the whole match is returned. Group
0 is the whole match.

cf.
http://pypi.python.org/pypi/regex

hth,
vbr

Vlastimil Brom · Apr 21, 2011

2011/4/20 MRAB said:
You should take a look at the regex module on PyPI.

Ah well...
sorry for possibly destroying the point and the aha! effect ...
vbr

John Nagle · Apr 24, 2011

Findall does something a bit different. It returns a list of
matches of the entire pattern, not repeats of groups within
the pattern.

Consider a regular expression for matching domain names:

kre = re.compile(r'^([a-zA-Z0-9\-]+)(?:\.([a-zA-Z0-9\-]+))+$')
s = 'www.example.com'
ms = kre.match(s)
ms.groups() ('www', 'com')
msall = kre.findall(s)
msall

Click to expand...

[('www', 'com')]

This is just a simple example. But it illustrates an unnecessary
limitation. The matcher can do the repeated matching; you just can't
get the results out.

Click to expand...

Thanks for the further explantion.

Assuming a fake API that returned multiple group matches as a
tuple:

? print(re.match(r"^([a-z])+$", "abcdef").groups())

Click to expand...

Click to expand...

(('a', 'b', 'c', 'd', 'e', 'f'),)

I was thinking of applying findall something like this, but you
have to make multiple calls:

m = re.match(r"^[a-z]+$", s)
if m:

Click to expand...

Click to expand...

... print(re.findall(r"[a-z]", m.group()))
...
['a', 'b', 'c', 'd', 'e', 'f']

I can see that getting really annoying. Is there a better way to
make multiple group matches accessible without adding a third
element type as a group element?

The most elegant solution would be to have a regular expression
function that returned a tree of tuples or lists. Then you could
express an entire language syntax as a regular expression and
get out a parse tree.

Since the regular expression system is actually doing that work,
then discarding the results, it seems a reasonable extension.
I'm not suggesting extending regular expression matching itself,
just the way the results are stored.

John Nagle

Utility to locate errors in regular expressions	3	May 24, 2013
Regular expressions, capture repeated groups	4	Jul 8, 2010
The power of regular expressions without regular expressions.	0	Jul 17, 2013
regular expression extracting groups	3	Aug 10, 2008
Trouble with regular expressions	6	Feb 7, 2009
Repeating assertions in regular expression	3	Jan 3, 2012
regular expressions eliminating filenames of type foo.thumbnail.jpg	7	Jun 25, 2007
Searching for Regular Expressions in a string WITH overlap	1	Nov 21, 2008

Groups in regular expressions don't repeat as expected

John Nagle

Neil Cerutti

MRAB

John Nagle

Neil Cerutti

Vlastimil Brom

Vlastimil Brom

John Nagle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads