regex help: splitting string gets weird groups

gry · Apr 8, 2010

[ python3.1.1, re.__version__='2.2.1' ]
I'm trying to use re to split a string into (any number of) pieces of
these kinds:
1) contiguous runs of letters
2) contiguous runs of digits
3) single other characters

e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain',
'.', 'in', '#', '=', 1234]
I tried:

re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups()

Click to expand...

Click to expand...

('1234', 'in', '1234', '=')

Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a
group? Is my regexp illegal somehow and confusing the engine?

I *would* like to understand what's wrong with this regex, though if
someone has a neat other way to do the above task, I'm also interested
in suggestions.

MRAB · Apr 8, 2010

gry said:
[ python3.1.1, re.__version__='2.2.1' ]
I'm trying to use re to split a string into (any number of) pieces of
these kinds:
1) contiguous runs of letters
2) contiguous runs of digits
3) single other characters

e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain',
'.', 'in', '#', '=', 1234]
I tried:

re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups()

Click to expand...

Click to expand...

('1234', 'in', '1234', '=')

Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a
group? Is my regexp illegal somehow and confusing the engine?

I *would* like to understand what's wrong with this regex, though if
someone has a neat other way to do the above task, I'm also interested
in suggestions.

If the regex was illegal then it would raise an exception. It's doing
exactly what you're asking it to do!

First of all, there are 4 groups, with group 1 containing groups 2..4 as
alternatives, so group 1 will match whatever groups 2..4 match:

Group 1: (([A-Za-z]+)|([0-9]+)|([-.#=]))
Group 2: ([A-Za-z]+)
Group 3: ([0-9]+)
Group 4: ([-.#=])

It matches like this:

Group 1 and group 3 match '555'.
Group 1 and group 2 match 'tHe'.
Group 1 and group 4 match '-'.
Group 1 and group 2 match 'rain'.
Group 1 and group 4 match '.'.
Group 1 and group 2 match 'in'.
Group 1 and group 4 match '#'.
Group 1 and group 4 match '='.
Group 1 and group 3 match '1234'.

If a group matches then any earlier match of that group is discarded,
so:

Group 1 finishes with '1234'.
Group 2 finishes with 'in'.
Group 3 finishes with '1234'.
Group 4 finishes with '='.

A solution is:

>>> re.findall('[A-Za-z]+|[0-9]+|[-.#=]', '555tHe-rain.in#=1234')

Click to expand...

Click to expand...

['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

Note: re.findall() returns a list of matches, so if the regex doesn't
contain any groups then it returns the matched substrings. Compare:

>>> re.findall("a(.)", "ax ay") ['x', 'y']
>>> re.findall("a.", "ax ay")

Click to expand...

Click to expand...

['ax', 'ay']

Jon Clements · Apr 8, 2010

[ python3.1.1, re.__version__='2.2.1' ]
I'm trying to use re to split a string into (any number of) pieces of
these kinds:
1) contiguous runs of letters
2) contiguous runs of digits
3) single other characters

e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain',
'.', 'in', '#', '=', 1234]
I tried:>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain..in#=1234').groups()

('1234', 'in', '1234', '=')

Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a
group? Is my regexp illegal somehow and confusing the engine?

I *would* like to understand what's wrong with this regex, though if
someone has a neat other way to do the above task, I'm also interested
in suggestions.

I would avoid .match and use .findall
(if you walk through them both together, it'll make sense what's
happening
with your match string).

s = """555tHe-rain.in#=1234"""
re.findall('[A-Za-z]+|[0-9]+|[-.#=]', s)

Click to expand...

Click to expand...

['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

hth,

Jon.

Patrick Maupin · Apr 8, 2010

[ python3.1.1, re.__version__='2.2.1' ]
I'm trying to use re to split a string into (any number of) pieces of
these kinds:
1) contiguous runs of letters
2) contiguous runs of digits
3) single other characters

e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain',
'.', 'in', '#', '=', 1234]
I tried:>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain..in#=1234').groups()

('1234', 'in', '1234', '=')

Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a
group? Is my regexp illegal somehow and confusing the engine?

I *would* like to understand what's wrong with this regex, though if
someone has a neat other way to do the above task, I'm also interested
in suggestions.

IMO, for most purposes, for people who don't want to become re
experts, the easiest, fastest, best, most predictable way to use re is
re.split. You can either call re.split directly, or, if you are going
to be splitting on the same pattern over and over, compile the pattern
and grab its split method. Use a *single* capture group in the
pattern, that covers the *whole* pattern. In the case of your example
data:

import re
splitter=re.compile('([A-Za-z]+|[0-9]+|[-.#=])').split
s='555tHe-rain.in#=1234'
[x for x in splitter(s) if x]

Click to expand...

Click to expand...

['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

The reason for the list comprehension is that re.split will always
return a non-matching string between matches. Sometimes this is
useful even when it is a null string (see recent discussion in the
group about splitting digits out of a string), but if you don't care
to see null (empty) strings, this comprehension will remove them.

The reason for a single capture group that covers the whole pattern is
that it is much easier to reason about the output. The split will
give you all your data, in order, e.g.
True

HTH,
Pat

Tim Chase · Apr 8, 2010

gry said:
[ python3.1.1, re.__version__='2.2.1' ]
I'm trying to use re to split a string into (any number of) pieces of
these kinds:
1) contiguous runs of letters
2) contiguous runs of digits
3) single other characters

e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain',
'.', 'in', '#', '=', 1234]
I tried:

re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups()

Click to expand...

Click to expand...

('1234', 'in', '1234', '=')

Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a
group? Is my regexp illegal somehow and confusing the engine?

well, I'm not sure what it thinks its finding but nested capture-groups
always produce somewhat weird results for me (I suspect that's what's
triggering the duplication). Additionally, you're only searching for
one match (.match() returns a single match-object or None; not all
possible matches within the repeated super-group).

I *would* like to understand what's wrong with this regex, though if
someone has a neat other way to do the above task, I'm also interested
in suggestions.

Tweaking your original, I used

>>> s='555tHe-rain.in#=1234'
>>> import re
>>> r=re.compile(r'([a-zA-Z]+|\d+|.)')
>>> r.findall(s)

Click to expand...

Click to expand...

['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

The only difference between my results and your results is that the 555
and 1234 come back as strings, not ints.

-tkc

gry · Apr 8, 2010

....

Group 1 and group 4 match '='.
Group 1 and group 3 match '1234'.

If a group matches then any earlier match of that group is discarded,

Wow, that makes this much clearer! I wonder if this behaviour
shouldn't be mentioned in some form in the python docs?
Thanks much!

Jon Clements · Apr 8, 2010

[ python3.1.1, re.__version__='2.2.1' ]
I'm trying to use re to split a string into (any number of) pieces of
these kinds:
1) contiguous runs of letters
2) contiguous runs of digits
3) single other characters

e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain',
'.', 'in', '#', '=', 1234]
I tried:>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain..in#=1234').groups()

('1234', 'in', '1234', '=')

Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a
group? Is my regexp illegal somehow and confusing the engine?

I *would* like to understand what's wrong with this regex, though if
someone has a neat other way to do the above task, I'm also interested
in suggestions.

Avoiding re's (for a bit of fun):
(no good for unicode obviously)

import string
from itertools import groupby, chain, repeat, count, izip

s = """555tHe-rain.in#=1234"""

unique_group = count()
lookup = dict(
chain(
izip(string.ascii_letters, repeat('L')),
izip(string.digits, repeat('D')),
izip(string.punctuation, unique_group)
)
)
parse = dict(D=int, L=str.capitalize)

print [ parse.get(key, lambda L: L)(''.join(items)) for key, items in
groupby(s, lambda L: lookup[L]) ]
[555, 'The', '-', 'Rain', '.', 'In', '#', '=', 1234]

Jon.

gry · Apr 8, 2010

>>> s='555tHe-rain.in#=1234'

>>> import re
>>> r=re.compile(r'([a-zA-Z]+|\d+|.)')
>>> r.findall(s)
['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

This is nice and simple and has the invertible property that Patrick
mentioned above. Thanks much!

Patrick Maupin · Apr 8, 2010

>>> s='555tHe-rain.in#=1234'
>>> import re
>>> r=re.compile(r'([a-zA-Z]+|\d+|.)')
>>> r.findall(s)
['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

Click to expand...

This is nice and simple and has the invertible property that Patrick
mentioned above. Thanks much!

Yes, like using split(), this is invertible. But you will see a
difference (and for a given task, you might prefer one way or the
other) if, for example, you put a few consecutive spaces in the middle
of your string, where this pattern and findall() will return each
space individually, and split() will return them all together.

You *can* fix up the pattern for findall() where it will have the same
properties as the split(), but it will almost always be a more
complicated pattern than for the equivalent split().

Another thing you can do with split(): if you *think* you have a
pattern that fully covers every string you expect to throw at it, but
would like to verify this, you can make use of the fact that split()
returns a string between each match (and before the first match and
after the last match). So if you expect that every character in your
entire string should be a part of a match, you can do something like:

strings = splitter(s)
tokens = strings[1::2]
assert not ''.join(strings[::2])

Regards,
Pat

anybody help me	1	Feb 10, 2006
[SUMMARY] Word Loop (#149)	0	Dec 13, 2007
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

regex help: splitting string gets weird groups

gry

MRAB

Jon Clements

Patrick Maupin

Tim Chase

gry

Jon Clements

gry

Patrick Maupin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads