regex problem

O

Odd-R.

Input is a string of four digit sequences, possibly
separated by a -, for instance like this

"1234,2222-8888,4567,"

My regular expression is like this:

rx1=re.compile(r"""\A(\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)*\Z""")

When running rx1.findall("1234,2222-8888,4567,")

I only get the last match as the result. Isn't
findall suppose to return all the matches?

Thanks in advance.
 
T

Thomas Guettler

Am Tue, 26 Jul 2005 09:57:23 +0000 schrieb Odd-R.:
Input is a string of four digit sequences, possibly
separated by a -, for instance like this

"1234,2222-8888,4567,"

My regular expression is like this:

rx1=re.compile(r"""\A(\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)*\Z""")

Hi,

try it without \A and \Z

import re
rx1=re.compile(r"""(\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)""")
print rx1.findall("1234,2222-8888,4567,")
# --> ['1234,', '2222-8888,', '4567,']

Thomas
 
J

John Machin

Odd-R. said:
Input is a string of four digit sequences, possibly
separated by a -, for instance like this

"1234,2222-8888,4567,"

My regular expression is like this:

rx1=re.compile(r"""\A(\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,)*\Z""")

When running rx1.findall("1234,2222-8888,4567,")

I only get the last match as the result. Isn't
findall suppose to return all the matches?

For a start, an expression that starts with \A and ends with \Z will
match the whole string (or not match at all). You have only one match.

Secondly, as you have a group in your expression, findall returns what
the group matches. Your expression matches zero or more of what your
group matches, provided there is nothing else at the start/end of the
string. The "zero or more" makes the re engine waltz about a bit; when
the music stopped, the group was matching "4567,".

Thirdly, findall should be thought of as merely a wrapper around a loop
using the search method -- it finds all non-overlapping matches of a
pattern. So the clue to get from this is that you need a really simple
pattern, like the following. You *don't* have to write an expression
that does the looping.

So here's the mean lean no-flab version -- you don't even need the
parentheses (sorry, Thomas).
['1234,', '2222-8888,', '4567,']

HTH,
John
 
D

Duncan Booth

John said:
So here's the mean lean no-flab version -- you don't even need the
parentheses (sorry, Thomas).
['1234,', '2222-8888,', '4567,']

No flab? What about all that repetition of \d? A less flabby version:
['1234,', '2222-8888,', '4567,']
 
J

John Machin

Duncan said:
John Machin wrote:

So here's the mean lean no-flab version -- you don't even need the
parentheses (sorry, Thomas).

rx1=re.compile(r"""\b\d\d\d\d,|\b\d\d\d\d-\d\d\d\d,""")
rx1.findall("1234,2222-8888,4567,")

['1234,', '2222-8888,', '4567,']


No flab? What about all that repetition of \d? A less flabby version:


['1234,', '2222-8888,', '4567,']


OK, good idea to factor out the prefix and follow it by optional -1234.
However optimising re engines do common prefix factoring, *and* they
rewrite stuff like x{4} as xxxx.

Cheers,
John
 
O

Odd-R.

['1234,', '2222-8888,', '4567,']

Thanks all for good advice. However this last expression
also matches the first four digits when the input is more
than four digits. To resolve this problem, I first do a
match of this,

regex=re.compile(r"""\A(\b\d{4},|\d{4}-\d{4},)*(\b\d{4}|\d{4}-\d{4})\Z""")

If this turns out ok, I do a find all with your expression, and then I get
the desired result.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,585
Members
45,080
Latest member
mikkipirss

Latest Threads

Top