Trouble with regular expressions

LaundroMat · Feb 7, 2009

Hi,

I'm quite new to regular expressions, and I wonder if anyone here
could help me out.

I'm looking to split strings that ideally look like this: "Update: New
item (Household)" into a group.
This expression works ok: '^(Update

?(.*)($.*$)$' - it returns
("Update", "New item", "(Household)")

Some strings will look like this however: "Update: New item (item)
(Household)". The expression above still does its job, as it returns
("Update", "New item (item)", "(Household)").

It does not work however when there is no text in parentheses (eg
"Update: new item"). How can I get the expression to return a tuple
such as ("Update:", "new item", None)?

Thanks in advance,

Mathieu

John Machin · Feb 7, 2009

Hi,

I'm quite new to regular expressions, and I wonder if anyone here
could help me out.

I'm looking to split strings that ideally look like this: "Update: New
item (Household)" into a group.
This expression works ok: '^(Update?(.*)($.*$)$' - it returns
("Update", "New item", "(Household)")

Some strings will look like this however: "Update: New item (item)
(Household)". The expression above still does its job, as it returns
("Update", "New item (item)", "(Household)").

It does not work however when there is no text in parentheses (eg
"Update: new item"). How can I get the expression to return a tuple
such as ("Update:", "new item", None)?

I don't see how it can be done without some post-matching adjustment.
Try this:

C:\junk>type mathieu.py
import re

tests = [
"Update: New item (Household)",
"Update: New item (item) (Household)",
"Update: new item",
"minimal",
"parenthesis (plague) (has) (struck)",
]

regex = re.compile("""
(Update

? # optional prefix
\s* # ignore whitespace
([^()]*) # any non-parentheses stuff
($[^()]*$)? # optional (blahblah)
\s* # ignore whitespace
($[^()]*$)? # another optional (blahblah)
$
""", re.VERBOSE)

for i, test in enumerate(tests):
print "Test #%d: %r" % (i, test)
m = regex.match(test)
if not m:
print "No match"
else:
g = m.groups()
print g
if g[3] is not None:
x = (g[0], g[1] + g[2], g[3])
else:
x = g[:3]
print x
print

C:\junk>mathieu.py
Test #0: 'Update: New item (Household)'
('Update:', 'New item ', '(Household)', None)
('Update:', 'New item ', '(Household)')

Test #1: 'Update: New item (item) (Household)'
('Update:', 'New item ', '(item)', '(Household)')
('Update:', 'New item (item)', '(Household)')

Test #2: 'Update: new item'
('Update:', 'new item', None, None)
('Update:', 'new item', None)

Test #3: 'minimal'
(None, 'minimal', None, None)
(None, 'minimal', None)

Test #4: 'parenthesis (plague) (has) (struck)'
No match

HTH,
John

MRAB · Feb 7, 2009

LaundroMat said:
Hi,

I'm quite new to regular expressions, and I wonder if anyone here
could help me out.

I'm looking to split strings that ideally look like this: "Update: New
item (Household)" into a group.
This expression works ok: '^(Update?(.*)($.*$)$' - it returns
("Update", "New item", "(Household)")

Some strings will look like this however: "Update: New item (item)
(Household)". The expression above still does its job, as it returns
("Update", "New item (item)", "(Household)").

It does not work however when there is no text in parentheses (eg
"Update: new item"). How can I get the expression to return a tuple
such as ("Update:", "new item", None)?

You need to make the last group optional and also make the middle group
lazy: r'^(Update

?(.*?)(?

$.*$))?$'.

(?:...) is the non-capturing version of (...).

If you don't make the middle group lazy then it'll capture the rest of
the string and the last group would never match anything!

John Machin · Feb 7, 2009

Not quite true; it actually returns
('Update:', ' New item (item) ', '(Household)')
However ignoring the difference in whitespace, the OP's intention is
clear. Yours returns
('Update:', ' New item ', '(item) (Household)')

You need to make the last group optional and also make the middle group
lazy: r'^(Update?(.*?)(?$.*$))?$'.

Why do you perpetuate the redundant ^ anchor?

(?:...) is the non-capturing version of (...).

Why do you use
(?

subpattern))?
instead of just plain
(subpattern)?
?

MRAB · Feb 7, 2009

John said:
Not quite true; it actually returns
('Update:', ' New item (item) ', '(Household)')
However ignoring the difference in whitespace, the OP's intention is
clear. Yours returns
('Update:', ' New item ', '(item) (Household)')

The OP said it works OK, which I took to mean that the OP was OK with
the extra whitespace, which can be easily stripped off. Close enough!

Why do you perpetuate the redundant ^ anchor?

The OP didn't say whether search() or match() was being used. With the ^
it doesn't matter.

Why do you use
(?subpattern))?
instead of just plain
(subpattern)?
?

Oops, you're right. I was distracted by the $ and $!

John Machin · Feb 7, 2009

The OP said it works OK, which I took to mean that the OP was OK with
the extra whitespace, which can be easily stripped off. Close enough!

As I said, the whitespace difference [between what the OP said his
regex did and what it actually does] is not the problem. The problem
is that the OP's "works OK" included (item) in the 2nd group, whereas
yours includes (item) in the 3rd group.

The OP didn't say whether search() or match() was being used. With the ^
it doesn't matter.

It *does* matter. re.search() is suboptimal; after failing at the
first position, it's not smart enough to give up if the pattern has a
front anchor.

[win32, 2.6.1]
C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.17 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
0*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.17 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
*'x'" "assert not rx.search(txt)"
100000 loops, best of 3: 4.37 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
0*'x'" "assert not rx.search(txt)"
10000 loops, best of 3: 34.1 usec per loop

Corresponding figures for 3.0 are 1.02, 1.02, 3.99, and 32.9

MRAB · Feb 8, 2009

John said:
The OP said it works OK, which I took to mean that the OP was OK with
the extra whitespace, which can be easily stripped off. Close enough!

Click to expand...

As I said, the whitespace difference [between what the OP said his
regex did and what it actually does] is not the problem. The problem
is that the OP's "works OK" included (item) in the 2nd group, whereas
yours includes (item) in the 3rd group.

Ugh, right again!

That just shows what happens when I try to post while debugging!

The OP didn't say whether search() or match() was being used. With the ^
it doesn't matter.

Click to expand...

It *does* matter. re.search() is suboptimal; after failing at the
first position, it's not smart enough to give up if the pattern has a
front anchor.

[win32, 2.6.1]
C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.17 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
0*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.17 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
*'x'" "assert not rx.search(txt)"
100000 loops, best of 3: 4.37 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
0*'x'" "assert not rx.search(txt)"
10000 loops, best of 3: 34.1 usec per loop

Corresponding figures for 3.0 are 1.02, 1.02, 3.99, and 32.9

On my PC the numbers for Python 2.6 are:

C:\Python26>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=100*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.02 usec per loop

C:\Python26>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=1000*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.04 usec per loop

C:\Python26>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=100*'x'" "assert not rx.search(txt)"
100000 loops, best of 3: 3.69 usec per loop

C:\Python26>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=1000*'x'" "assert not rx.search(txt)"
10000 loops, best of 3: 28.6 usec per loop

I'm currently working on the re module and I've fixed that problem:

C:\Python27>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=100*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.28 usec per loop

C:\Python27>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=1000*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.23 usec per loop

C:\Python27>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=100*'x'" "assert not rx.search(txt)"
1000000 loops, best of 3: 1.21 usec per loop

C:\Python27>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=1000*'x'" "assert not rx.search(txt)"
1000000 loops, best of 3: 1.21 usec per loop

Hmm. Needs more tweaking...

regular expressions and matching delimeters	17	May 21, 2014
Trouble with regular expressions	8	Jul 29, 2009
The power of regular expressions without regular expressions.	0	Jul 17, 2013
regular expressions, stack and nesting	2	Mar 22, 2009
Issue with regular expressions	10	Apr 29, 2008
Problems with regular expressions	0	Jun 15, 2007
Regular Expressions	4	Jun 17, 2008
find and replace with regular expressions	6	Jul 31, 2008

Trouble with regular expressions

LaundroMat

John Machin

MRAB

John Machin

MRAB

John Machin

MRAB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads