Trouble with regular expressions

L

LaundroMat

Hi,

I'm quite new to regular expressions, and I wonder if anyone here
could help me out.

I'm looking to split strings that ideally look like this: "Update: New
item (Household)" into a group.
This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns
("Update", "New item", "(Household)")

Some strings will look like this however: "Update: New item (item)
(Household)". The expression above still does its job, as it returns
("Update", "New item (item)", "(Household)").

It does not work however when there is no text in parentheses (eg
"Update: new item"). How can I get the expression to return a tuple
such as ("Update:", "new item", None)?

Thanks in advance,

Mathieu
 
J

John Machin

Hi,

I'm quite new to regular expressions, and I wonder if anyone here
could help me out.

I'm looking to split strings that ideally look like this: "Update: New
item (Household)" into a group.
This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns
("Update", "New item", "(Household)")

Some strings will look like this however: "Update: New item (item)
(Household)". The expression above still does its job, as it returns
("Update", "New item (item)", "(Household)").

It does not work however when there is no text in parentheses (eg
"Update: new item"). How can I get the expression to return a tuple
such as ("Update:", "new item", None)?

I don't see how it can be done without some post-matching adjustment.
Try this:

C:\junk>type mathieu.py
import re

tests = [
"Update: New item (Household)",
"Update: New item (item) (Household)",
"Update: new item",
"minimal",
"parenthesis (plague) (has) (struck)",
]

regex = re.compile("""
(Update:)? # optional prefix
\s* # ignore whitespace
([^()]*) # any non-parentheses stuff
(\([^()]*\))? # optional (blahblah)
\s* # ignore whitespace
(\([^()]*\))? # another optional (blahblah)
$
""", re.VERBOSE)

for i, test in enumerate(tests):
print "Test #%d: %r" % (i, test)
m = regex.match(test)
if not m:
print "No match"
else:
g = m.groups()
print g
if g[3] is not None:
x = (g[0], g[1] + g[2], g[3])
else:
x = g[:3]
print x
print

C:\junk>mathieu.py
Test #0: 'Update: New item (Household)'
('Update:', 'New item ', '(Household)', None)
('Update:', 'New item ', '(Household)')

Test #1: 'Update: New item (item) (Household)'
('Update:', 'New item ', '(item)', '(Household)')
('Update:', 'New item (item)', '(Household)')

Test #2: 'Update: new item'
('Update:', 'new item', None, None)
('Update:', 'new item', None)

Test #3: 'minimal'
(None, 'minimal', None, None)
(None, 'minimal', None)

Test #4: 'parenthesis (plague) (has) (struck)'
No match

HTH,
John
 
M

MRAB

LaundroMat said:
Hi,

I'm quite new to regular expressions, and I wonder if anyone here
could help me out.

I'm looking to split strings that ideally look like this: "Update: New
item (Household)" into a group.
This expression works ok: '^(Update:)?(.*)(\(.*\))$' - it returns
("Update", "New item", "(Household)")

Some strings will look like this however: "Update: New item (item)
(Household)". The expression above still does its job, as it returns
("Update", "New item (item)", "(Household)").

It does not work however when there is no text in parentheses (eg
"Update: new item"). How can I get the expression to return a tuple
such as ("Update:", "new item", None)?
You need to make the last group optional and also make the middle group
lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'.

(?:...) is the non-capturing version of (...).

If you don't make the middle group lazy then it'll capture the rest of
the string and the last group would never match anything!
 
J

John Machin

Not quite true; it actually returns
('Update:', ' New item (item) ', '(Household)')
However ignoring the difference in whitespace, the OP's intention is
clear. Yours returns
('Update:', ' New item ', '(item) (Household)')

You need to make the last group optional and also make the middle group
lazy: r'^(Update:)?(.*?)(?:(\(.*\)))?$'.

Why do you perpetuate the redundant ^ anchor?
(?:...) is the non-capturing version of (...).

Why do you use
(?:(subpattern))?
instead of just plain
(subpattern)?
?
 
M

MRAB

John said:
Not quite true; it actually returns
('Update:', ' New item (item) ', '(Household)')
However ignoring the difference in whitespace, the OP's intention is
clear. Yours returns
('Update:', ' New item ', '(item) (Household)')
The OP said it works OK, which I took to mean that the OP was OK with
the extra whitespace, which can be easily stripped off. Close enough!
Why do you perpetuate the redundant ^ anchor?
The OP didn't say whether search() or match() was being used. With the ^
it doesn't matter.
Why do you use
(?:(subpattern))?
instead of just plain
(subpattern)?
?
Oops, you're right. I was distracted by the \( and \)! :)
 
J

John Machin

The OP said it works OK, which I took to mean that the OP was OK with
the extra whitespace, which can be easily stripped off. Close enough!

As I said, the whitespace difference [between what the OP said his
regex did and what it actually does] is not the problem. The problem
is that the OP's "works OK" included (item) in the 2nd group, whereas
yours includes (item) in the 3rd group.
The OP didn't say whether search() or match() was being used. With the ^
it doesn't matter.

It *does* matter. re.search() is suboptimal; after failing at the
first position, it's not smart enough to give up if the pattern has a
front anchor.

[win32, 2.6.1]
C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.17 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
0*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.17 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
*'x'" "assert not rx.search(txt)"
100000 loops, best of 3: 4.37 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
0*'x'" "assert not rx.search(txt)"
10000 loops, best of 3: 34.1 usec per loop

Corresponding figures for 3.0 are 1.02, 1.02, 3.99, and 32.9
 
M

MRAB

John said:
The OP said it works OK, which I took to mean that the OP was OK with
the extra whitespace, which can be easily stripped off. Close enough!

As I said, the whitespace difference [between what the OP said his
regex did and what it actually does] is not the problem. The problem
is that the OP's "works OK" included (item) in the 2nd group, whereas
yours includes (item) in the 3rd group.
Ugh, right again!

That just shows what happens when I try to post while debugging! :)
The OP didn't say whether search() or match() was being used. With the ^
it doesn't matter.

It *does* matter. re.search() is suboptimal; after failing at the
first position, it's not smart enough to give up if the pattern has a
front anchor.

[win32, 2.6.1]
C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.17 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
0*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.17 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
*'x'" "assert not rx.search(txt)"
100000 loops, best of 3: 4.37 usec per loop

C:\junk>\python26\python -mtimeit -s"import re;rx=re.compile
('^frobozz');txt=100
0*'x'" "assert not rx.search(txt)"
10000 loops, best of 3: 34.1 usec per loop

Corresponding figures for 3.0 are 1.02, 1.02, 3.99, and 32.9
On my PC the numbers for Python 2.6 are:

C:\Python26>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=100*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.02 usec per loop

C:\Python26>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=1000*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.04 usec per loop

C:\Python26>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=100*'x'" "assert not rx.search(txt)"
100000 loops, best of 3: 3.69 usec per loop

C:\Python26>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=1000*'x'" "assert not rx.search(txt)"
10000 loops, best of 3: 28.6 usec per loop

I'm currently working on the re module and I've fixed that problem:

C:\Python27>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=100*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.28 usec per loop

C:\Python27>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=1000*'x'" "assert not rx.match(txt)"
1000000 loops, best of 3: 1.23 usec per loop

C:\Python27>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=100*'x'" "assert not rx.search(txt)"
1000000 loops, best of 3: 1.21 usec per loop

C:\Python27>python -mtimeit -s"import
re;rx=re.compile('^frobozz');txt=1000*'x'" "assert not rx.search(txt)"
1000000 loops, best of 3: 1.21 usec per loop

Hmm. Needs more tweaking...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,056
Messages
2,570,441
Members
47,125
Latest member
MDBT

Latest Threads

Top