Question: Optional Regular Expression Grouping

galyle · Oct 10, 2011

HI, I've looked through this forum, but I haven't been able to find a
resolution to the problem I'm having (maybe I didn't look hard enough
-- I have to believe this has come up before). The problem is this:
I have a file which has 0, 2, or 3 groups that I'd like to record;
however, in the case of 3 groups, the third group is correctly
captured, but the first two groups get collapsed into just one group.
I'm sure that I'm missing something in the way I've constructed my
regular expression, but I can't figure out what's wrong. Does anyone
have any suggestions?

The demo below showcases the problem I'm having:

import re

valid_line = re.compile('^\[(\S+)\]\[(\S+)\](?:\s+|\[(\S+)\])=|\s+[\d\
[\']+.*$')
line1 = "[field1][field2] = blarg"
line2 = " 'a continuation of blarg'"
line3 = "[field1][field2][field3] = blorg"

m = valid_line.match(line1)
print 'Expected: ' + m.group(1) + ', ' + m.group(2)
m = valid_line.match(line2)
print 'Expected: ' + str(m.group(1))
m = valid_line.match(line3)
print 'Uh-oh: ' + m.group(1) + ', ' + m.group(2)

MRAB · Oct 10, 2011

HI, I've looked through this forum, but I haven't been able to find a
resolution to the problem I'm having (maybe I didn't look hard enough
-- I have to believe this has come up before). The problem is this:
I have a file which has 0, 2, or 3 groups that I'd like to record;
however, in the case of 3 groups, the third group is correctly
captured, but the first two groups get collapsed into just one group.
I'm sure that I'm missing something in the way I've constructed my
regular expression, but I can't figure out what's wrong. Does anyone
have any suggestions?

The demo below showcases the problem I'm having:

import re

valid_line = re.compile('^\[(\S+)\]\[(\S+)\](?:\s+|\[(\S+)\])=|\s+[\d\
[\']+.*$')
line1 = "[field1][field2] = blarg"
line2 = " 'a continuation of blarg'"
line3 = "[field1][field2][field3] = blorg"

m = valid_line.match(line1)
print 'Expected: ' + m.group(1) + ', ' + m.group(2)
m = valid_line.match(line2)
print 'Expected: ' + str(m.group(1))
m = valid_line.match(line3)
print 'Uh-oh: ' + m.group(1) + ', ' + m.group(2)

Instead of "\S" I'd recommend using "[^\]]", or using a lazy repetition
"\S+?".

You'll also need to handle the space before the "=" in line3.

valid_line =
re.compile(r'^\[(\[^\]]+)\]\[(\[^\]]+)\](?:\s+|\[(\[^\]]+)\])\s*=|\s+[\d\[\']+.*$')

Vlastimil Brom · Oct 10, 2011

2011/10/10 galyle said:
HI, I've looked through this forum, but I haven't been able to find a
resolution to the problem I'm having (maybe I didn't look hard enough
-- I have to believe this has come up before). The problem is this:
I have a file which has 0, 2, or 3 groups that I'd like to record;
however, in the case of 3 groups, the third group is correctly
captured, but the first two groups get collapsed into just one group.
I'm sure that I'm missing something in the way I've constructed my
regular expression, but I can't figure out what's wrong. Does anyone
have any suggestions?

The demo below showcases the problem I'm having:

import re

valid_line = re.compile('^\[(\S+)\]\[(\S+)\](?:\s+|\[(\S+)\])=|\s+[\d\
[\']+.*$')
line1 = "[field1][field2] = blarg"
line2 = " 'a continuation of blarg'"
line3 = "[field1][field2][field3] = blorg"

m = valid_line.match(line1)
print 'Expected: ' + m.group(1) + ', ' + m.group(2)
m = valid_line.match(line2)
print 'Expected: ' + str(m.group(1))
m = valid_line.match(line3)
print 'Uh-oh: ' + m.group(1) + ', ' + m.group(2)

Hi,
I believe, the space before = is causing problems (or the pattern missingit);
you also need non greedy quantifiers +? to match as little as possible
as opposed to the greedy default:

valid_line = re.compile('^\[(\S+?)\]\[(\S+?)\](?:\s+|\[(\S+)\])\s*=|\s+[\d\[\']+.*$')

or you can use word-patterns explicitely excluding the closing ], like:

valid_line = re.compile('^\[([^\]]+)\]\[([^\]]+)\](?:\s+|\[([^\]]+)\])\s*=|\s+[\d\[\']+.*$')

hth
vbr

Ian Kelly · Oct 11, 2011

Instead of "\S" I'd recommend using "[^\]]", or using a lazy repetition
"\S+?".

Preferably the former. The core problem is that the regex matches
ambiguously on the problem string. Lazy repetition doesn't remove
that ambiguity; it merely attempts to make the module prefer the match
that you prefer.

Other notes to the OP: Always use raw strings (r'') when writing
regex patterns, to make sure the backslashes are escape characters in
the pattern rather than in the string literal.

The '^foo|bar$' construct you're using is wonky. I think you're
writing this to mean "match if the entire string is either 'foo' or
'bar'". But what that actually matches is "anything that either
starts with 'foo' or ends with 'bar'". The correct way to do the
former would be either '^foo$|^bar$' or '^(?:foo|bar)$'.

galyle · Oct 11, 2011

2011/10/10 galyle <[email protected]>:

HI, I've looked through this forum, but I haven't been able to find a
resolution to the problem I'm having (maybe I didn't look hard enough
-- I have to believe this has come up before). The problem is this:
I have a file which has 0, 2, or 3 groups that I'd like to record;
however, in the case of 3 groups, the third group is correctly
captured, but the first two groups get collapsed into just one group.
I'm sure that I'm missing something in the way I've constructed my
regular expression, but I can't figure out what's wrong. Does anyone
have any suggestions?

Click to expand...

The demo below showcases the problem I'm having:

Click to expand...

import re

Click to expand...

valid_line = re.compile('^\[(\S+)\]\[(\S+)\](?:\s+|\[(\S+)\])=|\s+[\d\
[\']+.*$')
line1 = "[field1][field2] = blarg"
line2 = " 'a continuation of blarg'"
line3 = "[field1][field2][field3] = blorg"

Click to expand...

m = valid_line.match(line1)
print 'Expected: ' + m.group(1) + ', ' + m.group(2)
m = valid_line.match(line2)
print 'Expected: ' + str(m.group(1))
m = valid_line.match(line3)
print 'Uh-oh: ' + m.group(1) + ', ' + m.group(2)

Click to expand...

Hi,
I believe, the space before = is causing problems (or the pattern missing it);
you also need non greedy quantifiers +? to match as little as possible
as opposed to the greedy default:

valid_line = re.compile('^\[(\S+?)\]\[(\S+?)\](?:\s+|\[(\S+)\])\s*=|\s+[\d\[\']+.*$')

or you can use word-patterns explicitely excluding the closing ], like:

valid_line = re.compile('^\[([^\]]+)\]\[([^\]]+)\](?:\s+|\[([^\]]+)\])\s*=|\s+[\d\[\']+. *$')

hth
vbr

Thanks, I had a feeling that greedy matching in my expression was
causing problem. Your suggestion makes sense to me, and works quite
well.

Regular Expression Non Capturing Grouping Does Not Work.	3	Jun 27, 2009
Problem creating a regular expression to parse open-iscsi, iscsiadmoutput (help?)	5	Jun 13, 2013
Regular expression problem	13	Mar 10, 2013
Help with regex and optional substring in search string	4	Oct 14, 2009
Regular expression and exception	2	Nov 15, 2008
Repeating assertions in regular expression	3	Jan 3, 2012
Regular Expression for Finding and Deleting comments	1	Jan 4, 2011
Regular expression match objects - compact syntax?	1	Feb 3, 2005

Question: Optional Regular Expression Grouping

galyle

MRAB

Vlastimil Brom

Ian Kelly

galyle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads