Question: Optional Regular Expression Grouping

G

galyle

HI, I've looked through this forum, but I haven't been able to find a
resolution to the problem I'm having (maybe I didn't look hard enough
-- I have to believe this has come up before). The problem is this:
I have a file which has 0, 2, or 3 groups that I'd like to record;
however, in the case of 3 groups, the third group is correctly
captured, but the first two groups get collapsed into just one group.
I'm sure that I'm missing something in the way I've constructed my
regular expression, but I can't figure out what's wrong. Does anyone
have any suggestions?

The demo below showcases the problem I'm having:

import re

valid_line = re.compile('^\[(\S+)\]\[(\S+)\](?:\s+|\[(\S+)\])=|\s+[\d\
[\']+.*$')
line1 = "[field1][field2] = blarg"
line2 = " 'a continuation of blarg'"
line3 = "[field1][field2][field3] = blorg"

m = valid_line.match(line1)
print 'Expected: ' + m.group(1) + ', ' + m.group(2)
m = valid_line.match(line2)
print 'Expected: ' + str(m.group(1))
m = valid_line.match(line3)
print 'Uh-oh: ' + m.group(1) + ', ' + m.group(2)
 
M

MRAB

HI, I've looked through this forum, but I haven't been able to find a
resolution to the problem I'm having (maybe I didn't look hard enough
-- I have to believe this has come up before). The problem is this:
I have a file which has 0, 2, or 3 groups that I'd like to record;
however, in the case of 3 groups, the third group is correctly
captured, but the first two groups get collapsed into just one group.
I'm sure that I'm missing something in the way I've constructed my
regular expression, but I can't figure out what's wrong. Does anyone
have any suggestions?

The demo below showcases the problem I'm having:

import re

valid_line = re.compile('^\[(\S+)\]\[(\S+)\](?:\s+|\[(\S+)\])=|\s+[\d\
[\']+.*$')
line1 = "[field1][field2] = blarg"
line2 = " 'a continuation of blarg'"
line3 = "[field1][field2][field3] = blorg"

m = valid_line.match(line1)
print 'Expected: ' + m.group(1) + ', ' + m.group(2)
m = valid_line.match(line2)
print 'Expected: ' + str(m.group(1))
m = valid_line.match(line3)
print 'Uh-oh: ' + m.group(1) + ', ' + m.group(2)

Instead of "\S" I'd recommend using "[^\]]", or using a lazy repetition
"\S+?".

You'll also need to handle the space before the "=" in line3.

valid_line =
re.compile(r'^\[(\[^\]]+)\]\[(\[^\]]+)\](?:\s+|\[(\[^\]]+)\])\s*=|\s+[\d\[\']+.*$')
 
V

Vlastimil Brom

2011/10/10 galyle said:
HI, I've looked through this forum, but I haven't been able to find a
resolution to the problem I'm having (maybe I didn't look hard enough
-- I have to believe this has come up before).  The problem is this:
I have a file which has 0, 2, or 3 groups that I'd like to record;
however, in the case of 3 groups, the third group is correctly
captured, but the first two groups get collapsed into just one group.
I'm sure that I'm missing something in the way I've constructed my
regular expression, but I can't figure out what's wrong.  Does anyone
have any suggestions?

The demo below showcases the problem I'm having:

import re

valid_line = re.compile('^\[(\S+)\]\[(\S+)\](?:\s+|\[(\S+)\])=|\s+[\d\
[\']+.*$')
line1 = "[field1][field2] = blarg"
line2 = "    'a continuation of blarg'"
line3 = "[field1][field2][field3] = blorg"

m = valid_line.match(line1)
print 'Expected: ' + m.group(1) + ', ' + m.group(2)
m = valid_line.match(line2)
print 'Expected: ' + str(m.group(1))
m = valid_line.match(line3)
print 'Uh-oh: ' + m.group(1) + ', ' + m.group(2)

Hi,
I believe, the space before = is causing problems (or the pattern missingit);
you also need non greedy quantifiers +? to match as little as possible
as opposed to the greedy default:

valid_line = re.compile('^\[(\S+?)\]\[(\S+?)\](?:\s+|\[(\S+)\])\s*=|\s+[\d\[\']+.*$')

or you can use word-patterns explicitely excluding the closing ], like:

valid_line = re.compile('^\[([^\]]+)\]\[([^\]]+)\](?:\s+|\[([^\]]+)\])\s*=|\s+[\d\[\']+.*$')

hth
vbr
 
I

Ian Kelly

Instead of "\S" I'd recommend using "[^\]]", or using a lazy repetition
"\S+?".

Preferably the former. The core problem is that the regex matches
ambiguously on the problem string. Lazy repetition doesn't remove
that ambiguity; it merely attempts to make the module prefer the match
that you prefer.

Other notes to the OP: Always use raw strings (r'') when writing
regex patterns, to make sure the backslashes are escape characters in
the pattern rather than in the string literal.

The '^foo|bar$' construct you're using is wonky. I think you're
writing this to mean "match if the entire string is either 'foo' or
'bar'". But what that actually matches is "anything that either
starts with 'foo' or ends with 'bar'". The correct way to do the
former would be either '^foo$|^bar$' or '^(?:foo|bar)$'.
 
G

galyle

2011/10/10 galyle <[email protected]>:








HI, I've looked through this forum, but I haven't been able to find a
resolution to the problem I'm having (maybe I didn't look hard enough
-- I have to believe this has come up before).  The problem is this:
I have a file which has 0, 2, or 3 groups that I'd like to record;
however, in the case of 3 groups, the third group is correctly
captured, but the first two groups get collapsed into just one group.
I'm sure that I'm missing something in the way I've constructed my
regular expression, but I can't figure out what's wrong.  Does anyone
have any suggestions?
The demo below showcases the problem I'm having:
import re
valid_line = re.compile('^\[(\S+)\]\[(\S+)\](?:\s+|\[(\S+)\])=|\s+[\d\
[\']+.*$')
line1 = "[field1][field2] = blarg"
line2 = "    'a continuation of blarg'"
line3 = "[field1][field2][field3] = blorg"
m = valid_line.match(line1)
print 'Expected: ' + m.group(1) + ', ' + m.group(2)
m = valid_line.match(line2)
print 'Expected: ' + str(m.group(1))
m = valid_line.match(line3)
print 'Uh-oh: ' + m.group(1) + ', ' + m.group(2)

Hi,
I believe, the space before = is causing problems (or the pattern missing it);
you also need non greedy quantifiers +? to match as little as possible
as opposed to the greedy default:

valid_line = re.compile('^\[(\S+?)\]\[(\S+?)\](?:\s+|\[(\S+)\])\s*=|\s+[\d\[\']+.*$')

or you can use word-patterns explicitely excluding the closing ], like:

valid_line = re.compile('^\[([^\]]+)\]\[([^\]]+)\](?:\s+|\[([^\]]+)\])\s*=|\s+[\d\[\']+. *$')

hth
 vbr

Thanks, I had a feeling that greedy matching in my expression was
causing problem. Your suggestion makes sense to me, and works quite
well.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top