unexpected regexp behaviour using 'A|B|C.....'


A

AlienBaby

When using re patterns of the form 'A|B|C|...' the docs seem to
suggest that once any of A,B,C.. match, it is captured and no further
patterns are tried. But I am seeing,

st=' Id Name Prov Type CopyOf BsId
Rd -Detailed_State- Adm Snp Usr VSize'

p='Type *'
re.search(p,st).group()
'Type '

p='Type *| *Type'
re.search(p,st).group()
' Type'


Shouldn’t the second search return the same as the first, if further
patterns are not tried?

The documentation appears to suggest the first match should be
returned, or am I misunderstanding?

'|'
A|B, where A and B can be arbitrary REs, creates a regular expression
that will match either A or B. An arbitrary number of REs can be
separated by the '|' in this way. This can be used inside groups (see
below) as well. As the target string is scanned, REs separated by '|'
are tried from left to right.

When one pattern completely matches, that branch is accepted. This
means that once A matches, B will not be tested further, even if it
would produce a longer overall match.

In other words, the '|' operator is never greedy. To match a literal
'|', use \|, or enclose it inside a character class, as in [|].
 
Ad

Advertisements

P

Peter Otten

AlienBaby said:
When using re patterns of the form 'A|B|C|...' the docs seem to
suggest that once any of A,B,C.. match, it is captured and no further
patterns are tried. But I am seeing,

st=' Id Name Prov Type CopyOf BsId
Rd -Detailed_State- Adm Snp Usr VSize'

p='Type *'
re.search(p,st).group()
'Type '

p='Type *| *Type'
re.search(p,st).group()
' Type'


Shouldn’t the second search return the same as the first, if further
patterns are not tried?

The documentation appears to suggest the first match should be
returned, or am I misunderstanding?

All alternatives are tried at a given starting position in the string before
the algorithm advances to the next position. The second alternative
" *Type", at least one space followed by the character sequence "Type"
matches right after "Prov" in your example, therefore the first
alternative, "Type" and any following spaces, which would match after
"Prov " is never tried.

Maybe you accidentally typed one extra " "? If you didn't " +Type" would be
clearer.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top