Nothing to repeat

T

Tom Anderson

Hello everyone, long time no see,

This is probably not a Python problem, but rather a regular expressions
problem.

I want, for the sake of arguments, to match strings comprising any number
of occurrences of 'spa', each interspersed by any number of occurrences of
the 'm'. 'any number' includes zero, so the whole pattern should match the
empty string.

Here's the conversation Python and i had about it:

Python 2.6.4 (r264:75706, Jun 4 2010, 18:20:16)
[GCC 4.4.4 20100503 (Red Hat 4.4.4-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 245, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat

What's going on here? Why is there nothing to repeat? Is the problem
having one *'d term inside another?

Now, i could actually rewrite this particular pattern as '(spa|m)*'. But
what i neglected to mention above is that i'm actually generating patterns
from structures of objects (representations of XML DTDs, as it happens),
and as it stands, patterns like this are a possibility.

Any thoughts on what i should do? Do i have to bite the bullet and apply
some cleverness in my pattern generation to avoid situations like this?

Thanks,
tom
 
I

Ian

Hello everyone, long time no see,

This is probably not a Python problem, but rather a regular
expressions problem.

I want, for the sake of arguments, to match strings comprising any
number of occurrences of 'spa', each interspersed by any number of
occurrences of the 'm'. 'any number' includes zero, so the whole
pattern should match the empty string.

Here's the conversation Python and i had about it:

Python 2.6.4 (r264:75706, Jun 4 2010, 18:20:16)
[GCC 4.4.4 20100503 (Red Hat 4.4.4-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.6/re.py", line 245, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat

What's going on here? Why is there nothing to repeat? Is the problem
having one *'d term inside another?

Now, i could actually rewrite this particular pattern as '(spa|m)*'.
But what i neglected to mention above is that i'm actually generating
patterns from structures of objects (representations of XML DTDs, as
it happens), and as it stands, patterns like this are a possibility.

Any thoughts on what i should do? Do i have to bite the bullet and
apply some cleverness in my pattern generation to avoid situations
like this?

Thanks,
tom
I think you want to anchor your list, or anything will match. Perhaps

re.compile('/^(spa(m)+)*$/')

is what you need.

Regards

Ian
 
M

Martin Gregorie

Any thoughts on what i should do? Do i have to bite the bullet and apply
some cleverness in my pattern generation to avoid situations like this?
This sort of works:

import re
f = open("test.txt")
p = re.compile("(spam*)*")
for line in f:
print "input line: %s" % (line.strip())
for m in p.findall(line):
if m != "":
print "==> %s" % (m)

when I feed it
=======================test.txt===========================
a line with no match
spa should match
spam should match
so should all of spaspamspammspammm
and so should all of spa spam spamm spammm
no match again.
=======================test.txt===========================

it produces:

input line: a line with no match
input line: spa should match
==> spa
input line: spam should match
==> spam
input line: so should all of spaspamspammspammm
==> spammm
input line: and so should all of spa spam spamm spammm
==> spa
==> spam
==> spamm
==> spammm
input line: no match again.

so obviously there's a problem with greedy matching where there are no
separators between adjacent matching strings. I tried non-greedy
matching, e.g. r'(spam*?)*', but this was worse, so I'll be interested to
see how the real regex mavens do it.
 
I

Ian

I think you want to anchor your list, or anything will match. Perhaps
My bad - this is better

re.compile('^((spa)*(m)*)+$')

search finds match in 'spa', 'spaspaspa', 'spammmspa', '' and 'mmm'

search fails on 'spats', 'mats' and others.
 
T

Terry Reedy

Hello everyone, long time no see,

This is probably not a Python problem, but rather a regular expressions
problem.

I want, for the sake of arguments, to match strings comprising any
number of occurrences of 'spa', each interspersed by any number of
occurrences of the 'm'. 'any number' includes zero, so the whole pattern
should match the empty string.

All you sure? A pattern that matches the empty string matches every string.
Here's the conversation Python and i had about it:

Python 2.6.4 (r264:75706, Jun 4 2010, 18:20:16)
[GCC 4.4.4 20100503 (Red Hat 4.4.4-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.


I believe precedence rule of * tighter than | (not in the doc) makes
this re is the same as "(spa|(m)*)*", which gives same error traceback.
I believe that for this, re compiles first (spa)* and then ((m)*)* and
the latter gives the same traceback. Either would seem to match strings
of 'm's without and 'spa', which is not your spec.

"((spa|m)*)*" does compile, so it is not the nesting itself.

The doc does not give the formal grammar for Python re's, so it is hard
to pinpoint which informal rule is violated, or if indeed the error is a
bug. Someone else may do better.
Now, i could actually rewrite this particular pattern as '(spa|m)*'.

That also does not match your spec.
Any thoughts on what i should do? Do i have to bite the bullet and apply
some cleverness in my pattern generation to avoid situations like this?

Well, it has to generate legal re's according to the engine you are
using (with whatever bugs and limitations it has).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top