Pathological regular expression

D

David Liang

Hi all,
I'm having a weird problem with a regular expression (tested in 2.6
and 3.0):

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

....hangs the interpreter. For reference, if the first command had been
_re_comments = re.compile(r'^(([^z]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

(off by one character z) it works fine, and all the equivalent
operations work in sed and awk. Am I missing something about Python
RE's?

-David
 
D

David Liang

Hi all,
I'm having a weird problem with a regular expression (tested in 2.6
and 3.0):

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

...hangs the interpreter. For reference, if the first command had been
_re_comments = re.compile(r'^(([^z]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

(off by one character z) it works fine, and all the equivalent
operations work in sed and awk. Am I missing something about Python
RE's?

-David

The problem was the redundant +'s; the fixed RE is

_re_comments = re.compile(r'^(([^#"\\]|\\.|"([^"\\]|\\.)*")*)#.*')
 
D

David Liang

Hi all,
I'm having a weird problem with a regular expression (tested in 2.6
and 3.0):

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

...hangs the interpreter. For reference, if the first command had been
_re_comments = re.compile(r'^(([^z]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

(off by one character z) it works fine, and all the equivalent
operations work in sed and awk. Am I missing something about Python
RE's?

-David

The problem was the redundant +'s; the fixed RE is

_re_comments = re.compile(r'^(([^#"\\]|\\.|"([^"\\]|\\.)*")*)#.*')
 
S

Steven D'Aprano

Hi all,
I'm having a weird problem with a regular expression (tested in 2.6 and
3.0):

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

...hangs the interpreter.


I can confirm the first one hangs the interpreter in Python 2.5 as well.
I haven't tested the other two.

To my mind, this is a bug in the RE engine. Is there any reason to not
treat it as a bug?
 
M

MRAB

Steven said:
Hi all,
I'm having a weird problem with a regular expression (tested in 2.6 and
3.0):

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

...hangs the interpreter.


I can confirm the first one hangs the interpreter in Python 2.5 as well.
I haven't tested the other two.

To my mind, this is a bug in the RE engine. Is there any reason to not
treat it as a bug?
It's not a bug but one of those pathological cases where there are a LOT
of possibilities to try (we're talking about exponential growth here).

Hmm, maybe the re module needs to let the user set a timeout control
that raises an exception if it takes longer than n seconds (it can
already be interrupted by ^C)...
 
J

John Machin

Hi all,
I'm having a weird problem with a regular expression (tested in 2.6 and
3.0):
Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))
...hangs the interpreter.

I can confirm the first one hangs the interpreter in Python 2.5 as well.
I haven't tested the other two.

To my mind, this is a bug in the RE engine. Is there any reason to not
treat it as a bug?

IMHO it's not a bug -- s/hang/takes a long time to compute/

Just look at it: 2 + operators and 3 * operators ... It's one of those
"come back after lunch" REs.
 
D

Dotan Cohen

IMHO it's not a bug -- s/hang/takes a long time to compute/
‎That is quite what a hang is, and why the timeout was invented. The
real bug is that there is no timeout mechanism.
Just look at it: 2 + operators and 3 * operators ... It's one of those
"come back after lunch" REs.

Some users would not be able to tell that from the beginning. While a
timeout override would be necessary in these cases, a timeout should
be set.
 
M

MRAB

Dotan said:
‎That is quite what a hang is, and why the timeout was invented. The
real bug is that there is no timeout mechanism.
I wouldn't call it a "hang" because it is actually doing work. If it was
'stuck' on a certain part of the text and not progressing then it would
be a hang.
 
A

Aaron Brady

Hi all,
I'm having a weird problem with a regular expression (tested in 2.6 and
3.0):
Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))
...hangs the interpreter.

I can confirm the first one hangs the interpreter in Python 2.5 as well.
I haven't tested the other two.

To my mind, this is a bug in the RE engine. Is there any reason to not
treat it as a bug?

How long does it take on a short example?
 
S

Steven D'Aprano

IMHO it's not a bug -- s/hang/takes a long time to compute/

Just look at it: 2 + operators and 3 * operators ... It's one of those
"come back after lunch" REs.

Well, it's been running now for about two and a half hours, that's a
rather long lunch. And despite MRAB's assertion, it *cannot* be
interrupted by ctrl-C. That means that to all intents and purposes, the
interpreter has locked up for the duration of the calculation, which may
be days or weeks for all I know.
 
A

Aaron Brady

Well, it's been running now for about two and a half hours, that's a
rather long lunch. And despite MRAB's assertion, it *cannot* be
interrupted by ctrl-C. That means that to all intents and purposes, the
interpreter has locked up for the duration of the calculation, which may
be days or weeks for all I know.

While beyond the limits of a long lunch, that would make a nice
vacation?
 
M

MRAB

Steven said:
Well, it's been running now for about two and a half hours, that's a
rather long lunch. And despite MRAB's assertion, it *cannot* be
interrupted by ctrl-C. That means that to all intents and purposes, the
interpreter has locked up for the duration of the calculation, which may
be days or weeks for all I know.
I've just tried all 3 regexes in Python 2.5.2. All could be interrupted
by ctrl-C.
 
J

John Machin

Well, it's been running now for about two and a half hours, that's a
rather long lunch. And despite MRAB's assertion, it *cannot* be
interrupted by ctrl-C. That means that to all intents and purposes, the
interpreter has locked up for the duration of the calculation, which may
be days or weeks for all I know.

If you don't know, experiment!

import re, time
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
for end in ("# yadda", ""):
for start in ("", '"#"'):
for size in xrange(25):
line = start + "x" * size + end
print line
t0 = time.clock()
result = _re_comments_nc.sub(r"\1", line)
t1 = time.clock()
print "%s (%.6f secs)" % (result, t1 - t0)

Observations:
1. Runs at light speed in first two cases (genuine comemnt to be
removed)
2. Takes approx O(2**N) in the 3rd case (no # in sight)
3. Takes even longer to produce a *WRONG* result in the 4th case (# in
a string literal)
4. The RE assumes that only " can quote a string literal; what about '
''' and """?

Cluesticks:
1. if "#" not in line:
dont_need_no_stinking_regexes()
2. try timing on short pieces of input
3. Test the RE, get it correct
 
J

John Machin

            result = _re_comments_nc.sub(r"\1", line)

s/_nc// ... that's an artifact of timing it with Non-Capture groups
(?:blahblah) on the two internal groups that don't need to be
capturing (results identical, and no perceptible effect on running
time)
 
S

Steven D'Aprano

If you don't know, experiment!

My original test has now been running for close to ten hours now, and
still can't be interrupted with ctrl-C. However that's in Python 2.5,
having tried it in Python 2.6.2 they can be interrupted, so I'm satisfied
that this bug of "regex hangs the interpreter" is not worth reporting, as
it effects only older versions.

[...]
3. Test the RE, get it correct

They're not my regexes. I don't care whether they are correct or not, I'm
more concerned about them locking up the interpreter.
 
A

Aaron Brady

On Apr 11, 7:31 pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:
_
My original test has now been running for close to ten hours now, and
still can't be interrupted with ctrl-C. However that's in Python 2.5,
having tried it in Python 2.6.2 they can be interrupted, so I'm satisfied
that this bug of "regex hangs the interpreter" is not worth reporting, as
it effects only older versions.

[...]
3. Test the RE, get it correct

They're not my regexes. I don't care whether they are correct or not, I'm
more concerned about them locking up the interpreter.

I don't believe in second consoles or kill(1)... or telnet. I wonder
if the hanging is symptomatic of a wider-spread bug, that shorter
operations would never reveal. ...
 
J

John Machin

My original test has now been running for close to ten hours now, and
still can't be interrupted with ctrl-C. However that's in Python 2.5,

What platform are you running on? It works OK for me back to 2.1 on
Windows XP SP3:

Traceback (most recent call last):
File "weirdre.py", line 34, in ?
result = _re_comments.sub(r"\1", line)
File "C:\python21\lib\sre.py", line 164, in _sub
return _subn(pattern, template, string, count)[0]
File "C:\python21\lib\sre.py", line 179, in _subn
m = c.search()
KeyboardInterrupt

having tried it in Python 2.6.2 they can be interrupted, so I'm satisfied
that this bug of "regex hangs the interpreter" is not worth reporting, as
it effects only older versions.

[...]
3. Test the RE, get it correct

They're not my regexes. I don't care whether they are correct or not, I'm
more concerned about them locking up the interpreter.

That's understood -- it was a shotgun reply to the OP and most
responders; each can take their own pellets ;-)
 
F

franck

To my mind, this is a bug in the RE engine. Is there any reason to not
treat it as a bug?

It's not a bug, it's a feature! ;-)
Indeed, in order to handle constructs like (?P=name), RE engines have
to use inefficient algorithms. In my opinion, this is not the problem
of using a pathological regular expression, but rather a problem of
the RE engine that is not optimized to use the faster approach when
possible.

This is well known problem very well explained on:
http://swtch.com/~rsc/regexp/regexp1.html

Cheers,
Franck
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top