Pathological regular expression

David Liang · Apr 9, 2009

Hi all,
I'm having a weird problem with a regular expression (tested in 2.6
and 3.0):

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

....hangs the interpreter. For reference, if the first command had been
_re_comments = re.compile(r'^(([^z]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

(off by one character z) it works fine, and all the equivalent
operations work in sed and awk. Am I missing something about Python
RE's?

-David

David Liang · Apr 9, 2009

Hi all,
I'm having a weird problem with a regular expression (tested in 2.6
and 3.0):

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

...hangs the interpreter. For reference, if the first command had been
_re_comments = re.compile(r'^(([^z]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

(off by one character z) it works fine, and all the equivalent
operations work in sed and awk. Am I missing something about Python
RE's?

-David

The problem was the redundant +'s; the fixed RE is

_re_comments = re.compile(r'^(([^#"\\]|\\.|"([^"\\]|\\.)*")*)#.*')

David Liang · Apr 9, 2009

Hi all,
I'm having a weird problem with a regular expression (tested in 2.6
and 3.0):

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

...hangs the interpreter. For reference, if the first command had been
_re_comments = re.compile(r'^(([^z]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

(off by one character z) it works fine, and all the equivalent
operations work in sed and awk. Am I missing something about Python
RE's?

-David

The problem was the redundant +'s; the fixed RE is

_re_comments = re.compile(r'^(([^#"\\]|\\.|"([^"\\]|\\.)*")*)#.*')

Steven D'Aprano · Apr 11, 2009

Hi all,
I'm having a weird problem with a regular expression (tested in 2.6 and
3.0):

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

...hangs the interpreter.

I can confirm the first one hangs the interpreter in Python 2.5 as well.
I haven't tested the other two.

To my mind, this is a bug in the RE engine. Is there any reason to not
treat it as a bug?

MRAB · Apr 11, 2009

Steven said:
Hi all,
I'm having a weird problem with a regular expression (tested in 2.6 and
3.0):

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

...hangs the interpreter.

Click to expand...

I can confirm the first one hangs the interpreter in Python 2.5 as well.
I haven't tested the other two.

To my mind, this is a bug in the RE engine. Is there any reason to not
treat it as a bug?

It's not a bug but one of those pathological cases where there are a LOT
of possibilities to try (we're talking about exponential growth here).

Hmm, maybe the re module needs to let the user set a timeout control
that raises an exception if it takes longer than n seconds (it can
already be interrupted by ^C)...

John Machin · Apr 11, 2009

Hi all,
I'm having a weird problem with a regular expression (tested in 2.6 and
3.0):

Click to expand...

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

Click to expand...

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

Click to expand...

...hangs the interpreter.

Click to expand...

I can confirm the first one hangs the interpreter in Python 2.5 as well.
I haven't tested the other two.

To my mind, this is a bug in the RE engine. Is there any reason to not
treat it as a bug?

IMHO it's not a bug -- s/hang/takes a long time to compute/

Just look at it: 2 + operators and 3 * operators ... It's one of those
"come back after lunch" REs.

Dotan Cohen · Apr 11, 2009

IMHO it's not a bug -- s/hang/takes a long time to compute/
â€ŽThat is quite what a hang is, and why the timeout was invented. The
real bug is that there is no timeout mechanism.

Just look at it: 2 + operators and 3 * operators ... It's one of those
"come back after lunch" REs.

Some users would not be able to tell that from the beginning. While a
timeout override would be necessary in these cases, a timeout should
be set.

MRAB · Apr 11, 2009

Dotan said:
â€ŽThat is quite what a hang is, and why the timeout was invented. The
real bug is that there is no timeout mechanism.

I wouldn't call it a "hang" because it is actually doing work. If it was
'stuck' on a certain part of the text and not progressing then it would
be a hang.

Aaron Brady · Apr 11, 2009

Hi all,
I'm having a weird problem with a regular expression (tested in 2.6 and
3.0):

Click to expand...

Basically, any of these:
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
_re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

Click to expand...

followed by for example,
line = r'~/.[m]ozilla/firefox/*.default/chrome'
print(_re_comments.sub(r'\1', line))

Click to expand...

...hangs the interpreter.

Click to expand...

I can confirm the first one hangs the interpreter in Python 2.5 as well.
I haven't tested the other two.

To my mind, this is a bug in the RE engine. Is there any reason to not
treat it as a bug?

How long does it take on a short example?

Steven D'Aprano · Apr 11, 2009

IMHO it's not a bug -- s/hang/takes a long time to compute/

Just look at it: 2 + operators and 3 * operators ... It's one of those
"come back after lunch" REs.

Well, it's been running now for about two and a half hours, that's a
rather long lunch. And despite MRAB's assertion, it *cannot* be
interrupted by ctrl-C. That means that to all intents and purposes, the
interpreter has locked up for the duration of the calculation, which may
be days or weeks for all I know.

Dotan Cohen · Apr 11, 2009

Well, it's been running now for about two and a half hours, that's a

rather long lunch.

I'd also like a pony!

Aaron Brady · Apr 11, 2009

Well, it's been running now for about two and a half hours, that's a
rather long lunch. And despite MRAB's assertion, it *cannot* be
interrupted by ctrl-C. That means that to all intents and purposes, the
interpreter has locked up for the duration of the calculation, which may
be days or weeks for all I know.

While beyond the limits of a long lunch, that would make a nice
vacation?

MRAB · Apr 11, 2009

Steven said:
Well, it's been running now for about two and a half hours, that's a
rather long lunch. And despite MRAB's assertion, it *cannot* be
interrupted by ctrl-C. That means that to all intents and purposes, the
interpreter has locked up for the duration of the calculation, which may
be days or weeks for all I know.

I've just tried all 3 regexes in Python 2.5.2. All could be interrupted
by ctrl-C.

John Machin · Apr 12, 2009

Well, it's been running now for about two and a half hours, that's a
rather long lunch. And despite MRAB's assertion, it *cannot* be
interrupted by ctrl-C. That means that to all intents and purposes, the
interpreter has locked up for the duration of the calculation, which may
be days or weeks for all I know.

If you don't know, experiment!

import re, time
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
for end in ("# yadda", ""):
for start in ("", '"#"'):
for size in xrange(25):
line = start + "x" * size + end
print line
t0 = time.clock()
result = _re_comments_nc.sub(r"\1", line)
t1 = time.clock()
print "%s (%.6f secs)" % (result, t1 - t0)

Observations:
1. Runs at light speed in first two cases (genuine comemnt to be
removed)
2. Takes approx O(2**N) in the 3rd case (no # in sight)
3. Takes even longer to produce a *WRONG* result in the 4th case (# in
a string literal)
4. The RE assumes that only " can quote a string literal; what about '
''' and """?

Cluesticks:
1. if "#" not in line:
dont_need_no_stinking_regexes()
2. try timing on short pieces of input
3. Test the RE, get it correct

John Machin · Apr 12, 2009

result = _re_comments_nc.sub(r"\1", line)

s/_nc// ... that's an artifact of timing it with Non-Capture groups
(?:blahblah) on the two internal groups that don't need to be
capturing (results identical, and no perceptible effect on running
time)

Steven D'Aprano · Apr 12, 2009

If you don't know, experiment!

My original test has now been running for close to ten hours now, and
still can't be interrupted with ctrl-C. However that's in Python 2.5,
having tried it in Python 2.6.2 they can be interrupted, so I'm satisfied
that this bug of "regex hangs the interpreter" is not worth reporting, as
it effects only older versions.

[...]

3. Test the RE, get it correct

They're not my regexes. I don't care whether they are correct or not, I'm
more concerned about them locking up the interpreter.

Aaron Brady · Apr 12, 2009

On Apr 11, 7:31 pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:
_

My original test has now been running for close to ten hours now, and
still can't be interrupted with ctrl-C. However that's in Python 2.5,
having tried it in Python 2.6.2 they can be interrupted, so I'm satisfied
that this bug of "regex hangs the interpreter" is not worth reporting, as
it effects only older versions.

[...]

3. Test the RE, get it correct

Click to expand...

They're not my regexes. I don't care whether they are correct or not, I'm
more concerned about them locking up the interpreter.

I don't believe in second consoles or kill(1)... or telnet. I wonder
if the hanging is symptomatic of a wider-spread bug, that shorter
operations would never reveal. ...

John Machin · Apr 12, 2009

My original test has now been running for close to ten hours now, and
still can't be interrupted with ctrl-C. However that's in Python 2.5,

What platform are you running on? It works OK for me back to 2.1 on
Windows XP SP3:

Traceback (most recent call last):
File "weirdre.py", line 34, in ?
result = _re_comments.sub(r"\1", line)
File "C:\python21\lib\sre.py", line 164, in _sub
return _subn(pattern, template, string, count)[0]
File "C:\python21\lib\sre.py", line 179, in _subn
m = c.search()
KeyboardInterrupt

having tried it in Python 2.6.2 they can be interrupted, so I'm satisfied
that this bug of "regex hangs the interpreter" is not worth reporting, as
it effects only older versions.

[...]

3. Test the RE, get it correct

Click to expand...

They're not my regexes. I don't care whether they are correct or not, I'm
more concerned about them locking up the interpreter.

That's understood -- it was a shotgun reply to the OP and most
responders; each can take their own pellets ;-)

franck · Apr 17, 2009

To my mind, this is a bug in the RE engine. Is there any reason to not

treat it as a bug?

It's not a bug, it's a feature! ;-)
Indeed, in order to handle constructs like (?P=name), RE engines have
to use inefficient algorithms. In my opinion, this is not the problem
of using a pathological regular expression, but rather a problem of
the RE engine that is not optimized to use the faster approach when
possible.

This is well known problem very well explained on:
http://swtch.com/~rsc/regexp/regexp1.html

Cheers,
Franck

Repeating assertions in regular expression	3	Jan 3, 2012
Help with regular expression in python	1	Aug 18, 2011
Regular Expression for Finding and Deleting comments	1	Jan 4, 2011
Regular expression to match a #	19	Aug 11, 2005
Regular Expression question	12	Aug 21, 2006
regular expression for nested parentheses	5	Dec 9, 2007
regular expression extracting groups	3	Aug 10, 2008
problem with regular expression?	0	May 14, 2004

Pathological regular expression

David Liang

David Liang

David Liang

Steven D'Aprano

MRAB

John Machin

Dotan Cohen

MRAB

Aaron Brady

Steven D'Aprano

Dotan Cohen

Aaron Brady

MRAB

John Machin

John Machin

Steven D'Aprano

Aaron Brady

John Machin

franck

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads