re bug

K

Klaus Neuner

Hello,

it seems that there is a bug in the re module. Below there is a sample
program that illustrates it.(Sorry for its length. I was not able to
write a simpler regex that produced the same effect.)

The regex rx compiles. It can be used to search the strings str1 and
str2. Yet, when used with str3, the program will not terminate for
days.

I had already had this problem some months ago in another program. I
had the program run for several days. In the end, the regex search
terminated and returned the right result. As I could avoid regular
expressions of the kind that led to the problem, I did avoid them.

At the time being, I can neither avoid these regular expressions, nor
can I afford to wait days for the result of my program. And, even if
the search on str3 would terminate some day, I would still consider
this behaviour a serious bug. I am searching tons of texts with
regular expressions. And in tons of texts there will always be
something of the type of str3. This means that my program just doesn't
terminate, although it has no (home made) bug.

I am REALLY reluctant to change my program, because it is very complex
and it runs fine apart from the regex-search-does-not-terminate
problem. But, if changing my program was the most efficient way to
make the problem go away, I would do it. (I cannot afford to wait for
some future version of the re module that doesn't have the problem.)

Is there perhaps a way to tell Python something like

"Try the search for 5 minutes. If you don't find anything, continue."

Or are there any other solutions?


Klaus



#########################################################################

import re, sys

rx = re.compile("(?:^| )(?P<ALLES>(?:[^/ ]*/[^/ ]*/(?:cn(?::[^/ #]+)*)
)*[^/ ]*/[^/ ]*/(?:cn(?::[^: /#]+)*:2(?::[^: /#]+)*:3(?::[^:
/#]+)*))(?=( |$))")


str1 = "mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:2:3"

test1 = rx.search(str1)
if test1:
print test1.group("ALLES")

str2 = "//bos Innit/no/no ?/?/pun:? Mm/mm/adj:4 mm/mm/cn:nom:akk:3
mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
mm/mm/cn:nom:akk:2:3 ././pun:."

test2 = rx.search(str2)

if test2:
print test2.group("ALLES")

# Part below will not terminate for days

# str3 = "//bos Innit/no/no ?/?/pun:? Mm/mm/adj:4 mm/mm/cn:nom:akk:3
mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 ././pun:."

# test3 = rx.search(str3)

# if test3:
# print test3.group("ALLES")
 
A

Alex Martelli

Klaus Neuner said:
Is there perhaps a way to tell Python something like

"Try the search for 5 minutes. If you don't find anything, continue."

You could spawn a watchdog thread that will raise an exception in the
main thread 300 seconds from now (unless previously annulled). If that
doesn't work (because the main thread is deep into some operation which
doesn't let it see exceptions) you could spawn a separate process for
the job and kill that process if it hasn't emitted results within 300
seconds. Not _good_ solutions but maybe better than nothing...


Alex
 
M

Michael Hoffman

Klaus said:
it seems that there is a bug in the re module. Below there is a sample
program that illustrates it.(Sorry for its length. I was not able to
write a simpler regex that produced the same effect.)

It might help if you told us what this is supposed to do in real terms...
Is there perhaps a way to tell Python something like

"Try the search for 5 minutes. If you don't find anything, continue."

Yes. Use signal.alarm(). See:

http://www.python.org/doc/current/lib/node304.html
 
T

Thomas Rast

The regex rx compiles. It can be used to search the strings str1 and
str2. Yet, when used with str3, the program will not terminate for
days.

Here's a more readable version of the regexp in question (I assume the
newlines were introduced by your mail program and not part of the
original string), which should be equivalent to your form if compiled
with the re.VERBOSE flag:

'''
(?:^|\ )
(?P<ALLES>
(?:
[^/ ]*/[^/ ]*/
(?: # why a group?
cn
(?: :[^/ #]+ )*
)
)*
[^/ ]*/[^/ ]*/
(?: # why a group?
cn
(?: :[^: /#]+ )*
:2
(?: :[^: /#]+ )*
:3
(?: :[^: /#]+ )*
)
)
(?=(\ |$))
'''

Judging from the number of '*' and '+' quantifiers, the long search
time may be due to excessive backtracking as the regexp engine tries
to find a match.

[Now that I'm looking this through for the N-th time over, maybe it's
not, maybe you really found a bug. But I've already written the part
below and don't want to waste the time I invested in analysing your
regexp by not posting it...]
At the time being, I can neither avoid these regular expressions, nor
can I afford to wait days for the result of my program.

Your example imposes a very strong structure on a possible match: It
consists of small "tokens" (for lack of a better word) delimited by
'/' and ':'. Try simplifying the regexp to something like (re.VERBOSE
again):

'''
(
[^ /]*/[^ /]*/ # two "tokens" followed by a slash each
cn
:)[^ /:]+)* # "tokens" that start with a colon
)*
'''

then .split() the string you're trying to search (this is a far more
efficient way of expressing your "boundary conditions") and .match()
against each of the words. If you have a match, check if it fits the
rest of the requirements, i.e. look for the :2 and :3 tokens etc.

HTH
Thomas
 
G

Gustavo Niemeyer

'''
(?:^|\ )
(?P<ALLES>
(?:
[^/ ]*/[^/ ]*/
(?: # why a group?
cn
(?: :[^/ #]+ )*
[...]

That's not the same regular expression. You must escape whitespaces
when using re.VERBOSE.
Judging from the number of '*' and '+' quantifiers, the long search
time may be due to excessive backtracking as the regexp engine tries
to find a match.

The problem is not just the number of repeating qualifiers, but
the nesting of them. Nesting repeating qualifiers deeply is a good
way to kill regular expression engines.

Try this example:

re.search("a(((.)*c)*d)*e", "abcdf"*20)

Also, if you're curious enough, try to replace 20 by 10, and
increase it one at a time.

Btw, the only reason that the OP's expression didn't got stuck
in the first two test cases is because the expression matched.
 
T

Thomas Rast

Gustavo Niemeyer said:
That's not the same regular expression. You must escape whitespaces
when using re.VERBOSE.

As far as I can see, I did, except in character classes, and quoting
from the 're' manual:

] X
] VERBOSE
] This flag allows you to write regular expressions that look
] nicer. Whitespace within the pattern is ignored, except when in a
] character class or preceded by an unescaped backslash [...]
The problem is not just the number of repeating qualifiers, but
the nesting of them. Nesting repeating qualifiers deeply is a good
way to kill regular expression engines.
[...]

I actually tried the example stated in 'man perlre', translated to
python:
import re
r=re.compile('((a{0,5}){0,5})*[c]')
r.match('a'*10)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
RuntimeError: maximum recursion limit exceeded

but apparently there are ways to nest the backtracking without hitting
recursion limits.

- Thomas
 
G

Gustavo Niemeyer

That's not the same regular expression. You must escape whitespaces
As far as I can see, I did, except in character classes, and quoting
from the 're' manual:

'''
(?P<ALLES>
(?:
[^/ ]*/[^/ ]*/
(?: # why a group?
cn
(?: :[^/ #]+ )*
)
^
There's one space here. Your version doesn't expose the issue
mentioned by the OP, as you may see by uncommenting the third
test case.

[...]
I actually tried the example stated in 'man perlre', translated to
python: [...]
but apparently there are ways to nest the backtracking without hitting
recursion limits.

This is fixed in Python 2.4:
import re
r = re.compile("((a{0,5}){0,5})*[c]")
r.match("a"*10)
 
T

Thomas Rast

Gustavo Niemeyer said:
There's one space here. Your version doesn't expose the issue
mentioned by the OP, as you may see by uncommenting the third
test case.

For whatever reason, it doesn't show up in the newspost I see.
There's a newline at that position, but no space, and as I stated I
assumed the newlines were an artifact (since he used single quotes).
Maybe the gateway between news and mail dropped it? In any case, I
stand corrected. Unfortunately that also makes splitting the string a
moot point.

- Thomas
 
K

Klaus Neuner

Thanks to you all. I will try to answer some of the questions in your
answers.

(Michael Hoffman:)
It might help if you told us what this is supposed to do in real terms...

I invented a kind of linguistic metalanguage. My program is an
interpreter for this language. It translates statements of the
linguistic metalanguage into cascades of transducers. Therefore, I
cannot answer questions like the following one:

(Thomas Rast:)
(?: # why a group?

There is a group, because the translation algorithm has introduced it.
It is hard to tell why the algorithm has decided to do so. Maybe the
group doesn't make sense here, but I think it will make sense in some
similar case. Even if I re-read all of my program now in order to
re-understand my translation algorithm, it would not make much sense
to try to explain why I designed it that way and not another. I would
need dozens of pages for this.

I have been testing my program now since months and on gigabytes of
data. It has not failed for a long time. It did never produce
unwellformed regular expressions. I was so naive to think that it
would be sufficient to make sure that all regular expressions be
wellformed. (Althoug I was also concerned with efficiency issues, of
course.) Then I saw that wellformedness was not sufficient and I was
afraid that I would have to change my algorithm.

Luckily, I may safely assume that any pair of a regex and a string
such that the regex cannot be matched on the string in a few seconds,
is a case that can be disregarded. Therefore the signal-module
solution is perfect for me.

(Gustavo Niemeyer:)
Btw, the only reason that the OP's expression didn't got stuck
in the first two test cases is because the expression matched.

I don't really understand what you mean. You can easily change str3
such that rx matches. Nevertheless, you will have to wait a long time
to get a result.
 
G

Gustavo Niemeyer

Btw, the only reason that the OP's expression didn't got stuck
I don't really understand what you mean. You can easily change str3
such that rx matches. Nevertheless, you will have to wait a long time
to get a result.

If you're waiting a long time, your expression isn't matching. Try to
use "mm/mm/cn:nom:akk:2:3" as the last term of your third expression.
It executes in 0.045 seconds here.
 
K

Klaus Neuner

If you're waiting a long time, your expression isn't matching. Try to
use "mm/mm/cn:nom:akk:2:3" as the last term of your third expression.
It executes in 0.045 seconds here.

You are right. Sorry, I thaught I had tried it with
"mm/mm/cn:nom:akk:2:3" as last term, too.

Klaus
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top