re bug

Discussion in 'Python' started by Klaus Neuner, Oct 5, 2004.

  1. Klaus Neuner

    Klaus Neuner Guest

    Hello,

    it seems that there is a bug in the re module. Below there is a sample
    program that illustrates it.(Sorry for its length. I was not able to
    write a simpler regex that produced the same effect.)

    The regex rx compiles. It can be used to search the strings str1 and
    str2. Yet, when used with str3, the program will not terminate for
    days.

    I had already had this problem some months ago in another program. I
    had the program run for several days. In the end, the regex search
    terminated and returned the right result. As I could avoid regular
    expressions of the kind that led to the problem, I did avoid them.

    At the time being, I can neither avoid these regular expressions, nor
    can I afford to wait days for the result of my program. And, even if
    the search on str3 would terminate some day, I would still consider
    this behaviour a serious bug. I am searching tons of texts with
    regular expressions. And in tons of texts there will always be
    something of the type of str3. This means that my program just doesn't
    terminate, although it has no (home made) bug.

    I am REALLY reluctant to change my program, because it is very complex
    and it runs fine apart from the regex-search-does-not-terminate
    problem. But, if changing my program was the most efficient way to
    make the problem go away, I would do it. (I cannot afford to wait for
    some future version of the re module that doesn't have the problem.)

    Is there perhaps a way to tell Python something like

    "Try the search for 5 minutes. If you don't find anything, continue."

    Or are there any other solutions?


    Klaus



    #########################################################################

    import re, sys

    rx = re.compile("(?:^| )(?P<ALLES>(?:[^/ ]*/[^/ ]*/(?:cn(?::[^/ #]+)*)
    )*[^/ ]*/[^/ ]*/(?:cn(?::[^: /#]+)*:2(?::[^: /#]+)*:3(?::[^:
    /#]+)*))(?=( |$))")


    str1 = "mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:2:3"

    test1 = rx.search(str1)
    if test1:
    print test1.group("ALLES")

    str2 = "//bos Innit/no/no ?/?/pun:? Mm/mm/adj:4 mm/mm/cn:nom:akk:3
    mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
    mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
    mm/mm/cn:nom:akk:2:3 ././pun:."

    test2 = rx.search(str2)

    if test2:
    print test2.group("ALLES")

    # Part below will not terminate for days

    # str3 = "//bos Innit/no/no ?/?/pun:? Mm/mm/adj:4 mm/mm/cn:nom:akk:3
    mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
    mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
    mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
    mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
    mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
    mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
    mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3
    mm/mm/cn:nom:akk:3 mm/mm/cn:nom:akk:3 ././pun:."

    # test3 = rx.search(str3)

    # if test3:
    # print test3.group("ALLES")
     
    Klaus Neuner, Oct 5, 2004
    #1
    1. Advertising

  2. Klaus Neuner <> wrote:
    ...
    > Is there perhaps a way to tell Python something like
    >
    > "Try the search for 5 minutes. If you don't find anything, continue."


    You could spawn a watchdog thread that will raise an exception in the
    main thread 300 seconds from now (unless previously annulled). If that
    doesn't work (because the main thread is deep into some operation which
    doesn't let it see exceptions) you could spawn a separate process for
    the job and kill that process if it hasn't emitted results within 300
    seconds. Not _good_ solutions but maybe better than nothing...


    Alex
     
    Alex Martelli, Oct 5, 2004
    #2
    1. Advertising

  3. Klaus Neuner wrote:

    > it seems that there is a bug in the re module. Below there is a sample
    > program that illustrates it.(Sorry for its length. I was not able to
    > write a simpler regex that produced the same effect.)


    It might help if you told us what this is supposed to do in real terms...

    > Is there perhaps a way to tell Python something like
    >
    > "Try the search for 5 minutes. If you don't find anything, continue."


    Yes. Use signal.alarm(). See:

    http://www.python.org/doc/current/lib/node304.html
    --
    Michael Hoffman
     
    Michael Hoffman, Oct 5, 2004
    #3
  4. Klaus Neuner

    Thomas Rast Guest

    (Klaus Neuner) writes:

    > The regex rx compiles. It can be used to search the strings str1 and
    > str2. Yet, when used with str3, the program will not terminate for
    > days.


    Here's a more readable version of the regexp in question (I assume the
    newlines were introduced by your mail program and not part of the
    original string), which should be equivalent to your form if compiled
    with the re.VERBOSE flag:

    '''
    (?:^|\ )
    (?P<ALLES>
    (?:
    [^/ ]*/[^/ ]*/
    (?: # why a group?
    cn
    (?: :[^/ #]+ )*
    )
    )*
    [^/ ]*/[^/ ]*/
    (?: # why a group?
    cn
    (?: :[^: /#]+ )*
    :2
    (?: :[^: /#]+ )*
    :3
    (?: :[^: /#]+ )*
    )
    )
    (?=(\ |$))
    '''

    Judging from the number of '*' and '+' quantifiers, the long search
    time may be due to excessive backtracking as the regexp engine tries
    to find a match.

    [Now that I'm looking this through for the N-th time over, maybe it's
    not, maybe you really found a bug. But I've already written the part
    below and don't want to waste the time I invested in analysing your
    regexp by not posting it...]

    > At the time being, I can neither avoid these regular expressions, nor
    > can I afford to wait days for the result of my program.


    Your example imposes a very strong structure on a possible match: It
    consists of small "tokens" (for lack of a better word) delimited by
    '/' and ':'. Try simplifying the regexp to something like (re.VERBOSE
    again):

    '''
    (
    [^ /]*/[^ /]*/ # two "tokens" followed by a slash each
    cn
    :)[^ /:]+)* # "tokens" that start with a colon
    )*
    '''

    then .split() the string you're trying to search (this is a far more
    efficient way of expressing your "boundary conditions") and .match()
    against each of the words. If you have a match, check if it fits the
    rest of the requirements, i.e. look for the :2 and :3 tokens etc.

    HTH
    Thomas

    --
    If you want to reply by mail, substitute my first and last name for
    'foo' and 'bar', respectively, and remove '.invalid'.
     
    Thomas Rast, Oct 5, 2004
    #4
  5. > '''
    > (?:^|\ )
    > (?P<ALLES>
    > (?:
    > [^/ ]*/[^/ ]*/
    > (?: # why a group?
    > cn
    > (?: :[^/ #]+ )*

    [...]

    That's not the same regular expression. You must escape whitespaces
    when using re.VERBOSE.

    > Judging from the number of '*' and '+' quantifiers, the long search
    > time may be due to excessive backtracking as the regexp engine tries
    > to find a match.


    The problem is not just the number of repeating qualifiers, but
    the nesting of them. Nesting repeating qualifiers deeply is a good
    way to kill regular expression engines.

    Try this example:

    re.search("a(((.)*c)*d)*e", "abcdf"*20)

    Also, if you're curious enough, try to replace 20 by 10, and
    increase it one at a time.

    Btw, the only reason that the OP's expression didn't got stuck
    in the first two test cases is because the expression matched.

    --
    Gustavo Niemeyer
    http://niemeyer.net
     
    Gustavo Niemeyer, Oct 5, 2004
    #5
  6. Klaus Neuner

    Thomas Rast Guest

    Gustavo Niemeyer <> writes:

    > That's not the same regular expression. You must escape whitespaces
    > when using re.VERBOSE.


    As far as I can see, I did, except in character classes, and quoting
    from the 're' manual:

    ] X
    ] VERBOSE
    ] This flag allows you to write regular expressions that look
    ] nicer. Whitespace within the pattern is ignored, except when in a
    ] character class or preceded by an unescaped backslash [...]

    > The problem is not just the number of repeating qualifiers, but
    > the nesting of them. Nesting repeating qualifiers deeply is a good
    > way to kill regular expression engines.

    [...]

    I actually tried the example stated in 'man perlre', translated to
    python:

    >>> import re
    >>> r=re.compile('((a{0,5}){0,5})*[c]')
    >>> r.match('a'*10)

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    RuntimeError: maximum recursion limit exceeded

    but apparently there are ways to nest the backtracking without hitting
    recursion limits.

    - Thomas

    --
    If you want to reply by mail, substitute my first and last name for
    'foo' and 'bar', respectively, and remove '.invalid'.
     
    Thomas Rast, Oct 5, 2004
    #6
  7. > > That's not the same regular expression. You must escape whitespaces
    > > when using re.VERBOSE.

    >
    > As far as I can see, I did, except in character classes, and quoting
    > from the 're' manual:


    '''
    (?P<ALLES>
    (?:
    [^/ ]*/[^/ ]*/
    (?: # why a group?
    cn
    (?: :[^/ #]+ )*
    )
    ^
    There's one space here. Your version doesn't expose the issue
    mentioned by the OP, as you may see by uncommenting the third
    test case.

    [...]
    > I actually tried the example stated in 'man perlre', translated to
    > python:

    [...]
    > but apparently there are ways to nest the backtracking without hitting
    > recursion limits.


    This is fixed in Python 2.4:

    >>> import re
    >>> r = re.compile("((a{0,5}){0,5})*[c]")
    >>> r.match("a"*10)
    >>>


    --
    Gustavo Niemeyer
    http://niemeyer.net
     
    Gustavo Niemeyer, Oct 5, 2004
    #7
  8. Klaus Neuner

    Thomas Rast Guest

    Gustavo Niemeyer <> writes:

    > There's one space here. Your version doesn't expose the issue
    > mentioned by the OP, as you may see by uncommenting the third
    > test case.


    For whatever reason, it doesn't show up in the newspost I see.
    There's a newline at that position, but no space, and as I stated I
    assumed the newlines were an artifact (since he used single quotes).
    Maybe the gateway between news and mail dropped it? In any case, I
    stand corrected. Unfortunately that also makes splitting the string a
    moot point.

    - Thomas

    --
    If you want to reply by mail, substitute my first and last name for
    'foo' and 'bar', respectively, and remove '.invalid'.
     
    Thomas Rast, Oct 5, 2004
    #8
  9. Klaus Neuner

    Klaus Neuner Guest

    Klaus Neuner, Oct 6, 2004
    #9
  10. Klaus Neuner

    Klaus Neuner Guest

    Thanks to you all. I will try to answer some of the questions in your
    answers.

    (Michael Hoffman:)

    > It might help if you told us what this is supposed to do in real terms...


    I invented a kind of linguistic metalanguage. My program is an
    interpreter for this language. It translates statements of the
    linguistic metalanguage into cascades of transducers. Therefore, I
    cannot answer questions like the following one:

    (Thomas Rast:)

    > (?: # why a group?


    There is a group, because the translation algorithm has introduced it.
    It is hard to tell why the algorithm has decided to do so. Maybe the
    group doesn't make sense here, but I think it will make sense in some
    similar case. Even if I re-read all of my program now in order to
    re-understand my translation algorithm, it would not make much sense
    to try to explain why I designed it that way and not another. I would
    need dozens of pages for this.

    I have been testing my program now since months and on gigabytes of
    data. It has not failed for a long time. It did never produce
    unwellformed regular expressions. I was so naive to think that it
    would be sufficient to make sure that all regular expressions be
    wellformed. (Althoug I was also concerned with efficiency issues, of
    course.) Then I saw that wellformedness was not sufficient and I was
    afraid that I would have to change my algorithm.

    Luckily, I may safely assume that any pair of a regex and a string
    such that the regex cannot be matched on the string in a few seconds,
    is a case that can be disregarded. Therefore the signal-module
    solution is perfect for me.

    (Gustavo Niemeyer:)

    > Btw, the only reason that the OP's expression didn't got stuck
    > in the first two test cases is because the expression matched.


    I don't really understand what you mean. You can easily change str3
    such that rx matches. Nevertheless, you will have to wait a long time
    to get a result.
     
    Klaus Neuner, Oct 6, 2004
    #10
  11. > > Btw, the only reason that the OP's expression didn't got stuck
    > > in the first two test cases is because the expression matched.

    >
    > I don't really understand what you mean. You can easily change str3
    > such that rx matches. Nevertheless, you will have to wait a long time
    > to get a result.


    If you're waiting a long time, your expression isn't matching. Try to
    use "mm/mm/cn:nom:akk:2:3" as the last term of your third expression.
    It executes in 0.045 seconds here.

    --
    Gustavo Niemeyer
    http://niemeyer.net
     
    Gustavo Niemeyer, Oct 6, 2004
    #11
  12. Klaus Neuner

    Klaus Neuner Guest

    > If you're waiting a long time, your expression isn't matching. Try to
    > use "mm/mm/cn:nom:akk:2:3" as the last term of your third expression.
    > It executes in 0.045 seconds here.


    You are right. Sorry, I thaught I had tried it with
    "mm/mm/cn:nom:akk:2:3" as last term, too.

    Klaus
     
    Klaus Neuner, Oct 7, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John

    Re: BUG? OR NOT A BUG?

    John, Sep 20, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    587
  2. RedEye
    Replies:
    2
    Views:
    616
    Jason Kester
    Dec 13, 2005
  3. Michel Joly de Lotbiniere

    Bug Parade Bug 4953793

    Michel Joly de Lotbiniere, Nov 30, 2003, in forum: Java
    Replies:
    4
    Views:
    672
    Michel
    Dec 2, 2003
  4. DarkSpy
    Replies:
    4
    Views:
    917
    tom_usenet
    Jun 27, 2003
  5. Steve Holden
    Replies:
    1
    Views:
    422
    Behrang Dadsetan
    Jul 2, 2003
Loading...

Share This Page