Pathological regular expression

Discussion in 'Python' started by David Liang, Apr 9, 2009.

  1. David Liang

    David Liang Guest

    Hi all,
    I'm having a weird problem with a regular expression (tested in 2.6
    and 3.0):

    Basically, any of these:
    _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

    followed by for example,
    line = r'~/.[m]ozilla/firefox/*.default/chrome'
    print(_re_comments.sub(r'\1', line))

    ....hangs the interpreter. For reference, if the first command had been
    _re_comments = re.compile(r'^(([^z]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

    (off by one character z) it works fine, and all the equivalent
    operations work in sed and awk. Am I missing something about Python
    RE's?

    -David
    David Liang, Apr 9, 2009
    #1
    1. Advertising

  2. David Liang

    David Liang Guest

    On Apr 9, 2:56 am, David Liang <> wrote:
    > Hi all,
    > I'm having a weird problem with a regular expression (tested in 2.6
    > and 3.0):
    >
    > Basically, any of these:
    > _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    > _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    > _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    >
    > followed by for example,
    > line = r'~/.[m]ozilla/firefox/*.default/chrome'
    > print(_re_comments.sub(r'\1', line))
    >
    > ...hangs the interpreter. For reference, if the first command had been
    > _re_comments = re.compile(r'^(([^z]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    >
    > (off by one character z) it works fine, and all the equivalent
    > operations work in sed and awk. Am I missing something about Python
    > RE's?
    >
    > -David


    The problem was the redundant +'s; the fixed RE is

    _re_comments = re.compile(r'^(([^#"\\]|\\.|"([^"\\]|\\.)*")*)#.*')
    David Liang, Apr 9, 2009
    #2
    1. Advertising

  3. David Liang

    David Liang Guest

    On Apr 9, 2:56 am, David Liang <> wrote:
    > Hi all,
    > I'm having a weird problem with a regular expression (tested in 2.6
    > and 3.0):
    >
    > Basically, any of these:
    > _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    > _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    > _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    >
    > followed by for example,
    > line = r'~/.[m]ozilla/firefox/*.default/chrome'
    > print(_re_comments.sub(r'\1', line))
    >
    > ...hangs the interpreter. For reference, if the first command had been
    > _re_comments = re.compile(r'^(([^z]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    >
    > (off by one character z) it works fine, and all the equivalent
    > operations work in sed and awk. Am I missing something about Python
    > RE's?
    >
    > -David


    The problem was the redundant +'s; the fixed RE is

    _re_comments = re.compile(r'^(([^#"\\]|\\.|"([^"\\]|\\.)*")*)#.*')
    David Liang, Apr 9, 2009
    #3
  4. On Thu, 09 Apr 2009 02:56:00 -0700, David Liang wrote:

    > Hi all,
    > I'm having a weird problem with a regular expression (tested in 2.6 and
    > 3.0):
    >
    > Basically, any of these:
    > _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    > _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    > _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    >
    > followed by for example,
    > line = r'~/.[m]ozilla/firefox/*.default/chrome'
    > print(_re_comments.sub(r'\1', line))
    >
    > ...hangs the interpreter.



    I can confirm the first one hangs the interpreter in Python 2.5 as well.
    I haven't tested the other two.

    To my mind, this is a bug in the RE engine. Is there any reason to not
    treat it as a bug?


    --
    Steven
    Steven D'Aprano, Apr 11, 2009
    #4
  5. David Liang

    MRAB Guest

    Steven D'Aprano wrote:
    > On Thu, 09 Apr 2009 02:56:00 -0700, David Liang wrote:
    >
    >> Hi all,
    >> I'm having a weird problem with a regular expression (tested in 2.6 and
    >> 3.0):
    >>
    >> Basically, any of these:
    >> _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    >> _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    >> _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    >>
    >> followed by for example,
    >> line = r'~/.[m]ozilla/firefox/*.default/chrome'
    >> print(_re_comments.sub(r'\1', line))
    >>
    >> ...hangs the interpreter.

    >
    >
    > I can confirm the first one hangs the interpreter in Python 2.5 as well.
    > I haven't tested the other two.
    >
    > To my mind, this is a bug in the RE engine. Is there any reason to not
    > treat it as a bug?
    >

    It's not a bug but one of those pathological cases where there are a LOT
    of possibilities to try (we're talking about exponential growth here).

    Hmm, maybe the re module needs to let the user set a timeout control
    that raises an exception if it takes longer than n seconds (it can
    already be interrupted by ^C)...
    MRAB, Apr 11, 2009
    #5
  6. David Liang

    John Machin Guest

    On Apr 12, 1:07 am, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Thu, 09 Apr 2009 02:56:00 -0700, David Liang wrote:
    > > Hi all,
    > > I'm having a weird problem with a regular expression (tested in 2.6 and
    > > 3.0):

    >
    > > Basically, any of these:
    > > _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    > > _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    > > _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

    >
    > > followed by for example,
    > > line = r'~/.[m]ozilla/firefox/*.default/chrome'
    > > print(_re_comments.sub(r'\1', line))

    >
    > > ...hangs the interpreter.

    >
    > I can confirm the first one hangs the interpreter in Python 2.5 as well.
    > I haven't tested the other two.
    >
    > To my mind, this is a bug in the RE engine. Is there any reason to not
    > treat it as a bug?


    IMHO it's not a bug -- s/hang/takes a long time to compute/

    Just look at it: 2 + operators and 3 * operators ... It's one of those
    "come back after lunch" REs.
    John Machin, Apr 11, 2009
    #6
  7. David Liang

    Dotan Cohen Guest

    > IMHO it's not a bug -- s/hang/takes a long time to compute/
    >


    ‎That is quite what a hang is, and why the timeout was invented. The
    real bug is that there is no timeout mechanism.

    > Just look at it: 2 + operators and 3 * operators ... It's one of those
    > "come back after lunch" REs.
    >


    Some users would not be able to tell that from the beginning. While a
    timeout override would be necessary in these cases, a timeout should
    be set.


    --
    Dotan Cohen

    http://what-is-what.com
    http://gibberish.co.il
    Dotan Cohen, Apr 11, 2009
    #7
  8. David Liang

    MRAB Guest

    Dotan Cohen wrote:
    >> IMHO it's not a bug -- s/hang/takes a long time to compute/
    >>

    >
    > ‎That is quite what a hang is, and why the timeout was invented. The
    > real bug is that there is no timeout mechanism.
    >

    I wouldn't call it a "hang" because it is actually doing work. If it was
    'stuck' on a certain part of the text and not progressing then it would
    be a hang.

    >> Just look at it: 2 + operators and 3 * operators ... It's one of those
    >> "come back after lunch" REs.
    >>

    >
    > Some users would not be able to tell that from the beginning. While a
    > timeout override would be necessary in these cases, a timeout should
    > be set.
    >
    MRAB, Apr 11, 2009
    #8
  9. David Liang

    Aaron Brady Guest

    On Apr 11, 10:07 am, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Thu, 09 Apr 2009 02:56:00 -0700, David Liang wrote:
    > > Hi all,
    > > I'm having a weird problem with a regular expression (tested in 2.6 and
    > > 3.0):

    >
    > > Basically, any of these:
    > > _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    > > _re_comments = re.compile(r'^(([^#]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    > > _re_comments = re.compile(r'^(([^"]+|\\.|"([^"\\]+|\\.)*")*)#.*$')

    >
    > > followed by for example,
    > > line = r'~/.[m]ozilla/firefox/*.default/chrome'
    > > print(_re_comments.sub(r'\1', line))

    >
    > > ...hangs the interpreter.

    >
    > I can confirm the first one hangs the interpreter in Python 2.5 as well.
    > I haven't tested the other two.
    >
    > To my mind, this is a bug in the RE engine. Is there any reason to not
    > treat it as a bug?
    >
    > --
    > Steven


    How long does it take on a short example?
    Aaron Brady, Apr 11, 2009
    #9
  10. On Sat, 11 Apr 2009 08:40:03 -0700, John Machin wrote:

    >> To my mind, this is a bug in the RE engine. Is there any reason to not
    >> treat it as a bug?

    >
    > IMHO it's not a bug -- s/hang/takes a long time to compute/
    >
    > Just look at it: 2 + operators and 3 * operators ... It's one of those
    > "come back after lunch" REs.


    Well, it's been running now for about two and a half hours, that's a
    rather long lunch. And despite MRAB's assertion, it *cannot* be
    interrupted by ctrl-C. That means that to all intents and purposes, the
    interpreter has locked up for the duration of the calculation, which may
    be days or weeks for all I know.



    --
    Steven
    Steven D'Aprano, Apr 11, 2009
    #10
  11. David Liang

    Dotan Cohen Guest

    Dotan Cohen, Apr 11, 2009
    #11
  12. David Liang

    Aaron Brady Guest

    On Apr 11, 12:40 pm, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Sat, 11 Apr 2009 08:40:03 -0700, John Machin wrote:
    > >> To my mind, this is a bug in the RE engine. Is there any reason to not
    > >> treat it as a bug?

    >
    > > IMHO it's not a bug -- s/hang/takes a long time to compute/

    >
    > > Just look at it: 2 + operators and 3 * operators ... It's one of those
    > > "come back after lunch" REs.

    >
    > Well, it's been running now for about two and a half hours, that's a
    > rather long lunch. And despite MRAB's assertion, it *cannot* be
    > interrupted by ctrl-C. That means that to all intents and purposes, the
    > interpreter has locked up for the duration of the calculation, which may
    > be days or weeks for all I know.


    While beyond the limits of a long lunch, that would make a nice
    vacation?
    Aaron Brady, Apr 11, 2009
    #12
  13. David Liang

    MRAB Guest

    Steven D'Aprano wrote:
    > On Sat, 11 Apr 2009 08:40:03 -0700, John Machin wrote:
    >
    >>> To my mind, this is a bug in the RE engine. Is there any reason to not
    >>> treat it as a bug?

    >> IMHO it's not a bug -- s/hang/takes a long time to compute/
    >>
    >> Just look at it: 2 + operators and 3 * operators ... It's one of those
    >> "come back after lunch" REs.

    >
    > Well, it's been running now for about two and a half hours, that's a
    > rather long lunch. And despite MRAB's assertion, it *cannot* be
    > interrupted by ctrl-C. That means that to all intents and purposes, the
    > interpreter has locked up for the duration of the calculation, which may
    > be days or weeks for all I know.
    >

    I've just tried all 3 regexes in Python 2.5.2. All could be interrupted
    by ctrl-C.
    MRAB, Apr 11, 2009
    #13
  14. David Liang

    John Machin Guest

    On Apr 12, 3:40 am, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Sat, 11 Apr 2009 08:40:03 -0700, John Machin wrote:
    > >> To my mind, this is a bug in the RE engine. Is there any reason to not
    > >> treat it as a bug?

    >
    > > IMHO it's not a bug -- s/hang/takes a long time to compute/

    >
    > > Just look at it: 2 + operators and 3 * operators ... It's one of those
    > > "come back after lunch" REs.

    >
    > Well, it's been running now for about two and a half hours, that's a
    > rather long lunch. And despite MRAB's assertion, it *cannot* be
    > interrupted by ctrl-C. That means that to all intents and purposes, the
    > interpreter has locked up for the duration of the calculation, which may
    > be days or weeks for all I know.


    If you don't know, experiment!

    import re, time
    _re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
    for end in ("# yadda", ""):
    for start in ("", '"#"'):
    for size in xrange(25):
    line = start + "x" * size + end
    print line
    t0 = time.clock()
    result = _re_comments_nc.sub(r"\1", line)
    t1 = time.clock()
    print "%s (%.6f secs)" % (result, t1 - t0)

    Observations:
    1. Runs at light speed in first two cases (genuine comemnt to be
    removed)
    2. Takes approx O(2**N) in the 3rd case (no # in sight)
    3. Takes even longer to produce a *WRONG* result in the 4th case (# in
    a string literal)
    4. The RE assumes that only " can quote a string literal; what about '
    ''' and """?

    Cluesticks:
    1. if "#" not in line:
    dont_need_no_stinking_regexes()
    2. try timing on short pieces of input
    3. Test the RE, get it correct
    John Machin, Apr 12, 2009
    #14
  15. David Liang

    John Machin Guest

    On Apr 12, 9:46 am, John Machin <> wrote:

    >             result = _re_comments_nc.sub(r"\1", line)


    s/_nc// ... that's an artifact of timing it with Non-Capture groups
    (?:blahblah) on the two internal groups that don't need to be
    capturing (results identical, and no perceptible effect on running
    time)
    John Machin, Apr 12, 2009
    #15
  16. On Sat, 11 Apr 2009 16:46:20 -0700, John Machin wrote:

    > On Apr 12, 3:40 am, Steven D'Aprano <st...@REMOVE-THIS-
    > cybersource.com.au> wrote:
    >> On Sat, 11 Apr 2009 08:40:03 -0700, John Machin wrote:
    >> >> To my mind, this is a bug in the RE engine. Is there any reason to
    >> >> not treat it as a bug?

    >>
    >> > IMHO it's not a bug -- s/hang/takes a long time to compute/

    >>
    >> > Just look at it: 2 + operators and 3 * operators ... It's one of
    >> > those "come back after lunch" REs.

    >>
    >> Well, it's been running now for about two and a half hours, that's a
    >> rather long lunch. And despite MRAB's assertion, it *cannot* be
    >> interrupted by ctrl-C. That means that to all intents and purposes, the
    >> interpreter has locked up for the duration of the calculation, which
    >> may be days or weeks for all I know.

    >
    > If you don't know, experiment!


    My original test has now been running for close to ten hours now, and
    still can't be interrupted with ctrl-C. However that's in Python 2.5,
    having tried it in Python 2.6.2 they can be interrupted, so I'm satisfied
    that this bug of "regex hangs the interpreter" is not worth reporting, as
    it effects only older versions.

    [...]

    > 3. Test the RE, get it correct


    They're not my regexes. I don't care whether they are correct or not, I'm
    more concerned about them locking up the interpreter.


    --
    Steven
    Steven D'Aprano, Apr 12, 2009
    #16
  17. David Liang

    Aaron Brady Guest

    On Apr 11, 7:31 pm, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    _
    > My original test has now been running for close to ten hours now, and
    > still can't be interrupted with ctrl-C. However that's in Python 2.5,
    > having tried it in Python 2.6.2 they can be interrupted, so I'm satisfied
    > that this bug of "regex hangs the interpreter" is not worth reporting, as
    > it effects only older versions.
    >
    > [...]
    >
    > > 3. Test the RE, get it correct

    >
    > They're not my regexes. I don't care whether they are correct or not, I'm
    > more concerned about them locking up the interpreter.
    >
    > --
    > Steven


    I don't believe in second consoles or kill(1)... or telnet. I wonder
    if the hanging is symptomatic of a wider-spread bug, that shorter
    operations would never reveal. ...
    Aaron Brady, Apr 12, 2009
    #17
  18. David Liang

    John Machin Guest

    On Apr 12, 10:31 am, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Sat, 11 Apr 2009 16:46:20 -0700, John Machin wrote:
    > > On Apr 12, 3:40 am, Steven D'Aprano <st...@REMOVE-THIS-
    > > cybersource.com.au> wrote:
    > >> On Sat, 11 Apr 2009 08:40:03 -0700, John Machin wrote:
    > >> >> To my mind, this is a bug in the RE engine. Is there any reason to
    > >> >> not treat it as a bug?

    >
    > >> > IMHO it's not a bug -- s/hang/takes a long time to compute/

    >
    > >> > Just look at it: 2 + operators and 3 * operators ... It's one of
    > >> > those "come back after lunch" REs.

    >
    > >> Well, it's been running now for about two and a half hours, that's a
    > >> rather long lunch. And despite MRAB's assertion, it *cannot* be
    > >> interrupted by ctrl-C. That means that to all intents and purposes, the
    > >> interpreter has locked up for the duration of the calculation, which
    > >> may be days or weeks for all I know.

    >
    > > If you don't know, experiment!

    >
    > My original test has now been running for close to ten hours now, and
    > still can't be interrupted with ctrl-C. However that's in Python 2.5,


    What platform are you running on? It works OK for me back to 2.1 on
    Windows XP SP3:

    Traceback (most recent call last):
    File "weirdre.py", line 34, in ?
    result = _re_comments.sub(r"\1", line)
    File "C:\python21\lib\sre.py", line 164, in _sub
    return _subn(pattern, template, string, count)[0]
    File "C:\python21\lib\sre.py", line 179, in _subn
    m = c.search()
    KeyboardInterrupt


    > having tried it in Python 2.6.2 they can be interrupted, so I'm satisfied
    > that this bug of "regex hangs the interpreter" is not worth reporting, as
    > it effects only older versions.
    >
    > [...]
    >
    > > 3. Test the RE, get it correct

    >
    > They're not my regexes. I don't care whether they are correct or not, I'm
    > more concerned about them locking up the interpreter.


    That's understood -- it was a shotgun reply to the OP and most
    responders; each can take their own pellets ;-)
    John Machin, Apr 12, 2009
    #18
  19. David Liang

    franck Guest

    > To my mind, this is a bug in the RE engine. Is there any reason to not
    > treat it as a bug?


    It's not a bug, it's a feature! ;-)
    Indeed, in order to handle constructs like (?P=name), RE engines have
    to use inefficient algorithms. In my opinion, this is not the problem
    of using a pathological regular expression, but rather a problem of
    the RE engine that is not optimized to use the faster approach when
    possible.

    This is well known problem very well explained on:
    http://swtch.com/~rsc/regexp/regexp1.html

    Cheers,
    Franck
    franck, Apr 17, 2009
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,269
  2. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    831
    Alan Moore
    Dec 2, 2005
  3. GIMME
    Replies:
    3
    Views:
    11,921
    vforvikash
    Dec 29, 2008
  4. Just Another Victim of the Ambient Morality

    inject's pathological case...

    Just Another Victim of the Ambient Morality, Mar 30, 2008, in forum: Ruby
    Replies:
    22
    Views:
    215
    Brian Adkins
    Apr 2, 2008
  5. Noman Shapiro
    Replies:
    0
    Views:
    219
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page