Re: Cancel or timeout a long running regular expression

Discussion in 'Python' started by Terry Reedy, Sep 15, 2011.

  1. Terry Reedy

    Terry Reedy Guest

    On 9/15/2011 1:19 AM, wrote:
    > Is there a way to cancel or timeout a long running regular expression?
    > I have a program that accepts regular expressions from users and I'm
    > concerned about how to handle worst case regular expressions that seem
    > to run forever. Ideally I'm looking for a way to evaluate a regular
    > expression and timeout after a specified time period if the regular
    > expression hasn't completed yet. Or a way for a user to cancel a long
    > running regular expression.


    This is a general problem when evaluating *any* expression from the
    outside. [0]*10000*10000 will eat space as well as time. At least, as
    far as I know, an re cannot cause a disk reformat ;-).

    There have been previous discussions on this generally topic.

    > I was thinking there might be a technique I could use to evaluate
    > regular expressions in a thread or another process launched via
    > multiprocessing module and then kill the thread/process after a
    > specified timeout period.


    Only solution I remember ever seen posted. I wonder if there are any
    heuristics for detecting exponential time re's.

    > My concern about the multiprocessing module
    > technique is that launching a process for every regex evaluation sounds
    > pretty inefficient. And I don't think the threading module supports the
    > ability to kill threads from outside a thread itself.


    --
    Terry Jan Reedy
     
    Terry Reedy, Sep 15, 2011
    #1
    1. Advertising

  2. Terry Reedy

    Nobody Guest

    On Thu, 15 Sep 2011 14:54:57 -0400, Terry Reedy wrote:

    >> I was thinking there might be a technique I could use to evaluate
    >> regular expressions in a thread or another process launched via
    >> multiprocessing module and then kill the thread/process after a
    >> specified timeout period.

    >
    > Only solution I remember ever seen posted. I wonder if there are any
    > heuristics for detecting exponential time re's.


    Exponential growth results from non-determinism, i.e. if there are
    multiple transitions for a given character from a given state.

    Common patterns include:

    ...(a...)?a...
    ...(a...)*a...
    ...(a...)+a...

    with a choice between matching the "a" at the start of the bracketed
    pattern or the "a" following it.

    (xxxa...|xxxb...|xxxc...)

    with a choice between branches which cannot be resolved until more data
    has been read.

    For re.search (as opposed to re.match):

    axxxa...

    When axxx has been read, a following "a" gives a choice between
    continuing the existing match with the second "a", or aborting and
    matching against the first "a". For patterns which contain many copies of
    the initial character, each copy creates another branch.

    Also, using back-references in a regexp [sic] can be a significant
    performance killer, as it rules out the use of a DFA (which, IIRC, Python
    doesn't do anyhow) and can require brute-forcing many combinations. A
    particularly bad case is:

    (a*)\1

    matching against "aaaaaaaa...".

    If the performance issue is with re.match/re.search rather than with
    re.compile, one option is to use ctypes to access libc's regexp functions.
    Those are likely to provide better searching throughput at the expense of
    potentially increased compilation time.
     
    Nobody, Sep 16, 2011
    #2
    1. Advertising

  3. Terry Reedy

    Terry Reedy Guest

    On 9/16/2011 9:57 AM, Nobody wrote:
    >>I wonder if there are any
    >> heuristics for detecting exponential time re's.

    >
    > Exponential growth results from non-determinism, i.e. if there are
    > multiple transitions for a given character from a given state.
    >
    > Common patterns include:
    >
    > ...(a...)?a...
    > ...(a...)*a...
    > ...(a...)+a...
    >
    > with a choice between matching the "a" at the start of the bracketed
    > pattern or the "a" following it.
    >
    > (xxxa...|xxxb...|xxxc...)
    >
    > with a choice between branches which cannot be resolved until more data
    > has been read.
    >
    > For re.search (as opposed to re.match):
    >
    > axxxa...
    >
    > When axxx has been read, a following "a" gives a choice between
    > continuing the existing match with the second "a", or aborting and
    > matching against the first "a". For patterns which contain many copies of
    > the initial character, each copy creates another branch.
    >
    > Also, using back-references in a regexp [sic] can be a significant
    > performance killer, as it rules out the use of a DFA (which, IIRC, Python
    > doesn't do anyhow) and can require brute-forcing many combinations. A
    > particularly bad case is:
    >
    > (a*)\1
    >
    > matching against "aaaaaaaa...".
    >
    > If the performance issue is with re.match/re.search rather than with
    > re.compile, one option is to use ctypes to access libc's regexp functions.
    > Those are likely to provide better searching throughput at the expense of
    > potentially increased compilation time.


    Now, can you write that as a heuristic *algorithm* def
    dangerous_re(some_re):?

    --
    Terry Jan Reedy
     
    Terry Reedy, Sep 16, 2011
    #3
  4. Terry Reedy

    Nobody Guest

    On Fri, 16 Sep 2011 18:01:27 -0400, Terry Reedy wrote:

    > Now, can you write that as a heuristic *algorithm*
    > def dangerous_re(some_re):?


    return re.search(r'\\\d', some_re) is not None

    That will handle the most extreme case ;)

    If the OP is serious about analysing regexps, sre_parse.parse() will
    decompose a regexp to a more convenient form.

    However, I wouldn't rely upon being able to catch every possible bad case.
    The only robust solution is to use a separate process (threads won't
    suffice, as they don't have a .kill() method).
     
    Nobody, Sep 17, 2011
    #4
  5. Terry Reedy

    Guest

    Thanks for everyone's comments - much appreciated!

    Malcolm (the OP)
     
    , Sep 18, 2011
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,303
  2. John Nagle
    Replies:
    0
    Views:
    497
    John Nagle
    Jan 3, 2008
  3. Chris Angelico
    Replies:
    2
    Views:
    277
    Antoon Pardon
    Sep 21, 2011
  4. Stupid48
    Replies:
    6
    Views:
    151
    Ken Schaefer
    Mar 18, 2005
  5. Mark Probert

    Timeout::timeout and Socket timeout

    Mark Probert, Oct 6, 2004, in forum: Ruby
    Replies:
    1
    Views:
    1,290
    Brian Candler
    Oct 6, 2004
Loading...

Share This Page