Python RegExp

Discussion in 'Python' started by davidchipping@gmail.com, Mar 22, 2005.

  1. Guest

    I have a string which I wish to match using RE, however when I run my
    comparison (using the match method) on my machine it never returns,
    using the CPU fully.

    The code is (very simply):

    ------------------------------
    import re

    buffer = r"#1 1 550 111 SYNC_PEER RES
    <YP>syncpeers=(id=54325432;add=10." \

    "0.0.1;port=89;up=89),(id=97899;add=10.0.0.1;port=543;up=543)," \
    "(id=54325432;add=10.0.0.1;port=89;up=8"

    p = re.compile(r'#(?P<tran>[0-9]+\s)(?P<sess>[0-9]+\s)' \
    + '(?P<ndto>[0-9]+\s)(?P<ndfr>[0-9]+\s)' \
    + '(?P<cmd>[a-zA-Z_]+\s)(?P<dir>[a-zA-Z_]+)' \
    + '(?P<parm>(\s<[a-zA-Z]+>(\s*[a-zA-Z]+=[a-zA' \
    + '-Z0-9.,=();]+\s*)+<[a-zA-Z]+>)*)#(?P<left>.*$)')

    print "starting the match"
    rst = (p.match (buffer) != None)
    print "finishing the match"
    --------------------------------

    Should this every actually happen? Regardless of whether the statment
    matches or not, surely it should do this?
    , Mar 22, 2005
    #1
    1. Advertising

  2. wrote:

    >I have a string which I wish to match using RE, however when I run my
    > comparison (using the match method) on my machine it never returns,
    > using the CPU fully.
    >
    > The code is (very simply):
    >
    > ------------------------------
    > import re
    >
    > buffer = r"#1 1 550 111 SYNC_PEER RES
    > <YP>syncpeers=(id=54325432;add=10." \
    >
    > "0.0.1;port=89;up=89),(id=97899;add=10.0.0.1;port=543;up=543)," \
    > "(id=54325432;add=10.0.0.1;port=89;up=8"
    >
    > p = re.compile(r'#(?P<tran>[0-9]+\s)(?P<sess>[0-9]+\s)' \
    > + '(?P<ndto>[0-9]+\s)(?P<ndfr>[0-9]+\s)' \
    > + '(?P<cmd>[a-zA-Z_]+\s)(?P<dir>[a-zA-Z_]+)' \
    > + '(?P<parm>(\s<[a-zA-Z]+>(\s*[a-zA-Z]+=[a-zA' \
    > + '-Z0-9.,=();]+\s*)+<[a-zA-Z]+>)*)#(?P<left>.*$)')
    >
    > print "starting the match"
    > rst = (p.match (buffer) != None)
    > print "finishing the match"
    > --------------------------------
    >
    > Should this every actually happen? Regardless of whether the statment
    > matches or not, surely it should do this?


    surely the following should never be slow?

    n = len(s)
    for a in range(n):
    for b in range(n):
    for c in range(n):
    for d in range(n):
    for e in range(n):
    ... etc ...

    (your RE contains several nested and/or related repeat operators,
    which forces the poor engine to test zillions of combinations before
    giving up. see Friedl's RE book for more information on this)

    I suggest using a real parser for this task (or using the RE engine as
    a tokenizer, and doing the rest in code; see
    http://effbot.org/zone/xml-scanner.htm for more on this)

    </F>
    Fredrik Lundh, Mar 22, 2005
    #2
    1. Advertising

  3. Jeff Epler Guest

    On my machine the program finishes in 30 seconds. (it's a 1.5GHz machine)
    If the 'parm' group is removed, or if the buffer is shortened, the time
    is reduced considerably.

    There are "pathological cases" for regular expressions which can take
    quite a long time. In the case of your expression, it's happening for
    the group 'parm'. I think, but don't know, that each time a candidate
    for 'parm' is found, the following '#' (or maybe the second '<'?) is not
    found, and it backtracks to try to match 'parm' in a different way,
    which involves considering many different combinations (basically, each
    'name=' could be the start of a new instance of the first parenthsized
    subgroup of <parm>, or it could be part of the character class that
    includes [a-zA-Z=])

    You may wish to consider using other approaches for parsing this text
    than regular expressions.

    Jeff

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.6 (GNU/Linux)

    iD8DBQFCQK+2Jd01MZaTXX0RApByAJ4hIU1RVFWwdOeX28RLwQH1rSk5awCeJXrE
    mhBvK2/gMBghI/VamQfVFYc=
    =8+da
    -----END PGP SIGNATURE-----
    Jeff Epler, Mar 22, 2005
    #3
  4. Jeff,

    Thanks very much for that, I didn't even consider the option of it
    finishing, considering I'm using a much slower machine it was running
    for over 2 minutes when I just killed it! I think I get the rest now.

    Cheers again,

    -David


    On Tue, 22 Mar 2005 17:52:22 -0600, Jeff Epler <> wrote:
    > On my machine the program finishes in 30 seconds. (it's a 1.5GHz machine)
    > If the 'parm' group is removed, or if the buffer is shortened, the time
    > is reduced considerably.
    >
    > There are "pathological cases" for regular expressions which can take
    > quite a long time. In the case of your expression, it's happening for
    > the group 'parm'. I think, but don't know, that each time a candidate
    > for 'parm' is found, the following '#' (or maybe the second '<'?) is not
    > found, and it backtracks to try to match 'parm' in a different way,
    > which involves considering many different combinations (basically, each
    > 'name=' could be the start of a new instance of the first parenthsized
    > subgroup of <parm>, or it could be part of the character class that
    > includes [a-zA-Z=])
    >
    > You may wish to consider using other approaches for parsing this text
    > than regular expressions.
    >
    > Jeff
    >
    >
    >
    David Chipping, Mar 23, 2005
    #4
  5. D H Guest

    wrote:
    > I have a string which I wish to match using RE, however when I run my
    > comparison (using the match method) on my machine it never returns,
    > using the CPU fully.


    In your case it may be simpler to just split the string into groups.
    You don't even need regular expressions or a parser.

    buff = r"#1 1 550 111 SYNC_PEER RES <YP>syncpeers=(id=54325432;add=10." \
    "0.0.1;port=89;up=89),(id=97899;add=10.0.0.1;port=543;up=543)," \
    "(id=54325432;add=10.0.0.1;port=89;up=8)"

    tran, sess, ndto, ndf, cmd, dirr, rest = buff.split()

    eq = rest.find("=")
    parmname = rest[0:eq]
    parms = rest[eq+1:].split(",")

    for parm in parms:
    parmitems = parm[1:-1].split(";")
    for item in parmitems:
    name, val = item.split("=")
    print name, val
    print "---"
    D H, Mar 23, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Hurrell
    Replies:
    4
    Views:
    148
    James Edward Gray II
    Feb 14, 2007
  2. Mikel Lindsaar
    Replies:
    0
    Views:
    461
    Mikel Lindsaar
    Mar 31, 2008
  3. Joao Silva
    Replies:
    16
    Views:
    337
    7stud --
    Aug 21, 2009
  4. Uldis  Bojars
    Replies:
    2
    Views:
    182
    Janwillem Borleffs
    Dec 17, 2006
  5. Matìj Cepl

    new RegExp().test() or just RegExp().test()

    Matìj Cepl, Nov 24, 2009, in forum: Javascript
    Replies:
    3
    Views:
    168
    Matěj Cepl
    Nov 24, 2009
Loading...

Share This Page