Python RegExp

davidchipping · Mar 22, 2005

I have a string which I wish to match using RE, however when I run my
comparison (using the match method) on my machine it never returns,
using the CPU fully.

The code is (very simply):

------------------------------
import re

buffer = r"#1 1 550 111 SYNC_PEER RES
<YP>syncpeers=(id=54325432;add=10." \

"0.0.1;port=89;up=89),(id=97899;add=10.0.0.1;port=543;up=543)," \
"(id=54325432;add=10.0.0.1;port=89;up=8"

p = re.compile(r'#(?P<tran>[0-9]+\s)(?P<sess>[0-9]+\s)' \
+ '(?P<ndto>[0-9]+\s)(?P<ndfr>[0-9]+\s)' \
+ '(?P<cmd>[a-zA-Z_]+\s)(?P<dir>[a-zA-Z_]+)' \
+ '(?P<parm>(\s<[a-zA-Z]+>(\s*[a-zA-Z]+=[a-zA' \
+ '-Z0-9.,=();]+\s*)+<[a-zA-Z]+>)*)#(?P<left>.*$)')

print "starting the match"
rst = (p.match (buffer) != None)
print "finishing the match"

Fredrik Lundh · Mar 22, 2005

I have a string which I wish to match using RE, however when I run my
comparison (using the match method) on my machine it never returns,
using the CPU fully.

The code is (very simply):

------------------------------
import re

buffer = r"#1 1 550 111 SYNC_PEER RES
<YP>syncpeers=(id=54325432;add=10." \

"0.0.1;port=89;up=89),(id=97899;add=10.0.0.1;port=543;up=543)," \
"(id=54325432;add=10.0.0.1;port=89;up=8"

p = re.compile(r'#(?P<tran>[0-9]+\s)(?P<sess>[0-9]+\s)' \
+ '(?P<ndto>[0-9]+\s)(?P<ndfr>[0-9]+\s)' \
+ '(?P<cmd>[a-zA-Z_]+\s)(?P<dir>[a-zA-Z_]+)' \
+ '(?P<parm>(\s<[a-zA-Z]+>(\s*[a-zA-Z]+=[a-zA' \
+ '-Z0-9.,=();]+\s*)+<[a-zA-Z]+>)*)#(?P<left>.*$)')

print "starting the match"
rst = (p.match (buffer) != None)
print "finishing the match"

surely the following should never be slow?

n = len(s)
for a in range(n):
for b in range(n):
for c in range(n):
for d in range(n):
for e in range(n):
... etc ...

(your RE contains several nested and/or related repeat operators,
which forces the poor engine to test zillions of combinations before
giving up. see Friedl's RE book for more information on this)

I suggest using a real parser for this task (or using the RE engine as
a tokenizer, and doing the rest in code; see
http://effbot.org/zone/xml-scanner.htm for more on this)

</F>

Jeff Epler · Mar 22, 2005

On my machine the program finishes in 30 seconds. (it's a 1.5GHz machine)
If the 'parm' group is removed, or if the buffer is shortened, the time
is reduced considerably.

There are "pathological cases" for regular expressions which can take
quite a long time. In the case of your expression, it's happening for
the group 'parm'. I think, but don't know, that each time a candidate
for 'parm' is found, the following '#' (or maybe the second '<'?) is not
found, and it backtracks to try to match 'parm' in a different way,
which involves considering many different combinations (basically, each
'name=' could be the start of a new instance of the first parenthsized
subgroup of <parm>, or it could be part of the character class that
includes [a-zA-Z=])

You may wish to consider using other approaches for parsing this text
than regular expressions.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFCQK+2Jd01MZaTXX0RApByAJ4hIU1RVFWwdOeX28RLwQH1rSk5awCeJXrE
mhBvK2/gMBghI/VamQfVFYc=
=8+da
-----END PGP SIGNATURE-----

David Chipping · Mar 23, 2005

Jeff,

Thanks very much for that, I didn't even consider the option of it
finishing, considering I'm using a much slower machine it was running
for over 2 minutes when I just killed it! I think I get the rest now.

Cheers again,

-David

D H · Mar 23, 2005

I have a string which I wish to match using RE, however when I run my
comparison (using the match method) on my machine it never returns,
using the CPU fully.

In your case it may be simpler to just split the string into groups.
You don't even need regular expressions or a parser.

buff = r"#1 1 550 111 SYNC_PEER RES <YP>syncpeers=(id=54325432;add=10." \
"0.0.1;port=89;up=89),(id=97899;add=10.0.0.1;port=543;up=543)," \
"(id=54325432;add=10.0.0.1;port=89;up=8)"

tran, sess, ndto, ndf, cmd, dirr, rest = buff.split()

eq = rest.find("=")
parmname = rest[0:eq]
parms = rest[eq+1:].split(",")

for parm in parms:
parmitems = parm[1:-1].split(";")
for item in parmitems:
name, val = item.split("=")
print name, val
print "---"

Named regexp variables, an extension proposal.	8	May 13, 2006
Python code problem	2	Apr 23, 2023
Adding adressing of IPv6 to program	1	Feb 16, 2023
My first Python program -- a lexer	25	Nov 8, 2008
Help with code	0	Jun 12, 2022
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
HOWTO: Parsing email using Python part1	2	Jul 3, 2011
Help please	8	Jul 7, 2023

Python RegExp

davidchipping

Fredrik Lundh

Jeff Epler

David Chipping

D H

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads