using re: hitting recursion limit

E

Erik Johnson

I have done a fair amount of regular expression text processing in Perl,
and am currently trying to convert a running Perl script into Python (for a
number of reasons I won't go into here). I have not had any problems with
memory limits using Perl, but in trying to clip out a particular table from
a web page, I am hitting Python's recursion limit.

The RE is pretty simple:

pat = '(<table.*?%s.*?</table>)' % magic_string

This seems about as simple as a "real-world" RE can get. If I cut down
the web page to about 150 lines, this works, but that's not practical - the
table I need to parse can easily gro to over 1000 lines. I found the
following bit in the reference manual from section 4.2.6:

#------------------------------------------------------------------------
Avoiding recursion
If you create regular expressions that require the engine to perform a lot
of recursion, you may encounter a RuntimeError exception with the message
maximum recursion limit exceeded. For example,

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/local/lib/python2.3/sre.py", line 132, in match
return _compile(pattern, flags).match(string)
RuntimeError: maximum recursion limit exceeded
You can often restructure your regular expression to avoid recursion.

Starting with Python 2.3, simple uses of the *? pattern are special-cased to
avoid recursion. Thus, the above regular expression can avoid recursion by
being recast as Begin [a-zA-Z0-9_ ]*?end. As a further benefit, such regular
expressions will run faster than their recursive equivalents.

#------------------------------------------------------------------------
Yeah, so this is a known problem, and sure enough, if you run this
example, it will crap out as shown above. I tried to build 2.3 from sources
before and ran into other funky problems. When I get a standard SuSE
distribution, I will upgrade to 2.3. In the meantime, I'm running 2.2.

If you take the advice and recast the RE as they suggest...
pat = r'Begin [a-zA-Z0-9_ ]*?end'
re.match(pat, s)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.2/sre.py", line 132, in match
return _compile(pattern, flags).match(string)
RuntimeError: maximum recursion limit exceeded
Ummm... wow... big improvement. :( And what if I am trying to match
'.*?' instead of '\w*?', is trying to build the equivalent mess for dot
going to help me? Didn't help for \w here anyway. So, there may be various
ways I can hack my way around this, but this is not making me a happy
camper. There are a lot of things I like about Python, but running into this
sort of thing so early is definitely not giving me a warm fuzzy feeling
about using Python as a general replacement for Perl.

One question is, "Does getting this error actually mean I have hit my
process' stack memory limit or is there possibly some artificial recursion
limit set in front of that?" If there is some number I can push up, I'd be
happy to let it run until the OS axes it for exceeding it's process limits
(i.e., I'm not clear on whether this RuntimeError imples that's what has
already happened). I'm not finding much of anything trying to Google "Python
recursion limit" (though I obviously haven't read all of the 10,000+ pages
Google will hit.) It hits this error basically "immediately", so it seems
unlikely to me that it has actually use all of the machiens memory, or even
all of my process' memory in under a tenth of a second:

ej@sand:~/src/python> time ./find_table.py

pat = '(<table.*?Unit No.*?</table>)'

Traceback (most recent call last):
File "./find_table.py", line 22, in ?
match = re.search(pat, html, re.DOTALL)
File "/usr/lib/python2.2/sre.py", line 137, in search
return _compile(pattern, flags).search(string)
RuntimeError: maximum recursion limit exceeded

real 0m0.049s
user 0m0.050s
sys 0m0.000s

I can probably hack my way around this and just use substrings to pick
out my table but there are other situations where there are multiple tables
before, after, and nested around the table that I want, all of which change
dynamically. I can't be very confident that a magic string alone inside
table tags is going to be unique. RE definitely seems like the fast and
painless (or rather, less painful) way to go here. So, short of telling me
to use 2.3 or a way I can reconfigure what the recursion limit is,
suggestions about smarter ways to pick out a table under those conditions
are welcomed.

As sort of an aside, I am aware of the ulimit command, as I poked around
with it a bit years ago, but am no systems guru. I remember it being a
stand-alone program under /sbin, but on my SuSE 8.2 system it appears to be
incorporated into bash itself? I do have root privs, so if you think you
have a solution via ulimit, I'm happy to hear, but it would appear I am not
being restricted by that:

ej@sand:~/src/python> ulimit
unlimited
ej@sand:~/src/python>

ej@sand:~/src/python> ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 2047
virtual memory (kbytes, -v) unlimited



Thanks for taking the time to read my post! :)

-ej
 
T

Terry Reedy

Erik Johnson said:
One question is, "Does getting this error actually mean I have hit my
process' stack memory limit or is there possibly some artificial
recursion
limit set in front of that?" If there is some number I can push up, I'd
be
happy to let it run until the OS axes it for exceeding it's process
limits

Perhaps you are looking fo
sys.getrecursionlimit()
sys.setrecursionlimit(n)

Terry J. Reedy
 
A

Andrew Dalke

Erik said:
pat = '(<table.*?%s.*?</table>)' % magic_string

This seems about as simple as a "real-world" RE can get. If I cut down
the web page to about 150 lines, this works, but that's not practical - the
table I need to parse can easily gro to over 1000 lines.

Another option is to use assertion checks. The following seems to work
....
.... Blah blah blah.
.... <table id="foo">
.... asdfasdfadfasdfasdfaf asdfasdfadfasdfasdfaf asdfasdfadfasdfasdfaf
.... werwlkejr wljr wklerwlerwerwuoi dofg dif poritertkle ggkdfjgdlg
.... q u f jkh gertwerkjshk klajw h jhkrjesdfoisdjf sjhdfkjhwer
.... werwlkejr wljr wklerwlerwerwuoi dofg dif poritertkle ggkdfjgdlg
.... werwlkejr wljr wklerwlerwerwuoi dofg dif poritertkle ggkdfjgdlg
.... werwlkejr wljr wklerwlerwerwuoi dofg dif poritertkle ggkdfjgdlg
.... werwlkejr wljr wklerwlerwerwuoi hello dif poritertkle ggkdfjgdlg
.... said:
>>> import re
>>> magic_string = re.escape("hello")
>>> pat = re.compile(
.... r said:
>>> m = pat.search(text)>>> m.start(0) 18
>>> m.end(0) 490
>>> text[m.end(0)-10:m.end(0)]
'g\n said:

The .(?! ... ) pattern consumes all characters that don't
have the ... immediately afterwards. So the first big group
skips all characters that don't have magic_string or </table>
immediately afterwards, then it requires the magic_string,
then skips all characters that don't have '</table>' immediately
afterwards, then consumsed the </table>'

The .) is to consume the character which does immediately
follow with the desired pattern.

The (| is for the case when, say, "</table>" immediately
follows magic_string.

One question is, "Does getting this error actually mean I have hit my
process' stack memory limit or is there possibly some artificial recursion
limit set in front of that?"

It's an artificial recursion limit set to make sure you don't bust your
system's stack. I think the re engine respects sys.setrecursionlimit().

In your case you have one stack frame for every character, so if
your current limit is 150 characters then you'll need to double the
stack limit to hit 300 character, etc. You should consider some
other solution.
It hits this error basically "immediately", so it seems
unlikely to me that it has actually use all of the machiens memory, or even
all of my process' memory in under a tenth of a second:

The stack space is a lot smaller than the heap space. On my
OS X box I only have 8M of stack by default.

% limit
cputime unlimited
filesize unlimited
datasize 6144 kbytes
stacksize 8192 kbytes
coredumpsize 0 kbytes
memoryuse unlimited
descriptors 256
memorylocked unlimited
maxproc 100
%
> RE definitely seems like the fast and
painless (or rather, less painful) way to go here. So, short of telling me
to use 2.3 or a way I can reconfigure what the recursion limit is,
suggestions about smarter ways to pick out a table under those conditions
are welcomed.

An HTML parser is a better bet. HTML is just over the border of
what's parsable by HTML -- it works, but only if you can make some
strong assertions about the orgianization of your HTML. (Eg, that
characters aren't represented by character entities, that whitespace
will be as you expect, and more.)

Have you considered BeautifulSoup, ElementTree, or using
Python's sgmllib? See the writing of Mark Pilgrim for using
the last in his FeedParser
http://sourceforge.net/projects/feedparser/

Andrew
(e-mail address removed)
 
E

Erik Johnson

ayup....

sys.getrecursionlimit() and
sys.setrecursionlimit()

is the sort of thing I was thinking about, but wasn't aware of exactly what
or where the functions were. :) Thanks for that, Terry.

And I will have to study Andrew's reply for a while to fully understand
it, but I got the general idea. I have read about look-ahead assertions in
Perl before, but never actually used them. I think I can work with this.

Thank you both for the fast and helpful replies! :)

-ej
 
D

Dennis Benzinger

Erik Johnson said:
[...]
Yeah, so this is a known problem, and sure enough, if you run this
example, it will crap out as shown above. I tried to build 2.3 from sources
before and ran into other funky problems. When I get a standard SuSE
distribution, I will upgrade to 2.3. In the meantime, I'm running 2.2.
[...]

Perhaps you can wait a bit and directly upgrade to 2.4 with the new
non-recursive
regular expression engine.
Read more on http://www.python.org/2.4/highlights.html in the "New or
upgraded modules and packages" section.
 
O

Oliver Fromme

Erik Johnson said:
> I have done a fair amount of regular expression text processing in Perl,
> and am currently trying to convert a running Perl script into Python (for a
> number of reasons I won't go into here). I have not had any problems with
> memory limits using Perl, but in trying to clip out a particular table from
> a web page, I am hitting Python's recursion limit.
>
> The RE is pretty simple:
>
> pat = '(<table.*?%s.*?</table>)' % magic_string

Maybe the most efficient way is not ot use REs at all.

For example, one way would be to split the whole page on
the string "<table", strip off the closing "</table>" and
any stuff behind it, then look for the one containing your
magic_string.

tables = [t.split("</table>", 1)[0] for t in page.split("<table")]
for t in tables:
if t.find(magic_string) >= 0:
print t

There's one disadvantage: If you have nested tables, this
approach will only handle the innermost tables correctly
(i.e. those which don't contain further tables).

Best regards
Oliver
 
H

Helmut Jarausch

Erik said:
I have done a fair amount of regular expression text processing in Perl,
and am currently trying to convert a running Perl script into Python (for a
number of reasons I won't go into here). I have not had any problems with
memory limits using Perl, but in trying to clip out a particular table from
a web page, I am hitting Python's recursion limit.

The RE is pretty simple:

pat = '(<table.*?%s.*?</table>)' % magic_string

This seems about as simple as a "real-world" RE can get. If I cut down
the web page to about 150 lines, this works, but that's not practical - the
table I need to parse can easily gro to over 1000 lines. I found the
following bit in the reference manual from section 4.2.6:

You can do without REs. There is an (even non-recursive) version of
mxTextTools see http://simpleparse.sourceforge.net/
and it's really fast.

--
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top