Recursion limit of pickle?

V

Victor Lin

Hi,

I encounter a problem with pickle.
I download a html from:

http://www.amazon.com/Magellan-Maes...2?ie=UTF8&s=electronics&qid=1202541889&sr=1-2

and parse it with BeautifulSoup.
This page is very huge.
When I use pickle to dump it, a RuntimeError: maximum recursion depth
exceeded occur.
I think it is cause by this problem at first :

http://bugs.python.org/issue1757062

But and then I do not think so, because I log recursion call of pickle
in file
I found that the recursion limit is exceeded in mid-way to expand
whole the BeautifulSoup object.
Not repeat to call some methods.

This is the code for test.

from BeautifulSoup import *

import pickle as pickle
import urllib

doc = urllib.urlopen('http://www.amazon.com/Magellan-Maestro-4040-
Widescreen-Navigator/dp/B000NMKHW6/ref=sr_1_2?
ie=UTF8&s=electronics&qid=1202541889&sr=1-2')

import sys
sys.setrecursionlimit(40000)

soup = BeautifulSoup(doc)
print pickle.dumps(soup)

-------------------
What I want to ask is: Is this cause by the limit of recursion limit
and stack size?

I had tired cPickle at first, and then I try pickle, cPickle just stop
running program without any message.
I think it is also implement with recursion way, and it also over flow
stack when dumping soup.

Are there any version of pickle that implement with no-recursion way?

Thanks.

Victor Lin.
 
G

Gabriel Genellina

I encounter a problem with pickle.
I download a html from:

http://www.amazon.com/Magellan-Maes...2?ie=UTF8&s=electronics&qid=1202541889&sr=1-2

and parse it with BeautifulSoup.
This page is very huge.
When I use pickle to dump it, a RuntimeError: maximum recursion depth
exceeded occur.

BeautifulSoup objects usually aren't pickleable, independently of your
recursion error.

py> import pickle
py> import BeautifulSoup
py> soup = BeautifulSoup.BeautifulSoup("<html><body>Hello, world!</html>")
py> print pickle.dumps(soup)
Traceback (most recent call last):
....
TypeError: 'NoneType' object is not callable
py>

Why do you want to pickle it? Store the downloaded page instead, and
rebuild the BeautifulSoup object later when needed.
 
V

Victor Lin

BeautifulSoup objects usually aren't pickleable, independently of your
recursion error.
But I pickle and unpickle other soup objects successfully.
Only this object seems too deep to pickle.
py> import pickle
py> import BeautifulSoup
py> soup = BeautifulSoup.BeautifulSoup("<html><body>Hello, world!</html>")
py> print pickle.dumps(soup)
Traceback (most recent call last):
...
TypeError: 'NoneType' object is not callable
py>

Why do you want to pickle it? Store the downloaded page instead, and
rebuild the BeautifulSoup object later when needed.

Because parsing html cost a lots of cpu time. So I want to cache soup
object as file. If I have to get same page, I can get it from cache
file, even the parsed soup file. My program's bottleneck is on parsing
html, so if I can parse once and unpickle them later, it could save a
lots of time.
 
G

Gabriel Genellina

Yes, I could reproduce the error. Worse, using cPicle instead of pickle,
Python just aborts (no exception trace, no error printed, no Application
Error popup...) (this is with Python 2.5.1 on Windows XP)

<code>
import urllib
import BeautifulSoup
import cPickle

doc =
urllib.urlopen('http://www.amazon.com/Magellan-Maes...2?ie=UTF8&s=electronics&qid=1202541889&sr=1-2')
soup = BeautifulSoup.BeautifulSoup(doc)
#print len(cPickle.dumps(soup,-1))
</code>

That page has an insane SELECT containing 1000 OPTIONs. Removing some of
them makes cPickle happy:

<code>
div=soup.find("div", id="buyboxDivId")
select=div.find("select", attrs={"name":"quantity"})
for i in range(200): # remove 200 options out of 1000
select.contents[5].extract()
print len(cPickle.dumps(soup,-1))
</code>

I don't know whether this is an error in BeautifulSoup or in pickle. That
SELECT with many OPTIONs is big, but not recursive (and I think that
BeautifulSoup uses weak references to build its links); anyway pickle is
supposed to handle recursion well. The longest chain of nested tags has
length=32; in principle I would expect that BS has a similar nesting
complexity, and the "recursion limit exceeded" error isn't expected.
But I pickle and unpickle other soup objects successfully.
Only this object seems too deep to pickle.

Yes, sorry, I was using an older version of BeautifulSoup.
 
G

Gabriel Genellina

Yes, I could reproduce the error. Worse, using cPicle instead of pickle,
Python just aborts (no exception trace, no error printed, no Application
Error popup...) (this is with Python 2.5.1 on Windows XP)

<code>
import urllib
import BeautifulSoup
import cPickle

doc =
urllib.urlopen('http://www.amazon.com/Magellan-Maes...2?ie=UTF8&s=electronics&qid=1202541889&sr=1-2')
soup = BeautifulSoup.BeautifulSoup(doc)
#print len(cPickle.dumps(soup,-1))
</code>

That page has an insane SELECT containing 1000 OPTIONs. Removing some of
them makes cPickle happy:

<code>
div=soup.find("div", id="buyboxDivId")
select=div.find("select", attrs={"name":"quantity"})
for i in range(200): # remove 200 options out of 1000
select.contents[5].extract()
print len(cPickle.dumps(soup,-1))
</code>

I don't know whether this is an error in BeautifulSoup or in pickle. That
SELECT with many OPTIONs is big, but not recursive (and I think that
BeautifulSoup uses weak references to build its links); anyway pickle is
supposed to handle recursion well. The longest chain of nested tags has
length=32; in principle I would expect that BS has a similar nesting
complexity, and the "recursion limit exceeded" error isn't expected.
But I pickle and unpickle other soup objects successfully.
Only this object seems too deep to pickle.

Yes, sorry, I was using an older version of BeautifulSoup.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top