XML document causes pickle to go into infinite recursion

Orest Kozyar · Nov 1, 2007

I'm working on a CGI script that pulls XML data from a public database
(Medline) and caches this data using shelveleto minimize load on the
database. In general, the script works quite well, but keeps crashing
every time I try to pickle a particular XML document. Below is a
script that illustrates the problem, followed by the stack trace that
is generated (thanks to Kent Johnson who helped me refine the
script). I'd appreciate any advice for solving this particular
problem. Someone on Python-Tutor suggested that the XML document has
a circular reference, but I'm not sure exactly what this means, or why
the document would have a reference to itself.

import urllib
from pickle import Pickler
from cStringIO import StringIO
from xml.dom import minidom

baseurl = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?'

params = {
'db': 'pubmed',
'retmode': 'xml',
'rettype': 'medline',
}

badkey = '16842422'

params['id'] = badkey
url = baseurl + urllib.urlencode(params) doc =
minidom.parseString(urllib.urlopen(url).read())
print 'Successfully retrieved and parsed XML document with ID %s' %
badkey

f = StringIO()
p = Pickler(f, 0)
p.dump(doc)

#Will fail on the above line
print 'Successfully shelved XML document with ID %s' % badkey

Here is the top of the stack trace:
File "BadShelve.py", line 35, in <module>
p.dump(doc)
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 224, in dump
self.save(obj)
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 725, in save_inst
save(stuff)
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 649, in save_dict
self._batch_setitems(obj.iteritems())
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 663, in _batch_setitems
save(v)
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 725, in save_inst
save(stuff)

Stefan Behnel · Nov 1, 2007

Orest said:
I'm working on a CGI script that pulls XML data from a public database
(Medline) and caches this data using shelveleto minimize load on the
database. In general, the script works quite well, but keeps crashing
every time I try to pickle a particular XML document. Below is a
script that illustrates the problem, followed by the stack trace that
is generated (thanks to Kent Johnson who helped me refine the
script). I'd appreciate any advice for solving this particular
problem. Someone on Python-Tutor suggested that the XML document has
a circular reference, but I'm not sure exactly what this means, or why
the document would have a reference to itself.

minidom creates a pretty complete tree data structure, with loads of backlinks
to parent elements etc. That's where your circular references come from.

I don't know why you want to use pickle here (and not serialised XML or the
plain in-memory tree), but if memory consumption is an issue, try
cElementTree, which comes with Python 2.5 (or as an external module for older
versions). It's faster, more memory friendly and easier to use than minidom.
There's also lxml.objectify, in case you can't live without pickeling.

http://effbot.org/zone/celementtree.htm
http://codespeak.net/lxml

Stefan

Stefan Behnel · Nov 1, 2007

Orest said:
I'm working on a CGI script that pulls XML data from a public database

Ah, I missed that bit on first read. Consider using something different than
CGI here if you want to do caching. FCGI would allow you to do in-memory
caching, for example, as would mod_python and a lot of other solutions.

Stefan

Orest Kozyar · Nov 1, 2007

minidom creates a pretty complete tree data structure, with loads of backlinks
to parent elements etc. That's where your circular references come from.

I don't know why you want to use pickle here (and not serialised XML or the
plain in-memory tree), but if memory consumption is an issue, try
cElementTree, which comes with Python 2.5 (or as an external module for older
versions). It's faster, more memory friendly and easier to use than minidom.
There's also lxml.objectify, in case you can't live without pickeling.

I wasn't aware of cElementTree. When I was looking for examples of
how to parse XML documents, most of the tutorials I came across used
minidom. Thanks for pointing this out. I've switched over, and I
like ElementTree much better than minidom (the structure returned by
the minidom parser always seemed overly complex).

I've also gotten rid of the pickling and am storing the XML files as
raw text in directories, so this got rid of my other problem with
shelve.

Orest Kozyar · Nov 1, 2007

I'm working on a CGI script that pulls XML data from a public database

Ah, I missed that bit on first read. Consider using something different than
CGI here if you want to do caching. FCGI would allow you to do in-memory
caching, for example, as would mod_python and a lot of other solutions.

What I'm aiming for is sort of a "permanent" disk cache/mirror that
adds records as needed. The main issue is that the database (PubMed)
requests that we limit requests to once every three seconds. I often
need to access data for hundreds of records, so I've set up a cron job
to cache the needed records to disk overnight. I was using shelve
before, but the size of the file grew to about 500 mB and I started
having issues with shelve performance. So, I just stopped using
shelve and am storing each record separately on disk now.

Thanks!
Orest

Paul Boddie · Nov 1, 2007

What I'm aiming for is sort of a "permanent" disk cache/mirror that
adds records as needed. The main issue is that the database (PubMed)
requests that we limit requests to once every three seconds.

Drifting off-topic slightly, it is possible to download archives of
PubMed, but you'll need a lot of disk space; you probably know this
already. Putting the data into something rapidly searchable can demand
even more space, depending on what you use and what you want to be
able to search for. I guess it depends on whether you're retrieving
relatively few documents across the whole of PubMed or whether your
searches are concentrated in particular sections of the whole archive.

Paul

why can't I pickle a class containing this dispatch dictionary?	6	Apr 2, 2012
pickle error: can't pickle instancemethod objects	2	May 23, 2008
Pickle MemoryError - any ideas?	3	Jul 20, 2010
Memory error while saving dictionary using pickle	0	Jul 7, 2008
Memory error while saving dictionary of size 65000X50 using pickle	1	Jul 7, 2008
Why can't you pickle instancemethods?	5	Oct 20, 2006
TypeError: can't pickle HASH objects?	10	Oct 1, 2008
Cannot Read MySQLdb docs within Python interpreter	1	May 26, 2008

XML document causes pickle to go into infinite recursion

Orest Kozyar

Stefan Behnel

Stefan Behnel

Orest Kozyar

Orest Kozyar

Paul Boddie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads