XML document causes pickle to go into infinite recursion

O

Orest Kozyar

I'm working on a CGI script that pulls XML data from a public database
(Medline) and caches this data using shelveleto minimize load on the
database. In general, the script works quite well, but keeps crashing
every time I try to pickle a particular XML document. Below is a
script that illustrates the problem, followed by the stack trace that
is generated (thanks to Kent Johnson who helped me refine the
script). I'd appreciate any advice for solving this particular
problem. Someone on Python-Tutor suggested that the XML document has
a circular reference, but I'm not sure exactly what this means, or why
the document would have a reference to itself.

import urllib
from pickle import Pickler
from cStringIO import StringIO
from xml.dom import minidom

baseurl = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?'

params = {
'db': 'pubmed',
'retmode': 'xml',
'rettype': 'medline',
}

badkey = '16842422'

params['id'] = badkey
url = baseurl + urllib.urlencode(params) doc =
minidom.parseString(urllib.urlopen(url).read())
print 'Successfully retrieved and parsed XML document with ID %s' %
badkey

f = StringIO()
p = Pickler(f, 0)
p.dump(doc)

#Will fail on the above line
print 'Successfully shelved XML document with ID %s' % badkey

Here is the top of the stack trace:
File "BadShelve.py", line 35, in <module>
p.dump(doc)
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 224, in dump
self.save(obj)
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 725, in save_inst
save(stuff)
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 649, in save_dict
self._batch_setitems(obj.iteritems())
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 663, in _batch_setitems
save(v)
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 286, in save
f(self, obj) # Call unbound method with explicit self
File
"/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
pickle.py",
line 725, in save_inst
save(stuff)
 
S

Stefan Behnel

Orest said:
I'm working on a CGI script that pulls XML data from a public database
(Medline) and caches this data using shelveleto minimize load on the
database. In general, the script works quite well, but keeps crashing
every time I try to pickle a particular XML document. Below is a
script that illustrates the problem, followed by the stack trace that
is generated (thanks to Kent Johnson who helped me refine the
script). I'd appreciate any advice for solving this particular
problem. Someone on Python-Tutor suggested that the XML document has
a circular reference, but I'm not sure exactly what this means, or why
the document would have a reference to itself.

minidom creates a pretty complete tree data structure, with loads of backlinks
to parent elements etc. That's where your circular references come from.

I don't know why you want to use pickle here (and not serialised XML or the
plain in-memory tree), but if memory consumption is an issue, try
cElementTree, which comes with Python 2.5 (or as an external module for older
versions). It's faster, more memory friendly and easier to use than minidom.
There's also lxml.objectify, in case you can't live without pickeling.

http://effbot.org/zone/celementtree.htm
http://codespeak.net/lxml

Stefan
 
S

Stefan Behnel

Orest said:
I'm working on a CGI script that pulls XML data from a public database

Ah, I missed that bit on first read. Consider using something different than
CGI here if you want to do caching. FCGI would allow you to do in-memory
caching, for example, as would mod_python and a lot of other solutions.

Stefan
 
O

Orest Kozyar

minidom creates a pretty complete tree data structure, with loads of backlinks
to parent elements etc. That's where your circular references come from.

I don't know why you want to use pickle here (and not serialised XML or the
plain in-memory tree), but if memory consumption is an issue, try
cElementTree, which comes with Python 2.5 (or as an external module for older
versions). It's faster, more memory friendly and easier to use than minidom.
There's also lxml.objectify, in case you can't live without pickeling.

I wasn't aware of cElementTree. When I was looking for examples of
how to parse XML documents, most of the tutorials I came across used
minidom. Thanks for pointing this out. I've switched over, and I
like ElementTree much better than minidom (the structure returned by
the minidom parser always seemed overly complex).

I've also gotten rid of the pickling and am storing the XML files as
raw text in directories, so this got rid of my other problem with
shelve.
 
O

Orest Kozyar

I'm working on a CGI script that pulls XML data from a public database
Ah, I missed that bit on first read. Consider using something different than
CGI here if you want to do caching. FCGI would allow you to do in-memory
caching, for example, as would mod_python and a lot of other solutions.

What I'm aiming for is sort of a "permanent" disk cache/mirror that
adds records as needed. The main issue is that the database (PubMed)
requests that we limit requests to once every three seconds. I often
need to access data for hundreds of records, so I've set up a cron job
to cache the needed records to disk overnight. I was using shelve
before, but the size of the file grew to about 500 mB and I started
having issues with shelve performance. So, I just stopped using
shelve and am storing each record separately on disk now.

Thanks!
Orest
 
P

Paul Boddie

What I'm aiming for is sort of a "permanent" disk cache/mirror that
adds records as needed. The main issue is that the database (PubMed)
requests that we limit requests to once every three seconds.

Drifting off-topic slightly, it is possible to download archives of
PubMed, but you'll need a lot of disk space; you probably know this
already. Putting the data into something rapidly searchable can demand
even more space, depending on what you use and what you want to be
able to search for. I guess it depends on whether you're retrieving
relatively few documents across the whole of PubMed or whether your
searches are concentrated in particular sections of the whole archive.

Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top