XML document causes pickle to go into infinite recursion

Discussion in 'Python' started by Orest Kozyar, Nov 1, 2007.

  1. Orest Kozyar

    Orest Kozyar Guest

    I'm working on a CGI script that pulls XML data from a public database
    (Medline) and caches this data using shelveleto minimize load on the
    database. In general, the script works quite well, but keeps crashing
    every time I try to pickle a particular XML document. Below is a
    script that illustrates the problem, followed by the stack trace that
    is generated (thanks to Kent Johnson who helped me refine the
    script). I'd appreciate any advice for solving this particular
    problem. Someone on Python-Tutor suggested that the XML document has
    a circular reference, but I'm not sure exactly what this means, or why
    the document would have a reference to itself.

    import urllib
    from pickle import Pickler
    from cStringIO import StringIO
    from xml.dom import minidom

    baseurl = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?'

    params = {
    'db': 'pubmed',
    'retmode': 'xml',
    'rettype': 'medline',
    }

    badkey = '16842422'

    params['id'] = badkey
    url = baseurl + urllib.urlencode(params) doc =
    minidom.parseString(urllib.urlopen(url).read())
    print 'Successfully retrieved and parsed XML document with ID %s' %
    badkey

    f = StringIO()
    p = Pickler(f, 0)
    p.dump(doc)

    #Will fail on the above line
    print 'Successfully shelved XML document with ID %s' % badkey

    Here is the top of the stack trace:
    File "BadShelve.py", line 35, in <module>
    p.dump(doc)
    File
    "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
    pickle.py",
    line 224, in dump
    self.save(obj)
    File
    "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
    pickle.py",
    line 286, in save
    f(self, obj) # Call unbound method with explicit self
    File
    "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
    pickle.py",
    line 725, in save_inst
    save(stuff)
    File
    "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
    pickle.py",
    line 286, in save
    f(self, obj) # Call unbound method with explicit self
    File
    "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
    pickle.py",
    line 649, in save_dict
    self._batch_setitems(obj.iteritems())
    File
    "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
    pickle.py",
    line 663, in _batch_setitems
    save(v)
    File
    "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
    pickle.py",
    line 286, in save
    f(self, obj) # Call unbound method with explicit self
    File
    "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/
    pickle.py",
    line 725, in save_inst
    save(stuff)
     
    Orest Kozyar, Nov 1, 2007
    #1
    1. Advertising

  2. Orest Kozyar wrote:
    > I'm working on a CGI script that pulls XML data from a public database
    > (Medline) and caches this data using shelveleto minimize load on the
    > database. In general, the script works quite well, but keeps crashing
    > every time I try to pickle a particular XML document. Below is a
    > script that illustrates the problem, followed by the stack trace that
    > is generated (thanks to Kent Johnson who helped me refine the
    > script). I'd appreciate any advice for solving this particular
    > problem. Someone on Python-Tutor suggested that the XML document has
    > a circular reference, but I'm not sure exactly what this means, or why
    > the document would have a reference to itself.


    minidom creates a pretty complete tree data structure, with loads of backlinks
    to parent elements etc. That's where your circular references come from.

    I don't know why you want to use pickle here (and not serialised XML or the
    plain in-memory tree), but if memory consumption is an issue, try
    cElementTree, which comes with Python 2.5 (or as an external module for older
    versions). It's faster, more memory friendly and easier to use than minidom.
    There's also lxml.objectify, in case you can't live without pickeling.

    http://effbot.org/zone/celementtree.htm
    http://codespeak.net/lxml

    Stefan
     
    Stefan Behnel, Nov 1, 2007
    #2
    1. Advertising

  3. Orest Kozyar wrote:
    > I'm working on a CGI script that pulls XML data from a public database


    Ah, I missed that bit on first read. Consider using something different than
    CGI here if you want to do caching. FCGI would allow you to do in-memory
    caching, for example, as would mod_python and a lot of other solutions.

    Stefan
     
    Stefan Behnel, Nov 1, 2007
    #3
  4. Orest Kozyar

    Orest Kozyar Guest


    > minidom creates a pretty complete tree data structure, with loads of backlinks
    > to parent elements etc. That's where your circular references come from.
    >
    > I don't know why you want to use pickle here (and not serialised XML or the
    > plain in-memory tree), but if memory consumption is an issue, try
    > cElementTree, which comes with Python 2.5 (or as an external module for older
    > versions). It's faster, more memory friendly and easier to use than minidom.
    > There's also lxml.objectify, in case you can't live without pickeling.


    I wasn't aware of cElementTree. When I was looking for examples of
    how to parse XML documents, most of the tutorials I came across used
    minidom. Thanks for pointing this out. I've switched over, and I
    like ElementTree much better than minidom (the structure returned by
    the minidom parser always seemed overly complex).

    I've also gotten rid of the pickling and am storing the XML files as
    raw text in directories, so this got rid of my other problem with
    shelve.
     
    Orest Kozyar, Nov 1, 2007
    #4
  5. Orest Kozyar

    Orest Kozyar Guest

    > > I'm working on a CGI script that pulls XML data from a public database
    >
    > Ah, I missed that bit on first read. Consider using something different than
    > CGI here if you want to do caching. FCGI would allow you to do in-memory
    > caching, for example, as would mod_python and a lot of other solutions.


    What I'm aiming for is sort of a "permanent" disk cache/mirror that
    adds records as needed. The main issue is that the database (PubMed)
    requests that we limit requests to once every three seconds. I often
    need to access data for hundreds of records, so I've set up a cron job
    to cache the needed records to disk overnight. I was using shelve
    before, but the size of the file grew to about 500 mB and I started
    having issues with shelve performance. So, I just stopped using
    shelve and am storing each record separately on disk now.

    Thanks!
    Orest
     
    Orest Kozyar, Nov 1, 2007
    #5
  6. Orest Kozyar

    Paul Boddie Guest

    On 1 Nov, 22:25, Orest Kozyar <> wrote:
    > > > I'm working on a CGI script that pulls XML data from a public database

    >
    > > Ah, I missed that bit on first read. Consider using something different than
    > > CGI here if you want to do caching. FCGI would allow you to do in-memory
    > > caching, for example, as would mod_python and a lot of other solutions.

    >
    > What I'm aiming for is sort of a "permanent" disk cache/mirror that
    > adds records as needed. The main issue is that the database (PubMed)
    > requests that we limit requests to once every three seconds.


    Drifting off-topic slightly, it is possible to download archives of
    PubMed, but you'll need a lot of disk space; you probably know this
    already. Putting the data into something rapidly searchable can demand
    even more space, depending on what you use and what you want to be
    able to search for. I guess it depends on whether you're retrieving
    relatively few documents across the whole of PubMed or whether your
    searches are concentrated in particular sections of the whole archive.

    Paul
     
    Paul Boddie, Nov 1, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GB
    Replies:
    0
    Views:
    412
  2. a pickle's pickle

    , Aug 2, 2005, in forum: Python
    Replies:
    4
    Views:
    407
  3. -dresden.de
    Replies:
    2
    Views:
    523
    Peter Otten
    Mar 12, 2008
  4. Michele Simionato
    Replies:
    2
    Views:
    1,936
    Michele Simionato
    May 23, 2008
  5. Thomas
    Replies:
    0
    Views:
    394
    Thomas
    Jul 1, 2009
Loading...

Share This Page