When to clear elements using cElementTree

Discussion in 'Python' started by Ben Temperton, Oct 19, 2012.

  1. Hi there, I am parsing some huge xml files (1.8 Gb) that look like this:
    <scan num='1'>
    <peaks>some data</peaks>
    <scan num='2'>
    <peaks>some data</peaks>
    </scan>
    <scan num='3'>
    <peaks>some data</peaks>
    </scan>
    </scan>

    What I am trying to do is build up a dictionary of lists where the key is the parent scan num and the members of the list are the child scan nums.

    I have created an iterator:

    for event, elem in cElementTree.iterparse(filename):
    if elem.tag == self.XML_SPACE + "scan":
    parentId = int(elem.get('num'))
    for child in elem.findall(self.XML_SPACE +'scan'):
    try:
    indexes = scans[parentId]
    except KeyError:
    indexes = []
    scans[parentId] = indexes
    childId = int(child.get('num'))
    indexes.append(childId)
    # choice 1 - child.clear()
    #choice 2 - elem.clear()
    #choice 3 - elem.clear()

    If I don't use any of the clear functions, the method works fine, but is very slow (presumably because nothing is getting cleared from memory. But, if I implement any of the clear functions shown, then

    childId = int(child.get('num'))

    fails because child.get('num') returns a NoneType. If you dump the child element using cElementTree.dump(child), all of the attributes on the child scans are missing, even though the clear() calls are made after the assignment of the childId.

    What I don't understand is why, given the calls are made after assignment, that the assignment then fails, but succeeds when clear() is not called.

    When should I be calling clear() in this case to maximize speed?

    Many thanks,

    Ben
    Ben Temperton, Oct 19, 2012
    #1
    1. Advertising

  2. I managed to solve this using the following method:

    """Returns a dictionary of indexes of spectra for which there are secondary scans, along with the indexes of those scans
    """
    scans = dict()

    # get an iterable
    context = cElementTree.iterparse(self.info['filename'], events=("end",))

    # turn it into an iterator
    context = iter(context)

    # get the root element
    event, root = context.next()

    for event, elem in context:
    if event == "end" and elem.tag == self.XML_SPACE + "scan":
    parentId = int(elem.get('num'))
    for child in elem.findall(self.XML_SPACE + 'scan'):
    childId = int(child.get('num'))
    try:
    indexes = scans[parentId]
    except KeyError:
    indexes = []
    scans[parentId] = indexes
    indexes.append(childId)
    child.clear()
    root.clear()
    return scans

    I think the trick is using the 'end' event to determine how much data your iterparse is taking in, but I'm still not quite clear on whether this is the best way to do it.
    Ben Temperton, Oct 19, 2012
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fredrik Lundh

    ANN: cElementTree 0.9.8 (january 23, 2005)

    Fredrik Lundh, Jan 23, 2005, in forum: Python
    Replies:
    0
    Views:
    278
    Fredrik Lundh
    Jan 23, 2005
  2. Igor V. Rafienko

    cElementTree clear semantics

    Igor V. Rafienko, Sep 25, 2005, in forum: Python
    Replies:
    27
    Views:
    655
    Paul Boddie
    Sep 26, 2005
  3. Mark
    Replies:
    0
    Views:
    309
  4. Mark E. Smith
    Replies:
    0
    Views:
    252
    Mark E. Smith
    Oct 23, 2006
  5. Mark
    Replies:
    0
    Views:
    292
Loading...

Share This Page