When to clear elements using cElementTree

Ben Temperton · Oct 19, 2012

Hi there, I am parsing some huge xml files (1.8 Gb) that look like this:
<scan num='1'>
<peaks>some data</peaks>
<scan num='2'>
<peaks>some data</peaks>
</scan>
<scan num='3'>
<peaks>some data</peaks>
</scan>
</scan>

What I am trying to do is build up a dictionary of lists where the key is the parent scan num and the members of the list are the child scan nums.

I have created an iterator:

for event, elem in cElementTree.iterparse(filename):
if elem.tag == self.XML_SPACE + "scan":
parentId = int(elem.get('num'))
for child in elem.findall(self.XML_SPACE +'scan'):
try:
indexes = scans[parentId]
except KeyError:
indexes = []
scans[parentId] = indexes
childId = int(child.get('num'))
indexes.append(childId)
# choice 1 - child.clear()
#choice 2 - elem.clear()
#choice 3 - elem.clear()

If I don't use any of the clear functions, the method works fine, but is very slow (presumably because nothing is getting cleared from memory. But, if I implement any of the clear functions shown, then

childId = int(child.get('num'))

fails because child.get('num') returns a NoneType. If you dump the child element using cElementTree.dump(child), all of the attributes on the child scans are missing, even though the clear() calls are made after the assignment of the childId.

What I don't understand is why, given the calls are made after assignment, that the assignment then fails, but succeeds when clear() is not called.

When should I be calling clear() in this case to maximize speed?

Many thanks,

Ben

Ben Temperton · Oct 19, 2012

I managed to solve this using the following method:

"""Returns a dictionary of indexes of spectra for which there are secondary scans, along with the indexes of those scans
"""
scans = dict()

# get an iterable
context = cElementTree.iterparse(self.info['filename'], events=("end",))

# turn it into an iterator
context = iter(context)

# get the root element
event, root = context.next()

for event, elem in context:
if event == "end" and elem.tag == self.XML_SPACE + "scan":
parentId = int(elem.get('num'))
for child in elem.findall(self.XML_SPACE + 'scan'):
childId = int(child.get('num'))
try:
indexes = scans[parentId]
except KeyError:
indexes = []
scans[parentId] = indexes
indexes.append(childId)
child.clear()
root.clear()
return scans

I think the trick is using the 'end' event to determine how much data your iterparse is taking in, but I'm still not quite clear on whether this is the best way to do it.

using pointers and handles	3	Dec 11, 2006
Interfacing with Ruby garbage collector - when returning value fromC extension to Ruby	6	Jan 4, 2010
Painful?: Using the ast module for metaprogramming	5	Apr 6, 2009
facing segmentation fault when i tried to free dynamic memory...	4	May 8, 2009
Need help - trying to write a basic renderer	2	Mar 10, 2010
Applet Hangs when submitting data to servlet	21	Aug 31, 2007
Replies to Upcasting vs downcasting.	9	Feb 19, 2009
PEP 353: Using ssize_t as the index type	2	Feb 12, 2006

When to clear elements using cElementTree

Ben Temperton

Ben Temperton

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads