iterate over a series of nodes in an XML file

  • Thread starter Diez B. Roggisch
  • Start date
D

Diez B. Roggisch

Hi, I have an XML file which contains entries of the form:

<idlist>
<myID>1</myID>
<myID>2</myID>
....
<myID>10000</myID>
</idlist>

Currently, I have written a SAX based handler that will read in all the
<myID></myID> entries and return a list of the contents of these
entries. However this is not scalable and for my purposes it would be
better if I could iterate over the list of <myID> nodes. Some thing
like:

for myid in getMyIDList(document):
print myid

I realize that I can do this with generators, but I can't see how I can
incorporate generators into my handler class (which is a subclass of
xml.sax.ContentHandler).

Any pointers would be appreciated

Use ElementTree. Or one of the other packages that implement its very
pythonic interface, lxml or cElementTree.

Otherwise, you don't have much chances of using SAX to create a generator
besides reading the whole document into memory (which somehow defeats the
purpose of SAX in the first place) or creating a separate thread that
communicates with an iterable over a queue.

Alternatively, there are parsers out there that implement a PULL style of
parsing instead of the PUSH SAX does. Butr before you start with theses -
take ElementTree.

Diez
 
R

rajarshi.guha

Hi, I have an XML file which contains entries of the form:

<idlist>
<myID>1</myID>
<myID>2</myID>
.....
<myID>10000</myID>
</idlist>

Currently, I have written a SAX based handler that will read in all the
<myID></myID> entries and return a list of the contents of these
entries. However this is not scalable and for my purposes it would be
better if I could iterate over the list of <myID> nodes. Some thing
like:

for myid in getMyIDList(document):
print myid

I realize that I can do this with generators, but I can't see how I can
incorporate generators into my handler class (which is a subclass of
xml.sax.ContentHandler).

Any pointers would be appreciated

Thanks,
Rajarshi
 
S

Stefan Behnel

I have an XML file which contains entries of the form:

<idlist>
<myID>1</myID>
<myID>2</myID>
....
<myID>10000</myID>
</idlist>

Currently, I have written a SAX based handler that will read in all the
<myID></myID> entries and return a list of the contents of these
entries. However this is not scalable and for my purposes it would be
better if I could iterate over the list of <myID> nodes. Some thing
like:

for myid in getMyIDList(document):
print myid

You can try lxml 1.1.

http://cheeseshop.python.org/pypi/lxml/1.1alpha

Some documentation is here:
http://codespeak.net/svn/lxml/trunk/doc/api.txt

I haven't tested it, but you should be able to do this:

from lxml.etree import iterparse
last = None
for event, myid in iterparse(document_url, tag="myID"):
print myid.text
if last is not None:
last.getparent().remove(last)
last = myid

Internally, iterparse builds up a tree, so the last three lines are there to
remove the myid elements from the tree that were already handled. This saves a
lot of memory for large documents.

Stefan
 
R

rajarshi.guha

Thanks to everybody for the pointers. ElementTree is what I ended up
using and my looks like this (based on the ElementTree tutorial code):

def extractIds(filename):
f = open(filename,'r')
context = ET.iterparse(f, events=('start','end'))
context = iter(context)
even, root = context.next()

for event, elem in context:
if event == 'end' and elem.tag == 'Id':
yield elem.text
root.clear()

As a result I can do:

for id in extractIds(someFileName):
do something
 
S

Steve M

I see you've had success with elementtree, but in case you are still
thinking about SAX, here is an approach that might interest you. The
idea is basically to turn your program inside-out by writing a
standalone function to process one myID node. This function has nothing
to do with SAX or parsing the XML tree. This function becomes a
callback that you pass to your SAX handler to call on each node.

import xml.sax

def myID_callback(data):
"""Process the text of one myID node - boil it, mash it, stick it
in a list..."""
print data

class MyHandler(xml.sax.ContentHandler):
def __init__(self, myID_callback):
#a buffer to collect text data that may or may not be needed
later
self.current_text_data = []
self.myID_callback = myID_callback

def characters(self, data):
"""Accumulate characters. startElement("myID") resets it."""
self.current_text_data.append(data)

def startElement(self, name, attributes):
if name == 'myID':
self.current_text_data = []

def endElement(self, name):
if name == 'myID':
data = "".join(self.current_text_data)
self.myID_callback(data)

filename = 'idlist.xml'
xml.sax.parse(filename, MyHandler(myID_callback))
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top