xml.parsers.expat loading xml into a dict and whitespace

K

kaens

Hey everyone, this may be a stupid question, but I noticed the
following and as I'm pretty new to using xml and python, I was
wondering if I could get an explanation.

Let's say I write a simple xml parser, for an xml file that just loads
the content of each tag into a dict (the xml file doesn't have
multiple hierarchies in it, it's flat other than the parent node)

so we have
<parent>
<option1>foo</option1>
<option2>bar</option2>
. . .
</parent>

(I'm using xml.parsers.expat)
the parser sets a flag that says it's in the parent, and sets the
value of the current tag it's processing in the start tag handler.
The character data handler sets a dictionary value like so:

dictName[curTag] = data

after I'm done processing the file, I print out the dict, and the first value is
<a few bits of whitespace> : <a whole bunch of whitespace>

There are comments in the xml file - is this what is causing this?
There are also blank lines. . .but I don't see how a blank line would
be interpreted as a tag. Comments though, I could see that happening.

Actually, I just did a test on an xml file that had no comments or
whitespace and got the same behaviour.

If I feed it the following xml file:

<options>
<one>hey</one>
<two>bee</two>
<three>eff</three>
</options>

it prints out:
" :

three : eff
two : bee
one : hey"

wtf.

For reference, here's the handler functions:

def handleCharacterData(self, data):
if self.inOptions and self.curTag != "options":
self.options[self.curTag] = data

def handleStartElement(self, name, attributes):
if name == "options":
self.inOptions = True
if self.inOptions:
self.curTag = name


def handleEndElement(self, name):
if name == "options":
self.inOptions = False
self.curTag = ""

Sorry if the whitespace in the code got mangled (fingers crossed...)
 
S

Steven Bethard

kaens said:
Let's say I write a simple xml parser, for an xml file that just loads
the content of each tag into a dict (the xml file doesn't have
multiple hierarchies in it, it's flat other than the parent node) [snip]
<options>
<one>hey</one>
<two>bee</two>
<three>eff</three>
</options>

it prints out:
" :

three : eff
two : bee
one : hey"

I don't have a good answer for your expat code, but if you're not
married to that, I strongly suggest you look into ElementTree[1]::
.... <options>
.... <one>hey</one>
.... <two>bee</two>
.... <three>eff</three>
.... d[child.tag] = child.text
....{'three': 'eff', 'two': 'bee', 'one': 'hey'}


[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions

STeVe
 
K

kaens

[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions

I've got 2.5, and I'm not attached to expat at all. I'll check it out, thanks.
 
K

kaens

Now the code looks like this:

import xml.etree.ElementTree as etree

optionsXML = etree.parse("options.xml")
options = {}

for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value

freaking easy. Compare with making a generic xml parser class, and
inheriting from it for doing different things with different xml
files. This does exactly the right thing. I'm sure it's not perfect
for all cases, and I'm sure there will be times when I want something
closer to expat, but this is PERFECT for what I need to do right now.

That settles it, I'm addicted to python now. I swear I had a little
bit of a nerdgasm. This is orders of magnitude smaller than what I had
before, way easier to read and way easier to maintain.

Thanks again for the point in the right direction, Steve.

[1] ElementTree is in the 2.5 standard library, but if you're stuck with
an earlier python, just Google for it -- there are standalone versions

I've got 2.5, and I'm not attached to expat at all. I'll check it out, thanks.
 
S

Steven Bethard

kaens said:
Now the code looks like this:
[snip ElementTree code]

freaking easy. Compare with making a generic xml parser class, and
inheriting from it for doing different things with different xml
files. This does exactly the right thing. I'm sure it's not perfect
for all cases, and I'm sure there will be times when I want something
closer to expat, but this is PERFECT for what I need to do right now.

That settles it, I'm addicted to python now. I swear I had a little
bit of a nerdgasm. This is orders of magnitude smaller than what I had
before, way easier to read and way easier to maintain.

Thanks again for the point in the right direction, Steve.

You're welcome. In return, you've helped me to augment my vocabulary
with an important new word "nerdgasm". ;-)

STeVe
 
S

Stefan Behnel

kaens said:
Now the code looks like this:

import xml.etree.ElementTree as etree

optionsXML = etree.parse("options.xml")
options = {}

for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value

Three things to add:

Importing cElementTree instead of ElementTree should speed this up pretty
heavily, but:

Consider using iterparse():

http://effbot.org/zone/element-iterparse.htm

*untested*:

from xml.etree import cElementTree as etree

iterevents = etree.iterparse("options.xml")
options = {}

for event, child in iterevents:
if child.tag != "parent":
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value


Note that this also works with lxml.etree. But using lxml.objectify is maybe
actually what you want:

http://codespeak.net/lxml/dev/objectify.html

*untested*:

from lxml import etree, objectify

# setup
parser = etree.XMLParser(remove_blank_text=True)
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)

# parse
parent = etree.parse("options.xml", parser)

# get to work
option1 = parent.option1
...

# or, if you prefer dictionaries:
options = vars(parent)
for key, value in options.items():
print key, ":", value


Have fun,

Stefan
 
S

Stefan Behnel

kaens said:
Now the code looks like this:

import xml.etree.ElementTree as etree

optionsXML = etree.parse("options.xml")
options = {}

for child in optionsXML.getiterator():
if child.tag != optionsXML.getroot().tag:
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value

Three things to add:

Importing cElementTree instead of ElementTree should speed this up pretty
heavily, but:

Consider using iterparse():

http://effbot.org/zone/element-iterparse.htm

*untested*:

from xml.etree import cElementTree as etree

iterevents = etree.iterparse("options.xml")
options = {}

for event, child in iterevents:
if child.tag != "parent":
options[child.tag] = child.text

for key, value in options.items():
print key, ":", value


Note that this also works with lxml.etree. But using lxml.objectify is maybe
actually what you want:

http://codespeak.net/lxml/dev/objectify.html

*untested*:

from lxml import etree, objectify

# setup
parser = etree.XMLParser(remove_blank_text=True)
lookup = objectify.ObjectifyElementClassLookup()
parser.setElementClassLookup(lookup)

# parse
parent = etree.parse("options.xml", parser)

# get to work
option1 = parent.option1
...

# or, if you prefer dictionaries:
options = vars(parent)
for key, value in options.items():
print key, ":", value


Have fun,

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,439
Members
44,829
Latest member
PIXThurman

Latest Threads

Top