xml.parsers.expat loading xml into a dict and whitespace

Discussion in 'Python' started by kaens, May 23, 2007.

  1. kaens

    kaens Guest

    Hey everyone, this may be a stupid question, but I noticed the
    following and as I'm pretty new to using xml and python, I was
    wondering if I could get an explanation.

    Let's say I write a simple xml parser, for an xml file that just loads
    the content of each tag into a dict (the xml file doesn't have
    multiple hierarchies in it, it's flat other than the parent node)

    so we have
    <parent>
    <option1>foo</option1>
    <option2>bar</option2>
    . . .
    </parent>

    (I'm using xml.parsers.expat)
    the parser sets a flag that says it's in the parent, and sets the
    value of the current tag it's processing in the start tag handler.
    The character data handler sets a dictionary value like so:

    dictName[curTag] = data

    after I'm done processing the file, I print out the dict, and the first value is
    <a few bits of whitespace> : <a whole bunch of whitespace>

    There are comments in the xml file - is this what is causing this?
    There are also blank lines. . .but I don't see how a blank line would
    be interpreted as a tag. Comments though, I could see that happening.

    Actually, I just did a test on an xml file that had no comments or
    whitespace and got the same behaviour.

    If I feed it the following xml file:

    <options>
    <one>hey</one>
    <two>bee</two>
    <three>eff</three>
    </options>

    it prints out:
    " :

    three : eff
    two : bee
    one : hey"

    wtf.

    For reference, here's the handler functions:

    def handleCharacterData(self, data):
    if self.inOptions and self.curTag != "options":
    self.options[self.curTag] = data

    def handleStartElement(self, name, attributes):
    if name == "options":
    self.inOptions = True
    if self.inOptions:
    self.curTag = name


    def handleEndElement(self, name):
    if name == "options":
    self.inOptions = False
    self.curTag = ""

    Sorry if the whitespace in the code got mangled (fingers crossed...)
    kaens, May 23, 2007
    #1
    1. Advertising

  2. kaens wrote:
    > Let's say I write a simple xml parser, for an xml file that just loads
    > the content of each tag into a dict (the xml file doesn't have
    > multiple hierarchies in it, it's flat other than the parent node)

    [snip]
    > <options>
    > <one>hey</one>
    > <two>bee</two>
    > <three>eff</three>
    > </options>
    >
    > it prints out:
    > " :
    >
    > three : eff
    > two : bee
    > one : hey"


    I don't have a good answer for your expat code, but if you're not
    married to that, I strongly suggest you look into ElementTree[1]::

    >>> xml = '''\

    .... <options>
    .... <one>hey</one>
    .... <two>bee</two>
    .... <three>eff</three>
    .... </options>
    .... '''

    >>> import xml.etree.cElementTree as etree
    >>> tree = etree.fromstring(xml)
    >>> d = {}
    >>> for child in tree:

    .... d[child.tag] = child.text
    ....
    >>> d

    {'three': 'eff', 'two': 'bee', 'one': 'hey'}


    [1] ElementTree is in the 2.5 standard library, but if you're stuck with
    an earlier python, just Google for it -- there are standalone versions

    STeVe
    Steven Bethard, May 23, 2007
    #2
    1. Advertising

  3. kaens

    kaens Guest

    > [1] ElementTree is in the 2.5 standard library, but if you're stuck with
    > an earlier python, just Google for it -- there are standalone versions


    I've got 2.5, and I'm not attached to expat at all. I'll check it out, thanks.
    kaens, May 23, 2007
    #3
  4. kaens

    kaens Guest

    Now the code looks like this:

    import xml.etree.ElementTree as etree

    optionsXML = etree.parse("options.xml")
    options = {}

    for child in optionsXML.getiterator():
    if child.tag != optionsXML.getroot().tag:
    options[child.tag] = child.text

    for key, value in options.items():
    print key, ":", value

    freaking easy. Compare with making a generic xml parser class, and
    inheriting from it for doing different things with different xml
    files. This does exactly the right thing. I'm sure it's not perfect
    for all cases, and I'm sure there will be times when I want something
    closer to expat, but this is PERFECT for what I need to do right now.

    That settles it, I'm addicted to python now. I swear I had a little
    bit of a nerdgasm. This is orders of magnitude smaller than what I had
    before, way easier to read and way easier to maintain.

    Thanks again for the point in the right direction, Steve.

    On 5/23/07, kaens <> wrote:
    > > [1] ElementTree is in the 2.5 standard library, but if you're stuck with
    > > an earlier python, just Google for it -- there are standalone versions

    >
    > I've got 2.5, and I'm not attached to expat at all. I'll check it out, thanks.
    >
    kaens, May 23, 2007
    #4
  5. kaens wrote:
    > Now the code looks like this:
    >

    [snip ElementTree code]
    >
    > freaking easy. Compare with making a generic xml parser class, and
    > inheriting from it for doing different things with different xml
    > files. This does exactly the right thing. I'm sure it's not perfect
    > for all cases, and I'm sure there will be times when I want something
    > closer to expat, but this is PERFECT for what I need to do right now.
    >
    > That settles it, I'm addicted to python now. I swear I had a little
    > bit of a nerdgasm. This is orders of magnitude smaller than what I had
    > before, way easier to read and way easier to maintain.
    >
    > Thanks again for the point in the right direction, Steve.


    You're welcome. In return, you've helped me to augment my vocabulary
    with an important new word "nerdgasm". ;-)

    STeVe
    Steven Bethard, May 23, 2007
    #5
  6. kaens wrote:
    > Now the code looks like this:
    >
    > import xml.etree.ElementTree as etree
    >
    > optionsXML = etree.parse("options.xml")
    > options = {}
    >
    > for child in optionsXML.getiterator():
    > if child.tag != optionsXML.getroot().tag:
    > options[child.tag] = child.text
    >
    > for key, value in options.items():
    > print key, ":", value


    Three things to add:

    Importing cElementTree instead of ElementTree should speed this up pretty
    heavily, but:

    Consider using iterparse():

    http://effbot.org/zone/element-iterparse.htm

    *untested*:

    from xml.etree import cElementTree as etree

    iterevents = etree.iterparse("options.xml")
    options = {}

    for event, child in iterevents:
    if child.tag != "parent":
    options[child.tag] = child.text

    for key, value in options.items():
    print key, ":", value


    Note that this also works with lxml.etree. But using lxml.objectify is maybe
    actually what you want:

    http://codespeak.net/lxml/dev/objectify.html

    *untested*:

    from lxml import etree, objectify

    # setup
    parser = etree.XMLParser(remove_blank_text=True)
    lookup = objectify.ObjectifyElementClassLookup()
    parser.setElementClassLookup(lookup)

    # parse
    parent = etree.parse("options.xml", parser)

    # get to work
    option1 = parent.option1
    ...

    # or, if you prefer dictionaries:
    options = vars(parent)
    for key, value in options.items():
    print key, ":", value


    Have fun,

    Stefan
    Stefan Behnel, May 23, 2007
    #6
  7. kaens wrote:
    > Now the code looks like this:
    >
    > import xml.etree.ElementTree as etree
    >
    > optionsXML = etree.parse("options.xml")
    > options = {}
    >
    > for child in optionsXML.getiterator():
    > if child.tag != optionsXML.getroot().tag:
    > options[child.tag] = child.text
    >
    > for key, value in options.items():
    > print key, ":", value


    Three things to add:

    Importing cElementTree instead of ElementTree should speed this up pretty
    heavily, but:

    Consider using iterparse():

    http://effbot.org/zone/element-iterparse.htm

    *untested*:

    from xml.etree import cElementTree as etree

    iterevents = etree.iterparse("options.xml")
    options = {}

    for event, child in iterevents:
    if child.tag != "parent":
    options[child.tag] = child.text

    for key, value in options.items():
    print key, ":", value


    Note that this also works with lxml.etree. But using lxml.objectify is maybe
    actually what you want:

    http://codespeak.net/lxml/dev/objectify.html

    *untested*:

    from lxml import etree, objectify

    # setup
    parser = etree.XMLParser(remove_blank_text=True)
    lookup = objectify.ObjectifyElementClassLookup()
    parser.setElementClassLookup(lookup)

    # parse
    parent = etree.parse("options.xml", parser)

    # get to work
    option1 = parent.option1
    ...

    # or, if you prefer dictionaries:
    options = vars(parent)
    for key, value in options.items():
    print key, ":", value


    Have fun,

    Stefan
    Stefan Behnel, May 23, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Will Stuyvesant

    Help with xml.parsers.expat please?

    Will Stuyvesant, Jul 4, 2003, in forum: Python
    Replies:
    1
    Views:
    674
    Alan Kennedy
    Jul 4, 2003
  2. Thomas Guettler

    xml.parsers.expat vs. xml.sax

    Thomas Guettler, Apr 27, 2004, in forum: Python
    Replies:
    2
    Views:
    886
    Martijn Faassen
    Apr 27, 2004
  3. Replies:
    2
    Views:
    770
    Kent Johnson
    May 4, 2005
  4. kaens
    Replies:
    0
    Views:
    371
    kaens
    May 23, 2007
  5. sharan
    Replies:
    1
    Views:
    713
    Pavel Lepin
    Oct 26, 2007
Loading...

Share This Page