parsing nested unbounded XML fields with ElementTree

Larry.Martell · Nov 25, 2013

I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:

nodes = []

def parseChild(c):
if c.tag == 'Node':
if 'Name' in c.attrib:
nodes.append(c.attrib['Name'])
for c1 in c:
parseChild(c1)
else:
for node in nodes:
print node,
print c.tag

for parent in tree.getiterator():
for child in parent:
for x in child:
parseChild(x)

My problem is that I don't know when I'm done with a node and I should remove a level of nesting. I would think this is a fairly common situation, but I could not find any examples of parsing a file like this. Perhaps I'm going about it completely wrong.

Chris Angelico · Nov 25, 2013

I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

First off, please clarify: Are there five corresponding </Node> tags
later on? If not, it's not XML, and nesting will have to be defined
some other way.

Secondly, please get off Google Groups. Your initial post is
malformed, and unless you specifically fight the software, your
replies will be even more malformed, to the point of being quite
annoying. There are many other ways to read a newsgroup, or you can
subscribe to the mailing list (e-mail address removed), which carries
the same content.

ChrisA

Stefan Behnel · Nov 26, 2013

(e-mail address removed), 25.11.2013 23:22:

I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:

nodes = []

def parseChild(c):
if c.tag == 'Node':
if 'Name' in c.attrib:
nodes.append(c.attrib['Name'])
for c1 in c:
parseChild(c1)
else:
for node in nodes:
print node,
print c.tag

for parent in tree.getiterator():
for child in parent:
for x in child:
parseChild(x)

This seems hugely redundant. tree.getiterator() already returns a recursive
iterable, and then, for each nodes in your document, you are running
recursively over its entire subtree. Meaning that you'll visit each node as
many times as its depth in the tree.

My problem is that I don't know when I'm done with a node and I should
remove a level of nesting. I would think this is a fairly common
situation, but I could not find any examples of parsing a file like
this. Perhaps I'm going about it completely wrong.

Your recursive traversal function tells you when you're done. If you drop
the getiterator() bit, reaching the end of parseChild() means that you're
done with the element and start backing up. So you can simply pass down a
list of element names that you append() at the beginning of the function
and pop() at the end, i.e. a stack. That list will then always give you the
current path from the root node.

Alternatively, if you want to use lxml.etree instead of ElementTree, you
can use it's iterwalk() function, which gives you the same thing but
without recursion, as a plain iterator.

http://lxml.de/parsing.html#iterparse-and-iterwalk

Stefan

Larry Martell · Nov 26, 2013

(e-mail address removed), 25.11.2013 23:22:

I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:

nodes = []

def parseChild(c):
if c.tag == 'Node':
if 'Name' in c.attrib:
nodes.append(c.attrib['Name'])
for c1 in c:
parseChild(c1)
else:
for node in nodes:
print node,
print c.tag

for parent in tree.getiterator():
for child in parent:
for x in child:
parseChild(x)

Click to expand...

This seems hugely redundant. tree.getiterator() already returns a recursive
iterable, and then, for each nodes in your document, you are running
recursively over its entire subtree. Meaning that you'll visit each node as
many times as its depth in the tree.

My problem is that I don't know when I'm done with a node and I should
remove a level of nesting. I would think this is a fairly common
situation, but I could not find any examples of parsing a file like
this. Perhaps I'm going about it completely wrong.

Click to expand...

Your recursive traversal function tells you when you're done. If you drop
the getiterator() bit, reaching the end of parseChild() means that you're
done with the element and start backing up. So you can simply pass down a
list of element names that you append() at the beginning of the function
and pop() at the end, i.e. a stack. That list will then always give you the
current path from the root node.

Thanks for the reply. How can I remove getiterator()? Then I won't be
traversing the nodes of the tree. I can't iterate over tree. I am also
unclear on where to do the pop(). I tried putting it just after the
recursive call to parseChild() and I tried putting as the very last
statement in parseChild() - neither one gave the desired result. Can
you show me in code what you mean?

Thanks!
-larry

Stefan Behnel · Nov 26, 2013

Larry Martell, 26.11.2013 13:23:

(e-mail address removed), 25.11.2013 23:22:

I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:

nodes = []

def parseChild(c):
if c.tag == 'Node':
if 'Name' in c.attrib:
nodes.append(c.attrib['Name'])
for c1 in c:
parseChild(c1)
else:
for node in nodes:
print node,
print c.tag

for parent in tree.getiterator():
for child in parent:
for x in child:
parseChild(x)

Click to expand...

This seems hugely redundant. tree.getiterator() already returns a recursive
iterable, and then, for each nodes in your document, you are running
recursively over its entire subtree. Meaning that you'll visit each node as
many times as its depth in the tree.

My problem is that I don't know when I'm done with a node and I should
remove a level of nesting. I would think this is a fairly common
situation, but I could not find any examples of parsing a file like
this. Perhaps I'm going about it completely wrong.

Click to expand...

Your recursive traversal function tells you when you're done. If you drop
the getiterator() bit, reaching the end of parseChild() means that you're
done with the element and start backing up. So you can simply pass down a
list of element names that you append() at the beginning of the function
and pop() at the end, i.e. a stack. That list will then always give you the
current path from the root node.

Click to expand...

Thanks for the reply. How can I remove getiterator()? Then I won't be
traversing the nodes of the tree. I can't iterate over tree. I am also
unclear on where to do the pop(). I tried putting it just after the
recursive call to parseChild() and I tried putting as the very last
statement in parseChild() - neither one gave the desired result. Can
you show me in code what you mean?

untested:

nodes = []

def process_subtree(c, path):
name = c.get('Name') if c.tag == 'Node' else None
if name:
path.append(name)
nodes.append('/'.join(path))

for c1 in c:
process_subtree(c1, path)

if name:
path.pop()

process_subtree(tree.getroot(), [])

Stefan

Neil Cerutti · Nov 26, 2013

I have an XML file that has an element called "Node". These can
be nested to any depth and the depth of the nesting is not
known to me. I need to parse the file and preserve the nesting.
For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E.
Problem is I don't know how deep this can be. This is the code
I have so far:

I also an ElementTree user, but it's fairly heavy-duty for simple
jobs. I use sax for simple those. In fact, I'm kind of a saxophone.
This is basically the same idea as others have posted.

the_xml = """<?xml version="1.0" encoding="ISO-8859-1"?>
<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">
</Node></Node></Node></Node></Node>"""
import io
import sys
import xml.sax as sax

class NodeHandler(sax.handler.ContentHandler):
def startDocument(self):
self.title = ''
self.names = []

def startElement(self, name, attrs):
self.process(attrs['Name'])
self.names.append(attrs['Name'])

def process(self, name):
print("Node {} Nest {}".format(name, '/'.join(self.names)))
# Do your stuff.

def endElement(self, name):
self.names.pop()

print(sys.version_info)
handler = NodeHandler()
parser = sax.parse(io.StringIO(the_xml), handler)

Output:
sys.version_info(major=3, minor=3, micro=2, releaselevel='final', serial=0)
Node A Nest
Node B Nest A
Node C Nest A/B
Node D Nest A/B/C
Node E Nest A/B/C/D

Larry Martell · Nov 27, 2013

Larry Martell, 26.11.2013 13:23:

(e-mail address removed), 25.11.2013 23:22:
I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:

nodes = []

def parseChild(c):
if c.tag == 'Node':
if 'Name' in c.attrib:
nodes.append(c.attrib['Name'])
for c1 in c:
parseChild(c1)
else:
for node in nodes:
print node,
print c.tag

for parent in tree.getiterator():
for child in parent:
for x in child:
parseChild(x)

This seems hugely redundant. tree.getiterator() already returns a recursive
iterable, and then, for each nodes in your document, you are running
recursively over its entire subtree. Meaning that you'll visit each node as
many times as its depth in the tree.

My problem is that I don't know when I'm done with a node and I should
remove a level of nesting. I would think this is a fairly common
situation, but I could not find any examples of parsing a file like
this. Perhaps I'm going about it completely wrong.

Your recursive traversal function tells you when you're done. If you drop
the getiterator() bit, reaching the end of parseChild() means that you're
done with the element and start backing up. So you can simply pass down a
list of element names that you append() at the beginning of the function
and pop() at the end, i.e. a stack. That list will then always give you the
current path from the root node.

Click to expand...

Thanks for the reply. How can I remove getiterator()? Then I won't be
traversing the nodes of the tree. I can't iterate over tree. I am also
unclear on where to do the pop(). I tried putting it just after the
recursive call to parseChild() and I tried putting as the very last
statement in parseChild() - neither one gave the desired result. Can
you show me in code what you mean?

Click to expand...

untested:

nodes = []

def process_subtree(c, path):
name = c.get('Name') if c.tag == 'Node' else None
if name:
path.append(name)
nodes.append('/'.join(path))

for c1 in c:
process_subtree(c1, path)

if name:
path.pop()

process_subtree(tree.getroot(), [])

Thanks! This was extremely helpful and I've use these concepts to
write script that successfully parses my file.

Help needed with nested parsing of file into objects	12	Jun 4, 2012
ElementTree XML parsing problem	8	Apr 27, 2011
simple ElementTree based parser that allows entity definition map	0	Dec 4, 2013
Dealing with xml namespaces with ElementTree	0	Jan 21, 2011
Finding all instances of a string in an XML file	0	Jun 21, 2013
elementtree and rounding questions	1	Jul 30, 2008
Search nested folders with specific names in python	0	Sep 23, 2022
ElementTree Issue - Search and remove elements	2	Oct 17, 2012

parsing nested unbounded XML fields with ElementTree

Larry.Martell

Chris Angelico

Stefan Behnel

Larry Martell

Stefan Behnel

Neil Cerutti

Larry Martell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads