parsing nested unbounded XML fields with ElementTree

L

Larry.Martell

I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:

nodes = []

def parseChild(c):
if c.tag == 'Node':
if 'Name' in c.attrib:
nodes.append(c.attrib['Name'])
for c1 in c:
parseChild(c1)
else:
for node in nodes:
print node,
print c.tag

for parent in tree.getiterator():
for child in parent:
for x in child:
parseChild(x)

My problem is that I don't know when I'm done with a node and I should remove a level of nesting. I would think this is a fairly common situation, but I could not find any examples of parsing a file like this. Perhaps I'm going about it completely wrong.
 
C

Chris Angelico

I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

First off, please clarify: Are there five corresponding </Node> tags
later on? If not, it's not XML, and nesting will have to be defined
some other way.

Secondly, please get off Google Groups. Your initial post is
malformed, and unless you specifically fight the software, your
replies will be even more malformed, to the point of being quite
annoying. There are many other ways to read a newsgroup, or you can
subscribe to the mailing list (e-mail address removed), which carries
the same content.

ChrisA
 
S

Stefan Behnel

(e-mail address removed), 25.11.2013 23:22:
I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:

nodes = []

def parseChild(c):
if c.tag == 'Node':
if 'Name' in c.attrib:
nodes.append(c.attrib['Name'])
for c1 in c:
parseChild(c1)
else:
for node in nodes:
print node,
print c.tag

for parent in tree.getiterator():
for child in parent:
for x in child:
parseChild(x)

This seems hugely redundant. tree.getiterator() already returns a recursive
iterable, and then, for each nodes in your document, you are running
recursively over its entire subtree. Meaning that you'll visit each node as
many times as its depth in the tree.

My problem is that I don't know when I'm done with a node and I should
remove a level of nesting. I would think this is a fairly common
situation, but I could not find any examples of parsing a file like
this. Perhaps I'm going about it completely wrong.

Your recursive traversal function tells you when you're done. If you drop
the getiterator() bit, reaching the end of parseChild() means that you're
done with the element and start backing up. So you can simply pass down a
list of element names that you append() at the beginning of the function
and pop() at the end, i.e. a stack. That list will then always give you the
current path from the root node.

Alternatively, if you want to use lxml.etree instead of ElementTree, you
can use it's iterwalk() function, which gives you the same thing but
without recursion, as a plain iterator.

http://lxml.de/parsing.html#iterparse-and-iterwalk

Stefan
 
L

Larry Martell

(e-mail address removed), 25.11.2013 23:22:
I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:

nodes = []

def parseChild(c):
if c.tag == 'Node':
if 'Name' in c.attrib:
nodes.append(c.attrib['Name'])
for c1 in c:
parseChild(c1)
else:
for node in nodes:
print node,
print c.tag

for parent in tree.getiterator():
for child in parent:
for x in child:
parseChild(x)

This seems hugely redundant. tree.getiterator() already returns a recursive
iterable, and then, for each nodes in your document, you are running
recursively over its entire subtree. Meaning that you'll visit each node as
many times as its depth in the tree.

My problem is that I don't know when I'm done with a node and I should
remove a level of nesting. I would think this is a fairly common
situation, but I could not find any examples of parsing a file like
this. Perhaps I'm going about it completely wrong.

Your recursive traversal function tells you when you're done. If you drop
the getiterator() bit, reaching the end of parseChild() means that you're
done with the element and start backing up. So you can simply pass down a
list of element names that you append() at the beginning of the function
and pop() at the end, i.e. a stack. That list will then always give you the
current path from the root node.

Thanks for the reply. How can I remove getiterator()? Then I won't be
traversing the nodes of the tree. I can't iterate over tree. I am also
unclear on where to do the pop(). I tried putting it just after the
recursive call to parseChild() and I tried putting as the very last
statement in parseChild() - neither one gave the desired result. Can
you show me in code what you mean?

Thanks!
-larry
 
S

Stefan Behnel

Larry Martell, 26.11.2013 13:23:
(e-mail address removed), 25.11.2013 23:22:
I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:

nodes = []

def parseChild(c):
if c.tag == 'Node':
if 'Name' in c.attrib:
nodes.append(c.attrib['Name'])
for c1 in c:
parseChild(c1)
else:
for node in nodes:
print node,
print c.tag

for parent in tree.getiterator():
for child in parent:
for x in child:
parseChild(x)

This seems hugely redundant. tree.getiterator() already returns a recursive
iterable, and then, for each nodes in your document, you are running
recursively over its entire subtree. Meaning that you'll visit each node as
many times as its depth in the tree.

My problem is that I don't know when I'm done with a node and I should
remove a level of nesting. I would think this is a fairly common
situation, but I could not find any examples of parsing a file like
this. Perhaps I'm going about it completely wrong.

Your recursive traversal function tells you when you're done. If you drop
the getiterator() bit, reaching the end of parseChild() means that you're
done with the element and start backing up. So you can simply pass down a
list of element names that you append() at the beginning of the function
and pop() at the end, i.e. a stack. That list will then always give you the
current path from the root node.

Thanks for the reply. How can I remove getiterator()? Then I won't be
traversing the nodes of the tree. I can't iterate over tree. I am also
unclear on where to do the pop(). I tried putting it just after the
recursive call to parseChild() and I tried putting as the very last
statement in parseChild() - neither one gave the desired result. Can
you show me in code what you mean?

untested:

nodes = []

def process_subtree(c, path):
name = c.get('Name') if c.tag == 'Node' else None
if name:
path.append(name)
nodes.append('/'.join(path))

for c1 in c:
process_subtree(c1, path)

if name:
path.pop()

process_subtree(tree.getroot(), [])


Stefan
 
N

Neil Cerutti

I have an XML file that has an element called "Node". These can
be nested to any depth and the depth of the nesting is not
known to me. I need to parse the file and preserve the nesting.
For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E.
Problem is I don't know how deep this can be. This is the code
I have so far:

I also an ElementTree user, but it's fairly heavy-duty for simple
jobs. I use sax for simple those. In fact, I'm kind of a saxophone.
This is basically the same idea as others have posted.

the_xml = """<?xml version="1.0" encoding="ISO-8859-1"?>
<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">
</Node></Node></Node></Node></Node>"""
import io
import sys
import xml.sax as sax


class NodeHandler(sax.handler.ContentHandler):
def startDocument(self):
self.title = ''
self.names = []

def startElement(self, name, attrs):
self.process(attrs['Name'])
self.names.append(attrs['Name'])

def process(self, name):
print("Node {} Nest {}".format(name, '/'.join(self.names)))
# Do your stuff.

def endElement(self, name):
self.names.pop()


print(sys.version_info)
handler = NodeHandler()
parser = sax.parse(io.StringIO(the_xml), handler)

Output:
sys.version_info(major=3, minor=3, micro=2, releaselevel='final', serial=0)
Node A Nest
Node B Nest A
Node C Nest A/B
Node D Nest A/B/C
Node E Nest A/B/C/D
 
L

Larry Martell

Larry Martell, 26.11.2013 13:23:
(e-mail address removed), 25.11.2013 23:22:
I have an XML file that has an element called "Node". These can be nested to any depth and the depth of the nesting is not known to me. I need to parse the file and preserve the nesting. For exmaple, if the XML file had:

<Node Name="A">
<Node Name="B">
<Node Name="C">
<Node Name="D">
<Node Name="E">

When I'm parsing Node "E" I need to know I'm in A/B/C/D/E. Problem is I don't know how deep this can be. This is the code I have so far:

nodes = []

def parseChild(c):
if c.tag == 'Node':
if 'Name' in c.attrib:
nodes.append(c.attrib['Name'])
for c1 in c:
parseChild(c1)
else:
for node in nodes:
print node,
print c.tag

for parent in tree.getiterator():
for child in parent:
for x in child:
parseChild(x)

This seems hugely redundant. tree.getiterator() already returns a recursive
iterable, and then, for each nodes in your document, you are running
recursively over its entire subtree. Meaning that you'll visit each node as
many times as its depth in the tree.


My problem is that I don't know when I'm done with a node and I should
remove a level of nesting. I would think this is a fairly common
situation, but I could not find any examples of parsing a file like
this. Perhaps I'm going about it completely wrong.

Your recursive traversal function tells you when you're done. If you drop
the getiterator() bit, reaching the end of parseChild() means that you're
done with the element and start backing up. So you can simply pass down a
list of element names that you append() at the beginning of the function
and pop() at the end, i.e. a stack. That list will then always give you the
current path from the root node.

Thanks for the reply. How can I remove getiterator()? Then I won't be
traversing the nodes of the tree. I can't iterate over tree. I am also
unclear on where to do the pop(). I tried putting it just after the
recursive call to parseChild() and I tried putting as the very last
statement in parseChild() - neither one gave the desired result. Can
you show me in code what you mean?

untested:

nodes = []

def process_subtree(c, path):
name = c.get('Name') if c.tag == 'Node' else None
if name:
path.append(name)
nodes.append('/'.join(path))

for c1 in c:
process_subtree(c1, path)

if name:
path.pop()

process_subtree(tree.getroot(), [])

Thanks! This was extremely helpful and I've use these concepts to
write script that successfully parses my file.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top