"Full" element tag listing possible with Elementtree?

jaime.dyson · Sep 5, 2008

Hello all,

I have the unenviable task of turning about 20K strangely formatted
XML documents from different sources into something resembling a
clean, standard, uniform format. I like Elementtree and have been
using it to step through the documents to get a feel for their
structure. .getiterator() gives me a depth-first traversal that
eliminates the hierarchy of the elements. What I'd like is to be able
to traverse elements while keeping track of ancestors, and print out
the full structure of all of an ancestor's nodes as I arrive at each
node. So, for example, if I had a document that looked like this:

<a>
 this is node b 
<c> this is node c
<d />
<e> this is node e </e>
</c>
<f> this is node f </f>
</a>

I would want to print the following:

<a>
<a> 
<a> text: this is node b
<a> <c>
<a> <c> text: this is node c
<a> <c> <d>
<a> <c> <e>
<a> <c> <e> text: this is node e
<a> <f>
<a> <f> this is node f

Is there a simple way to do this? Any help would be appreciated.
Thanks..

Fredrik Lundh · Sep 5, 2008

<a>
 this is node b 
<c> this is node c
<d />
<e> this is node e </e>
</c>
<f> this is node f </f>
</a>

I would want to print the following:

<a>
<a> 
<a> text: this is node b
<a> <c>
<a> <c> text: this is node c
<a> <c> <d>
<a> <c> <e>
<a> <c> <e> text: this is node e
<a> <f>
<a> <f> this is node f

Is there a simple way to do this? Any help would be appreciated.

in stock ET, using a parent map is probably the easiest way to do this:

http://effbot.org/zone/element.htm#accessing-parents

that is, for a given ET structure "tree", you can do

parent_map = dict((c, p) for p in tree.getiterator() for c in p)

def get_parents(elem):
parents = []
while 1:
elem = parent_map.get(elem)
if elem is None:
break
parents.append(elem)
return reversed(parents)

for elem in tree.getiterator():
print list(get_parents(elem)), elem

</F>

Stefan Behnel · Sep 5, 2008

I have the unenviable task of turning about 20K strangely formatted
XML documents from different sources into something resembling a
clean, standard, uniform format. I like Elementtree and have been
using it to step through the documents to get a feel for their
structure. .getiterator() gives me a depth-first traversal that
eliminates the hierarchy of the elements. What I'd like is to be able
to traverse elements while keeping track of ancestors, and print out
the full structure of all of an ancestor's nodes as I arrive at each
node.

Try lxml.etree. It's an extended re-implementation of ElementTree based on
libxml2. Amongst tons of other features, it provides its Elements with a
getparent() method and allows you to iterate over their ancestors (and other
XPath axes), or to iterate over a parsed document in an iterparse-like fashion
(called iterwalk).

http://codespeak.net/lxml/

Stefan

jaime.dyson · Sep 5, 2008

[email protected] said:
[email protected] said:

<a>
 this is node b 
<c> this is node c
<d />
<e> this is node e </e>
</c>
<f> this is node f </f>
</a>

Click to expand...

I would want to print the following:

Click to expand...

<a>
<a> 
<a> text: this is node b
<a> <c>
<a> <c> text: this is node c
<a> <c> <d>
<a> <c> <e>
<a> <c> <e> text: this is node e
<a> <f>
<a> <f> this is node f

Click to expand...

Is there a simple way to do this? Any help would be appreciated.

Click to expand...

in stock ET, using a parent map is probably the easiest way to do this:

http://effbot.org/zone/element.htm#accessing-parents

that is, for a given ET structure "tree", you can do

parent_map = dict((c, p) for p in tree.getiterator() for c in p)

def get_parents(elem):
parents = []
while 1:
elem = parent_map.get(elem)
if elem is None:
break
parents.append(elem)
return reversed(parents)

for elem in tree.getiterator():
print list(get_parents(elem)), elem

</F>

Fantastic. Thank you very much, Fredrik! And thanks for ET!

Carl Banks · Sep 5, 2008

So, for example, if I had a document that looked like this:

<a>
 this is node b 
<c> this is node c
<d />
<e> this is node e </e>
</c>
<f> this is node f </f>
</a>

I would want to print the following:

<a>
<a> 
<a> text: this is node b
<a> <c>
<a> <c> text: this is node c
<a> <c> <d>
<a> <c> <e>
<a> <c> <e> text: this is node e
<a> <f>
<a> <f> this is node f

Is there a simple way to do this? Any help would be appreciated.
Thanks..

Fredrik Lundh wrote Element Tree, so he'd know the best solution, but
I'd like to point out that this is also trivially easy with recursion:

def print_nodes(element, ancestors = []):
s = hierarchy = ancestors + ["<" + element.tag + ">"]
if element.text is not None:
s = s + [element.text]
print " ".join(s)
for subelement in element:
print_nodes(subelement,hierarchy)

Carl Banks

parsing nested unbounded XML fields with ElementTree	6	Nov 25, 2013
simple ElementTree based parser that allows entity definition map	0	Dec 4, 2013
remove element with ElementTree	3	Mar 9, 2010
elementtree and rounding questions	1	Jul 30, 2008
insert method in ElementTree	2	Jul 16, 2006
Blue J Ciphertext Program	2	Nov 22, 2023
request for advice - possible ElementTree nexus	2	Jul 4, 2006
finding element by tag in xml	2	Feb 20, 2010

"Full" element tag listing possible with Elementtree?

jaime.dyson

Fredrik Lundh

Stefan Behnel

jaime.dyson

Carl Banks

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads