"Full" element tag listing possible with Elementtree?

J

jaime.dyson

Hello all,

I have the unenviable task of turning about 20K strangely formatted
XML documents from different sources into something resembling a
clean, standard, uniform format. I like Elementtree and have been
using it to step through the documents to get a feel for their
structure. .getiterator() gives me a depth-first traversal that
eliminates the hierarchy of the elements. What I'd like is to be able
to traverse elements while keeping track of ancestors, and print out
the full structure of all of an ancestor's nodes as I arrive at each
node. So, for example, if I had a document that looked like this:

<a>
<b att="atttag" content="b"> this is node b </b>
<c> this is node c
<d />
<e> this is node e </e>
</c>
<f> this is node f </f>
</a>

I would want to print the following:

<a>
<a> <b>
<a> <b> text: this is node b
<a> <c>
<a> <c> text: this is node c
<a> <c> <d>
<a> <c> <e>
<a> <c> <e> text: this is node e
<a> <f>
<a> <f> this is node f


Is there a simple way to do this? Any help would be appreciated.
Thanks..
 
F

Fredrik Lundh

<a>
<b att="atttag" content="b"> this is node b </b>
<c> this is node c
<d />
<e> this is node e </e>
</c>
<f> this is node f </f>
</a>

I would want to print the following:

<a>
<a> <b>
<a> <b> text: this is node b
<a> <c>
<a> <c> text: this is node c
<a> <c> <d>
<a> <c> <e>
<a> <c> <e> text: this is node e
<a> <f>
<a> <f> this is node f

Is there a simple way to do this? Any help would be appreciated.

in stock ET, using a parent map is probably the easiest way to do this:

http://effbot.org/zone/element.htm#accessing-parents

that is, for a given ET structure "tree", you can do

parent_map = dict((c, p) for p in tree.getiterator() for c in p)

def get_parents(elem):
parents = []
while 1:
elem = parent_map.get(elem)
if elem is None:
break
parents.append(elem)
return reversed(parents)

for elem in tree.getiterator():
print list(get_parents(elem)), elem

</F>
 
S

Stefan Behnel

I have the unenviable task of turning about 20K strangely formatted
XML documents from different sources into something resembling a
clean, standard, uniform format. I like Elementtree and have been
using it to step through the documents to get a feel for their
structure. .getiterator() gives me a depth-first traversal that
eliminates the hierarchy of the elements. What I'd like is to be able
to traverse elements while keeping track of ancestors, and print out
the full structure of all of an ancestor's nodes as I arrive at each
node.

Try lxml.etree. It's an extended re-implementation of ElementTree based on
libxml2. Amongst tons of other features, it provides its Elements with a
getparent() method and allows you to iterate over their ancestors (and other
XPath axes), or to iterate over a parsed document in an iterparse-like fashion
(called iterwalk).

http://codespeak.net/lxml/

Stefan
 
J

jaime.dyson

<a>
  <b att="atttag" content="b"> this is node b </b>
  <c> this is node c
    <d />
    <e> this is node e </e>
  </c>
  <f> this is node f </f>
</a>
I would want to print the following:
<a>
<a> <b>
<a> <b> text: this is node b
<a> <c>
<a> <c> text: this is node c
<a> <c> <d>
<a> <c> <e>
<a> <c> <e> text: this is node e
<a> <f>
<a> <f> this is node f
Is there a simple way to do this?  Any help would be appreciated.

in stock ET, using a parent map is probably the easiest way to do this:

     http://effbot.org/zone/element.htm#accessing-parents

that is, for a given ET structure "tree", you can do

parent_map = dict((c, p) for p in tree.getiterator() for c in p)

def get_parents(elem):
     parents = []
     while 1:
         elem = parent_map.get(elem)
         if elem is None:
             break
         parents.append(elem)
     return reversed(parents)

for elem in tree.getiterator():
     print list(get_parents(elem)), elem

</F>

Fantastic. Thank you very much, Fredrik! And thanks for ET!
 
C

Carl Banks

So, for example, if I had a document that looked like this:

<a>
<b att="atttag" content="b"> this is node b </b>
<c> this is node c
<d />
<e> this is node e </e>
</c>
<f> this is node f </f>
</a>

I would want to print the following:

<a>
<a> <b>
<a> <b> text: this is node b
<a> <c>
<a> <c> text: this is node c
<a> <c> <d>
<a> <c> <e>
<a> <c> <e> text: this is node e
<a> <f>
<a> <f> this is node f

Is there a simple way to do this? Any help would be appreciated.
Thanks..

Fredrik Lundh wrote Element Tree, so he'd know the best solution, but
I'd like to point out that this is also trivially easy with recursion:


def print_nodes(element, ancestors = []):
s = hierarchy = ancestors + ["<" + element.tag + ">"]
if element.text is not None:
s = s + [element.text]
print " ".join(s)
for subelement in element:
print_nodes(subelement,hierarchy)



Carl Banks
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top