XML parser that sorts elements?

J

jmike

Hi everyone,

I am a total newbie to XML parsing. I've written a couple of toy
examples under the instruction of tutorials available on the web.

The problem I want to solve is this. I have an XML snippet (in a
string) that looks like this:

<booga foo="1" bar="2">
<well>hello</well>
<blah>goodbye</blah>
</booga>

and I want to alphabetize not only the attributes of an element, but I
also want to alphabetize the elements in the same scope:

<booga bar="2" foo="1">
<blah>goodbye</blah>
<well>hello</well>
</booga>

I've found a "Canonizer" class, that subclasses saxlib.HandlerBase, and
played around with it and vaguely understand what it's doing. But what
I get out of it is

<booga bar="2" foo="1">
<well>hello</well>
<blah>goodbye</blah>
</booga>

in other words it sorts the attributes of each element, but doesn't
touch the order of the elements.

How can I sort the elements? I think I want to subclass the parser, to
present the elements to the content handler in different order, but I
couldn't immediately find any examples of the parser being subclassed.

Thanks for any pointers!
--JMike
 
D

Diez B. Roggisch

Hi everyone,

I am a total newbie to XML parsing. I've written a couple of toy
examples under the instruction of tutorials available on the web.

The problem I want to solve is this. I have an XML snippet (in a
string) that looks like this:

<booga foo="1" bar="2">
<well>hello</well>
<blah>goodbye</blah>
</booga>

and I want to alphabetize not only the attributes of an element, but I
also want to alphabetize the elements in the same scope:

<booga bar="2" foo="1">
<blah>goodbye</blah>
<well>hello</well>
</booga>

I've found a "Canonizer" class, that subclasses saxlib.HandlerBase, and
played around with it and vaguely understand what it's doing. But what
I get out of it is

<booga bar="2" foo="1">
<well>hello</well>
<blah>goodbye</blah>
</booga>

in other words it sorts the attributes of each element, but doesn't
touch the order of the elements.

How can I sort the elements? I think I want to subclass the parser, to
present the elements to the content handler in different order, but I
couldn't immediately find any examples of the parser being subclassed.

You can sort them by obtaining them as tree of nodes, e.g. using element
tree or minidom.

But you should be aware that this will change the structure of your document
and it isn't always desirable to do so - e.g. html pages would look funny
to say the least if sorted in that way.

Diez
 
J

jmike

Diez said:
You can sort them by obtaining them as tree of nodes, e.g. using element
tree or minidom.

But you should be aware that this will change the structure of your document
and it isn't always desirable to do so - e.g. html pages would look funny
to say the least if sorted in that way.

Diez

In this particular case, I need to sort the elements, and the specific
application I'm testing guarantees that the order of the elements "in
the same scope" (this may not be the right term in XML semantics, but
it's what I know how to say) does not matter. That probably means that
the specific application I'm testing is not using XML in a standard
way, but so be it.

I'm looking at minidom now and I think maybe there's enough
documentation there that I can get a handle on it and do what I need to
do. Thanks. (But if anyone else has a specific example I can crib
from, that'd be great.)

--JMike
 
P

Paul McGuire

Hi everyone,

I am a total newbie to XML parsing. I've written a couple of toy
examples under the instruction of tutorials available on the web.

The problem I want to solve is this. I have an XML snippet (in a
string) that looks like this:

<booga foo="1" bar="2">
<well>hello</well>
<blah>goodbye</blah>
</booga>

and I want to alphabetize not only the attributes of an element, but I
also want to alphabetize the elements in the same scope:

<booga bar="2" foo="1">
<blah>goodbye</blah>
<well>hello</well>
</booga>

I've found a "Canonizer" class, that subclasses saxlib.HandlerBase, and
played around with it and vaguely understand what it's doing. But what
I get out of it is

<booga bar="2" foo="1">
<well>hello</well>
<blah>goodbye</blah>
</booga>

in other words it sorts the attributes of each element, but doesn't
touch the order of the elements.

How can I sort the elements? I think I want to subclass the parser, to
present the elements to the content handler in different order, but I
couldn't immediately find any examples of the parser being subclassed.

I suspect that Canonizer doesn't sort nested elements because some schemas
require elements to be in a particular order, and not necessarily an
alphabetical one.

Here is a snippet from an interactive Python session, working with the
"batteries included" xml.dom.minidom. The solution is not necessarily in
the parser, it may be instead in what you do with the parsed document
object.

This is not a solution to your actual problem, but I hope it gives you
enough to work with to find your own solution.

HTH,
-- Paul

.... <well>hello</well>
.... <blah>goodbye</blah>
.... said:
import xml.dom.minidom
doc = xml.dom.minidom.parseString(xmlsrc)
doc.childNodes
[ said:
print doc.toprettyxml()
<?xml version="1.0" ?>
<booga bar="2" foo="1">


<well>
hello
</well>


<blah>
goodbye
</blah>


[n.nodeName for n in doc.childNodes] [u'booga']
[n.nodeName for n in doc.childNodes[0].childNodes] ['#text', u'well', '#text', u'blah', '#text']
[n.nodeName for n in doc.childNodes[0].childNodes if n.nodeType ==
doc.ELEMENT_NODE] [u'well', u'blah']
doc.childNodes[0].childNodes =
sorted(doc.childNodes[0].childNodes,key=lambda n:n.nodeName)
print doc.toprettyxml()
<?xml version="1.0" ?>
<booga bar="2" foo="1">






<blah>
goodbye
</blah>
<well>
hello
</well>
doc.childNodes[0].childNodes = sorted([n for n in
doc.childNodes[0].childNodes if n.nodeType ==
doc.ELEMENT_NODE],key=lambda n:n.nodeName)
print doc.toprettyxml()
<?xml version="1.0" ?>
<booga bar="2" foo="1">
<blah>
goodbye
</blah>
<well>
hello
</well>
 
J

jmike

Paul McGuire wrote:

....
Here is a snippet from an interactive Python session, working with the
"batteries included" xml.dom.minidom. The solution is not necessarily in
the parser, it may be instead in what you do with the parsed document
object.

This is not a solution to your actual problem, but I hope it gives you
enough to work with to find your own solution.

HTH,
-- Paul

Whoa. Outstanding. Excellent. Thank you!
--JMike
 
P

Paul McGuire

Paul McGuire said:
This is what I posted, but it's not what I typed. I entered some very long
lines at the console, and the newsgroup software, when wrapping the text,
prefixed it with '>>>', not '...'. So this looks like something that wont
run.
doc.childNodes[0].childNodes = sorted([n for n in
doc.childNodes[0].childNodes if n.nodeType ==
doc.ELEMENT_NODE],key=lambda n:n.nodeName)
print doc.toprettyxml()
<?xml version="1.0" ?>
<booga bar="2" foo="1">
<blah>
goodbye
</blah>
<well>
hello
</well>

Here's the console session, with '...' continuation lines:
.... <well>hello</well>
.... <blah>goodbye</blah>
<?xml version="1.0" ?>
<booga bar="2" foo="1">


<well>
hello
</well>


<blah>
goodbye
</blah>


[n.nodeName for n in doc.childNodes] [u'booga']
[n.nodeName for n in doc.childNodes[0].childNodes] ['#text', u'well', '#text', u'blah', '#text']
[n.nodeName for n in doc.childNodes[0].childNodes
.... if n.nodeType == doc.ELEMENT_NODE]
[u'well', u'blah']
doc.childNodes[0].childNodes = sorted( .... doc.childNodes[0].childNodes,key=lambda n:n.nodeName)
[n.nodeName for n in doc.childNodes[0].childNodes
.... if n.nodeType == doc.ELEMENT_NODE]
[u'blah', u'well']<?xml version="1.0" ?>
<booga bar="2" foo="1">






<blah>
goodbye
</blah>
<well>
hello
</well>
doc.childNodes[0].childNodes = sorted(
.... [n for n in doc.childNodes[0].childNodes
.... if n.nodeType==doc.ELEMENT_NODE],
.... key=lambda n:n.nodeName)<?xml version="1.0" ?>
<booga bar="2" foo="1">
<blah>
goodbye
</blah>
<well>
hello
</well>
 
J

jmike

Paul said:
doc.childNodes[0].childNodes = sorted(
... [n for n in doc.childNodes[0].childNodes
... if n.nodeType==doc.ELEMENT_NODE],
... key=lambda n:n.nodeName)<?xml version="1.0" ?>
<booga bar="2" foo="1">
<blah>
goodbye
</blah>
<well>
hello
</well>
</booga>

My requirements changed a bit, so now I'm sorting second level elements
by their values of a specific attribute (where the specific attribute
can be chosen). But the solution is still mainly what you posted here.
It was just a matter of supplying a different function for 'key'.
It's up and running live now and all is well. Thanks again!

(A bonus side effect of this is that it let me sneak "sorted()" into
our test infrastructure, which gave me reason to get our IT guys to
upgrade a mismash of surprisingly old Python versions up to Python 2.5
everywhere.)

--JMike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top