XML parser that sorts elements?

Discussion in 'Python' started by jmike@alum.mit.edu, Sep 22, 2006.

  1. Guest

    Hi everyone,

    I am a total newbie to XML parsing. I've written a couple of toy
    examples under the instruction of tutorials available on the web.

    The problem I want to solve is this. I have an XML snippet (in a
    string) that looks like this:

    <booga foo="1" bar="2">
    <well>hello</well>
    <blah>goodbye</blah>
    </booga>

    and I want to alphabetize not only the attributes of an element, but I
    also want to alphabetize the elements in the same scope:

    <booga bar="2" foo="1">
    <blah>goodbye</blah>
    <well>hello</well>
    </booga>

    I've found a "Canonizer" class, that subclasses saxlib.HandlerBase, and
    played around with it and vaguely understand what it's doing. But what
    I get out of it is

    <booga bar="2" foo="1">
    <well>hello</well>
    <blah>goodbye</blah>
    </booga>

    in other words it sorts the attributes of each element, but doesn't
    touch the order of the elements.

    How can I sort the elements? I think I want to subclass the parser, to
    present the elements to the content handler in different order, but I
    couldn't immediately find any examples of the parser being subclassed.

    Thanks for any pointers!
    --JMike
    , Sep 22, 2006
    #1
    1. Advertising

  2. wrote:

    > Hi everyone,
    >
    > I am a total newbie to XML parsing. I've written a couple of toy
    > examples under the instruction of tutorials available on the web.
    >
    > The problem I want to solve is this. I have an XML snippet (in a
    > string) that looks like this:
    >
    > <booga foo="1" bar="2">
    > <well>hello</well>
    > <blah>goodbye</blah>
    > </booga>
    >
    > and I want to alphabetize not only the attributes of an element, but I
    > also want to alphabetize the elements in the same scope:
    >
    > <booga bar="2" foo="1">
    > <blah>goodbye</blah>
    > <well>hello</well>
    > </booga>
    >
    > I've found a "Canonizer" class, that subclasses saxlib.HandlerBase, and
    > played around with it and vaguely understand what it's doing. But what
    > I get out of it is
    >
    > <booga bar="2" foo="1">
    > <well>hello</well>
    > <blah>goodbye</blah>
    > </booga>
    >
    > in other words it sorts the attributes of each element, but doesn't
    > touch the order of the elements.
    >
    > How can I sort the elements? I think I want to subclass the parser, to
    > present the elements to the content handler in different order, but I
    > couldn't immediately find any examples of the parser being subclassed.


    You can sort them by obtaining them as tree of nodes, e.g. using element
    tree or minidom.

    But you should be aware that this will change the structure of your document
    and it isn't always desirable to do so - e.g. html pages would look funny
    to say the least if sorted in that way.

    Diez
    Diez B. Roggisch, Sep 22, 2006
    #2
    1. Advertising

  3. Guest

    Diez B. Roggisch wrote:

    > You can sort them by obtaining them as tree of nodes, e.g. using element
    > tree or minidom.
    >
    > But you should be aware that this will change the structure of your document
    > and it isn't always desirable to do so - e.g. html pages would look funny
    > to say the least if sorted in that way.
    >
    > Diez


    In this particular case, I need to sort the elements, and the specific
    application I'm testing guarantees that the order of the elements "in
    the same scope" (this may not be the right term in XML semantics, but
    it's what I know how to say) does not matter. That probably means that
    the specific application I'm testing is not using XML in a standard
    way, but so be it.

    I'm looking at minidom now and I think maybe there's enough
    documentation there that I can get a handle on it and do what I need to
    do. Thanks. (But if anyone else has a specific example I can crib
    from, that'd be great.)

    --JMike
    , Sep 22, 2006
    #3
  4. Paul McGuire Guest

    <> wrote in message
    news:...
    > Hi everyone,
    >
    > I am a total newbie to XML parsing. I've written a couple of toy
    > examples under the instruction of tutorials available on the web.
    >
    > The problem I want to solve is this. I have an XML snippet (in a
    > string) that looks like this:
    >
    > <booga foo="1" bar="2">
    > <well>hello</well>
    > <blah>goodbye</blah>
    > </booga>
    >
    > and I want to alphabetize not only the attributes of an element, but I
    > also want to alphabetize the elements in the same scope:
    >
    > <booga bar="2" foo="1">
    > <blah>goodbye</blah>
    > <well>hello</well>
    > </booga>
    >
    > I've found a "Canonizer" class, that subclasses saxlib.HandlerBase, and
    > played around with it and vaguely understand what it's doing. But what
    > I get out of it is
    >
    > <booga bar="2" foo="1">
    > <well>hello</well>
    > <blah>goodbye</blah>
    > </booga>
    >
    > in other words it sorts the attributes of each element, but doesn't
    > touch the order of the elements.
    >
    > How can I sort the elements? I think I want to subclass the parser, to
    > present the elements to the content handler in different order, but I
    > couldn't immediately find any examples of the parser being subclassed.
    >


    I suspect that Canonizer doesn't sort nested elements because some schemas
    require elements to be in a particular order, and not necessarily an
    alphabetical one.

    Here is a snippet from an interactive Python session, working with the
    "batteries included" xml.dom.minidom. The solution is not necessarily in
    the parser, it may be instead in what you do with the parsed document
    object.

    This is not a solution to your actual problem, but I hope it gives you
    enough to work with to find your own solution.

    HTH,
    -- Paul


    >>> xmlsrc = """<booga foo="1" bar="2">

    .... <well>hello</well>
    .... <blah>goodbye</blah>
    .... </booga>
    .... """
    >>> import xml.dom.minidom
    >>> doc = xml.dom.minidom.parseString(xmlsrc)
    >>> doc.childNodes

    [<DOM Element: booga at 0x9e8508>]
    >>> print doc.toprettyxml()

    <?xml version="1.0" ?>
    <booga bar="2" foo="1">


    <well>
    hello
    </well>


    <blah>
    goodbye
    </blah>


    </booga>

    >>> [n.nodeName for n in doc.childNodes]

    [u'booga']
    >>> [n.nodeName for n in doc.childNodes[0].childNodes]

    ['#text', u'well', '#text', u'blah', '#text']
    >>> [n.nodeName for n in doc.childNodes[0].childNodes if n.nodeType ==
    >>> doc.ELEMENT_NODE]

    [u'well', u'blah']
    >>> doc.childNodes[0].childNodes =
    >>> sorted(doc.childNodes[0].childNodes,key=lambda n:n.nodeName)
    >>> print doc.toprettyxml()

    <?xml version="1.0" ?>
    <booga bar="2" foo="1">






    <blah>
    goodbye
    </blah>
    <well>
    hello
    </well>
    </booga>

    >>> doc.childNodes[0].childNodes = sorted([n for n in
    >>> doc.childNodes[0].childNodes if n.nodeType ==
    >>> doc.ELEMENT_NODE],key=lambda n:n.nodeName)
    >>> print doc.toprettyxml()

    <?xml version="1.0" ?>
    <booga bar="2" foo="1">
    <blah>
    goodbye
    </blah>
    <well>
    hello
    </well>
    </booga>

    >>>
    Paul McGuire, Sep 22, 2006
    #4
  5. Guest

    Paul McGuire wrote:

    ....
    > Here is a snippet from an interactive Python session, working with the
    > "batteries included" xml.dom.minidom. The solution is not necessarily in
    > the parser, it may be instead in what you do with the parsed document
    > object.
    >
    > This is not a solution to your actual problem, but I hope it gives you
    > enough to work with to find your own solution.
    >
    > HTH,
    > -- Paul


    Whoa. Outstanding. Excellent. Thank you!
    --JMike
    , Sep 22, 2006
    #5
  6. Paul McGuire Guest

    "Paul McGuire" <._bogus_.com> wrote in message
    news:_OTQg.4326$...
    > <> wrote in message
    > news:...

    <snip>
    >

    This is what I posted, but it's not what I typed. I entered some very long
    lines at the console, and the newsgroup software, when wrapping the text,
    prefixed it with '>>>', not '...'. So this looks like something that wont
    run.
    >>>> doc.childNodes[0].childNodes = sorted([n for n in
    >>>> doc.childNodes[0].childNodes if n.nodeType ==
    >>>> doc.ELEMENT_NODE],key=lambda n:n.nodeName)
    >>>> print doc.toprettyxml()

    > <?xml version="1.0" ?>
    > <booga bar="2" foo="1">
    > <blah>
    > goodbye
    > </blah>
    > <well>
    > hello
    > </well>
    > </booga>
    >
    >>>>


    Here's the console session, with '...' continuation lines:

    >>> xmlsrc = """<booga foo="1" bar="2">

    .... <well>hello</well>
    .... <blah>goodbye</blah>
    .... </booga>
    .... """
    >>> import xml.dom.minidom
    >>> doc = xml.dom.minidom.parseString(xmlsrc)
    >>> print doc.toprettyxml()

    <?xml version="1.0" ?>
    <booga bar="2" foo="1">


    <well>
    hello
    </well>


    <blah>
    goodbye
    </blah>


    </booga>

    >>> [n.nodeName for n in doc.childNodes]

    [u'booga']
    >>> [n.nodeName for n in doc.childNodes[0].childNodes]

    ['#text', u'well', '#text', u'blah', '#text']
    >>> [n.nodeName for n in doc.childNodes[0].childNodes

    .... if n.nodeType == doc.ELEMENT_NODE]
    [u'well', u'blah']
    >>> doc.childNodes[0].childNodes = sorted(

    .... doc.childNodes[0].childNodes,key=lambda n:n.nodeName)
    >>> [n.nodeName for n in doc.childNodes[0].childNodes

    .... if n.nodeType == doc.ELEMENT_NODE]
    [u'blah', u'well']
    >>> print doc.toprettyxml()

    <?xml version="1.0" ?>
    <booga bar="2" foo="1">






    <blah>
    goodbye
    </blah>
    <well>
    hello
    </well>
    </booga>

    >>> doc.childNodes[0].childNodes = sorted(

    .... [n for n in doc.childNodes[0].childNodes
    .... if n.nodeType==doc.ELEMENT_NODE],
    .... key=lambda n:n.nodeName)
    >>> print doc.toprettyxml()

    <?xml version="1.0" ?>
    <booga bar="2" foo="1">
    <blah>
    goodbye
    </blah>
    <well>
    hello
    </well>
    </booga>

    >>>
    Paul McGuire, Sep 22, 2006
    #6
  7. Guest

    Paul McGuire wrote:
    > >>> doc.childNodes[0].childNodes = sorted(

    > ... [n for n in doc.childNodes[0].childNodes
    > ... if n.nodeType==doc.ELEMENT_NODE],
    > ... key=lambda n:n.nodeName)
    > >>> print doc.toprettyxml()

    > <?xml version="1.0" ?>
    > <booga bar="2" foo="1">
    > <blah>
    > goodbye
    > </blah>
    > <well>
    > hello
    > </well>
    > </booga>


    My requirements changed a bit, so now I'm sorting second level elements
    by their values of a specific attribute (where the specific attribute
    can be chosen). But the solution is still mainly what you posted here.
    It was just a matter of supplying a different function for 'key'.
    It's up and running live now and all is well. Thanks again!

    (A bonus side effect of this is that it let me sneak "sorted()" into
    our test infrastructure, which gave me reason to get our IT guys to
    upgrade a mismash of surprisingly old Python versions up to Python 2.5
    everywhere.)

    --JMike
    , Sep 28, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Srikanth Mandava
    Replies:
    1
    Views:
    382
    Michael Hudson
    Feb 19, 2004
  2. Ron Adam

    Alphabetical sorts

    Ron Adam, Oct 16, 2006, in forum: Python
    Replies:
    7
    Views:
    456
    Ron Adam
    Oct 17, 2006
  3. CMOS
    Replies:
    1
    Views:
    312
    Jack Klein
    Aug 29, 2006
  4. Jonathan Wood
    Replies:
    2
    Views:
    314
    Jonathan Wood
    Jun 18, 2008
  5. python_newbie

    insertion sorts...

    python_newbie, Jun 23, 2008, in forum: Python
    Replies:
    8
    Views:
    331
Loading...

Share This Page