Problem round-tripping with xml.dom.minidom pretty-printer

Discussion in 'Python' started by Ben Butler-Cole, Feb 29, 2008.

  1. Hello

    I have run into a problem using minidom. I have an HTML file that I
    want to make occasional, automated changes to (adding new links). My
    strategy is to parse it with minidom, add a node, pretty print it and
    write it back to disk.

    However I find that every time I do a round trip minidom's pretty
    printer puts extra blank lines around every element, so my file grows
    without limit. I have found that normalizing the document doesn't make
    any difference. Obviously I can fix the problem by doing without the
    pretty-printing, but I don't really like producing non-human readable
    HTML.

    Here is some code that shows the behaviour:

    import xml.dom.minidom as dom
    def p(t):
    d = dom.parseString(t)
    d.normalize()
    t2 = d.toprettyxml()
    print t2
    p(t2)
    p('<a><b><c/></b></a>')

    Does anyone know how to fix this behaviour? If not, can anyone
    recommend an alternative XML tool for simple tasks like this?

    Thanks
    Ben
     
    Ben Butler-Cole, Feb 29, 2008
    #1
    1. Advertising

  2. Ben Butler-Cole

    Robert Bossy Guest

    Ben Butler-Cole wrote:
    > Hello
    >
    > I have run into a problem using minidom. I have an HTML file that I
    > want to make occasional, automated changes to (adding new links). My
    > strategy is to parse it with minidom, add a node, pretty print it and
    > write it back to disk.
    >
    > However I find that every time I do a round trip minidom's pretty
    > printer puts extra blank lines around every element, so my file grows
    > without limit. I have found that normalizing the document doesn't make
    > any difference. Obviously I can fix the problem by doing without the
    > pretty-printing, but I don't really like producing non-human readable
    > HTML.
    >
    > Here is some code that shows the behaviour:
    >
    > import xml.dom.minidom as dom
    > def p(t):
    > d = dom.parseString(t)
    > d.normalize()
    > t2 = d.toprettyxml()
    > print t2
    > p(t2)
    > p('<a><b><c/></b></a>')
    >
    > Does anyone know how to fix this behaviour? If not, can anyone
    > recommend an alternative XML tool for simple tasks like this?

    Hi,

    The last line of p() calls itself: it is an unconditional recursive call
    so, no matter what it does, it will never stop. And since p() also
    prints something, calling it will print endlessly. By removing this
    line, you get something like:

    <?xml version="1.0" ?>
    <a>
    <b>
    <c/>
    </b>
    </a>

    That seems sensible, imo. Was that what you wanted?

    An additional thing to keep in mind is that toprettyxml does not print
    an XML identical to the original DOM tree: it adds newlines and tabs.
    When parsed again these blank characters are inserted in the DOM tree as
    character nodes. If you toprettyxml an XML document twice in a row, then
    the second one will also add newlines and tabs around the newlines and
    tabs added by the first. Since you call toprettyxml an infinite number
    of times, it is expected that lots of blank characters appear.

    Finally, normalize() is supposed to merge consecutive sibling character
    nodes, however it will never remove character contents even if they are
    blank. That means that several character
    nodes will be replaced by a single one whose content is the
    concatenation of the respective content of the original nodes. Clear enough?

    Cheers,
    RB
     
    Robert Bossy, Feb 29, 2008
    #2
    1. Advertising

  3. > The last line of p() calls itself: it is an unconditional recursive call
    > so, no matter what it does, it will never stop. And since p() also
    > prints something, calling it will print endlessly.


    Sorry, I wasn't clear. I realize that this recurses endlessly. The
    problem is that it also adds blank lines endlessly.

    > By removing this line, you get something like:
    >
    > <?xml version="1.0" ?>
    > <a>
    > <b>
    > <c/>
    > </b>
    > </a>
    >
    > That seems sensible, imo. Was that what you wanted?


    Sure. That's fine unless you then re-parse this out put and print it
    again in which case you get the behaviour you describe:

    > An additional thing to keep in mind is that toprettyxml does not print
    > an XML identical to the original DOM tree: it adds newlines and tabs.
    > When parsed again these blank characters are inserted in the DOM tree as
    > character nodes. If you toprettyxml an XML document twice in a row, then
    > the second one will also add newlines and tabs around the newlines and
    > tabs added by the first. Since you call toprettyxml an infinite number
    > of times, it is expected that lots of blank characters appear.


    Right. That's the behaviour I'm asking about, which I consider to be
    problematic. I would expect a module providing a parser and pretty-
    printer (not just for XML parsers) to be able to conservatively round-
    trip.

    As far as I can see (and your comments back this up) minidom doesn't
    have this property. Unless anyone knows how to get it to behave that
    way...

    Ben
     
    Ben Butler-Cole, Feb 29, 2008
    #3
  4. Ben Butler-Cole

    Robert Bossy Guest

    Ben Butler-Cole wrote:
    >> An additional thing to keep in mind is that toprettyxml does not print
    >> an XML identical to the original DOM tree: it adds newlines and tabs.
    >> When parsed again these blank characters are inserted in the DOM tree as
    >> character nodes. If you toprettyxml an XML document twice in a row, then
    >> the second one will also add newlines and tabs around the newlines and
    >> tabs added by the first. Since you call toprettyxml an infinite number
    >> of times, it is expected that lots of blank characters appear.
    >>

    >
    > Right. That's the behaviour I'm asking about, which I consider to be
    > problematic. I would expect a module providing a parser and pretty-
    > printer (not just for XML parsers) to be able to conservatively round-
    > trip.
    >
    > As far as I can see (and your comments back this up) minidom doesn't
    > have this property. Unless anyone knows how to get it to behave that
    > way...
    >

    minidom --any DOM parser, btw-- has no means to know which blank
    character is a pretty print artefact or actual blank content from the
    original XML.

    You could write a function that strips all-blank nodes recursively down
    the elements tree, before doing so I suggest you take a look at section
    2.10 of http://www.w3.org/TR/REC-xml/.

    RB
     
    Robert Bossy, Feb 29, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Chris
    Replies:
    12
    Views:
    3,041
    Jaakko Kangasharju
    Sep 29, 2005
  2. Greg Wogan-Browne
    Replies:
    1
    Views:
    809
    Uche Ogbuji
    Jan 28, 2005
  3. Replies:
    3
    Views:
    536
    Stefan Behnel
    Aug 3, 2007
  4. Johannes Bauer
    Replies:
    7
    Views:
    1,069
    Johannes Bauer
    Jun 11, 2009
  5. ming
    Replies:
    2
    Views:
    166
Loading...

Share This Page