toprettyxml messes up with whitespaces

Discussion in 'Python' started by Jorgen Bodde, Oct 2, 2007.

  1. Jorgen Bodde

    Jorgen Bodde Guest

    Hi all,

    I parse an XML file, replace a node with a new one (like updating
    cache) and write it back. Every write, new spaces are added. For
    example, first read - update - write cycle;

    <var name="APPNAME" status="undefined">
    My First App
    </var>

    Second cycle:

    <var name="APPNAME" status="undefined">
    My First App
    </var>

    Third cycle:

    <var name="APPNAME" status="undefined">
    My First App
    </var>


    And this goes on. The node is one that is not touched in the XML, it
    is simply written back after reading. I have the same with void spaces
    in between the nodes, I managed to compensate that by stripping the
    lines.

    I would like to use toprettyxml to make it user editable and viewable.
    But this is really weird. How can I circumvent this behaviour?

    regards,
    - Jorgen
     
    Jorgen Bodde, Oct 2, 2007
    #1
    1. Advertising

  2. Jorgen Bodde

    Guest

    On Oct 2, 11:43 am, "Jorgen Bodde" <> wrote:
    > Hi all,
    >
    > I parse an XML file, replace a node with a new one (like updating
    > cache) and write it back. Every write, new spaces are added. For
    > example, first read - update - write cycle;
    >
    > <var name="APPNAME" status="undefined">
    > My First App
    > </var>
    >
    > Second cycle:
    >
    > <var name="APPNAME" status="undefined">
    > My First App
    > </var>
    >
    > Third cycle:
    >
    > <var name="APPNAME" status="undefined">
    > My First App
    > </var>
    >
    > And this goes on. The node is one that is not touched in the XML, it
    > is simply written back after reading. I have the same with void spaces
    > in between the nodes, I managed to compensate that by stripping the
    > lines.
    >
    > I would like to use toprettyxml to make it user editable and viewable.
    > But this is really weird. How can I circumvent this behaviour?
    >
    > regards,
    > - Jorgen


    I had similar problems and ended up switching to the lxml package to
    solve the issue. I think you can do it with ElementTree too. Maybe
    somebody with more experience with the xml / minidom modules will show
    up soon.

    Mike
     
    , Oct 2, 2007
    #2
    1. Advertising

  3. Jorgen Bodde

    Jorgen Bodde Guest

    Hi there,

    Thank you for confirming this, I did manage a work around. When
    reading back the XML file, I strip it off it's whitespaces before I
    parse it. Then when writing it back no excessive whitespaces are
    appended. My best guess is that toprettyxml is not intelligently
    handling whitespaces that are already there, and bluntly appends more
    whitespaces to it, making it grow exponentially.

    This is the snippet;

    f = open(filename, "rt")
    for line in f:
    s = line.strip(' \t\n')
    if s:
    xmlstr += s + ' ' # space needed for spanning text nodes

    And then I simply use parseString instead of parse. But honestly, I
    think it is a bug, because the XML standard also says that whitespaces
    before normal text should be ignored, and I do not see it back as text
    when I read the node, so why preserve it and mess up the formatting in
    the end?

    Regards,
    - Jorgen




    On 10/2/07, <> wrote:
    > On Oct 2, 11:43 am, "Jorgen Bodde" <> wrote:
    > > Hi all,
    > >
    > > I parse an XML file, replace a node with a new one (like updating
    > > cache) and write it back. Every write, new spaces are added. For
    > > example, first read - update - write cycle;
    > >
    > > <var name="APPNAME" status="undefined">
    > > My First App
    > > </var>
    > >
    > > Second cycle:
    > >
    > > <var name="APPNAME" status="undefined">
    > > My First App
    > > </var>
    > >
    > > Third cycle:
    > >
    > > <var name="APPNAME" status="undefined">
    > > My First App
    > > </var>
    > >
    > > And this goes on. The node is one that is not touched in the XML, it
    > > is simply written back after reading. I have the same with void spaces
    > > in between the nodes, I managed to compensate that by stripping the
    > > lines.
    > >
    > > I would like to use toprettyxml to make it user editable and viewable.
    > > But this is really weird. How can I circumvent this behaviour?
    > >
    > > regards,
    > > - Jorgen

    >
    > I had similar problems and ended up switching to the lxml package to
    > solve the issue. I think you can do it with ElementTree too. Maybe
    > somebody with more experience with the xml / minidom modules will show
    > up soon.
    >
    > Mike
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
     
    Jorgen Bodde, Oct 3, 2007
    #3
  4. Jorgen Bodde

    Paul Boddie Guest

    On 3 Okt, 11:30, "Jorgen Bodde" <> wrote:
    >
    > Thank you for confirming this, I did manage a work around. When
    > reading back the XML file, I strip it off it's whitespaces before I
    > parse it. Then when writing it back no excessive whitespaces are
    > appended. My best guess is that toprettyxml is not intelligently
    > handling whitespaces that are already there, and bluntly appends more
    > whitespaces to it, making it grow exponentially.


    This seems like a reasonable explanation without having looked at the
    source code myself.

    [...]

    > And then I simply use parseString instead of parse. But honestly, I
    > think it is a bug, because the XML standard also says that whitespaces
    > before normal text should be ignored, and I do not see it back as text
    > when I read the node, so why preserve it and mess up the formatting in
    > the end?


    Which part of the standard is this? Here's the XML 1.0 specification's
    section on whitespace:

    http://www.w3.org/TR/2006/REC-xml-20060816/#sec-white-space

    It seems to me that applications (and the libraries which serve them)
    can choose what to do unless xml:space is set to "preserve". It does
    seem odd that the toprettyxml method chooses to respect existing
    whitespace whilst also disrupting it by adding more, however.

    Paul
     
    Paul Boddie, Oct 3, 2007
    #4
  5. Jorgen Bodde

    Jorgen Bodde Guest

    Hi Paul,

    > This seems like a reasonable explanation without having looked at the
    > source code myself.


    It's by thorough investigation ;-)

    > Which part of the standard is this? Here's the XML 1.0 specification's
    > section on whitespace:
    >
    > http://www.w3.org/TR/2006/REC-xml-20060816/#sec-white-space


    Well 2.10 if I quote:

    <quote>
    Such white space is typically not intended for inclusion in the
    delivered version of the document. On the other hand, "significant"
    white space that should be preserved in the delivered version is
    common, for example in poetry and source code.
    </quote>

    I interpret "significant" whitespaces as the ones between the words,
    if whitespaces occur at the beginning of a line due to an indent like

    <value>
    This is indented text
    </value>

    We can assume that the spaces in front of it are not significant
    whitespaces. Because when I read the text node in python and it is not
    included, I see no reason why it should be preserved. And if it is
    preserved in the xml DOM, toprettyxml should first investigate how
    many whitespaces are already there before adding more to indent the
    text.

    Also this happens. First the nodes are properly shown:

    <value>
    <a> ... </a>
    </value>
    <value>
    <a> ... </a>
    </value>

    When writing back this sometimes happen (mind the blank lines):

    <value>
    <a> ... </a>
    </value>

    <value>
    <a> ... </a>
    </value>

    And the next time, the spaces between the nodes is expanded again:

    <value>
    <a> ... </a>
    </value>


    <value>
    <a> ... </a>
    </value>

    (etc) .. so when reading, modifying, writing XML files, the empty
    blank lines will grow exponentially.

    > It seems to me that applications (and the libraries which serve them)
    > can choose what to do unless xml:space is set to "preserve". It does
    > seem odd that the toprettyxml method chooses to respect existing
    > whitespace whilst also disrupting it by adding more, however.


    I would think (simplistic I'm sure) that if spaces are that important,
    you can always use a CDATA tag which should treat the text inside as
    raw data without any formatting and whitespace changes.

    Should I file this as a bug to be solved? I have my workaround now,
    but I read online that more people seem to have ran into this.

    Regards,
    - Jorgen
     
    Jorgen Bodde, Oct 3, 2007
    #5
  6. Jorgen Bodde

    Jim Guest

    On Oct 3, 6:18 am, "Jorgen Bodde" <> wrote:
    > Should I file this as a bug to be solved? I have my workaround now,
    > but I read online that more people seem to have ran into this.

    Perhaps it is not a bug in that it does not violate the standard. But
    I know that I have been annoyed by it any number of times. I think it
    is fair to say that it violates the principle of least surprise.

    IMHO "<action><p>Then a shot rang out.\nHe shouted.</p></action>"
    should be pretty-printed as

    <action>
    <p>Then a shot rang out.
    He shouted.</p>
    </action>

    That is, I perceive that the "right" behavior is to not add white
    space to the textual data.

    No doubt this is a matter of taste and of intended audience (and maybe
    there are complications that I don't see). But let me urge you to
    send the mataintainers something.

    Jim Hefferon
     
    Jim, Oct 3, 2007
    #6
  7. On Wed, 03 Oct 2007 12:18:45 +0200, Jorgen Bodde wrote:

    >> Which part of the standard is this? Here's the XML 1.0 specification's
    >> section on whitespace:
    >>
    >> http://www.w3.org/TR/2006/REC-xml-20060816/#sec-white-space

    >
    > Well 2.10 if I quote:
    >
    > <quote>
    > Such white space is typically not intended for inclusion in the
    > delivered version of the document. On the other hand, "significant"
    > white space that should be preserved in the delivered version is
    > common, for example in poetry and source code.
    > </quote>
    >
    > I interpret "significant" whitespaces as the ones between the words,
    > if whitespaces occur at the beginning of a line due to an indent like


    Significant whitespace is all whitespace in nodes that may contain text.
    You need a DTD or schema to decide this, that's why all pretty printing
    without a DTD or schema is broken IMHO. Because you then simply don't
    know if it is safe to strip or add whitespace.

    > <value>
    > This is indented text
    > </value>
    >
    > We can assume that the spaces in front of it are not significant
    > whitespaces.


    I can't. You are just guessing.

    > Because when I read the text node in python and it is not
    > included, I see no reason why it should be preserved.


    But it should be included.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Oct 3, 2007
    #7
  8. > <quote>
    > Such white space is typically not intended for inclusion in the
    > delivered version of the document. On the other hand, "significant"
    > white space that should be preserved in the delivered version is
    > common, for example in poetry and source code.
    > </quote>
    >
    > I interpret "significant" whitespaces as the ones between the words,


    This interpretation is incorrect. It's not really possible to tell what
    whitespace is significant from looking just at the document; the
    classification into "significant" and "insignificant" is up to the
    application, not the XML processor.

    There is also the concept of "ignorable" white space in SAX (and other
    APIs); by this, white space in element content is meant. This is
    supported by the XML recommendation with the sentence
    "A validating XML processor MUST also inform the application which of
    these characters constitute white space appearing in element content."
    (you can only know if it's in element content if you validate)

    > We can assume that the spaces in front of it are not significant
    > whitespaces.


    No, we cannot. Maybe your application can assume that; the XML
    processor cannot. In fact, the XML recommend FORBIDS the XML processor
    from stripping white space.

    > (etc) .. so when reading, modifying, writing XML files, the empty
    > blank lines will grow exponentially.


    Not sure why you keep saying that growth is exponentially; I believe
    it's linear (with the number of read-write-cycles), not exponential.

    > I would think (simplistic I'm sure) that if spaces are that important,
    > you can always use a CDATA tag which should treat the text inside as
    > raw data without any formatting and whitespace changes.


    That is definitely simplistic. CDATA has no significance on formatting.

    > Should I file this as a bug to be solved? I have my workaround now,
    > but I read online that more people seem to have ran into this.


    Feel free to come up with a patch. It is questionable whether a bug
    report will help; there is a good chance that it stays open for several
    years.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Oct 3, 2007
    #8
  9. Jorgen Bodde

    Legrandin Guest

    Hi Jorgen,

    > I parse an XML file, replace a node with a new one (like updating cache)
    > and write it back. Every write, new spaces are added.

    [ ... ]
    > And this goes on. The node is one that is not touched in the XML, it is
    > simply written back after reading. I have the same with void spaces in
    > between the nodes, I managed to compensate that by stripping the lines.


    Before calling toxml/toprettyxml, I strip (with rstrip and lstrip) all
    text nodes and take care of removing all the empty ones.

    Of course, this is feasible only if whitespace (space, tab, newline) is
    not meaningful for the application.

    Legrandin
     
    Legrandin, Oct 3, 2007
    #9
  10. Jorgen Bodde

    Jorgen Bodde Guest

    Dear list,

    Thanks for the suggestions and clarification. After playing with XML
    for a while I noticed whitespaces can indeed be more important then I
    thought. I did came to the following conclusions;

    1. Removing whitespaces was done by my code, not by the
    xml.dom.minidom so I regret the fact I said that it removed
    whitespaces automatically
    2. toprettyxml() should however be smarter with outputting the XML. If
    it adds whitespaces in the sake of formatting, it should check how
    many of the whitespaces are already there. Consecutive read / modify /
    write actions should not cause an explosive growth of whitespaces.
    When I use toprettyxml() I am obviously not interested in whitespaces
    in front of the text in the nodes, or else I would have outputted it
    differently.

    Thanks all for the feedback,
    - Jorgen
     
    Jorgen Bodde, Oct 9, 2007
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Guest
    Replies:
    2
    Views:
    545
    Guest
    Feb 2, 2004
  2. Guest
    Replies:
    0
    Views:
    506
    Guest
    Feb 9, 2004
  3. =?Utf-8?B?TWFydGlu?=

    ASP.Net messes with my form's action attributes!

    =?Utf-8?B?TWFydGlu?=, Apr 5, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    2,701
    =?Utf-8?B?TWFydGlu?=
    Apr 6, 2004
  4. Dilip
    Replies:
    3
    Views:
    482
    Dilip
    Dec 30, 2004
  5. Paul Kozik

    XML minidom Parsing and ToPrettyXML

    Paul Kozik, Mar 26, 2007, in forum: Python
    Replies:
    1
    Views:
    1,297
    =?ISO-8859-2?Q?Wojciech_Mu=B3a?=
    Mar 26, 2007
Loading...

Share This Page