xml parsing escape characters

Discussion in 'Python' started by Luis P. Mendes, Jan 19, 2005.

  1. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Hi,

    I only know a little bit of xml and I'm trying to parse a xml document
    in order to save its elements in a file (dictionaries inside a list).

    When I access a url from python 2.3.3 running in Linux with the
    following lines:
    resposta = urllib.urlopen(url)
    xmldoc = minidom.parse(resposta)
    resposta.close()

    I get the following result:

    <?xml version="1.0" encoding="utf-8"?>
    <string xmlns="http://www......">&lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;
    (... others ...)
    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>
    _____________________________________________________________

    In the lines below, I try to get all the child nodes from string, first
    by counting them, and then ignoring the /n ones:

    stringNode = xmldoc.childNodes[0]
    print stringNode.toxml()
    dataSetNode = stringNode.childNodes[0]
    numNos = len(dataSetNode.childNodes)
    todosNos={}
    for no in range(numNos):
    todosNos[no] = dataSetNode.childNodes[no].toxml()
    posicaoXml = [no for no in todosNos.keys() if len(todosNos[no])>4]
    print posicaoXml

    (I'm almost sure there's a simpler way to do this...)
    _____________________________________________________________

    I don't get any elements. But, if I access the same url via a browser,
    the result in the browser window is something like:

    <string xmlns="http://www......">
    ~ <DataSet>
    ~ <Order>
    ~ <Customer>439</Customer>
    (... others ...)
    ~ </Order>
    ~ </DataSet>
    </string>

    and the lines I posted work as intended.

    I already browsed the web, I know it's about the escape characters, but
    I didn't find a simple solution for this.

    I tried to use LL2XML.py and unescape function with a simple replace
    text = text.replace("&lt;", "<")
    but I had to convert the xml document to string and then I could not (or
    don't know) how to convert it back to xml object.

    How can I solve this? Please, explain it having in mind that I'm just
    beggining with Xml and I'm not very experienced in Python, too.


    Luis
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)
    Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

    iD8DBQFB7rzKHn4UHCY8rB8RAhnlAKCYA6t0gd8rRDhIvZ5sdmNJlEPSeQCgteB3
    XUtZ0JoHeTavBOCYi6YYnNo=
    =VORM
    -----END PGP SIGNATURE-----
    Luis P. Mendes, Jan 19, 2005
    #1
    1. Advertising

  2. Luis P. Mendes wrote:
    > I get the following result:
    >
    > <?xml version="1.0" encoding="utf-8"?>
    > <string xmlns="http://www......">&lt;DataSet&gt;
    > ~ &lt;Order&gt;


    Most likely, this result is correct, and your document
    really does contain

    &lt;Order&gt;


    > I don't get any elements. But, if I access the same url via a browser,
    > the result in the browser window is something like:
    >
    > <string xmlns="http://www......">
    > ~ <DataSet>


    Most likely, your browser is incorrect (or atleast confusing), and
    renders &lt; as "<", even though this is not markup.

    > I already browsed the web, I know it's about the escape characters, but
    > I didn't find a simple solution for this.


    Not sure what "this" is. AFAICT, everything works correctly.

    Regards,
    Martin
    =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=, Jan 19, 2005
    #2
    1. Advertising

  3. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    this is the xml document:

    <?xml version="1.0" encoding="utf-8"?>
    <string xmlns="http://www......">&lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;
    (... others ...)
    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>

    When I do:

    print xmldoc.toxml()

    it prints:
    <?xml version="1.0" ?>
    <string xmlns="http://www...">&lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;

    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>

    __________________________________________________________
    with: stringNode = xmldoc.childNodes[0]
    print stringNode.toxml()
    I get:
    <string xmlns="http://www.......">&lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;

    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>
    ______________________________________________________________________

    with: DataSetNode = stringNode.childNodes[0]
    print DataSetNode.toxml()

    I get:

    &lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;

    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;
    _______________________________________________________________-

    so far so good, but when I issue the command:

    print DataSetNode.childNodes[0]

    I get:
    IndexError: tuple index out of range

    Why the error, and why does it return a tuple?
    Why doesn't it return:
    &lt;Order&gt;
    &lt;Customer&gt;439&lt;/Customer&gt;

    &lt;/Order&gt;
    ??
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)
    Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

    iD8DBQFB76y3Hn4UHCY8rB8RAvQsAKCFD/hps8ybQli8HAs3iSCvRjwqjACfS/12
    5gctpB91S5cy299e/TVLGQk=
    =XR2a
    -----END PGP SIGNATURE-----
    Luis P. Mendes, Jan 20, 2005
    #3
  4. Luis P. Mendes

    Kent Johnson Guest

    Luis P. Mendes wrote:
    > -----BEGIN PGP SIGNED MESSAGE-----
    > Hash: SHA1
    >
    > this is the xml document:
    >
    > <?xml version="1.0" encoding="utf-8"?>
    > <string xmlns="http://www......">&lt;DataSet&gt;
    > ~ &lt;Order&gt;
    > ~ &lt;Customer&gt;439&lt;/Customer&gt;
    > (... others ...)
    > ~ &lt;/Order&gt;
    > &lt;/DataSet&gt;</string>


    This is an XML document containing a single tag, <string>, whose content is text containing
    entity-escaped XML.

    This is *not* an XML document containing tags <DataSet>, <Order>, <Customer>, etc.

    All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the
    <string> tag to be able to treat it as structured XML.

    Kent
    Kent Johnson, Jan 20, 2005
    #4
  5. Kent Johnson wrote:
    [...]
    > This is an XML document containing a single tag, <string>, whose content
    > is text containing entity-escaped XML.
    >
    > This is *not* an XML document containing tags <DataSet>, <Order>,
    > <Customer>, etc.
    >
    > All the behaviour you are seeing is a consequence of this. You need to
    > unescape the contents of the <string> tag to be able to treat it as
    > structured XML.


    The unescaping is usually done for you by the xml parser that you use.

    --Irmen
    Irmen de Jong, Jan 20, 2005
    #5
  6. Luis P. Mendes

    Kent Johnson Guest

    Irmen de Jong wrote:
    > Kent Johnson wrote:
    > [...]
    >
    >> This is an XML document containing a single tag, <string>, whose
    >> content is text containing entity-escaped XML.
    >>
    >> This is *not* an XML document containing tags <DataSet>, <Order>,
    >> <Customer>, etc.
    >>
    >> All the behaviour you are seeing is a consequence of this. You need to
    >> unescape the contents of the <string> tag to be able to treat it as
    >> structured XML.

    >
    >
    > The unescaping is usually done for you by the xml parser that you use.


    Yes, so if your XML contains for example
    <stuff>&lt;not a tag&gt;</stuff>

    and you parse this and ask for the *text* content of the <stuff> tag, you will get the string
    "<not a tag>"

    but it's still *not* a tag. If you try to get child elements of the <stuff> element there will be none.

    This is exactly the confusion the OP has.

    >
    > --Irmen
    Kent Johnson, Jan 20, 2005
    #6
  7. Luis P. Mendes wrote:
    > with: DataSetNode = stringNode.childNodes[0]
    > print DataSetNode.toxml()
    >
    > I get:
    >
    > &lt;DataSet&gt;
    > ~ &lt;Order&gt;
    > ~ &lt;Customer&gt;439&lt;/Customer&gt;
    >
    > ~ &lt;/Order&gt;
    > &lt;/DataSet&gt;
    > _______________________________________________________________-
    >
    > so far so good, but when I issue the command:
    >
    > print DataSetNode.childNodes[0]
    >
    > I get:
    > IndexError: tuple index out of range
    >
    > Why the error, and why does it return a tuple?


    The DataSetNode has no children, because it is not
    an Element node, but a Text node. In XML, an element
    is denoted by

    <DataSet>...</DataSet>

    and *not* by

    &lt;DataSet&gt;...&lt;/DataSet&gt;

    The latter is just a single string, represented
    in XML as a Text node. It does not give you any
    hierarchy whatsoever.

    As a text node does not have any children, its
    childNode members is a empty tuple; accessing
    that tuple gives you an IndexError.

    Regards,
    Martin
    =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=, Jan 20, 2005
    #7
  8. Irmen de Jong wrote:
    > The unescaping is usually done for you by the xml parser that you use.


    Usually, but not in this case. If you have a text that looks like
    XML, and you want to put it into an XML element, the XML file uses
    &lt; and &gt;. The XML parser unescapes that as < and >. However, it
    does not then consider the < and > as markup, and it shouldn't.

    Regards,
    Martin
    =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=, Jan 20, 2005
    #8
  9. Martin v. Löwis wrote:
    > Irmen de Jong wrote:
    >
    >> The unescaping is usually done for you by the xml parser that you use.

    >
    >
    > Usually, but not in this case. If you have a text that looks like
    > XML, and you want to put it into an XML element, the XML file uses
    > &lt; and &gt;. The XML parser unescapes that as < and >. However, it
    > does not then consider the < and > as markup, and it shouldn't.


    That's also what I said?

    The unescaping of the XML entities in the contents of the OP's
    <string> element is done for you by the parser,
    so you will get a text node with the <,>,&,whatever in there.
    The OP probably wants to feed that to a new xml parser instance
    to process it as markup.
    Or perhaps the way the original XML document is constructed is
    flawed.

    --Irmen
    Irmen de Jong, Jan 20, 2005
    #9
  10. Irmen de Jong wrote:
    >> Usually, but not in this case. If you have a text that looks like
    >> XML, and you want to put it into an XML element, the XML file uses
    >> &lt; and &gt;. The XML parser unescapes that as < and >. However, it
    >> does not then consider the < and > as markup, and it shouldn't.

    >
    >
    > That's also what I said?


    You said it in response to

    >>> All the behaviour you are seeing is a consequence of this. You need
    >>> to unescape the contents of the <string> tag to be able to treat it
    >>> as structured XML.


    In that context, I interpreted

    >> The unescaping is usually done for you by the xml parser that you
    >> use.


    as "The parser should have done what you want; if the parser didn't,
    that is is bug in the parser".

    > The OP probably wants to feed that to a new xml parser instance
    > to process it as markup.
    > Or perhaps the way the original XML document is constructed is
    > flawed.


    Either of these, indeed - probably the latter.

    Regards,
    Martin
    =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=, Jan 20, 2005
    #10
  11. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    I would like to thank everyone for your answers, but I'm not seeing the
    light yet!

    When I access the url via the Firefox browser and look into the source
    code, I also get:

    <?xml version="1.0" encoding="utf-8"?>
    <string xmlns="http................">&lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;
    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>

    should I take the contents of the string tag that is text and replace
    all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?
    how to do it?

    or should I use another parser that accomplishes the task with no need
    to replace the escaped characters?
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)
    Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

    iD8DBQFB8AIQHn4UHCY8rB8RAuw8AJ9ZMQ8P3c7wXD1zVLd2fe7MktMQwwCfXAND
    EPpY1w2a3ix2s2vWRlzZ43U=
    =bJQV
    -----END PGP SIGNATURE-----
    Luis P. Mendes, Jan 20, 2005
    #11
  12. Luis P. Mendes wrote:
    > When I access the url via the Firefox browser and look into the source
    > code, I also get:
    >
    > <?xml version="1.0" encoding="utf-8"?>
    > <string xmlns="http................">&lt;DataSet&gt;
    > ~ &lt;Order&gt;
    > ~ &lt;Customer&gt;439&lt;/Customer&gt;
    > ~ &lt;/Order&gt;
    > &lt;/DataSet&gt;</string>


    Please do try to understand what you are seeing. This is crucial for
    understanding what happens.

    You may have the understanding that XML can be represented as a tree.
    This would be good - if not, please read a book that explains why
    XML can be considered as a tree.

    In the tree, you have inner nodes, and leaf nodes. For example,
    the document

    <a>
    <b>Hello</b>
    <c>World</c>
    </a>

    has 5 nodes (ignoring whitespace content):

    Element:a ---- Element:b ---- Text:"Hello"
    |
    \-- Element:c ---- Text:"World"

    So the leaf nodes are typically Text nodes (unless you
    have an empty element). Your document has this structure:

    Element:string ---- Text:"""<DataSet>
    <Order>
    <Customer>439</Customer>
    </Order>
    </DataSet>"""

    So the ***TEXT*** contains the letter "<", just like it contains
    the letters "O" and "r". There IS no element Order in your document,
    no matter how hard you look.

    If you want a DataSet *element* in your document, it should
    read

    <string xmlns="...">
    <DataSet>
    <Order>
    <Customer>439</Customer>
    </Order
    </DataSet>
    </string>

    As this is the document you apparently want to process, complain
    to whoever gave you that other document.

    > should I take the contents of the string tag that is text and replace
    > all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?


    No. We still don't know what you want to achieve, so it is difficult to
    advise you what to do. My best advise is that whoever generates the XML
    document should fix it.

    > or should I use another parser that accomplishes the task with no need
    > to replace the escaped characters?


    No. The parser is working correctly.

    The document you got can also be interpreted as containing another
    XML document as a text. This is evil, but apparently people are doing
    it, anyway. If you really want that embedded document, you need
    first to extract it.

    To see what I mean, do

    print DataSetNode.data

    The .data attribute gives you the string contents of
    a text node. You could use this as an XML document, and
    parse it again to an XML parser. This would be ugly,
    but might be your only choice if the producer of the
    document is unwilling to adjust.

    Regards,
    Martin
    =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=, Jan 20, 2005
    #12
  13. On Thu, 20 Jan 2005 21:54:30 +0100, Martin v. Löwis wrote:

    > Luis P. Mendes wrote:
    >> When I access the url via the Firefox browser and look into the source
    >> code, I also get:
    >>
    >> <?xml version="1.0" encoding="utf-8"?> <string
    >> xmlns="http................">&lt;DataSet&gt; ~ &lt;Order&gt;
    >> ~ &lt;Customer&gt;439&lt;/Customer&gt; ~ &lt;/Order&gt;
    >> &lt;/DataSet&gt;</string>

    >
    > Please do try to understand what you are seeing. This is crucial for
    > understanding what happens.


    From extremely painful and lengthy personal experience, Luis, I
    ***extremely*** strongly recommend taking the time to nail this down until
    you really, really, really understand what is going on. Until you can
    explain it to somebody else coherently, ideally.

    Mixing escaping levels like this absolutely, positively *must* be done
    correctly, or extremely-painful-to-debug problems will result.

    (My painful experience was layering an RPC implementation in plain text on
    top of IM messages, where I was dealing with everything from the socket
    level up except the XML parser. Ultimately it turned out there was a
    problem in the XML parser, it rendered "&amp;amp;" as "&", which is wrong
    wrong wrong. But that took a *long* time to find, especially as I had
    other bugs in the way.)

    Since you're layering XML in XML, test &amp;amp; and &amp;amp;amp; to make
    sure they work correctly; those usually show encoding errors. And, given
    your current understanding of the issue, do not write your own decoding
    function unless you absolutely can't avoid it.
    Jeremy Bowers, Jan 21, 2005
    #13
  14. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    ~From your experience, do you think that if this wrong XML code could be
    meant to be read only by somekind of Microsoft parser, the error will
    not occur?

    I'll try to explain:

    xml producer writes the code in Windows platform and 'thinks' that every
    client will read/parse the code with a specific Windows parser. Could
    that (wrong) XML code parse correctly in that kind of specific Windows
    client?

    Or in other words:

    Do you know any windows parser that could turn that erroneous encoding
    to a xml tree, with four or five inner levels of tags?

    I'd like to thank everyone for taking the time to answer me.


    Luis
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)
    Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

    iD8DBQFB8UIOHn4UHCY8rB8RAgK4AKCiHjPdkCKnirX4gEIawT9hBp3HmQCdGoFK
    3IEMLLXwMZKvNoqA4tISVnI=
    =jvOU
    -----END PGP SIGNATURE-----
    Luis P. Mendes, Jan 21, 2005
    #14
  15. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    ~From your experience, do you think that if this wrong XML code could be
    meant to be read only by somekind of Microsoft parser, the error will
    not occur?

    I'll try to explain:

    xml producer writes the code in Windows platform and 'thinks' that every
    client will read/parse the code with a specific Windows parser. Could
    that (wrong) XML code parse correctly in that kind of specific Windows
    client?

    Or in other words:

    Do you know any windows parser that could turn that erroneous encoding
    to a xml tree, with four or five inner levels of tags?

    I'd like to thank everyone for taking the time to answer me.


    Luis
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.4 (GNU/Linux)
    Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

    iD8DBQFB8UIOHn4UHCY8rB8RAgK4AKCiHjPdkCKnirX4gEIawT9hBp3HmQCdGoFK
    3IEMLLXwMZKvNoqA4tISVnI=
    =jvOU
    -----END PGP SIGNATURE-----
    Luis P. Mendes, Jan 21, 2005
    #15
  16. Luis P. Mendes wrote:

    > xml producer writes the code in Windows platform and 'thinks' that every
    > client will read/parse the code with a specific Windows parser. Could
    > that (wrong) XML code parse correctly in that kind of specific Windows
    > client?


    not if it's an XML parser.

    > Do you know any windows parser that could turn that erroneous encoding
    > to a xml tree, with four or five inner levels of tags?


    any parser *can* do that, but I doubt many parsers will do it unless
    you ask it to (by extracting the string and parsing it again). here's the
    elementtree version:

    from elementtree.ElementTree import parse, XML

    wrapper = parse(urllib.urlopen(url))
    dataset = XML(wrapper.findtext("{http://www......}string"))

    </F>
    Fredrik Lundh, Jan 21, 2005
    #16
  17. Luis P. Mendes wrote:
    > From your experience, do you think that if this wrong XML code could be
    > meant to be read only by somekind of Microsoft parser, the error will
    > not occur?


    This is very unlikely. MSXML would never do this incorrectly.

    Regards,
    Martin
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Jan 22, 2005
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Griff

    trying out escape characters

    Griff, Aug 3, 2004, in forum: Perl
    Replies:
    6
    Views:
    592
  2. Maziar Aflatoun

    Escape characters

    Maziar Aflatoun, Dec 5, 2003, in forum: ASP .Net
    Replies:
    3
    Views:
    534
    Jason S
    Dec 5, 2003
  3. Replies:
    5
    Views:
    74,838
    opalpa http://opalpa.info
    Feb 5, 2007
  4. slomo
    Replies:
    5
    Views:
    1,493
    Duncan Booth
    Dec 2, 2007
  5. Erik Wasser
    Replies:
    5
    Views:
    414
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page