xml parsing escape characters

L

Luis P. Mendes

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I only know a little bit of xml and I'm trying to parse a xml document
in order to save its elements in a file (dictionaries inside a list).

When I access a url from python 2.3.3 running in Linux with the
following lines:
resposta = urllib.urlopen(url)
xmldoc = minidom.parse(resposta)
resposta.close()

I get the following result:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;
(... others ...)
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>
_____________________________________________________________

In the lines below, I try to get all the child nodes from string, first
by counting them, and then ignoring the /n ones:

stringNode = xmldoc.childNodes[0]
print stringNode.toxml()
dataSetNode = stringNode.childNodes[0]
numNos = len(dataSetNode.childNodes)
todosNos={}
for no in range(numNos):
todosNos[no] = dataSetNode.childNodes[no].toxml()
posicaoXml = [no for no in todosNos.keys() if len(todosNos[no])>4]
print posicaoXml

(I'm almost sure there's a simpler way to do this...)
_____________________________________________________________

I don't get any elements. But, if I access the same url via a browser,
the result in the browser window is something like:

<string xmlns="http://www......">
~ <DataSet>
~ <Order>
~ <Customer>439</Customer>
(... others ...)
~ </Order>
~ </DataSet>
</string>

and the lines I posted work as intended.

I already browsed the web, I know it's about the escape characters, but
I didn't find a simple solution for this.

I tried to use LL2XML.py and unescape function with a simple replace
text = text.replace("&lt;", "<")
but I had to convert the xml document to string and then I could not (or
don't know) how to convert it back to xml object.

How can I solve this? Please, explain it having in mind that I'm just
beggining with Xml and I'm not very experienced in Python, too.


Luis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB7rzKHn4UHCY8rB8RAhnlAKCYA6t0gd8rRDhIvZ5sdmNJlEPSeQCgteB3
XUtZ0JoHeTavBOCYi6YYnNo=
=VORM
-----END PGP SIGNATURE-----
 
?

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Luis said:
I get the following result:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt;DataSet&gt;
~ &lt;Order&gt;

Most likely, this result is correct, and your document
really does contain

&lt;Order&gt;

I don't get any elements. But, if I access the same url via a browser,
the result in the browser window is something like:

<string xmlns="http://www......">
~ <DataSet>

Most likely, your browser is incorrect (or atleast confusing), and
renders &lt; as "<", even though this is not markup.
I already browsed the web, I know it's about the escape characters, but
I didn't find a simple solution for this.

Not sure what "this" is. AFAICT, everything works correctly.

Regards,
Martin
 
L

Luis P. Mendes

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

this is the xml document:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;
(... others ...)
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>

When I do:

print xmldoc.toxml()

it prints:
<?xml version="1.0" ?>
<string xmlns="http://www...">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;

~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>

__________________________________________________________
with: stringNode = xmldoc.childNodes[0]
print stringNode.toxml()
I get:
<string xmlns="http://www.......">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;

~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>
______________________________________________________________________

with: DataSetNode = stringNode.childNodes[0]
print DataSetNode.toxml()

I get:

&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;

~ &lt;/Order&gt;
&lt;/DataSet&gt;
_______________________________________________________________-

so far so good, but when I issue the command:

print DataSetNode.childNodes[0]

I get:
IndexError: tuple index out of range

Why the error, and why does it return a tuple?
Why doesn't it return:
&lt;Order&gt;
&lt;Customer&gt;439&lt;/Customer&gt;

&lt;/Order&gt;
??
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB76y3Hn4UHCY8rB8RAvQsAKCFD/hps8ybQli8HAs3iSCvRjwqjACfS/12
5gctpB91S5cy299e/TVLGQk=
=XR2a
-----END PGP SIGNATURE-----
 
K

Kent Johnson

Luis said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

this is the xml document:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;
(... others ...)
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>

This is an XML document containing a single tag, <string>, whose content is text containing
entity-escaped XML.

This is *not* an XML document containing tags <DataSet>, <Order>, <Customer>, etc.

All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the
<string> tag to be able to treat it as structured XML.

Kent
 
I

Irmen de Jong

Kent Johnson wrote:
[...]
This is an XML document containing a single tag, <string>, whose content
is text containing entity-escaped XML.

This is *not* an XML document containing tags <DataSet>, <Order>,
<Customer>, etc.

All the behaviour you are seeing is a consequence of this. You need to
unescape the contents of the <string> tag to be able to treat it as
structured XML.

The unescaping is usually done for you by the xml parser that you use.

--Irmen
 
K

Kent Johnson

Irmen said:
Kent Johnson wrote:
[...]
This is an XML document containing a single tag, <string>, whose
content is text containing entity-escaped XML.

This is *not* an XML document containing tags <DataSet>, <Order>,
<Customer>, etc.

All the behaviour you are seeing is a consequence of this. You need to
unescape the contents of the <string> tag to be able to treat it as
structured XML.


The unescaping is usually done for you by the xml parser that you use.

Yes, so if your XML contains for example
<stuff>&lt;not a tag&gt;</stuff>

and you parse this and ask for the *text* content of the <stuff> tag, you will get the string
"<not a tag>"

but it's still *not* a tag. If you try to get child elements of the <stuff> element there will be none.

This is exactly the confusion the OP has.
 
?

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Luis said:
with: DataSetNode = stringNode.childNodes[0]
print DataSetNode.toxml()

I get:

&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;

~ &lt;/Order&gt;
&lt;/DataSet&gt;
_______________________________________________________________-

so far so good, but when I issue the command:

print DataSetNode.childNodes[0]

I get:
IndexError: tuple index out of range

Why the error, and why does it return a tuple?

The DataSetNode has no children, because it is not
an Element node, but a Text node. In XML, an element
is denoted by

<DataSet>...</DataSet>

and *not* by

&lt;DataSet&gt;...&lt;/DataSet&gt;

The latter is just a single string, represented
in XML as a Text node. It does not give you any
hierarchy whatsoever.

As a text node does not have any children, its
childNode members is a empty tuple; accessing
that tuple gives you an IndexError.

Regards,
Martin
 
?

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Irmen said:
The unescaping is usually done for you by the xml parser that you use.

Usually, but not in this case. If you have a text that looks like
XML, and you want to put it into an XML element, the XML file uses
&lt; and &gt;. The XML parser unescapes that as < and >. However, it
does not then consider the < and > as markup, and it shouldn't.

Regards,
Martin
 
I

Irmen de Jong

Martin said:
Usually, but not in this case. If you have a text that looks like
XML, and you want to put it into an XML element, the XML file uses
&lt; and &gt;. The XML parser unescapes that as < and >. However, it
does not then consider the < and > as markup, and it shouldn't.

That's also what I said?

The unescaping of the XML entities in the contents of the OP's
<string> element is done for you by the parser,
so you will get a text node with the <,>,&,whatever in there.
The OP probably wants to feed that to a new xml parser instance
to process it as markup.
Or perhaps the way the original XML document is constructed is
flawed.

--Irmen
 
?

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Irmen said:
That's also what I said?

You said it in response to

In that context, I interpreted

as "The parser should have done what you want; if the parser didn't,
that is is bug in the parser".
The OP probably wants to feed that to a new xml parser instance
to process it as markup.
Or perhaps the way the original XML document is constructed is
flawed.

Either of these, indeed - probably the latter.

Regards,
Martin
 
L

Luis P. Mendes

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I would like to thank everyone for your answers, but I'm not seeing the
light yet!

When I access the url via the Firefox browser and look into the source
code, I also get:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http................">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>

should I take the contents of the string tag that is text and replace
all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?
how to do it?

or should I use another parser that accomplishes the task with no need
to replace the escaped characters?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB8AIQHn4UHCY8rB8RAuw8AJ9ZMQ8P3c7wXD1zVLd2fe7MktMQwwCfXAND
EPpY1w2a3ix2s2vWRlzZ43U=
=bJQV
-----END PGP SIGNATURE-----
 
?

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Luis said:
When I access the url via the Firefox browser and look into the source
code, I also get:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http................">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>

Please do try to understand what you are seeing. This is crucial for
understanding what happens.

You may have the understanding that XML can be represented as a tree.
This would be good - if not, please read a book that explains why
XML can be considered as a tree.

In the tree, you have inner nodes, and leaf nodes. For example,
the document

<a>
<b>Hello</b>
<c>World</c>
</a>

has 5 nodes (ignoring whitespace content):

Element:a ---- Element:b ---- Text:"Hello"
|
\-- Element:c ---- Text:"World"

So the leaf nodes are typically Text nodes (unless you
have an empty element). Your document has this structure:

Element:string ---- Text:"""<DataSet>
<Order>
<Customer>439</Customer>
</Order>
</DataSet>"""

So the ***TEXT*** contains the letter "<", just like it contains
the letters "O" and "r". There IS no element Order in your document,
no matter how hard you look.

If you want a DataSet *element* in your document, it should
read

<string xmlns="...">
<DataSet>
<Order>
<Customer>439</Customer>
</Order
</DataSet>
</string>

As this is the document you apparently want to process, complain
to whoever gave you that other document.
should I take the contents of the string tag that is text and replace
all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?

No. We still don't know what you want to achieve, so it is difficult to
advise you what to do. My best advise is that whoever generates the XML
document should fix it.
or should I use another parser that accomplishes the task with no need
to replace the escaped characters?

No. The parser is working correctly.

The document you got can also be interpreted as containing another
XML document as a text. This is evil, but apparently people are doing
it, anyway. If you really want that embedded document, you need
first to extract it.

To see what I mean, do

print DataSetNode.data

The .data attribute gives you the string contents of
a text node. You could use this as an XML document, and
parse it again to an XML parser. This would be ugly,
but might be your only choice if the producer of the
document is unwilling to adjust.

Regards,
Martin
 
J

Jeremy Bowers

Please do try to understand what you are seeing. This is crucial for
understanding what happens.

From extremely painful and lengthy personal experience, Luis, I
***extremely*** strongly recommend taking the time to nail this down until
you really, really, really understand what is going on. Until you can
explain it to somebody else coherently, ideally.

Mixing escaping levels like this absolutely, positively *must* be done
correctly, or extremely-painful-to-debug problems will result.

(My painful experience was layering an RPC implementation in plain text on
top of IM messages, where I was dealing with everything from the socket
level up except the XML parser. Ultimately it turned out there was a
problem in the XML parser, it rendered "&amp;amp;" as "&", which is wrong
wrong wrong. But that took a *long* time to find, especially as I had
other bugs in the way.)

Since you're layering XML in XML, test &amp;amp; and &amp;amp;amp; to make
sure they work correctly; those usually show encoding errors. And, given
your current understanding of the issue, do not write your own decoding
function unless you absolutely can't avoid it.
 
L

Luis P. Mendes

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

~From your experience, do you think that if this wrong XML code could be
meant to be read only by somekind of Microsoft parser, the error will
not occur?

I'll try to explain:

xml producer writes the code in Windows platform and 'thinks' that every
client will read/parse the code with a specific Windows parser. Could
that (wrong) XML code parse correctly in that kind of specific Windows
client?

Or in other words:

Do you know any windows parser that could turn that erroneous encoding
to a xml tree, with four or five inner levels of tags?

I'd like to thank everyone for taking the time to answer me.


Luis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB8UIOHn4UHCY8rB8RAgK4AKCiHjPdkCKnirX4gEIawT9hBp3HmQCdGoFK
3IEMLLXwMZKvNoqA4tISVnI=
=jvOU
-----END PGP SIGNATURE-----
 
L

Luis P. Mendes

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

~From your experience, do you think that if this wrong XML code could be
meant to be read only by somekind of Microsoft parser, the error will
not occur?

I'll try to explain:

xml producer writes the code in Windows platform and 'thinks' that every
client will read/parse the code with a specific Windows parser. Could
that (wrong) XML code parse correctly in that kind of specific Windows
client?

Or in other words:

Do you know any windows parser that could turn that erroneous encoding
to a xml tree, with four or five inner levels of tags?

I'd like to thank everyone for taking the time to answer me.


Luis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB8UIOHn4UHCY8rB8RAgK4AKCiHjPdkCKnirX4gEIawT9hBp3HmQCdGoFK
3IEMLLXwMZKvNoqA4tISVnI=
=jvOU
-----END PGP SIGNATURE-----
 
F

Fredrik Lundh

Luis said:
xml producer writes the code in Windows platform and 'thinks' that every
client will read/parse the code with a specific Windows parser. Could
that (wrong) XML code parse correctly in that kind of specific Windows
client?

not if it's an XML parser.
Do you know any windows parser that could turn that erroneous encoding
to a xml tree, with four or five inner levels of tags?

any parser *can* do that, but I doubt many parsers will do it unless
you ask it to (by extracting the string and parsing it again). here's the
elementtree version:

from elementtree.ElementTree import parse, XML

wrapper = parse(urllib.urlopen(url))
dataset = XML(wrapper.findtext("{http://www......}string"))

</F>
 
G

Guest

Luis said:
From your experience, do you think that if this wrong XML code could be
meant to be read only by somekind of Microsoft parser, the error will
not occur?

This is very unlikely. MSXML would never do this incorrectly.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top