elementtree XML() unicode

K

Kee Nethery

Having an issue with elementtree XML() in python 2.6.4.

This code works fine:

from xml.etree import ElementTree as et
getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
<customer><shipping><state>bobble</state><city>head</
city><street>city</street></shipping></customer>'''
theResponseXml = et.XML(getResponse)

This code errors out when it tries to do the et.XML()

from xml.etree import ElementTree as et
getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
<customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
\ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
shipping></customer>'''
theResponseXml = et.XML(getResponse)

In my real code, I'm pulling the getResponse data from a web page that
returns as XML and when I display it in the browser you can see the
Japanese characters in the data. I've removed all the stuff in my code
and tried to distill it down to just what is failing. Hopefully I have
not removed something essential.

Why is this not working and what do I need to do to use Elementtree
with unicode?

Thanks, Kee Nethery
 
J

John Machin

Having an issue with elementtree XML() in python 2.6.4.

This code works fine:

      from xml.etree import ElementTree as et
      getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>  
<customer><shipping><state>bobble</state><city>head</
city><street>city</street></shipping></customer>'''
      theResponseXml = et.XML(getResponse)

This code errors out when it tries to do the et.XML()

      from xml.etree import ElementTree as et
      getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>  
<customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
\ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
shipping></customer>'''
      theResponseXml = et.XML(getResponse)

In my real code, I'm pulling the getResponse data from a web page that  
returns as XML and when I display it in the browser you can see the  
Japanese characters in the data. I've removed all the stuff in my code  
and tried to distill it down to just what is failing. Hopefully I have  
not removed something essential.

Why is this not working and what do I need to do to use Elementtree  
with unicode?

Having an issue with elementtree XML() in python 2.6.4.

This code works fine:

from xml.etree import ElementTree as et
getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
<customer><shipping><state>bobble</state><city>head</
city><street>city</street></shipping></customer>'''
theResponseXml = et.XML(getResponse)

This code errors out when it tries to do the et.XML()

from xml.etree import ElementTree as et
getResponse = u'''<?xml version="1.0" encoding="UTF-8"?>
<customer><shipping><state>\ue58d83\ue89189\ue79c8C</state><city>
\ue69f8f\ue5b882</city><street>\ue9ab98\ue58d97\ue58fb03</street></
shipping></customer>'''
theResponseXml = et.XML(getResponse)

In my real code, I'm pulling the getResponse data from a web page that
returns as XML and when I display it in the browser you can see the
Japanese characters in the data. I've removed all the stuff in my code
and tried to distill it down to just what is failing. Hopefully I have
not removed something essential.

Why is this not working and what do I need to do to use Elementtree
with unicode?

What you need to do is NOT feed it unicode. You feed it a str object
and it gets decoded according to the encoding declaration found in the
first line. So take the str object that you get from the web (should
be UTF8-encoded already unless the header is lying), and throw that at
ET ... like this:

| Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit
(Intel)] on win32
| Type "help", "copyright", "credits" or "license" for more
information.
| >>> from xml.etree import ElementTree as et
| >>> ucode = u'''<?xml version="1.0" encoding="UTF-8"?>
| ... <customer><shipping>
| ... <state>\ue58d83\ue89189\ue79c8C</state>
| ... <city>\ue69f8f\ue5b882</city>
| ... <street>\ue9ab98\ue58d97\ue58fb03</street>
| ... </shipping></customer>'''
| >>> xml= et.XML(ucode)
| Traceback (most recent call last):
| File "<stdin>", line 1, in <module>
| File "C:\python26\lib\xml\etree\ElementTree.py", line 963, in XML
| parser.feed(text)
| File "C:\python26\lib\xml\etree\ElementTree.py", line 1245, in
feed
| self._parser.Parse(data, 0)
| UnicodeEncodeError: 'ascii' codec can't encode character u'\ue58d'
in position 69: ordinal not in range(128)
| # as expected
| >>> strg = ucode.encode('utf8')
| # encoding as utf8 is for DEMO purposes.
| # i.e. use the original web str object, don't convert it to unicode
| # and back to utf8.
| >>> xml2 = et.XML(strg)
| >>> xml2.tag
| 'customer'
| >>> for c in xml2.getchildren():
| ... print c.tag, repr(c.text)
| ...
| shipping '\n'
| >>> for c in xml2[0].getchildren():
| ... print c.tag, repr(c.text)
| ...
| state u'\ue58d83\ue89189\ue79c8C'
| city u'\ue69f8f\ue5b882'
| street u'\ue9ab98\ue58d97\ue58fb03'
| >>>

By the way: (1) it usually helps to be more explicit than "errors
out", preferably the exact copied/pasted output as shown above; this
is one of the rare cases where the error message is predictable (2)
PLEASE don't start a new topic in a reply in somebody else's thread.
 
K

Kee Nethery

What you need to do is NOT feed it unicode. You feed it a str object
and it gets decoded according to the encoding declaration found in the
first line.

That it uses "the encoding declaration found in the first line" is the
nugget of data that is not in the documentation that has stymied me
for days. Thank you!

The other thing that has been confusing is that I've been using "dump"
to view what is in the elementtree instance and the non-ASCII
characters have been displayed as "numbered
entities" (<city>柏市</city>) and I know that is not the
representation I want the data to be in. A co-worker suggested that
instead of "dump" that I use "et.tostring(theResponseXml,
encoding='utf-8')" and then print that to see the characters. That
process causes the non-ASCII characters to display as the glyphs I
know them to be.

If there was a place in the official docs for me to append these
nuggets of information to the sections for
"xml.etree.ElementTree.XML(text)" and
"xml.etree.ElementTree.dump(elem)" I would absolutely do so.

Thank you!
Kee Nethery

So take the str object that you get from the web (should
be UTF8-encoded already unless the header is lying), and throw that at
ET ... like this:

| Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit
(Intel)] on win32
| Type "help", "copyright", "credits" or "license" for more
information.
| >>> from xml.etree import ElementTree as et
| >>> ucode = u'''<?xml version="1.0" encoding="UTF-8"?>
| ... <customer><shipping>
| ... <state>\ue58d83\ue89189\ue79c8C</state>
| ... <city>\ue69f8f\ue5b882</city>
| ... <street>\ue9ab98\ue58d97\ue58fb03</street>
| ... </shipping></customer>'''
| >>> xml= et.XML(ucode)
| Traceback (most recent call last):
| File "<stdin>", line 1, in <module>
| File "C:\python26\lib\xml\etree\ElementTree.py", line 963, in XML
| parser.feed(text)
| File "C:\python26\lib\xml\etree\ElementTree.py", line 1245, in
feed
| self._parser.Parse(data, 0)
| UnicodeEncodeError: 'ascii' codec can't encode character u'\ue58d'
in position 69: ordinal not in range(128)
| # as expected
| >>> strg = ucode.encode('utf8')
| # encoding as utf8 is for DEMO purposes.
| # i.e. use the original web str object, don't convert it to unicode
| # and back to utf8.
| >>> xml2 = et.XML(strg)
| >>> xml2.tag
| 'customer'
| >>> for c in xml2.getchildren():
| ... print c.tag, repr(c.text)
| ...
| shipping '\n'
| >>> for c in xml2[0].getchildren():
| ... print c.tag, repr(c.text)
| ...
| state u'\ue58d83\ue89189\ue79c8C'
| city u'\ue69f8f\ue5b882'
| street u'\ue9ab98\ue58d97\ue58fb03'
| >>>

By the way: (1) it usually helps to be more explicit than "errors
out", preferably the exact copied/pasted output as shown above; this
is one of the rare cases where the error message is predictable (2)
PLEASE don't start a new topic in a reply in somebody else's thread.
 
J

John Machin

That it uses "the encoding declaration found in the first line" is the  
nugget of data that is not in the documentation that has stymied me  
for days. Thank you!

And under the "don't repeat" principle, it shouldn't be in the
Elementtree docs; it's nothing special about ET -- it's part of the
definition of an XML document (which for universal loss-free
transportability naturally must be encoded somehow, and the document
must state what its own encoding is (if it's not the default
(UTF-8))).
The other thing that has been confusing is that I've been using "dump"  
to view what is in the elementtree instance and the non-ASCII  
characters have been displayed as "numbered  
entities" (<city>柏市</city>) and I know that is not the  
representation I want the data to be in. A co-worker suggested that  
instead of "dump" that I use "et.tostring(theResponseXml,  
encoding='utf-8')" and then print that to see the characters. That  
process causes the non-ASCII characters to display as the glyphs I  
know them to be.

If there was a place in the official docs for me to append these  
nuggets of information to the sections for  
"xml.etree.ElementTree.XML(text)" and  
"xml.etree.ElementTree.dump(elem)" I would absolutely do so.

I don't understand ... tostring() is in the same section as dump(),
about two screen-heights away. You want to include the tostring() docs
in the dump() docs? The usual idea is not to get bogged down in the
first function that looks at first glance like it might do what you
want ("look at the glyphs") but doesn't (it writes a (transportable)
XML stream) but press on to the next plausible candidate.
 
K

Kee Nethery

http://bugs.python.org/ applies to documentation too.

I've submitted documentation bugs in the past and no action was taken
on them, the bugs were closed. I'm guessing that information "that
everyone knows" not being in the documentation is not a bug. It's my
fault I'm a newbie and I accept that. Thanks to you two for helping me
get past this block.

Kee
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top