small inconsistency in ElementTree (1.2.6)

Damjan · Dec 9, 2005

Attached is the smallest test case, that shows that ElementTree returns
a
string object if the text in the tree is only ascii, but returns a
unicode
object otherwise.

This would make sense if the sting object and unicode object were
interchangeable... but they are not - one example, the translate method
is
completelly different.

I've tested with cElementTree (1.0.2) too, it has the same behaviour.

Any suggestions?
Do I need to check the output of ElementTree everytime, or there's some
hidden switch to change this behaviour?

from elementtree import ElementTree

xml = """\
<?xml version="1.0" encoding="UTF-8"?>
<root>
<p1> ascii </p1>
<p2> \xd0\xba\xd0\xb8\xd1\x80\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x86\xd0\xb0
</p2>
</root>
"""

tree = ElementTree.fromstring(xml)
p1, p2 = tree.getchildren()
print "type(p1.text):", type(p1.text)
print "type(p2.text):", type(p2.text)

Fredrik Lundh · Dec 9, 2005

Damjan said:
Attached is the smallest test case, that shows that ElementTree returns
a string object if the text in the tree is only ascii, but returns a unicode
object otherwise.

This would make sense if the sting object and unicode object were
interchangeable... but they are not - one example, the translate method
is completelly different.

I've tested with cElementTree (1.0.2) too, it has the same behaviour.

Any suggestions?

this is documented behaviour.

Do I need to check the output of ElementTree everytime, or there's some
hidden switch to change this behaviour?

no.

ascii strings and unicode strings are perfectly interchangable, with some
minor exceptions. if you find yourself using translate all the time (why?),
add an explicit conversion to the translate code.

(fwiw, I'd say this is a bug in translate rather than in elementtree)

</F>

Damjan · Dec 9, 2005

Do I need to check the output of ElementTree everytime, or there's some

no.

ascii strings and unicode strings are perfectly interchangable, with some
minor exceptions.

It's not only translate, it's decode too... probably other methods and
behaviour differ too.
And the bigger picture, string objects are really only byte sequences,
while
text is consisted of characters and that's what unicode strings are
for,
strings-made-of-characters.

It seems to me more logical that an et.text to be a unicode object
always.
It's text, right!

if you find yourself using translate all the time
(why?), add an explicit conversion to the translate code.

I'm using translate because I need it

I'm currently just wrapping anything from ElementTree in unicode(), but
this
seems like an ugly step.

(fwiw, I'd say this is a bug in translate rather than in elementtree)

I wonder what the python devels will say?

Fredrik Lundh · Dec 9, 2005

Damjan said:
It's not only translate, it's decode too...

why would you use decode on the strings you get back from ET ?

probably other methods and behaviour differ too.

And the bigger picture, string objects are really only byte sequences

not if they contain ASCII characters.

while text is consisted of characters and that's what unicode strings
are for, strings-made-of-characters.

It seems to me more logical that an et.text to be a unicode object
always. It's text, right!

I'm using translate because I need it

I'm currently just wrapping anything from ElementTree in unicode(), but
this seems like an ugly step.

I wonder what the python devels will say?

well, you're talking to the developer who wrote the original Unicode
implementation...

</F>

Damjan · Dec 10, 2005

ascii strings and unicode strings are perfectly interchangable, with

why would you use decode on the strings you get back from ET ?

Long story... some time ago when computers wouldn't support charsets
people
invented so called "cyrillic fonts" - ie a font that has cyrillic
glyphs
mapped on the latin posstions. Since our cyrillic alphabet has 31
characters, some characters in said fonts were mapped to { or ~ etc..
Of
course this ,,sollution" is awful but it was the only one at the
time.

So I'm making a python script that takes an OpenDocument file and
translates
it to UTF-8...

ps. I use translate now, but I was making a general note that unicode
and
string objects are not 100% interchangeable. translate, encode, decode
are
especially problematic.

anyway, I wrap the output of ET in unicode() now... I don't see
another, better, sollution.

ElementTree XML parsing problem	8	Apr 27, 2011
WinXP, Python3.1.2,dir-listing to XML - problem with unicode file names	0	Apr 3, 2010
elementtree and gbk encoding	12	Mar 14, 2006
Elementtree and CDATA handling	5	Jun 1, 2005
the tostring and XML methods in ElementTree	7	May 7, 2006
[ANN] lxml 1.0 released	2	Jun 2, 2006
Possible bug in XML:LibXML	4	Dec 16, 2007
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

small inconsistency in ElementTree (1.2.6)

Damjan

Fredrik Lundh

Damjan

Fredrik Lundh

Damjan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads