small inconsistency in ElementTree (1.2.6)

D

Damjan

Attached is the smallest test case, that shows that ElementTree returns
a
string object if the text in the tree is only ascii, but returns a
unicode
object otherwise.

This would make sense if the sting object and unicode object were
interchangeable... but they are not - one example, the translate method
is
completelly different.

I've tested with cElementTree (1.0.2) too, it has the same behaviour.

Any suggestions?
Do I need to check the output of ElementTree everytime, or there's some
hidden switch to change this behaviour?

from elementtree import ElementTree

xml = """\
<?xml version="1.0" encoding="UTF-8"?>
<root>
<p1> ascii </p1>
<p2> \xd0\xba\xd0\xb8\xd1\x80\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x86\xd0\xb0
</p2>
</root>
"""

tree = ElementTree.fromstring(xml)
p1, p2 = tree.getchildren()
print "type(p1.text):", type(p1.text)
print "type(p2.text):", type(p2.text)
 
F

Fredrik Lundh

Damjan said:
Attached is the smallest test case, that shows that ElementTree returns
a string object if the text in the tree is only ascii, but returns a unicode
object otherwise.

This would make sense if the sting object and unicode object were
interchangeable... but they are not - one example, the translate method
is completelly different.

I've tested with cElementTree (1.0.2) too, it has the same behaviour.

Any suggestions?

this is documented behaviour.
Do I need to check the output of ElementTree everytime, or there's some
hidden switch to change this behaviour?

no.

ascii strings and unicode strings are perfectly interchangable, with some
minor exceptions. if you find yourself using translate all the time (why?),
add an explicit conversion to the translate code.

(fwiw, I'd say this is a bug in translate rather than in elementtree)

</F>
 
D

Damjan

Do I need to check the output of ElementTree everytime, or there's some
no.

ascii strings and unicode strings are perfectly interchangable, with some
minor exceptions.

It's not only translate, it's decode too... probably other methods and
behaviour differ too.
And the bigger picture, string objects are really only byte sequences,
while
text is consisted of characters and that's what unicode strings are
for,
strings-made-of-characters.

It seems to me more logical that an et.text to be a unicode object
always.
It's text, right!
if you find yourself using translate all the time
(why?), add an explicit conversion to the translate code.

I'm using translate because I need it :)

I'm currently just wrapping anything from ElementTree in unicode(), but
this
seems like an ugly step.
(fwiw, I'd say this is a bug in translate rather than in elementtree)

I wonder what the python devels will say? ;)
 
F

Fredrik Lundh

Damjan said:
It's not only translate, it's decode too...

why would you use decode on the strings you get back from ET ?
probably other methods and behaviour differ too.

And the bigger picture, string objects are really only byte sequences

not if they contain ASCII characters.
while text is consisted of characters and that's what unicode strings
are for, strings-made-of-characters.

It seems to me more logical that an et.text to be a unicode object
always. It's text, right!


I'm using translate because I need it :)

I'm currently just wrapping anything from ElementTree in unicode(), but
this seems like an ugly step.


I wonder what the python devels will say? ;)

well, you're talking to the developer who wrote the original Unicode
implementation...

</F>
 
D

Damjan

ascii strings and unicode strings are perfectly interchangable, with
why would you use decode on the strings you get back from ET ?

Long story... some time ago when computers wouldn't support charsets
people
invented so called "cyrillic fonts" - ie a font that has cyrillic
glyphs
mapped on the latin posstions. Since our cyrillic alphabet has 31
characters, some characters in said fonts were mapped to { or ~ etc..
Of
course this ,,sollution" is awful but it was the only one at the
time.

So I'm making a python script that takes an OpenDocument file and
translates
it to UTF-8...

ps. I use translate now, but I was making a general note that unicode
and
string objects are not 100% interchangeable. translate, encode, decode
are
especially problematic.

anyway, I wrap the output of ET in unicode() now... I don't see
another, better, sollution.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top