Treating a unicode string as latin-1

S

Simon Willison

Hello,

I'm using ElementTree to parse an XML file which includes some data
encoded as cp1252, for example:

<name>Bob\x92s Breakfast</name>

If this was a regular bytestring, I would convert it to utf8 using the
following:
Bob's Breakfast

But ElementTree gives me back a unicode string, so I get the following
error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

Thanks,

Simon Willison
 
P

Paul Hankin

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

u'Bob\x92s Breakfast'.encode('latin-1')
 
D

Duncan Booth

Simon Willison said:
How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

Can you not just fix your xml file so that it uses the same encoding as it
claims to use? If the xml says it contains utf8 encoded data then it should
not contain cp1252 encoded data, period.

If you really must, then try encoding with latin1 and then decoding with
cp1252:
Bob’s Breakfast

The latin1 codec will convert unicode characters in the range 0-255 to the
same single-byte value.
 
D

Diez B. Roggisch

Simon said:
Hello,

I'm using ElementTree to parse an XML file which includes some data
encoded as cp1252, for example:

<name>Bob\x92s Breakfast</name>

If this was a regular bytestring, I would convert it to utf8 using the
following:

Bob's Breakfast

But ElementTree gives me back a unicode string, so I get the following
error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

I don't get your problem. You get a unicode-object. Which means that it got
decoded by ET for you, as any XML-parser must do.

So - why don't you get rid of that .decode('cp1252') and happily encode it
to utf-8?

Diez
 
J

Jeroen Ruigrok van der Werven

-On [20080103 14:36] said:
How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

Although it does not address the exact question it does raise the issue how
you are using ElementTree. When I use the following:

test.xml

<entry>
<name>Bob\x92s Breakfast</name>
</entry>

parse.py

from xml.etree.ElementTree import ElementTree

xmlfile = open('test.xml')

tree = ElementTree()
tree.parse(xmlfile)
elem = tree.find('name')

print type(elem.text)

I get a string type back and not a unicode string.

However, if you are mixing encodings within the same file, e.g. cp1252 in an
UTF8 encoded file, then you are creating a ton of problems.

--
Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/
When moved to complain about others, remember that karma is endless and it
is loving that leads to love...
 
F

Fredrik Lundh

Simon said:
But ElementTree gives me back a unicode string, so I get the following
error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/encodings/cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in
position 3: ordinal not in range(128)

How can I tell Python "I know this says it's a unicode string, but I
need you to treat it like a bytestring"?

ET has already decoded the CP1252 data for you. If you want UTF-8, all
you need to do is to encode it:
'Bob\xc2\x92s Breakfast'

</F>
 
D

Duncan Booth

Fredrik Lundh said:
ET has already decoded the CP1252 data for you. If you want UTF-8, all
you need to do is to encode it:

'Bob\xc2\x92s Breakfast'
I think he is claiming that the encoding information in the file is
incorrect and therefore it has been decoded incorrectly.

I would think it more likely that he wants to end up with u'Bob\u2019s
Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner'
seems a probable consequence.
 
D

Diez B. Roggisch

Duncan said:
I think he is claiming that the encoding information in the file is
incorrect and therefore it has been decoded incorrectly.

I would think it more likely that he wants to end up with u'Bob\u2019s
Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner'
seems a probable consequence.

If that's the case, he should read the file as string, de- and encode it
(probably into a StringIO) and then feed it to the parser.

Diez
 
F

Fredrik Lundh

Diez said:
If that's the case, he should read the file as string, de- and encode it
(probably into a StringIO) and then feed it to the parser.

some alternatives:

- clean up the offending strings:

http://effbot.org/zone/unicode-gremlins.htm

- turn the offending strings back to iso-8859-1, and decode them again:

u = u'Bob\x92s Breakfast'
u = u.encode("iso-8859-1").decode("cp1252")

- upgrade to ET 1.3 (available in alpha) and use the parser's encoding
option to override the file's encoding:

parser = ET.XMLParser(encoding="cp1252")
tree = ET.parse(source, parser)

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top