Encoding/decoding: Still don't get it :-/

Gilles Ganault · Mar 13, 2009

Hello

I must be dense, but I still don't understand 1) why Python sometimes
barfs out this type of error when displaying text that might not be
Unicode-encoded, 2) whether I should use encode() or decode() to solve
the issue, or even 3) if this is a Python issue or due to APWS SQLite
wrapper that I'm using:

======
sql = 'SELECT id,address FROM companies'
rows=list(cursor.execute(sql))

for row in rows:
id = row[0]

#could be 'utf-8', 'iso8859-1' or 'cp1252'
try:
address = row[1]
except UnicodeDecodeError:
try:
address = row[1].decode('iso8859-1')
except UnicodeDecodeError:
address = row[1].decode('cp1252')

print id,address
======
152 Traceback (most recent call last):
File "C:\zip.py", line 28, in <module>
print id,address
File "C:\Python25\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xc8' in
position 2
4: character maps to <undefined>
======

Thank you for any tip.

Peter Otten · Mar 13, 2009

Gilles said:
I must be dense, but I still don't understand 1) why Python sometimes
barfs out this type of error when displaying text that might not be
Unicode-encoded, 2) whether I should use encode() or decode() to solve
the issue, or even 3) if this is a Python issue or due to APWS SQLite
wrapper that I'm using:

======
sql = 'SELECT id,address FROM companies'
rows=list(cursor.execute(sql))

for row in rows:
Â Â Â Â Â Â Â Â id = row[0]

Â Â Â Â Â Â Â Â #could be 'utf-8', 'iso8859-1' or 'cp1252'
Â Â Â Â Â Â Â Â try:
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â address = row[1]

Assuming row is a tuple with len(row) >= 2 the above line can never fail.
Therefore you can rewrite the loop as

for row in rows:
id, address = row[:2]
print id, address

Â Â Â Â Â Â Â Â except UnicodeDecodeError:
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â try:
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â address = row[1].decode('iso8859-1')
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â except UnicodeDecodeError:
Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â address = row[1].decode('cp1252')

Â Â Â Â Â Â Â Â print id,address
======
152 Traceback (most recent call last):
Â File "C:\zip.py", line 28, in <module>
Â Â print id,address
Â File "C:\Python25\lib\encodings\cp437.py", line 12, in encode
Â Â return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xc8' in
position 2
4: character maps to <undefined>

It seems the database gives you the strings as unicode. When a unicode
string is printed python tries to encode it using sys.stdout.encoding
before writing it to stdout. As you run your script on the windows commmand
line that encoding seems to be cp437. Unfortunately your database contains
characters the cannot be expressed in that encoding. One workaround is to
replace these characters with "?":

encoding = sys.stdout.encoding or "ascii"
for row in rows:
id, address = row[:2]
print id, address.encode(encoding, "replace")

Example:
'?hnlich l?lich ?blich'

Peter

Johannes Bauer · Mar 13, 2009

Peter said:
encoding = sys.stdout.encoding or "ascii"
for row in rows:
id, address = row[:2]
print id, address.encode(encoding, "replace")

Example:
'?hnlich l?lich ?blich'

A very good tip, Peter - I've also had this problem before and didn't
know about your solution.

Thanks,
Johannes

Gilles Ganault · Mar 16, 2009

It seems the database gives you the strings as unicode. When a unicode
string is printed python tries to encode it using sys.stdout.encoding
before writing it to stdout. As you run your script on the windows commmand
line that encoding seems to be cp437. Unfortunately your database contains
characters the cannot be expressed in that encoding.

Vielen Dank for the help

I hadn't thought about the code page used
to display data in the DOS box in XP.

It turns out that the HTML page from which I was trying to extract
data using regexes was encoded in 8859-1 instead of UTF8, the SQLite
wrapper expects Unicode only, and it had a problem with some
characters.

For those interested, here's how I solved it, although there's likely
a smarter way to do it:

============
data = re_data.search(response)
if data:
name = data.group(1).strip()
address = data.group(2).strip()

#content="text/html; charset=iso-8859-1">
name = name.decode('iso8859-1')
address = address.decode('iso8859-1')

sql = 'BEGIN;'
sql = sql + 'UPDATE companies SET name=?,address=? WHERE id=?;'
sql = sql + "COMMIT"

try:
cursor.execute(sql, (name,address,id) )
except:
print "Failed UPDATING"
raise
else:
print "Pattern not found"
============

Thanks again.

Antoon Pardon · Mar 16, 2009

Peter said:
Peter said:

encoding = sys.stdout.encoding or "ascii"
for row in rows:
id, address = row[:2]
print id, address.encode(encoding, "replace")

Example:

u"ähnlich lölich üblich".encode("ascii", "replace")

Click to expand...

'?hnlich l?lich ?blich'

Click to expand...

A very good tip, Peter - I've also had this problem before and didn't
know about your solution.

If you know before hand that you will be using ascii, you can eliminate
the accents, so that you will get the unaccentuated letter (followed by
a question mark if you prefer) instead of a question mark
'ahnlich lolich ublich'

[UnicodeEncodeError] Don't know what else to try	7	Nov 14, 2008
Python 3.1.1 bytes decode with replace bug	9	Oct 24, 2009
Printing characters outside of the ASCII range	18	Nov 9, 2012
Missing library path (WIndows)	4	Sep 29, 2012
print u'\u2013' error on console/terminal	1	Jun 25, 2009
unable to print Unicode characters in Python 3	12	Jan 26, 2009
Python 3.0b2 cannot map '\u12b'	8	Aug 31, 2008
Anoying unicode / str conversion problem	2	Jan 26, 2009

Encoding/decoding: Still don't get it :-/

Gilles Ganault

Peter Otten

Johannes Bauer

Gilles Ganault

Antoon Pardon

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads