Encoding/decoding: Still don't get it :-/


G

Gilles Ganault

Hello

I must be dense, but I still don't understand 1) why Python sometimes
barfs out this type of error when displaying text that might not be
Unicode-encoded, 2) whether I should use encode() or decode() to solve
the issue, or even 3) if this is a Python issue or due to APWS SQLite
wrapper that I'm using:

======
sql = 'SELECT id,address FROM companies'
rows=list(cursor.execute(sql))

for row in rows:
id = row[0]

#could be 'utf-8', 'iso8859-1' or 'cp1252'
try:
address = row[1]
except UnicodeDecodeError:
try:
address = row[1].decode('iso8859-1')
except UnicodeDecodeError:
address = row[1].decode('cp1252')

print id,address
======
152 Traceback (most recent call last):
File "C:\zip.py", line 28, in <module>
print id,address
File "C:\Python25\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xc8' in
position 2
4: character maps to <undefined>
======

Thank you for any tip.
 
Ad

Advertisements

P

Peter Otten

Gilles said:
I must be dense, but I still don't understand 1) why Python sometimes
barfs out this type of error when displaying text that might not be
Unicode-encoded, 2) whether I should use encode() or decode() to solve
the issue, or even 3) if this is a Python issue or due to APWS SQLite
wrapper that I'm using:

======
sql = 'SELECT id,address FROM companies'
rows=list(cursor.execute(sql))

for row in rows:
        id = row[0]

        #could be 'utf-8', 'iso8859-1' or 'cp1252'
        try:
                address = row[1]

Assuming row is a tuple with len(row) >= 2 the above line can never fail.
Therefore you can rewrite the loop as

for row in rows:
id, address = row[:2]
print id, address
        except UnicodeDecodeError:
                try:
                        address = row[1].decode('iso8859-1')
                except UnicodeDecodeError:
                        address = row[1].decode('cp1252')

        print id,address
======
152 Traceback (most recent call last):
  File "C:\zip.py", line 28, in <module>
    print id,address
  File "C:\Python25\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xc8' in
position 2
4: character maps to <undefined>

It seems the database gives you the strings as unicode. When a unicode
string is printed python tries to encode it using sys.stdout.encoding
before writing it to stdout. As you run your script on the windows commmand
line that encoding seems to be cp437. Unfortunately your database contains
characters the cannot be expressed in that encoding. One workaround is to
replace these characters with "?":

encoding = sys.stdout.encoding or "ascii"
for row in rows:
id, address = row[:2]
print id, address.encode(encoding, "replace")


Example:
'?hnlich l?lich ?blich'

Peter
 
J

Johannes Bauer

Peter said:
encoding = sys.stdout.encoding or "ascii"
for row in rows:
id, address = row[:2]
print id, address.encode(encoding, "replace")

Example:
'?hnlich l?lich ?blich'

A very good tip, Peter - I've also had this problem before and didn't
know about your solution.

Thanks,
Johannes
 
G

Gilles Ganault

It seems the database gives you the strings as unicode. When a unicode
string is printed python tries to encode it using sys.stdout.encoding
before writing it to stdout. As you run your script on the windows commmand
line that encoding seems to be cp437. Unfortunately your database contains
characters the cannot be expressed in that encoding.

Vielen Dank for the help :) I hadn't thought about the code page used
to display data in the DOS box in XP.

It turns out that the HTML page from which I was trying to extract
data using regexes was encoded in 8859-1 instead of UTF8, the SQLite
wrapper expects Unicode only, and it had a problem with some
characters.

For those interested, here's how I solved it, although there's likely
a smarter way to do it:

============
data = re_data.search(response)
if data:
name = data.group(1).strip()
address = data.group(2).strip()

#content="text/html; charset=iso-8859-1">
name = name.decode('iso8859-1')
address = address.decode('iso8859-1')

sql = 'BEGIN;'
sql = sql + 'UPDATE companies SET name=?,address=? WHERE id=?;'
sql = sql + "COMMIT"

try:
cursor.execute(sql, (name,address,id) )
except:
print "Failed UPDATING"
raise
else:
print "Pattern not found"
============

Thanks again.
 
Ad

Advertisements

A

Antoon Pardon

Peter said:
encoding = sys.stdout.encoding or "ascii"
for row in rows:
id, address = row[:2]
print id, address.encode(encoding, "replace")

Example:
u"ähnlich lölich üblich".encode("ascii", "replace")
'?hnlich l?lich ?blich'

A very good tip, Peter - I've also had this problem before and didn't
know about your solution.

If you know before hand that you will be using ascii, you can eliminate
the accents, so that you will get the unaccentuated letter (followed by
a question mark if you prefer) instead of a question mark
'ahnlich lolich ublich'
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top