Encoding/decoding: Still don't get it :-/

Discussion in 'Python' started by Gilles Ganault, Mar 13, 2009.

  1. Hello

    I must be dense, but I still don't understand 1) why Python sometimes
    barfs out this type of error when displaying text that might not be
    Unicode-encoded, 2) whether I should use encode() or decode() to solve
    the issue, or even 3) if this is a Python issue or due to APWS SQLite
    wrapper that I'm using:

    ======
    sql = 'SELECT id,address FROM companies'
    rows=list(cursor.execute(sql))

    for row in rows:
    id = row[0]

    #could be 'utf-8', 'iso8859-1' or 'cp1252'
    try:
    address = row[1]
    except UnicodeDecodeError:
    try:
    address = row[1].decode('iso8859-1')
    except UnicodeDecodeError:
    address = row[1].decode('cp1252')

    print id,address
    ======
    152 Traceback (most recent call last):
    File "C:\zip.py", line 28, in <module>
    print id,address
    File "C:\Python25\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
    UnicodeEncodeError: 'charmap' codec can't encode character u'\xc8' in
    position 2
    4: character maps to <undefined>
    ======

    Thank you for any tip.
    Gilles Ganault, Mar 13, 2009
    #1
    1. Advertising

  2. Gilles Ganault

    Peter Otten Guest

    Gilles Ganault wrote:

    > I must be dense, but I still don't understand 1) why Python sometimes
    > barfs out this type of error when displaying text that might not be
    > Unicode-encoded, 2) whether I should use encode() or decode() to solve
    > the issue, or even 3) if this is a Python issue or due to APWS SQLite
    > wrapper that I'm using:
    >
    > ======
    > sql = 'SELECT id,address FROM companies'
    > rows=list(cursor.execute(sql))
    >
    > for row in rows:
    >         id = row[0]
    >
    >         #could be 'utf-8', 'iso8859-1' or 'cp1252'
    >         try:
    >                 address = row[1]


    Assuming row is a tuple with len(row) >= 2 the above line can never fail.
    Therefore you can rewrite the loop as

    for row in rows:
    id, address = row[:2]
    print id, address

    >         except UnicodeDecodeError:
    >                 try:
    >                         address = row[1].decode('iso8859-1')
    >                 except UnicodeDecodeError:
    >                         address = row[1].decode('cp1252')
    >
    >         print id,address
    > ======
    > 152 Traceback (most recent call last):
    >   File "C:\zip.py", line 28, in <module>
    >     print id,address
    >   File "C:\Python25\lib\encodings\cp437.py", line 12, in encode
    >     return codecs.charmap_encode(input,errors,encoding_map)
    > UnicodeEncodeError: 'charmap' codec can't encode character u'\xc8' in
    > position 2
    > 4: character maps to <undefined>


    It seems the database gives you the strings as unicode. When a unicode
    string is printed python tries to encode it using sys.stdout.encoding
    before writing it to stdout. As you run your script on the windows commmand
    line that encoding seems to be cp437. Unfortunately your database contains
    characters the cannot be expressed in that encoding. One workaround is to
    replace these characters with "?":

    encoding = sys.stdout.encoding or "ascii"
    for row in rows:
    id, address = row[:2]
    print id, address.encode(encoding, "replace")


    Example:

    >>> u"ähnlich lölich üblich".encode("ascii", "replace")

    '?hnlich l?lich ?blich'

    Peter
    Peter Otten, Mar 13, 2009
    #2
    1. Advertising

  3. Peter Otten schrieb:

    > encoding = sys.stdout.encoding or "ascii"
    > for row in rows:
    > id, address = row[:2]
    > print id, address.encode(encoding, "replace")
    >
    > Example:
    >
    >>>> u"ähnlich lölich üblich".encode("ascii", "replace")

    > '?hnlich l?lich ?blich'


    A very good tip, Peter - I've also had this problem before and didn't
    know about your solution.

    Thanks,
    Johannes

    --
    "Meine Gegenklage gegen dich lautet dann auf bewusste Verlogenheit,
    verlästerung von Gott, Bibel und mir und bewusster Blasphemie."
    -- Prophet und Visionär Hans Joss aka HJP in de.sci.physik
    <48d8bf1d$0$7510$>
    Johannes Bauer, Mar 13, 2009
    #3
  4. On Fri, 13 Mar 2009 14:24:52 +0100, Peter Otten <>
    wrote:
    >It seems the database gives you the strings as unicode. When a unicode
    >string is printed python tries to encode it using sys.stdout.encoding
    >before writing it to stdout. As you run your script on the windows commmand
    >line that encoding seems to be cp437. Unfortunately your database contains
    >characters the cannot be expressed in that encoding.


    Vielen Dank for the help :) I hadn't thought about the code page used
    to display data in the DOS box in XP.

    It turns out that the HTML page from which I was trying to extract
    data using regexes was encoded in 8859-1 instead of UTF8, the SQLite
    wrapper expects Unicode only, and it had a problem with some
    characters.

    For those interested, here's how I solved it, although there's likely
    a smarter way to do it:

    ============
    data = re_data.search(response)
    if data:
    name = data.group(1).strip()
    address = data.group(2).strip()

    #content="text/html; charset=iso-8859-1">
    name = name.decode('iso8859-1')
    address = address.decode('iso8859-1')

    sql = 'BEGIN;'
    sql = sql + 'UPDATE companies SET name=?,address=? WHERE id=?;'
    sql = sql + "COMMIT"

    try:
    cursor.execute(sql, (name,address,id) )
    except:
    print "Failed UPDATING"
    raise
    else:
    print "Pattern not found"
    ============

    Thanks again.
    Gilles Ganault, Mar 16, 2009
    #4
  5. On 2009-03-13, Johannes Bauer <> wrote:
    > Peter Otten schrieb:
    >
    >> encoding = sys.stdout.encoding or "ascii"
    >> for row in rows:
    >> id, address = row[:2]
    >> print id, address.encode(encoding, "replace")
    >>
    >> Example:
    >>
    >>>>> u"ähnlich lölich üblich".encode("ascii", "replace")

    >> '?hnlich l?lich ?blich'

    >
    > A very good tip, Peter - I've also had this problem before and didn't
    > know about your solution.


    If you know before hand that you will be using ascii, you can eliminate
    the accents, so that you will get the unaccentuated letter (followed by
    a question mark if you prefer) instead of a question mark

    >>> from unicodedata import normalize, combining
    >>> example = u"ähnlich lölich üblich"
    >>> normalised = normalize('NFKD', example)
    >>> normalised.encode("ascii", "replace")

    'a?hnlich lo?lich u?blich'
    >>> eliminated = u''.join(l for l in normalised if not combining(l))
    >>> eliminated.encode("ascii", "replace")

    'ahnlich lolich ublich'

    --
    Antoon Pardon
    Antoon Pardon, Mar 16, 2009
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Slade

    Problem encoding/decoding image

    Slade, Jun 25, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    1,115
    Natty Gur
    Jun 25, 2003
  2. =?Utf-8?B?TWFyaw==?=

    query string encoding/decoding

    =?Utf-8?B?TWFyaw==?=, Mar 3, 2004, in forum: ASP .Net
    Replies:
    7
    Views:
    17,201
    T Conti
    Apr 5, 2004
  3. terry
    Replies:
    2
    Views:
    2,440
    terry
    Nov 3, 2003
  4. LarsM
    Replies:
    18
    Views:
    1,154
    Andreas Prilop
    Feb 11, 2005
  5. Sridhar Anupindi
    Replies:
    0
    Views:
    589
    Sridhar Anupindi
    May 25, 2004
Loading...

Share This Page