Long way around UnicodeDecodeError, or 'ascii' codec can't decode byte

Discussion in 'Python' started by Oleg Parashchenko, Mar 29, 2007.

  1. Hello,

    I'm working on an unicode-aware application. I like to use "print" to
    debug programs, but in this case it was nightmare. The most popular
    result of "print" was:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position
    0: ordinal not in range(128)

    I spent two hours fixing it, and I hope it's done. The solution is one
    of the ugliest hack I ever written, but it solves the pain. The full
    story and the code is in my blog:

    http://uucode.com/blog/2007/03/23/shut-up-you-dummy-7-bit-python/

    --
    Oleg Parashchenko olpa@ http://uucode.com/
    http://uucode.com/blog/ Generative Programming, XML, TeX, Scheme
    http://tohtml.com/ Online syntax highlighting
     
    Oleg Parashchenko, Mar 29, 2007
    #1
    1. Advertising

  2. Oleg  Parashchenko

    Paul Boddie Guest

    On 29 Mar, 06:26, "Oleg Parashchenko" <> wrote:
    > Hello,
    >
    > I'm working on an unicode-aware application. I like to use "print" to
    > debug programs, but in this case it was nightmare. The most popular
    > result of "print" was:
    >
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position
    > 0: ordinal not in range(128)


    What does sys.stdout.encoding say?

    > I spent two hours fixing it, and I hope it's done. The solution is one
    > of the ugliest hack I ever written, but it solves the pain. The full
    > story and the code is in my blog:
    >
    > http://uucode.com/blog/2007/03/23/shut-up-you-dummy-7-bit-python/


    Calling sys.setdefaultencoding might not even help in this case, and
    the consensus is that it may be harmful to your code's portability
    [1]. Writing output to a terminal may be influenced by your locale,
    but I'm not convinced that going through all the locale settings and
    setting the character set is the best approach (or even the right
    one).

    What do you get if you do this...?

    import locale
    locale.setlocale(locale.LC_ALL, "")
    print locale.getlocale()

    What is your terminal encoding?

    Usually, if I'm wanting to print Unicode objects, I explicitly encode
    them into something I know the terminal will support. The codecs
    module can help with writing Unicode to streams in different
    encodings, too.

    Paul

    [1] http://groups.google.com/group/comp.lang.python/msg/431017a4cb4bb8ea
     
    Paul Boddie, Mar 29, 2007
    #2
    1. Advertising

  3. Hello,

    On Mar 29, 4:53 pm, "Paul Boddie" <> wrote:
    > On 29 Mar, 06:26, "Oleg Parashchenko" <> wrote:
    >
    > > Hello,

    >
    > > I'm working on an unicode-aware application. I like to use "print" to
    > > debug programs, but in this case it was nightmare. The most popular
    > > result of "print" was:

    >
    > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position
    > > 0: ordinal not in range(128)

    >
    > What does sys.stdout.encoding say?


    'KOI8-R'

    >
    > > I spent two hours fixing it, and I hope it's done. The solution is one
    > > of the ugliest hack I ever written, but it solves the pain. The full
    > > story and the code is in my blog:

    >
    > >http://uucode.com/blog/2007/03/23/shut-up-you-dummy-7-bit-python/

    >
    > Calling sys.setdefaultencoding might not even help in this case, and
    > the consensus is that it may be harmful to your code's portability
    > [1].


    Yes, but I think UTF-8 is now everywhere.

    > Writing output to a terminal may be influenced by your locale,
    > but I'm not convinced that going through all the locale settings and
    > setting the character set is the best approach (or even the right
    > one).
    >
    > What do you get if you do this...?
    >
    > import locale
    > locale.setlocale(locale.LC_ALL, "")
    > print locale.getlocale()


    ('ru_RU', 'koi8-r')

    >
    > What is your terminal encoding?


    koi8-r

    >
    > Usually, if I'm wanting to print Unicode objects, I explicitly encode
    > them into something I know the terminal will support. The codecs
    > module can help with writing Unicode to streams in different
    > encodings, too.


    As long as input/output is the only place for such need, it's ok to
    encode expliciyely. But I also had problems, for example, with md5
    module, and I don't know the whole list of potential problematic
    places. Therefore, I'd better go with my brutal utf8ization.

    >
    > Paul
    >
    > [1]http://groups.google.com/group/comp.lang.python/msg/431017a4cb4bb8ea


    --
    Oleg Parashchenko olpa@ http://uucode.com/
    http://uucode.com/blog/ Generative Programming, XML, TeX, Scheme
    http://tohtml.com/ Online syntax highlighting
     
    Oleg Parashchenko, Mar 31, 2007
    #3
  4. Oleg  Parashchenko

    Jarek Zgoda Guest

    Re: Long way around UnicodeDecodeError, or 'ascii' codec can't decodebyte

    Oleg Parashchenko napisa³(a):

    >>> I spent two hours fixing it, and I hope it's done. The solution is one
    >>> of the ugliest hack I ever written, but it solves the pain. The full
    >>> story and the code is in my blog:
    >>> http://uucode.com/blog/2007/03/23/shut-up-you-dummy-7-bit-python/

    >> Calling sys.setdefaultencoding might not even help in this case, and
    >> the consensus is that it may be harmful to your code's portability
    >> [1].

    >
    > Yes, but I think UTF-8 is now everywhere.


    No, it is not. Your own system is "not ready for UTF-8", as you stated
    somewhere in this blog entry. How can you expect everybody else's system
    being utf-8, while "you are not ready for transition"?

    It would be better if you write your programs in encoding-agnostic way,
    using byte streams only for input and output (yes, printing a debug
    statement on terminal *is* a kind of producing the output). An, oh, you
    cann't encode/decode text not knowing the encoding...

    --
    Jarek Zgoda
    http://jpa.berlios.de/
     
    Jarek Zgoda, Mar 31, 2007
    #4
  5. Oleg  Parashchenko

    Paul Boddie Guest

    Oleg Parashchenko wrote:
    > On Mar 29, 4:53 pm, "Paul Boddie" <> wrote:
    > > On 29 Mar, 06:26, "Oleg Parashchenko" <> wrote:
    > > >
    > > > I'm working on an unicode-aware application. I like to use "print" to
    > > > debug programs, but in this case it was nightmare. The most popular
    > > > result of "print" was:
    > > >
    > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xXX in position
    > > > 0: ordinal not in range(128)


    I think I've found the actual source of this, and it isn't the print
    statement. UnicodeDecodeError relates to the construction of Unicode
    objects, not the encoding of such objects as byte strings. The
    terminology is explained using this simple diagram (which hopefully
    won't be ruined in transmission):

    byte string in XYZ encoding
    |
    (decode from XYZ) --> possible UnicodeDecodeError
    |
    V
    Unicode object
    |
    (encode to ABC) --> possible UnicodeEncodeError
    |
    V
    byte string in ABC encoding

    > > What does sys.stdout.encoding say?

    >
    > 'KOI8-R'


    [...]

    > > What do you get if you do this...?
    > >
    > > import locale
    > > locale.setlocale(locale.LC_ALL, "")
    > > print locale.getlocale()

    >
    > ('ru_RU', 'koi8-r')
    >
    > >
    > > What is your terminal encoding?

    >
    > koi8-r


    Here's a transcript on my system answering the same questions:

    Python 2.4.1 (#2, Oct 4 2006, 16:53:35)
    [GCC 3.3.5 (Debian 1:3.3.5-8ubuntu2.1)] on linux2
    Type "help", "copyright", "credits" or "license" for more
    information.
    >>> import locale
    >>> locale.getlocale()

    (None, None)
    >>> locale.setlocale(locale.LC_ALL, "")

    'en_US.ISO-8859-15'
    >>> locale.getlocale()

    ('en_US', 'iso-8859-15')

    So Python knows about the locale. Note that neither of us use UTF-8 as
    a system encoding.

    >>> import sys
    >>> sys.stdout.encoding

    'ISO-8859-15'
    >>> sys.stdin.encoding

    'ISO-8859-15'

    This tells us that Python could know things about writing Unicode
    objects out in the appropriate encoding. I wasn't sure whether Python
    was so smart about this, so let's see what happens...

    >>> print unicode("æøå")

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position
    0: ordinal not in range(128)

    Now this isn't anything to do with the print operation: what's
    happening here is that I'm explicitly making a Unicode object but
    haven't said what the encoding of my byte string is. The default
    encoding is 'ascii' as stated in the error message. None of the
    characters provided belong to the ASCII character set.

    We can check this by not printing anything out:

    >>> s = unicode("æøå")

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position
    0: ordinal not in range(128)

    So, let's try again and provide an encoding...

    >>> print unicode("æøå", sys.stdin.encoding)

    æøå

    Here, we've mentioned the encoding and even though the print statement
    is acting on a Unicode object, it seems to be happy to work out the
    resulting encoding.

    >>> print u"æøå"

    æøå

    Here, we've skipped the explicit Unicode object construction by using
    a Unicode literal, which works in this simple case.

    Of course, if your system encoding (along with the terminal) isn't
    capable of displaying every Unicode character, you'll experience
    problems doing the above. Frequently, it's interesting to encode
    things as UTF-8 and look at them in applications that are capable of
    displaying the text. Thus, you'd do something like this:

    import unicodedata

    (This gets an interesting function to help us look up characters in
    the Unicode database.)

    somefile = open("somefile.txt", "wb")
    print >>somefile, unicodedata.lookup("MONGOLIAN VOWEL
    SEPARATOR").encode("utf-8")

    Or even this:

    import codecs
    somefile = codecs.open("somefile.txt", "wb", encoding="utf-8")
    print >>somefile, unicodedata.lookup("MONGOLIAN VOWEL SEPARATOR")

    Here, we only specified the encoding once when opening the file. The
    file object accepts Unicode objects thereafter.

    > > Usually, if I'm wanting to print Unicode objects, I explicitly encode
    > > them into something I know the terminal will support. The codecs
    > > module can help with writing Unicode to streams in different
    > > encodings, too.

    >
    > As long as input/output is the only place for such need, it's ok to
    > encode expliciyely. But I also had problems, for example, with md5
    > module, and I don't know the whole list of potential problematic
    > places. Therefore, I'd better go with my brutal utf8ization.


    It's best to decode (ie. construct Unicode objects) upon receiving
    data as input, and to encode (ie. convert Unicode objects to byte
    strings) upon producing output. What may be the problem with the md5
    module, and you'd have to post example code for us to help you out, is
    that it assumes byte strings and doesn't work properly with Unicode
    objects, but I can't say for sure because I'm usually presenting byte
    strings to md5 module functions on the rare occasions I do anything
    with them. Note that one would usually calculate MD5 checksums on raw
    data, although I can imagine a hypothetical (although perhaps
    unrealistic) need to do so on Unicode text, so it doesn't necessarily
    make much sense to present those functions with Unicode data.

    Paul
     
    Paul Boddie, Mar 31, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robin Siebler
    Replies:
    4
    Views:
    27,193
    Tim Peters
    Oct 8, 2004
  2. Gilles Ganault
    Replies:
    2
    Views:
    1,142
    Gilles Ganault
    Jun 17, 2008
  3. Gabriel Genellina
    Replies:
    0
    Views:
    774
    Gabriel Genellina
    Oct 21, 2008
  4. Gilles Ganault
    Replies:
    3
    Views:
    1,980
    Steve Holden
    Oct 29, 2008
  5. Tim Golden
    Replies:
    0
    Views:
    120
    Tim Golden
    Nov 27, 2013
Loading...

Share This Page