Re: 'ascii' codec can't encode character u'\xf3'

Discussion in 'Python' started by Martin Slouf, Aug 17, 2004.

  1. Martin Slouf

    Martin Slouf Guest

    i had similar errors:

    Traceback (most recent call last):
    File "/home/martin/skripty/accounts.py", line 125, in ?
    main(sys.argv)
    File "/home/martin/skripty/accounts.py", line 119, in main
    print_accounts(accounts, url_part)
    File "/home/martin/skripty/accounts.py", line 94, in print_accounts
    print str(i).encode("utf-8", "replace")
    UnicodeEncodeError: 'ascii' codec can't encode characters in position
    151-152: ordinal not in range(128)

    - - - -

    the solution seems to be:

    0. string is not in unicode encoding (assumption)
    1. before printing out, convert the string to unicode
    2. when printing, convert to whatever charset you like

    though i dont understand much why (ive solved it a minute ago :) the
    code should be:

    str = "any nonunicode string"
    print unicode(str).encode("iso-8859-2", "replace")

    comments:

    1. why the string is not in unicode can have several reasons -- i guess:
    - does ogg stores tags in unicode?
    - you have parsed an xml file with encoding attribute set (that
    is what i do)
    - etc

    2. "replace" parameter in encode causes non-printable chars to be
    replaced with '?' (you can use "ignore" or strict", see your python
    doc)

    3. the above will work _only_ _if_ the 'str' encoding is "iso-8859-2" --
    a funny thing -- first line of code converts from unknown (but the
    programmer must know it) to unicode and the second one converts it back
    from unicode to unknown (now the programmer tells that secret to python
    :)

    4. i would like to know from any python expert whether/why/why not:

    * my assumptions are right

    * why is that behaviour? -- if you search google you get
    thousands of errors like this -- with no proper solutions i must add

    * is there an easier portable way (no sitecustomize.py changes)
    to do it

    * i was looking in site.py and there is deleted the
    sys.setdefaultencoding() function, but from the comments i do
    not know why -- you know it? why is user not allowed to change the
    default encoding? it seems reasonable to me if he/she could do that.

    thx.

    m.
    Martin Slouf, Aug 17, 2004
    #1
    1. Advertising

  2. Martin Slouf wrote:
    > the solution seems to be:
    >
    > 0. string is not in unicode encoding (assumption)
    > 1. before printing out, convert the string to unicode
    > 2. when printing, convert to whatever charset you like


    There is an alternative, if the print is a debug print:

    - print a repr() of the unicode object instead of
    the unicode object itself. This will work on all
    terminals, and show hex escapes of non-ASCII characters.

    > 1. why the string is not in unicode can have several reasons -- i guess:
    > - does ogg stores tags in unicode?
    > - you have parsed an xml file with encoding attribute set (that
    > is what i do)
    > - etc


    Correct.

    > 2. "replace" parameter in encode causes non-printable chars to be
    > replaced with '?' (you can use "ignore" or strict", see your python
    > doc)


    Correct.

    > 3. the above will work _only_ _if_ the 'str' encoding is "iso-8859-2" --
    > a funny thing -- first line of code converts from unknown (but the
    > programmer must know it) to unicode and the second one converts it back
    > from unicode to unknown (now the programmer tells that secret to python
    > :)


    No. unicode(text) uses the system default encoding
    (sys.getdefaultencoding()) which normally is ASCII.

    Printing a Unicode string to a terminal should work fine if the terminal
    is properly configured. What that means depends on your operating
    system.

    > * my assumptions are right


    Most of them.

    >
    > * why is that behaviour? -- if you search google you get
    > thousands of errors like this -- with no proper solutions i must add


    There is a proper solution. Unfortunately, very similar yet different
    problems cause the same error message, and each problem has a different
    proper solution:

    - A Unicode error is raised when trying to combine a Unicode string
    and a byte string, if the byte string contains non-ASCII characters,
    e.g.

    u"Martin v. " + "Löwis"

    The proper solution is to convert the second string into a Unicode
    object, e.g. through

    unicode("Löwis", "iso-8859-1")

    - A unicode error is raised when a Unicode string is printed to
    a terminal. The proper solution is that the system administrator
    or the user should properly administer the locale, so that Python
    knows what characters the terminal can print. For characters that
    are then still non-printable, repr() is the proper solution.

    - A unicode error is raised when a library does not support Unicode
    for some reason. The proper solution is to fix the library. A
    proper work-around is to explicitly convert Unicode strings into
    the encoding that the library expects.

    > * is there an easier portable way (no sitecustomize.py changes)
    > to do it


    Yes, see above.

    > * i was looking in site.py and there is deleted the
    > sys.setdefaultencoding() function, but from the comments i do
    > not know why -- you know it? why is user not allowed to change the
    > default encoding? it seems reasonable to me if he/she could do that.


    Yes, but that would not be a proper solution. It would mean that your
    script now only works on your system, and fails on a system where
    the default encoding has not been changed, or has been changed to
    something else. Users should use a proper solution instead.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Aug 17, 2004
    #2
    1. Advertising

  3. Martin Slouf

    Martin Slouf Guest

    thank you for reply, great info! it helped me to better understand it;
    but of course, some additional questions have risen.

    maybe some of those question/comments may seem stupid (ie. clear), but
    im new to python and i want to assure myself i get it right; thx for
    patience.

    > There is an alternative, if the print is a debug print:
    >
    > - print a repr() of the unicode object instead of
    > the unicode object itself. This will work on all
    > terminals, and show hex escapes of non-ASCII characters.


    just to make sure:

    override the object's __repr__(self) method to st. like:

    class my_string(string):
    def __repr__(self)
    tmp = unicode(self.attribute1 + " " + self.attribute2)
    return tmp

    and use 'my_string' class without any worries instead of classical
    string?

    >
    > No. unicode(text) uses the system default encoding
    > (sys.getdefaultencoding()) which normally is ASCII.
    >
    > Printing a Unicode string to a terminal should work fine if the terminal
    > is properly configured. What that means depends on your operating
    > system.


    my system is debian GNU/Linux stable, im using it for a very, very long
    time, though i did not changed any terminal settings but the very
    basics. My locales are properly set, im using LC_* environment
    variables to set default locale to czech environment with ISO-8859-2
    charset. Terminal is capable of displaying 8bit charsets, im not sure
    about unicode charsets -- never tried, never needed. All other
    locale-sensitive programms are satisfied. (ie. java interpretter -- this
    should be much like python :)

    guess in germany it is quite the same, maybe ISO-8859-1 is preferred

    example output from my system:

    >>> import locale
    >>> loc = locale.getdefaultlocale()
    >>> loc

    ['cs_CZ', 'ISO8859-2']

    so i guess this is ok.

    but the problem maybe in my 'site.py' where setting encoding
    according to my locale is done in a code like this:

    if 0:
    # Enable to support locale aware default string encodings.
    import locale
    loc = locale.getdefaultlocale()
    if loc[1]:
    encoding = loc[1]

    so i guess it is never done :(

    did you yourself changed it? did you think this is the 'portable
    solution'? i guess not -- another system, another locale, maybe being in
    ascii is the best.

    >
    > >
    > > * why is that behaviour? -- if you search google you get
    > >thousands of errors like this -- with no proper solutions i must add

    >
    > There is a proper solution. Unfortunately, very similar yet different
    > problems cause the same error message, and each problem has a different
    > proper solution:
    >


    well, if a piece of information like you gave to me was contained in
    standard python documentation, probably there will be less
    misunderstanding about this issue.

    > - A Unicode error is raised when trying to combine a Unicode string
    > and a byte string, if the byte string contains non-ASCII characters,
    > e.g.
    >
    > u"Martin v. " + "Löwis"
    >
    > The proper solution is to convert the second string into a Unicode
    > object, e.g. through
    >
    > unicode("Löwis", "iso-8859-1")
    >


    if i use
    #! /usr/bin/env python
    # -*- coding: UTF-8 -*-
    at the begginnig of my every script, the example above still has to
    be converted -- because of the iso-8859-1 you use in "Löwis"?

    what would change if i use
    #! /usr/bin/env python
    # -*- coding: ISO-8859-1 -*-
    ?

    can i ommit the conversion (ie. is it done automatically for me as if
    i write
    u"Martin v. " + unicode("Löwis", "ISO-8859-1")
    )?

    > - A unicode error is raised when a Unicode string is printed to
    > a terminal. The proper solution is that the system administrator
    > or the user should properly administer the locale, so that Python
    > knows what characters the terminal can print. For characters that
    > are then still non-printable, repr() is the proper solution.


    see above for comments on my setting. if you have done such a
    customization (and it differs from mine) and you have experience with
    linux, may i ask you for recommendations?

    >
    > - A unicode error is raised when a library does not support Unicode
    > for some reason. The proper solution is to fix the library. A
    > proper work-around is to explicitly convert Unicode strings into
    > the encoding that the library expects.
    >


    dont understand -- which library? you meant for example the ogg vorbis
    c-library when used with python bindings? -- in that case, what can be
    done by me as a developer? -- to know what encoding is used and do the
    tricky things i did -- now properly understood:

    1. convert from "unknown" to unicode
    tmp = unicode("string", "library-charset-specification")

    2. print it like
    print tmp.encode("my-terminal-charset-specification")

    question:

    library-charset-specification can be ommited if i specify it in a
    comment at the very begginning of a script (as i guessed above) -- or
    my-terminal-charset-specification can be ommitted if specied in comment
    -- or can i ommit both if equal?

    if im about to use the __repr__(self) method, i would do the conversion
    inside that method and return tmp, as i tried above, right?

    >
    > > * i was looking in site.py and there is deleted the
    > >sys.setdefaultencoding() function, but from the comments i do
    > >not know why -- you know it? why is user not allowed to change the
    > >default encoding? it seems reasonable to me if he/she could do that.

    >
    > Yes, but that would not be a proper solution. It would mean that your
    > script now only works on your system, and fails on a system where
    > the default encoding has not been changed, or has been changed to
    > something else. Users should use a proper solution instead.


    i thought that every programmer could call his
    sys.setdefaultencoding() method at the start of the script to set it to
    whatever he needs. it should work on every system that has proper
    encoding files. (though in site.py is a comment on MS indows -- it
    breaks that rule:)

    >
    > Regards,
    > Martin


    once again, thank you a lot.

    Regards,
    Martin (also :)
    Martin Slouf, Aug 17, 2004
    #3
  4. Martin Slouf

    Paul Prescod Guest

    Martin Slouf wrote:

    > thank you for reply, great info! it helped me to better understand it;
    > but of course, some additional questions have risen.
    >
    > maybe some of those question/comments may seem stupid (ie. clear), but
    > im new to python and i want to assure myself i get it right; thx for
    > patience.
    >
    >
    >>There is an alternative, if the print is a debug print:
    >>
    >>- print a repr() of the unicode object instead of
    >> the unicode object itself. This will work on all
    >> terminals, and show hex escapes of non-ASCII characters.

    >
    >
    > just to make sure:
    >
    > override the object's __repr__(self) method to st. like:


    No, he means instead of:

    print foo

    print repr(foo)

    Paul Prescod
    Paul Prescod, Aug 17, 2004
    #4
  5. Martin Slouf

    John Roth Guest

    "Martin Slouf" <> wrote in message
    news:...
    > i had similar errors:
    >
    > Traceback (most recent call last):
    > File "/home/martin/skripty/accounts.py", line 125, in ?
    > main(sys.argv)
    > File "/home/martin/skripty/accounts.py", line 119, in main
    > print_accounts(accounts, url_part)
    > File "/home/martin/skripty/accounts.py", line 94, in print_accounts
    > print str(i).encode("utf-8", "replace")
    > UnicodeEncodeError: 'ascii' codec can't encode characters in position
    > 151-152: ordinal not in range(128)
    >
    > - - - -
    >
    > the solution seems to be:
    >
    > 0. string is not in unicode encoding (assumption)
    > 1. before printing out, convert the string to unicode
    > 2. when printing, convert to whatever charset you like
    >
    > though i dont understand much why (ive solved it a minute ago :) the
    > code should be:
    >
    > str = "any nonunicode string"
    > print unicode(str).encode("iso-8859-2", "replace")


    I think the terminology is backwards. If you use a unicode string
    (that is, u"foo") that string will be in unicode. That's what Python
    does with unicode strings. However,
    it can't be read or written as such - it has to be decoded
    from something else (utf-8, iso-8859-2, whatever)
    after being read, and encoded to something (utf-8, iso-8859-1,
    whatever) to be written.

    A string on disk isn't in "unicode"; it's always in some
    encoded format, which is usually utf-8. Or it's in some
    single-byte format such as iso-8859-1. Or a far eastern
    multi-byte format. A string only winds up in unicode
    when it's comfortably ensconsed in a unicode string.

    > comments:
    >
    > 1. why the string is not in unicode can have several reasons -- i guess:
    > - does ogg stores tags in unicode?
    > - you have parsed an xml file with encoding attribute set (that
    > is what i do)
    > - etc
    >
    > 2. "replace" parameter in encode causes non-printable chars to be
    > replaced with '?' (you can use "ignore" or strict", see your python
    > doc)
    >
    > 3. the above will work _only_ _if_ the 'str' encoding is "iso-8859-2" --
    > a funny thing -- first line of code converts from unknown (but the
    > programmer must know it) to unicode and the second one converts it back
    > from unicode to unknown (now the programmer tells that secret to python
    > :)


    Well, the encoding declaration tells Python what to do with unicode
    string literals that it finds in the Python source. It doesn't do anything
    else.

    > 4. i would like to know from any python expert whether/why/why not:
    >
    > * my assumptions are right


    As I said above, the terminology is backwards. "Pure"
    unicode only exists in unicode strings. Everything else
    is some encoded character set or other in regular single
    byte strings, ***including unicode encoded as utf-8.***

    > * why is that behaviour? -- if you search google you get
    > thousands of errors like this -- with no proper solutions i must add


    There's a lot of confusion out there. Lots of people are under
    the impression that the encoding declaration somehow does
    something magical with unicode, when all, (and I need to
    emphasize that, ALL) it does is convert the source code
    to unicode in unicode literals using the specified decoding.
    Everything outside of unicode literals is treated as a stream
    of 8-bit bytes, regardless of the programmer's intentions.

    Before the encoding declaration, if you wanted to
    include unicode characters in your program you had
    to use an editor that encoded in utf-8 and put them
    in single byte strings, and then decode those strings
    into unicode strings. This was fairly error-prone since
    you could drop utf-8 encoded characters somewhere
    they didn't belong, causing very difficult to find bugs.

    > * is there an easier portable way (no sitecustomize.py changes)
    > to do it


    The best thing is to ignore the encoding declaration and
    write the program as if it wasn't there. On input you need
    to somehow determine the encoding of the data and then
    decode that into a unicode string; on output you need
    to do the reverse and encode the unicode string into a
    single byte string before writing it.

    You can simplify some of this by using the open
    function in the codecs module. That lets you
    declare the encoding on open so that the
    encoding and decoding happens transparently.

    > * i was looking in site.py and there is deleted the
    > sys.setdefaultencoding() function, but from the comments i do
    > not know why -- you know it? why is user not allowed to change the
    > default encoding? it seems reasonable to me if he/she could do that.


    That's someone else's answer. I'm not going to get into
    the politics behind that, other than to say that there are
    very serious release to release compatibility considerations
    here.

    John Roth

    >
    > thx.
    >
    > m.
    >
    John Roth, Aug 17, 2004
    #5
  6. Martin Slouf wrote:
    >>- print a repr() of the unicode object instead of
    >> the unicode object itself. This will work on all
    >> terminals, and show hex escapes of non-ASCII characters.

    >
    >
    > just to make sure:
    >
    > override the object's __repr__(self) method to st. like:
    >
    > class my_string(string):
    > def __repr__(self)
    > tmp = unicode(self.attribute1 + " " + self.attribute2)
    > return tmp
    >
    > and use 'my_string' class without any worries instead of classical
    > string?


    No. Assume yyy is a Unicode object which potentially contains
    non-printable characters. Instead of doing

    print yyy

    do

    print repr(yyy)

    > my system is debian GNU/Linux stable, im using it for a very, very long
    > time, though i did not changed any terminal settings but the very
    > basics. My locales are properly set, im using LC_* environment
    > variables to set default locale to czech environment with ISO-8859-2
    > charset. Terminal is capable of displaying 8bit charsets, im not sure
    > about unicode charsets -- never tried, never needed.


    I see. Could it be that you are using Python 2.1, then? Because in
    Python 2.3, printing Czech characters to the terminal should work
    just fine. Please do

    Python 2.3.4 (#2, Aug 5 2004, 09:33:45)
    [GCC 3.3.4 (Debian 1:3.3.4-7)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.stdout.encoding

    'ISO-8859-15'

    > if 0:
    > # Enable to support locale aware default string encodings.
    > import locale
    > loc = locale.getdefaultlocale()
    > if loc[1]:
    > encoding = loc[1]
    >
    > so i guess it is never done :(


    You don't need to change the default encoding. Instead,
    sys.stdout.encoding is used for printing to the terminal (in 2.3 and
    later).

    > did you yourself changed it?


    No. It will work out of the box.

    > well, if a piece of information like you gave to me was contained in
    > standard python documentation, probably there will be less
    > misunderstanding about this issue.


    What piece specifically are you referring to? It is all mentioned
    in the standard Python documentation.

    > #! /usr/bin/env python
    > # -*- coding: UTF-8 -*-
    > at the begginnig of my every script, the example above still has to
    > be converted -- because of the iso-8859-1 you use in "Löwis"?


    Yes, and no. Yes, it still has to be converted. UTF-8 is *not*
    Unicode; it is a byte encoding, and you cannot mix Unicode
    strings and byte strings. No, if I use UTF-8 in my source code,
    then "Löwis" will be encoded in UTF-8, not in ISO-8859-1.

    > can i ommit the conversion (ie. is it done automatically for me as if
    > i write
    > u"Martin v. " + unicode("Löwis", "ISO-8859-1")
    > )?


    You can, but you shouldn't. So I won't tell you how you could do that.

    > dont understand -- which library?


    The ODBC library, for example, or PyQt.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Aug 17, 2004
    #6
  7. Martin Slouf

    Martin Slouf Guest

    ok, thanks for your time while answering my questions.

    my python is

    Python 2.3.3 (#1, May 1 2004, 16:13:07)
    [GCC 3.2.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.stdout.encoding

    'ISO-8859-2'

    so im fine with it -- just a strange thing that it has used ascii, if
    sys default is ISO-8859-2.

    on the other hand: no matter now -- im 'overencoded' -- and i will
    explicitly call conversion function from now on in my python scripts
    (those are not programs :) to ensure myself everything is fine

    i see that the solution i came with was quite right, though i didnt
    much understand it. now i know how it works and im satisfied.

    thanks to all of you.

    martin.

    On Tue, Aug 17, 2004 at 08:17:41PM +0200, "Martin v. Löwis" wrote:
    > Martin Slouf wrote:
    > >>- print a repr() of the unicode object instead of
    > >> the unicode object itself. This will work on all
    > >> terminals, and show hex escapes of non-ASCII characters.

    > >
    > >
    > >just to make sure:
    > >
    > >override the object's __repr__(self) method to st. like:
    > >
    > >class my_string(string):
    > > def __repr__(self)
    > > tmp = unicode(self.attribute1 + " " + self.attribute2)
    > > return tmp
    > >
    > >and use 'my_string' class without any worries instead of classical
    > >string?

    >
    > No. Assume yyy is a Unicode object which potentially contains
    > non-printable characters. Instead of doing
    >
    > print yyy
    >
    > do
    >
    > print repr(yyy)
    >
    > >my system is debian GNU/Linux stable, im using it for a very, very long
    > >time, though i did not changed any terminal settings but the very
    > >basics. My locales are properly set, im using LC_* environment
    > >variables to set default locale to czech environment with ISO-8859-2
    > >charset. Terminal is capable of displaying 8bit charsets, im not sure
    > >about unicode charsets -- never tried, never needed.

    >
    > I see. Could it be that you are using Python 2.1, then? Because in
    > Python 2.3, printing Czech characters to the terminal should work
    > just fine. Please do
    >
    > Python 2.3.4 (#2, Aug 5 2004, 09:33:45)
    > [GCC 3.3.4 (Debian 1:3.3.4-7)] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    > >>> import sys
    > >>> sys.stdout.encoding

    > 'ISO-8859-15'
    >
    > >if 0:
    > > # Enable to support locale aware default string encodings.
    > > import locale
    > > loc = locale.getdefaultlocale()
    > > if loc[1]:
    > > encoding = loc[1]
    > >
    > >so i guess it is never done :(

    >
    > You don't need to change the default encoding. Instead,
    > sys.stdout.encoding is used for printing to the terminal (in 2.3 and
    > later).
    >
    > >did you yourself changed it?

    >
    > No. It will work out of the box.
    >
    > >well, if a piece of information like you gave to me was contained in
    > >standard python documentation, probably there will be less
    > >misunderstanding about this issue.

    >
    > What piece specifically are you referring to? It is all mentioned
    > in the standard Python documentation.
    >
    > >#! /usr/bin/env python
    > ># -*- coding: UTF-8 -*-
    > >at the begginnig of my every script, the example above still has to
    > >be converted -- because of the iso-8859-1 you use in "Löwis"?

    >
    > Yes, and no. Yes, it still has to be converted. UTF-8 is *not*
    > Unicode; it is a byte encoding, and you cannot mix Unicode
    > strings and byte strings. No, if I use UTF-8 in my source code,
    > then "Löwis" will be encoded in UTF-8, not in ISO-8859-1.
    >
    > >can i ommit the conversion (ie. is it done automatically for me as if
    > >i write
    > >u"Martin v. " + unicode("Löwis", "ISO-8859-1")
    > >)?

    >
    > You can, but you shouldn't. So I won't tell you how you could do that.
    >
    > >dont understand -- which library?

    >
    > The ODBC library, for example, or PyQt.
    >
    > Regards,
    > Martin
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    Martin Slouf, Aug 18, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. oziko
    Replies:
    1
    Views:
    509
    Leif K-Brooks
    Aug 17, 2004
  2. Ben Last
    Replies:
    0
    Views:
    413
    Ben Last
    Aug 17, 2004
  3. oziko
    Replies:
    2
    Views:
    11,388
    Diez B. Roggisch
    Aug 17, 2004
  4. thomas Armstrong

    'ascii' codec can't encode character u'\u2013'

    thomas Armstrong, Sep 30, 2005, in forum: Python
    Replies:
    3
    Views:
    4,450
    John J. Lee
    Sep 30, 2005
  5. Fredrik Lundh
    Replies:
    0
    Views:
    1,776
    Fredrik Lundh
    Sep 30, 2005
Loading...

Share This Page