print() and unicode strings (python 3.1)

Discussion in 'Python' started by 7stud, Aug 24, 2009.

  1. 7stud

    7stud Guest

    ======python 2.6 ======
    import sys

    print sys.getdefaultencoding()

    s = u"\u20ac"
    print s.encode("utf-8")


    $ python2.6 1test.py
    ascii



    =====python 3.1 =======
    import sys

    print(sys.getdefaultencoding())

    s = "€"
    print(s.encode("utf-8"))
    print(s)


    $ python3.1 1test.py
    utf-8
    b'\xe2\x82\xac'

    Traceback (most recent call last):
    File "1test.py", line 7, in <module>
    print(s)
    UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in
    position 0: ordinal not in range(12


    I don't understand why I'm getting an encode error in python 3.1.
    7stud, Aug 24, 2009
    #1
    1. Advertising

  2. > I don't understand why I'm getting an encode error in python 3.1.

    The default encoding is not relevant here at all. Look at
    sys.stdout.encoding.

    Regards,
    Martin
    Martin v. Löwis, Aug 24, 2009
    #2
    1. Advertising

  3. 7stud

    7stud Guest

    On Aug 24, 9:56 am, "Martin v. Löwis" <> wrote:
    > > I don't understand why I'm getting an encode error in python 3.1.

    >
    > The default encoding is not relevant here at all. Look at
    > sys.stdout.encoding.
    >
    > Regards,
    > Martin


    Hi,

    Thanks for the response. I get US-ASCII for both 2.6 and 3.1:

    ===python 3.1======
    import sys

    print(sys.stdout.encoding)


    $ python3.1 1test.py
    US-ASCII

    I can't figure out a way to programatically set the encoding for
    sys.stdout. So where does that leave me? python 3.1 won't let me
    explicitly encode my unicode string, and python 3.1 implicitly does
    the encoding with the wrong codec. And why would any programmer rely
    on python 3.1's implicit encoding of unicode strings anyway?
    Presumably, different systems will have different encodings for
    sys.stdout, some encodings might cause encode errors.
    7stud, Aug 24, 2009
    #3
  4. 7stud wrote:
    > python 3.1 won't let me
    > explicitly encode my unicode string


    Sure it does. But encoding a non-ASCII string to ASCII will necessarily fail.


    > and python 3.1 implicitly does
    > the encoding with the wrong codec.


    That's not a Python problem, though. Your terminal is configured for
    US-ASCII, so you can't output anything but US-ASCII characters.

    Change your terminal setup to e.g. UTF-8 and see how things start working.

    Stefan
    Stefan Behnel, Aug 24, 2009
    #4
  5. 7stud

    7stud Guest

    On Aug 24, 12:19 pm, Stefan Behnel <> wrote:
    > 7stud wrote:
    > > python 3.1 won't let me
    > > explicitly encode my unicode string

    >
    > Sure it does. But encoding a non-ASCII string to ASCII will necessarily fail.
    >


    As you should be able to see in the python 3.1 example I posted, I did
    not encode the string using the ascii codec. I encoded it with the
    utf-8 codec, and unfortunately in python 3.1 that creates a "bytes
    string", and print()'ing a bytes string does not produce human
    readable text.


    > > and python 3.1 implicitly does
    > > the encoding with the wrong codec.

    >
    > That's not a Python problem, though. Your terminal is configured for
    > US-ASCII, so you can't output anything but US-ASCII characters.
    >


    My terminal is configured for utf-8, and from the output of the python
    2.6 example I posted, it should be apparent that my terminal is
    capable of rendering the euro character.
    7stud, Aug 24, 2009
    #5
  6. > I can't figure out a way to programatically set the encoding for
    > sys.stdout. So where does that leave me?


    You should be setting the terminal encoding administratively, not
    programmatically.

    Regards,
    Martin
    Martin v. Löwis, Aug 24, 2009
    #6
  7. 7stud

    7stud Guest

    On Aug 24, 2:41 pm, "Martin v. Löwis" <> wrote:
    > > I can't figure out a way to programatically set the encoding for
    > > sys.stdout.  So where does that leave me?

    >
    > You should be setting the terminal encoding administratively, not
    > programmatically.
    >


    The terminal encoding has always been utf-8. It was not set
    programmatically.

    It seems to me that python 3.1's string handling is broken.
    Apparently, in python 3.1 I am unable to explicitly set the encoding
    of a string and print() it out with the result being human readable
    text. On the other hand, if I let python do the encoding implicitly,
    python uses a codec I don't want it to.
    7stud, Aug 25, 2009
    #7
  8. 7stud

    Ned Deily Guest

    In article
    <>,
    7stud <> wrote:

    > On Aug 24, 2:41 pm, "Martin v. Löwis" <> wrote:
    > > > I can't figure out a way to programatically set the encoding for
    > > > sys.stdout.  So where does that leave me?

    > >
    > > You should be setting the terminal encoding administratively, not
    > > programmatically.
    > >

    >
    > The terminal encoding has always been utf-8. It was not set
    > programmatically.
    >
    > It seems to me that python 3.1's string handling is broken.
    > Apparently, in python 3.1 I am unable to explicitly set the encoding
    > of a string and print() it out with the result being human readable
    > text. On the other hand, if I let python do the encoding implicitly,
    > python uses a codec I don't want it to.


    If you are running on a Unix-y system, check your locale settings (LANG,
    LC.*, et al). I think you'll likely find that your locale is really not
    UTF-8. The following was on Python 3.1 on OS X 10.5, similar results
    on Debian Linux:

    $ cat t3.py
    import sys
    print(sys.stdout.encoding)
    s = "¤"
    print(s.encode("utf-8"))
    print(s)

    $ export LANG=en_US.UTF-8
    $ python3.1 t3.py
    UTF-8
    b'\xe2\x82\xac'
    ¤

    $ export LANG=C
    $ python3.1 t3.py
    US-ASCII
    b'\xe2\x82\xac'
    Traceback (most recent call last):
    File "t3.py", line 7, in <module>
    print(s)
    UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in
    position 0: ordinal not in range(128)

    --
    Ned Deily,
    Ned Deily, Aug 25, 2009
    #8
  9. 7stud

    7stud Guest

    On Aug 24, 10:09 pm, Ned Deily <> wrote:
    > In article
    > <>,
    >
    >
    >
    > 7stud <> wrote:
    > > On Aug 24, 2:41 pm, "Martin v. Löwis" <> wrote:
    > > > > I can't figure out a way to programatically set the encoding for
    > > > > sys.stdout. So where does that leave me?

    >
    > > > You should be setting the terminal encoding administratively, not
    > > > programmatically.

    >
    > > The terminal encoding has always been utf-8. It was not set
    > > programmatically.

    >
    > > It seems to me that python 3.1's string handling is broken.
    > > Apparently, in python 3.1 I am unable to explicitly set the encoding
    > > of a string and print() it out with the result being human readable
    > > text. On the other hand, if I let python do the encoding implicitly,
    > > python uses a codec I don't want it to.

    >
    > If you are running on a Unix-y system, check your locale settings (LANG,
    > LC.*, et al). I think you'll likely find that your locale is really not
    > UTF-8. The following was on Python 3.1 on OS X 10.5, similar results
    > on Debian Linux:
    >
    > $ cat t3.py
    > import sys
    > print(sys.stdout.encoding)
    > s = "¤"
    > print(s.encode("utf-8"))
    > print(s)
    >
    > $ export LANG=en_US.UTF-8
    > $ python3.1 t3.py
    > UTF-8
    > b'\xe2\x82\xac'
    > ¤
    >
    > $ export LANG=C
    > $ python3.1 t3.py
    > US-ASCII
    > b'\xe2\x82\xac'
    > Traceback (most recent call last):
    > File "t3.py", line 7, in <module>
    > print(s)
    > UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in
    > position 0: ordinal not in range(128)
    >
    > --
    > Ned Deily,
    >


    Hi,

    Thanks for the response. My OS is mac osx 10.4.11. I'm not really
    sure how to check my locale settings. Here is some stuff I tried:

    $ echo $LANG

    $ echo $LC_ALL

    $ echo $LC_CTYPE

    $ locale
    LANG=
    LC_COLLATE="C"
    LC_CTYPE="C"
    LC_MESSAGES="C"
    LC_MONETARY="C"
    LC_NUMERIC="C"
    LC_TIME="C"
    LC_ALL="C"

    $man locale
    ....
    ....
    ....

    ENVIRONMENT:
    LANG
    Used as a substitute for any unset LC_* variable. If LANG is unset it
    will act as if set to "C". If any of LANG or LC_* are set to invalide
    values locale acts as if they are all unset.

    ===========

    As in your last example, my 'C' settings mean that an ascii codec is
    used somewhere to encode() the unicode string.

    --
    The locale C or POSIX is a portable locale; its LC_CTYPE part
    corresponds to the 7-bit ASCII character set.

    http://linux.about.com/library/cmd/blcmdl3_setlocale.htm
    --


    Is this the way it works:


    1) python sets the codec for sys.stdout to the LANG environment
    variable.
    2) It doesn't matter that my terminal's encoding is set to utf-8
    because output has to pass through sys.stdout first.

    So:

    a) My terminal's environment is telling python(and all other programs
    running in the terminal) that output sent to sys.stdout must be
    encoded in ascii.
    b) The solution is to set a LANG environment variable.


    Why does echoing $LC_ALL or $LC_CTYPE just give me a blank string?


    Previously, I've set environment variables that I want to be
    permanent, e.g PATH, in ~/.bash_profile, so I did this:

    ~/.bash_profile:
    --------------
    ....
    ....
    LANG="en_US.UTF-8"
    export LANG

    and now python 3.1 acts like I expect it to:

    -------
    import locale
    import sys

    print(locale.getlocale(locale.LC_CTYPE))
    print(sys.stdout.encoding)


    s = "€"
    print(s)

    print(s.encode("utf-8"))

    --output:--
    ('en_US', 'UTF8')
    UTF-8

    b'\xe2\x82\xac'
    ----------

    In conclusion, as far as I can tell, if your python 3.1 program tries
    to output a unicode string, and the unicode string cannot be encoded
    by the codec specified in the user's LANG environment variable**, then
    the user will get an encode error. Just because the programmer's
    system can handle the output doesn't mean that another user's system
    can. I guess that's the way it goes: if a user's environment is
    telling all programs that it only wants ascii output to go to the
    screen(sys.stdout), you can't(or shouldn't) do anything about it.

    **Or if the LANG environment variable is not present, then the codec
    corresponding to the locale settings(C' corresponds to ascii).

    some good locale info:
    http://www.chemie.fu-berlin.de/chemnet/use/info/libc/libc_19.html
    7stud, Aug 25, 2009
    #9
  10. 7stud

    Nobody Guest

    On Tue, 25 Aug 2009 03:41:54 -0700, 7stud wrote:

    > Why does echoing $LC_ALL or $LC_CTYPE just give me a blank string?


    Because the variables aren't set.

    The default locale for a particular category (e.g. LC_CTYPE) is taken from
    $LC_ALL if that is set, otherwise $LC_CTYPE, otherwise $LANG, otherwise
    "C" is used.

    Normally, you would either set LANG (and possibly some individual LC_*
    variables), or LC_ALL. There's no point in setting all of them.

    > In conclusion, as far as I can tell, if your python 3.1 program tries
    > to output a unicode string, and the unicode string cannot be encoded
    > by the codec specified in the user's LANG environment variable**, then
    > the user will get an encode error. Just because the programmer's
    > system can handle the output doesn't mean that another user's system
    > can. I guess that's the way it goes: if a user's environment is
    > telling all programs that it only wants ascii output to go to the
    > screen(sys.stdout), you can't(or shouldn't) do anything about it.
    >
    > **Or if the LANG environment variable is not present, then the codec
    > corresponding to the locale settings(C' corresponds to ascii).


    The underlying OS primitive can only handle bytes. If you read or write a
    (unicode) string, Python needs to know which encoding is used. For Python
    file objects created by the user (via open() etc), you can specify the
    encoding; for those created by the runtime (e.g. sys.stdin), Python uses
    the locale's LC_CTYPE category to select an encoding.

    Data written to or read from text streams is encoded or decoded using the
    stream's encoding. Filenames are encoded and decoded using the
    filesystem encoding (sys.getfilesystemencoding()). Anything else uses the
    default encoding (sys.getdefaultencoding()).

    In Python 3, text streams are handled using io.TextIOWrapper:

    http://docs.python.org/3.1/library/io.html#text-i-o

    This implements a stream which can read and/or write text data on top of
    one which can read and/or write binary data. The sys.std{in,out,err}
    streams are instances of TextIOWrapper. You can get the underlying
    binary stream from the "buffer" attribute, e.g.:

    sys.stdout.buffer.write(b'hello world\n')

    If you need to force a specific encoding (e.g. if the user has specified
    an encoding via a command-line option), you can detach the existing
    wrapper and create a new one, e.g.:

    sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = new_encoding)
    Nobody, Aug 25, 2009
    #10
  11. >>>>> 7stud <> (7) wrote:

    >7> Thanks for the response. My OS is mac osx 10.4.11. I'm not really
    >7> sure how to check my locale settings. Here is some stuff I tried:


    >7> $ echo $LANG


    >7> $ echo $LC_ALL


    >7> $ echo $LC_CTYPE


    >7> $ locale
    >7> LANG=
    >7> LC_COLLATE="C"
    >7> LC_CTYPE="C"
    >7> LC_MESSAGES="C"
    >7> LC_MONETARY="C"
    >7> LC_NUMERIC="C"
    >7> LC_TIME="C"
    >7> LC_ALL="C"


    IIRC, Mac OS X 10.4 does not set LANG or LC_* automatically. In 10.5
    Terminal has an option in the preferences to set LANG according to the
    encoding chosen (and presumably the language of the user).

    --
    Piet van Oostrum <>
    URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
    Private email:
    Piet van Oostrum, Aug 26, 2009
    #11
  12. 7stud

    7stud Guest

    On Aug 25, 6:34 am, Nobody <> wrote:
    > The underlying OS primitive can only handle bytes. If you read or write a
    > (unicode) string, Python needs to know which encoding is used. For Python
    > file objects created by the user (via open() etc), you can specify the
    > encoding; for those created by the runtime (e.g. sys.stdin), Python uses
    > the locale's LC_CTYPE category to select an encoding.
    >
    > Data written to or read from text streams is encoded or decoded using the
    > stream's encoding. Filenames are encoded and decoded using the
    > filesystem encoding (sys.getfilesystemencoding()). Anything else uses the
    > default encoding (sys.getdefaultencoding()).
    >
    > In Python 3, text streams are handled using io.TextIOWrapper:
    >
    >        http://docs.python.org/3.1/library/io.html#text-i-o
    >
    > This implements a stream which can read and/or write text data on top of
    > one which can read and/or write binary data. The sys.std{in,out,err}
    > streams are instances of TextIOWrapper. You can get the underlying
    > binary stream from the "buffer" attribute, e.g.:
    >
    >         sys.stdout.buffer.write(b'hello world\n')
    >
    > If you need to force a specific encoding (e.g. if the user has specified
    > an encoding via a command-line option), you can detach the existing
    > wrapper and create a new one, e.g.:
    >
    >         sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = new_encoding)


    Thanks for the details.
    7stud, Aug 26, 2009
    #12
  13. 7stud

    Dave P

    Joined:
    Oct 26, 2010
    Messages:
    1
    PYTHONIOENCODING

    You may want to try changing the environment variable "PYTHONIOENCODING" to "utf_8." I have written a webpage -- daveagp.wordpress.com/what-a-character/ -- with some details on my ordeal with this problem.
    Dave P, Oct 26, 2010
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    757
    Malcolm
    Jun 24, 2006
  2. keto
    Replies:
    0
    Views:
    936
  3. Asterix
    Replies:
    5
    Views:
    712
    Matt Nordhoff
    Aug 31, 2008
  4. David Cournapeau

    print a vs print '%s' % a vs print '%f' a

    David Cournapeau, Dec 30, 2008, in forum: Python
    Replies:
    0
    Views:
    348
    David Cournapeau
    Dec 30, 2008
  5. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    959
    Grzegorz ¦liwiñski
    Jan 19, 2011
Loading...

Share This Page