print u"\u0432": why is this so hard? UnciodeEncodeError

Discussion in 'Python' started by Nelson Minar, Apr 8, 2004.

  1. Nelson Minar

    Nelson Minar Guest

    I have a simple goal. I want the following Python program to work:
    print u"\u0432"

    This program fails on my US Debian machine:
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)

    Actually, I have a complex goal: I want my SOAPpy program to work when
    SOAPpy is in debug mode and is printing XML messages out to stdout.
    Solving the simple problem will solve the complex one. Since I'm using
    third party code, I can't go modify every print statement to call
    encode() explictly.


    The simplest solution I've come up with is this:
    $ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"'

    That seems to work reasonably well in Python 2.3 (but not 2.2!). But
    then for some obscure reason if I redirect stdout in my shell it fails.
    $ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"' > /dev/null

    Why is that?


    The only solution I've found that really works is reassigning
    sys.stdout at the top of the script. That's an awful lot of work, but
    it's the best I can do for now.

    Why is Python not respecting my locale?


    Here's my test program:

    ----------------------------------------------------------------------

    #!/bin/bash -x

    # Obliterate locale
    for e in LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL; do
    unset $e
    done

    # Doing the obvious thing has nonobvious effects
    python2.3 -c 'print u"\u0432"' # fails, OK.
    LC_ALL=en_US.utf8 python2.3 -c 'print u"\u0432"' # works!
    LC_ALL=en_US.utf8 python2.3 -c 'print u"\u0432"' > /dev/null # fails, huh?

    # These both work, but what a pain!
    python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"'
    python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"' > /dev/null

    ----------------------------------------------------------------------

    And sample output:

    ----------------------------------------------------------------------

    ~/src/python/testUnicode.sh
    + unset LANG
    + unset LC_CTYPE
    + unset LC_NUMERIC
    + unset LC_TIME
    + unset LC_COLLATE
    + unset LC_MONETARY
    + unset LC_MESSAGES
    + unset LC_PAPER
    + unset LC_NAME
    + unset LC_ADDRESS
    + unset LC_TELEPHONE
    + unset LC_MEASUREMENT
    + unset LC_IDENTIFICATION
    + unset LC_ALL
    + python2.3 -c 'print u"\u0432"'
    Traceback (most recent call last):
    File "<string>", line 1, in ?
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)
    + LC_ALL=en_US.utf8
    + python2.3 -c 'print u"\u0432"'
    в
    + LC_ALL=en_US.utf8
    + python2.3 -c 'print u"\u0432"'
    Traceback (most recent call last):
    File "<string>", line 1, in ?
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)
    + python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"'
    в
    + python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"'
    Nelson Minar, Apr 8, 2004
    #1
    1. Advertising

  2. Nelson Minar

    Jon Willeke Guest

    Nelson Minar wrote:
    > I have a simple goal. I want the following Python program to work:
    > print u"\u0432"
    >
    > This program fails on my US Debian machine:
    > UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)


    Try playing with sys.setdefaultencoding(). You'll need to reload( sys )
    to call it. If it solves your problem, you can make it permanent by
    modifying site.py.
    Jon Willeke, Apr 8, 2004
    #2
    1. Advertising

  3. Nelson Minar wrote:
    > I have a simple goal. I want the following Python program to work:
    > print u"\u0432"


    As you have discovered, this is not so simple. Printing this character
    might not be possible at all: If you have a terminal that just cannot
    display CYRILLIC SMALL LETTER VE, then there is absolutely no way to
    print the character - unless you change the terminal you use.

    > Actually, I have a complex goal: I want my SOAPpy program to work when
    > SOAPpy is in debug mode and is printing XML messages out to stdout.
    > Solving the simple problem will solve the complex one. Since I'm using
    > third party code, I can't go modify every print statement to call
    > encode() explictly.


    This shows the real source of the problem. SOAPpy should not print the
    strings, but repr them. For debugging, repr is more reliable than str,
    as it can render virtually every object.

    > That seems to work reasonably well in Python 2.3 (but not 2.2!). But
    > then for some obscure reason if I redirect stdout in my shell it fails.
    > $ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"' > /dev/null
    >
    > Why is that?


    Python 2.3 discovers the encoding of your terminal, and will display
    Unicode characters if the terminal supports them. Python 2.2 did not do
    that, and the new feature is mainly useful in interactive mode.

    When you redirect the output to a file, it is not a terminal anymore,
    and Python cannot guess the encoding.

    > The only solution I've found that really works is reassigning
    > sys.stdout at the top of the script. That's an awful lot of work, but
    > it's the best I can do for now.
    >
    > Why is Python not respecting my locale?


    It is: however, your locale only tells Python the encoding of your
    terminal, not the encoding of an arbitrary file you may write to.

    Assigning sys.stdout is the right thing to do. I'm uncertain why
    that could be an awful lot of work, as you do this only once...

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 8, 2004
    #3
  4. In article <c52lu9$j8c$03$-online.com>,
    "Martin v. Lowis" <> wrote:

    > > That seems to work reasonably well in Python 2.3 (but not 2.2!). But
    > > then for some obscure reason if I redirect stdout in my shell it fails.
    > > $ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"' > /dev/null
    > >
    > > Why is that?

    >
    > Python 2.3 discovers the encoding of your terminal, and will display
    > Unicode characters if the terminal supports them. Python 2.2 did not do
    > that, and the new feature is mainly useful in interactive mode.


    Py2.3 sure doesn't discover the encoding of my terminal automatically:

    hyperbolic ~: python
    Python 2.3 (#1, Sep 13 2003, 00:49:11)
    [GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print u"\u0432".encode('utf8')

    2
    >>> print u"\u0432"

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeEncodeError: 'ascii' codec can't encode character '\u432' in
    position 0: ordinal not in range(128)

    (that 2 was something else looking like an angular capital B before I
    copied and pasted it into my newsreader...and yes, utf8 is the correct
    encoding.)

    --
    David Eppstein http://www.ics.uci.edu/~eppstein/
    Univ. of California, Irvine, School of Information & Computer Science
    David Eppstein, Apr 8, 2004
    #4
  5. Nelson Minar

    Nelson Minar Guest

    Thanks for your answers.

    "Martin v. Löwis" <> writes:
    >As you have discovered, this is not so simple. Printing this character
    >might not be possible at all


    I know. I'm just trying to figure out an expedient way to get Python
    to make a best effort.

    > When you redirect the output to a file, it is not a terminal anymore,
    > and Python cannot guess the encoding.


    So when Python can't guess the encoding, it assumes that ASCII is the
    best it can do? Even as an American that annoys me; what do folks who
    need non-ASCII do in practice? Martin, what do you do when you write a
    Python script that prints your own name?

    I guess what I'd like is a way to set Python's default encoding and
    have that respected for files, terminals, etc. I'd also like some way
    to override the Unicode error mode. 'strict' is the right default, but
    I'd like the option to do 'ignore' or 'replace' globally.

    > Assigning sys.stdout is the right thing to do. I'm uncertain why
    > that could be an awful lot of work, as you do this only once...


    Now that I know the trick I can do it. But part of the joy of Python
    is that it makes simple things simple. For a beginner to the language
    having to learn about the difference between sys.stdout and
    sys.__stdout__ seems a bit much.
    Nelson Minar, Apr 8, 2004
    #5
  6. David Eppstein wrote:
    > Py2.3 sure doesn't discover the encoding of my terminal automatically:
    >
    > hyperbolic ~: python
    > Python 2.3 (#1, Sep 13 2003, 00:49:11)
    > [GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin


    Ah, Darwin. You lose.

    If anybody can tell me how to programmatically discover the encoding
    of Terminal.App, I'll happily incorporate a change into a 2.3.x release.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 8, 2004
    #6
  7. Nelson Minar wrote:
    > So when Python can't guess the encoding, it assumes that ASCII is the
    > best it can do? Even as an American that annoys me;


    In general, it uses the default encoding which, by default, is ASCII.

    This has been chosen after long discussions, which discovered that any
    other guess is wrong under likely circumstances (the specific
    circumstances depending on what the guess is). If the guess is wrong,
    you end up with moji-bake (nonsense characters), which are very hard
    to track back to their source.

    In the face of ambiguity, refuse the temptation to guess.

    ASCII is the only guess that has no significant risk of ambiguity:
    if something encodes successfully as ASCII, it would encode to the
    very same byte order in nearly any other encoding.

    > what do folks who
    > need non-ASCII do in practice? Martin, what do you do when you write a
    > Python script that prints your own name?


    It depends. If I print to the terminal, I use Unicode. If I print to
    XML, I use Unicode, and expect that the XML writer will pick some
    encoding, using XML character references if the o-umlat cannot be
    encoded. If I print to HTML, I make sure an explicit META tag has
    been added to denote the document as Latin-1, or I use &ouml;.
    If I print to a log file, I explcitly use Latin-1, unless I know
    that the encoding of that log file is meant to be UTF-8. And so on.

    It is not that Python is making that complicated, it is complicated
    by nature - until everybody switches to UTF-8, which may take another
    20 years or so.

    > I guess what I'd like is a way to set Python's default encoding and
    > have that respected for files, terminals, etc. I'd also like some way
    > to override the Unicode error mode. 'strict' is the right default, but
    > I'd like the option to do 'ignore' or 'replace' globally.


    Submit a patch that does that. I very much prefer to fix errors instead
    of ignoring them.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 8, 2004
    #7
  8. On Thu, Apr 08, 2004 at 08:51:18PM +0200, "Martin v. L?wis" wrote:
    > David Eppstein wrote:
    > >Py2.3 sure doesn't discover the encoding of my terminal automatically:
    > >
    > >hyperbolic ~: python
    > >Python 2.3 (#1, Sep 13 2003, 00:49:11)
    > >[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin

    >
    > Ah, Darwin. You lose.
    >
    > If anybody can tell me how to programmatically discover the encoding
    > of Terminal.App, I'll happily incorporate a change into a 2.3.x release.
    >


    The encoding of darwin terminal can be discovered by the routine
    currently we have.

    perky$ LC_ALL=ko_KR.UTF-8 python
    Python 2.3 (#1, Sep 13 2003, 00:49:11)
    [GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys;sys.stdin.encoding

    'UTF-8'
    >>> ^D

    perky$ LC_ALL=ko_KR.eucKR python
    Python 2.3 (#1, Sep 13 2003, 00:49:11)
    [GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys;sys.stdin.encoding

    'eucKR'


    Regards,
    Hye-Shik
    Hye-Shik Chang, Apr 9, 2004
    #8
  9. In article <>,
    Hye-Shik Chang <> wrote:

    > The encoding of darwin terminal can be discovered by the routine
    > currently we have.
    >
    > perky$ LC_ALL=ko_KR.UTF-8 python
    > Python 2.3 (#1, Sep 13 2003, 00:49:11)
    > [GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
    > Type "help", "copyright", "credits" or "license" for more information.
    > >>> import sys;sys.stdin.encoding

    > 'UTF-8'
    > >>> ^D

    > perky$ LC_ALL=ko_KR.eucKR python
    > Python 2.3 (#1, Sep 13 2003, 00:49:11)
    > [GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
    > Type "help", "copyright", "credits" or "license" for more information.
    > >>> import sys;sys.stdin.encoding

    > 'eucKR'


    Well, no.

    Python 2.3 (#1, Sep 13 2003, 00:49:11)
    [GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys;sys.stdin.encoding

    'US-ASCII'

    But, in fact, the encoding of my Terminal.App is utf8 (the usual
    default), as set in the Terminal Inspector (Terminal->Window
    Settings...) Display pane, Character Set Encoding menu.

    Possibly you can find Terminal's preferred encoding in
    Library/Preferences/com.apple.Terminal.plist (I just looked there, and
    don't see it, unless it's maybe the StringEncoding:4 line) but it can be
    changed from the default on a per-window basis and, like Martin, I don't
    know how to find out its current setting.

    --
    David Eppstein http://www.ics.uci.edu/~eppstein/
    Univ. of California, Irvine, School of Information & Computer Science
    David Eppstein, Apr 9, 2004
    #9
  10. Hye-Shik Chang wrote:
    > The encoding of darwin terminal can be discovered by the routine
    > currently we have.
    >
    > perky$ LC_ALL=ko_KR.UTF-8 python


    But that requires the user to set LC_ALL correctly. I'd rather
    prefer if the standard installation of the system is supported,
    where LC_ALL is not set. In particular, Terminal.App supports
    changing its encoding through Settings, and I would like Python
    to detect the current settings at startup time.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 9, 2004
    #10
  11. David Eppstein wrote:
    > Well, no.
    >
    > Python 2.3 (#1, Sep 13 2003, 00:49:11)
    > [GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
    > Type "help", "copyright", "credits" or "license" for more information.
    >
    >>>>import sys;sys.stdin.encoding

    >
    > 'US-ASCII'


    You did not follow Hye-Shik's instructions closely enough. You
    failed to set the LANG environment variable before starting Python.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 9, 2004
    #11
  12. In article <c55tkv$3s4$05$-online.com>,
    "Martin v. Lowis" <> wrote:

    > David Eppstein wrote:
    > > Well, no.
    > >
    > > Python 2.3 (#1, Sep 13 2003, 00:49:11)
    > > [GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
    > > Type "help", "copyright", "credits" or "license" for more information.
    > >
    > >>>>import sys;sys.stdin.encoding

    > >
    > > 'US-ASCII'

    >
    > You did not follow Hye-Shik's instructions closely enough. You
    > failed to set the LANG environment variable before starting Python.


    A system that requires me to manually set the LANG environment variable
    is no better than a system that requires me to manually define an
    encoding once in Python. It doesn't seem to answer your question about
    automatic determination of the Terminal's encoding.

    --
    David Eppstein http://www.ics.uci.edu/~eppstein/
    Univ. of California, Irvine, School of Information & Computer Science
    David Eppstein, Apr 9, 2004
    #12
  13. David Eppstein wrote:
    > A system that requires me to manually set the LANG environment variable
    > is no better than a system that requires me to manually define an
    > encoding once in Python.


    Perhaps: depending on your criteria to judge "no better".
    It does demonstrate that Python can be made to determine the terminal's
    encoding, without changing any kind of Python or C source code.

    Regards,
    Martin
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 10, 2004
    #13
  14. "Martin v. Löwis" <> writes:

    > David Eppstein wrote:
    > > Py2.3 sure doesn't discover the encoding of my terminal automatically:
    > > hyperbolic ~: python
    > > Python 2.3 (#1, Sep 13 2003, 00:49:11) [GCC 3.3 20030304 (Apple
    > > Computer, Inc. build 1495)] on darwin

    >
    > Ah, Darwin. You lose.
    >
    > If anybody can tell me how to programmatically discover the encoding
    > of Terminal.App, I'll happily incorporate a change into a 2.3.x release.


    The user can set it per-terminal, at runtime, with no notification to
    the running process that it has done so!

    You can't even find out the encoding in use via apple events, which is
    a bit of surprise. Filing a bug with apple might get that changed for
    10.4...

    Cheers,
    mwh

    --
    Reading Slashdot can [...] often be worse than useless, especially
    to young and budding programmers: it can give you exactly the wrong
    idea about the technical issues it raises.
    -- http://www.cs.washington.edu/homes/klee/misc/slashdot.html#reasons
    Michael Hudson, Apr 10, 2004
    #14
  15. Nelson Minar

    Paul Prescod Guest

    Nelson Minar wrote:
    >...
    >
    > So when Python can't guess the encoding, it assumes that ASCII is the
    > best it can do? Even as an American that annoys me; what do folks who
    > need non-ASCII do in practice? Martin, what do you do when you write a
    > Python script that prints your own name?
    >
    > I guess what I'd like is a way to set Python's default encoding and
    > have that respected for files, terminals, etc. I'd also like some way
    > to override the Unicode error mode. 'strict' is the right default, but
    > I'd like the option to do 'ignore' or 'replace' globally.


    The Python community has traditionally discouraged machine-specific
    configuration. The more you depend on the machine configuration the more
    likely you are to have problems when you move your program from one
    computer to another.

    The bug is in the third-party module that does not deal properly with
    Unicode data!

    >...
    > Now that I know the trick I can do it. But part of the joy of Python
    > is that it makes simple things simple. For a beginner to the language
    > having to learn about the difference between sys.stdout and
    > sys.__stdout__ seems a bit much.


    Agreed: the module should handle it for you.

    Paul Prescod
    Paul Prescod, Apr 11, 2004
    #15
  16. Michael Hudson wrote:
    > The user can set it per-terminal, at runtime, with no notification to
    > the running process that it has done so!


    I would find it acceptable to say "don't do that, then", here. However,
    changing the encoding should be visible to new programs running in the
    terminal.

    I wonder how much would break if Python would assume the terminal
    encoding is UTF-8 on Darwin. Do people use different terminal
    encodings?

    Regards,
    Martin
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Apr 11, 2004
    #16
  17. Martin> I wonder how much would break if Python would assume the
    Martin> terminal encoding is UTF-8 on Darwin. Do people use different
    Martin> terminal encodings?

    I generally use xterm instead of Terminal.app. I think it's encoding is
    latin-1.

    Skip
    Skip Montanaro, Apr 12, 2004
    #17
  18. In article <c5cgq3$tps$00$-online.com>,
    "Martin v. Lowis" <> wrote:

    > Michael Hudson wrote:
    > > The user can set it per-terminal, at runtime, with no notification to
    > > the running process that it has done so!

    >
    > I would find it acceptable to say "don't do that, then", here. However,
    > changing the encoding should be visible to new programs running in the
    > terminal.
    >
    > I wonder how much would break if Python would assume the terminal
    > encoding is UTF-8 on Darwin. Do people use different terminal
    > encodings?


    Now I'm curious -- how do you even find out it's a Terminal window
    you're looking at, rather than say an xterm?

    --
    David Eppstein http://www.ics.uci.edu/~eppstein/
    Univ. of California, Irvine, School of Information & Computer Science
    David Eppstein, Apr 12, 2004
    #18
  19. In article <>,
    David Eppstein <> wrote:

    > > I wonder how much would break if Python would assume the terminal
    > > encoding is UTF-8 on Darwin. Do people use different terminal
    > > encodings?

    >
    > Now I'm curious -- how do you even find out it's a Terminal window
    > you're looking at, rather than say an xterm?


    Never mind, I just did a printenv and saw
    TERM_PROGRAM=Apple_Terminal

    But I also saw
    __CF_USER_TEXT_ENCODING=0x1F5:0:0
    ....I wonder who's putting that there?

    --
    David Eppstein http://www.ics.uci.edu/~eppstein/
    Univ. of California, Irvine, School of Information & Computer Science
    David Eppstein, Apr 12, 2004
    #19
  20. "Martin v. Löwis" <> writes:

    > Michael Hudson wrote:
    > > The user can set it per-terminal, at runtime, with no notification to
    > > the running process that it has done so!

    >
    > I would find it acceptable to say "don't do that, then", here. However,
    > changing the encoding should be visible to new programs running in the
    > terminal.
    >
    > I wonder how much would break if Python would assume the terminal
    > encoding is UTF-8 on Darwin. Do people use different terminal
    > encodings?


    How long is a piece of string? *I* don't change the encoding very often.

    You could consider the output of 'defaults read com.apple.Terminal
    StringEncoding' or equivalent API calls but it's fairly opaque (it's
    '4' on my machine).

    I don't think $__CF_USER_TEXT_ENCODING has anything to do with
    terminals (CoreFoundation, more likely).

    Cheers,
    mwh

    --
    [1]For those of you who aren't aware "tossing" is a euphamism for,
    well, vigourously rubbing your love pole. You understand?
    Flogging the dolphin. Stretching the chicken's neck. Waving your
    magic wand. Basically, wanking. -- Just another Morfans SDA update
    Michael Hudson, Apr 12, 2004
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mr. SweatyFinger

    why why why why why

    Mr. SweatyFinger, Nov 28, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    863
    Mark Rae
    Dec 21, 2006
  2. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,769
    Smokey Grindel
    Dec 2, 2006
  3. keto
    Replies:
    0
    Views:
    908
  4. Prisoner at War

    Why Turn "Print" into "Print()"????

    Prisoner at War, May 26, 2008, in forum: Python
    Replies:
    11
    Views:
    427
    Paddy
    May 27, 2008
  5. David Cournapeau

    print a vs print '%s' % a vs print '%f' a

    David Cournapeau, Dec 30, 2008, in forum: Python
    Replies:
    0
    Views:
    337
    David Cournapeau
    Dec 30, 2008
Loading...

Share This Page