print u"\u0432": why is this so hard? UnciodeEncodeError

Nelson Minar · Apr 7, 2004

I have a simple goal. I want the following Python program to work:
print u"\u0432"

This program fails on my US Debian machine:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)

Actually, I have a complex goal: I want my SOAPpy program to work when
SOAPpy is in debug mode and is printing XML messages out to stdout.
Solving the simple problem will solve the complex one. Since I'm using
third party code, I can't go modify every print statement to call
encode() explictly.

The simplest solution I've come up with is this:
$ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"'

That seems to work reasonably well in Python 2.3 (but not 2.2!). But
then for some obscure reason if I redirect stdout in my shell it fails.
$ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"' > /dev/null

Why is that?

The only solution I've found that really works is reassigning
sys.stdout at the top of the script. That's an awful lot of work, but
it's the best I can do for now.

Why is Python not respecting my locale?

Here's my test program:

----------------------------------------------------------------------

#!/bin/bash -x

# Obliterate locale
for e in LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL; do
unset $e
done

# Doing the obvious thing has nonobvious effects
python2.3 -c 'print u"\u0432"' # fails, OK.
LC_ALL=en_US.utf8 python2.3 -c 'print u"\u0432"' # works!
LC_ALL=en_US.utf8 python2.3 -c 'print u"\u0432"' > /dev/null # fails, huh?

# These both work, but what a pain!
python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"'
python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"' > /dev/null

----------------------------------------------------------------------

And sample output:

----------------------------------------------------------------------

~/src/python/testUnicode.sh
+ unset LANG
+ unset LC_CTYPE
+ unset LC_NUMERIC
+ unset LC_TIME
+ unset LC_COLLATE
+ unset LC_MONETARY
+ unset LC_MESSAGES
+ unset LC_PAPER
+ unset LC_NAME
+ unset LC_ADDRESS
+ unset LC_TELEPHONE
+ unset LC_MEASUREMENT
+ unset LC_IDENTIFICATION
+ unset LC_ALL
+ python2.3 -c 'print u"\u0432"'
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)
+ LC_ALL=en_US.utf8
+ python2.3 -c 'print u"\u0432"'
Ð²
+ LC_ALL=en_US.utf8
+ python2.3 -c 'print u"\u0432"'
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)
+ python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"'
Ð²
+ python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"'

Jon Willeke · Apr 7, 2004

Nelson said:
I have a simple goal. I want the following Python program to work:
print u"\u0432"

This program fails on my US Debian machine:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)

Try playing with sys.setdefaultencoding(). You'll need to reload( sys )
to call it. If it solves your problem, you can make it permanent by
modifying site.py.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Apr 8, 2004

Nelson said:
I have a simple goal. I want the following Python program to work:
print u"\u0432"

As you have discovered, this is not so simple. Printing this character
might not be possible at all: If you have a terminal that just cannot
display CYRILLIC SMALL LETTER VE, then there is absolutely no way to
print the character - unless you change the terminal you use.

Actually, I have a complex goal: I want my SOAPpy program to work when
SOAPpy is in debug mode and is printing XML messages out to stdout.
Solving the simple problem will solve the complex one. Since I'm using
third party code, I can't go modify every print statement to call
encode() explictly.

This shows the real source of the problem. SOAPpy should not print the
strings, but repr them. For debugging, repr is more reliable than str,
as it can render virtually every object.

That seems to work reasonably well in Python 2.3 (but not 2.2!). But
then for some obscure reason if I redirect stdout in my shell it fails.
$ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"' > /dev/null

Why is that?

Python 2.3 discovers the encoding of your terminal, and will display
Unicode characters if the terminal supports them. Python 2.2 did not do
that, and the new feature is mainly useful in interactive mode.

When you redirect the output to a file, it is not a terminal anymore,
and Python cannot guess the encoding.

The only solution I've found that really works is reassigning
sys.stdout at the top of the script. That's an awful lot of work, but
it's the best I can do for now.

Why is Python not respecting my locale?

It is: however, your locale only tells Python the encoding of your
terminal, not the encoding of an arbitrary file you may write to.

Assigning sys.stdout is the right thing to do. I'm uncertain why
that could be an awful lot of work, as you do this only once...

Regards,
Martin

David Eppstein · Apr 8, 2004

That seems to work reasonably well in Python 2.3 (but not 2.2!). But
then for some obscure reason if I redirect stdout in my shell it fails.
$ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"' > /dev/null

Why is that?

Python 2.3 discovers the encoding of your terminal, and will display
Unicode characters if the terminal supports them. Python 2.2 did not do
that, and the new feature is mainly useful in interactive mode.[/QUOTE]

Py2.3 sure doesn't discover the encoding of my terminal automatically:

hyperbolic ~: python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\u432' in
position 0: ordinal not in range(128)

(that 2 was something else looking like an angular capital B before I
copied and pasted it into my newsreader...and yes, utf8 is the correct
encoding.)

Nelson Minar · Apr 8, 2004

Thanks for your answers.

Martin v. Löwis said:
As you have discovered, this is not so simple. Printing this character
might not be possible at all

I know. I'm just trying to figure out an expedient way to get Python
to make a best effort.

When you redirect the output to a file, it is not a terminal anymore,
and Python cannot guess the encoding.

So when Python can't guess the encoding, it assumes that ASCII is the
best it can do? Even as an American that annoys me; what do folks who
need non-ASCII do in practice? Martin, what do you do when you write a
Python script that prints your own name?

I guess what I'd like is a way to set Python's default encoding and
have that respected for files, terminals, etc. I'd also like some way
to override the Unicode error mode. 'strict' is the right default, but
I'd like the option to do 'ignore' or 'replace' globally.

Assigning sys.stdout is the right thing to do. I'm uncertain why
that could be an awful lot of work, as you do this only once...

Now that I know the trick I can do it. But part of the joy of Python
is that it makes simple things simple. For a beginner to the language
having to learn about the difference between sys.stdout and
sys.__stdout__ seems a bit much.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Apr 8, 2004

David said:
Py2.3 sure doesn't discover the encoding of my terminal automatically:

hyperbolic ~: python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin

Ah, Darwin. You lose.

If anybody can tell me how to programmatically discover the encoding
of Terminal.App, I'll happily incorporate a change into a 2.3.x release.

Regards,
Martin

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Apr 8, 2004

Nelson said:
So when Python can't guess the encoding, it assumes that ASCII is the
best it can do? Even as an American that annoys me;

In general, it uses the default encoding which, by default, is ASCII.

This has been chosen after long discussions, which discovered that any
other guess is wrong under likely circumstances (the specific
circumstances depending on what the guess is). If the guess is wrong,
you end up with moji-bake (nonsense characters), which are very hard
to track back to their source.

In the face of ambiguity, refuse the temptation to guess.

ASCII is the only guess that has no significant risk of ambiguity:
if something encodes successfully as ASCII, it would encode to the
very same byte order in nearly any other encoding.

what do folks who
need non-ASCII do in practice? Martin, what do you do when you write a
Python script that prints your own name?

It depends. If I print to the terminal, I use Unicode. If I print to
XML, I use Unicode, and expect that the XML writer will pick some
encoding, using XML character references if the o-umlat cannot be
encoded. If I print to HTML, I make sure an explicit META tag has
been added to denote the document as Latin-1, or I use ö.
If I print to a log file, I explcitly use Latin-1, unless I know
that the encoding of that log file is meant to be UTF-8. And so on.

It is not that Python is making that complicated, it is complicated
by nature - until everybody switches to UTF-8, which may take another
20 years or so.

I guess what I'd like is a way to set Python's default encoding and
have that respected for files, terminals, etc. I'd also like some way
to override the Unicode error mode. 'strict' is the right default, but
I'd like the option to do 'ignore' or 'replace' globally.

Submit a patch that does that. I very much prefer to fix errors instead
of ignoring them.

Regards,
Martin

Hye-Shik Chang · Apr 8, 2004

David said:
David said:

Py2.3 sure doesn't discover the encoding of my terminal automatically:

hyperbolic ~: python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin

Click to expand...

Ah, Darwin. You lose.

If anybody can tell me how to programmatically discover the encoding
of Terminal.App, I'll happily incorporate a change into a 2.3.x release.

The encoding of darwin terminal can be discovered by the routine
currently we have.

perky$ LC_ALL=ko_KR.UTF-8 python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.perky$ LC_ALL=ko_KR.eucKR python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.'eucKR'

Regards,
Hye-Shik

David Eppstein · Apr 9, 2004

Hye-Shik Chang said:
The encoding of darwin terminal can be discovered by the routine
currently we have.

perky$ LC_ALL=ko_KR.UTF-8 python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.perky$ LC_ALL=ko_KR.eucKR python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.'eucKR'

Well, no.

Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.'US-ASCII'

But, in fact, the encoding of my Terminal.App is utf8 (the usual
default), as set in the Terminal Inspector (Terminal->Window
Settings...) Display pane, Character Set Encoding menu.

Possibly you can find Terminal's preferred encoding in
Library/Preferences/com.apple.Terminal.plist (I just looked there, and
don't see it, unless it's maybe the StringEncoding:4 line) but it can be
changed from the default on a per-window basis and, like Martin, I don't
know how to find out its current setting.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Apr 9, 2004

Hye-Shik Chang said:
The encoding of darwin terminal can be discovered by the routine
currently we have.

perky$ LC_ALL=ko_KR.UTF-8 python

But that requires the user to set LC_ALL correctly. I'd rather
prefer if the standard installation of the system is supported,
where LC_ALL is not set. In particular, Terminal.App supports
changing its encoding through Settings, and I would like Python
to detect the current settings at startup time.

Regards,
Martin

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Apr 9, 2004

David said:
Well, no.

Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

'US-ASCII'

You did not follow Hye-Shik's instructions closely enough. You
failed to set the LANG environment variable before starting Python.

Regards,
Martin

David Eppstein · Apr 9, 2004

"Martin v. Lowis said:
David said:

Well, no.

Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import sys;sys.stdin.encoding

Click to expand...

'US-ASCII'

Click to expand...

You did not follow Hye-Shik's instructions closely enough. You
failed to set the LANG environment variable before starting Python.

A system that requires me to manually set the LANG environment variable
is no better than a system that requires me to manually define an
encoding once in Python. It doesn't seem to answer your question about
automatic determination of the Terminal's encoding.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Apr 10, 2004

David said:
A system that requires me to manually set the LANG environment variable
is no better than a system that requires me to manually define an
encoding once in Python.

Perhaps: depending on your criteria to judge "no better".
It does demonstrate that Python can be made to determine the terminal's
encoding, without changing any kind of Python or C source code.

Regards,
Martin

Michael Hudson · Apr 10, 2004

Martin v. LÃ¶wis said:
David said:

Py2.3 sure doesn't discover the encoding of my terminal automatically:
hyperbolic ~: python
Python 2.3 (#1, Sep 13 2003, 00:49:11) [GCC 3.3 20030304 (Apple
Computer, Inc. build 1495)] on darwin

Click to expand...

Ah, Darwin. You lose.

If anybody can tell me how to programmatically discover the encoding
of Terminal.App, I'll happily incorporate a change into a 2.3.x release.

The user can set it per-terminal, at runtime, with no notification to
the running process that it has done so!

You can't even find out the encoding in use via apple events, which is
a bit of surprise. Filing a bug with apple might get that changed for
10.4...

Cheers,
mwh

Paul Prescod · Apr 10, 2004

Nelson said:
...

So when Python can't guess the encoding, it assumes that ASCII is the
best it can do? Even as an American that annoys me; what do folks who
need non-ASCII do in practice? Martin, what do you do when you write a
Python script that prints your own name?

I guess what I'd like is a way to set Python's default encoding and
have that respected for files, terminals, etc. I'd also like some way
to override the Unicode error mode. 'strict' is the right default, but
I'd like the option to do 'ignore' or 'replace' globally.

The Python community has traditionally discouraged machine-specific
configuration. The more you depend on the machine configuration the more
likely you are to have problems when you move your program from one
computer to another.

The bug is in the third-party module that does not deal properly with
Unicode data!

...
Now that I know the trick I can do it. But part of the joy of Python
is that it makes simple things simple. For a beginner to the language
having to learn about the difference between sys.stdout and
sys.__stdout__ seems a bit much.

Agreed: the module should handle it for you.

Paul Prescod

Guest · Apr 11, 2004

Michael said:
The user can set it per-terminal, at runtime, with no notification to
the running process that it has done so!

I would find it acceptable to say "don't do that, then", here. However,
changing the encoding should be visible to new programs running in the
terminal.

I wonder how much would break if Python would assume the terminal
encoding is UTF-8 on Darwin. Do people use different terminal
encodings?

Regards,
Martin

Skip Montanaro · Apr 11, 2004

Martin> I wonder how much would break if Python would assume the
Martin> terminal encoding is UTF-8 on Darwin. Do people use different
Martin> terminal encodings?

I generally use xterm instead of Terminal.app. I think it's encoding is
latin-1.

Skip

David Eppstein · Apr 11, 2004

"Martin v. Lowis said:
I would find it acceptable to say "don't do that, then", here. However,
changing the encoding should be visible to new programs running in the
terminal.

I wonder how much would break if Python would assume the terminal
encoding is UTF-8 on Darwin. Do people use different terminal
encodings?

Now I'm curious -- how do you even find out it's a Terminal window
you're looking at, rather than say an xterm?

David Eppstein · Apr 11, 2004

I wonder how much would break if Python would assume the terminal
encoding is UTF-8 on Darwin. Do people use different terminal
encodings?

Now I'm curious -- how do you even find out it's a Terminal window
you're looking at, rather than say an xterm?[/QUOTE]

Never mind, I just did a printenv and saw
TERM_PROGRAM=Apple_Terminal

But I also saw
__CF_USER_TEXT_ENCODING=0x1F5:0:0
....I wonder who's putting that there?

Michael Hudson · Apr 12, 2004

Martin v. LÃ¶wis said:
I would find it acceptable to say "don't do that, then", here. However,
changing the encoding should be visible to new programs running in the
terminal.

I wonder how much would break if Python would assume the terminal
encoding is UTF-8 on Darwin. Do people use different terminal
encodings?

How long is a piece of string? *I* don't change the encoding very often.

You could consider the output of 'defaults read com.apple.Terminal
StringEncoding' or equivalent API calls but it's fairly opaque (it's
'4' on my machine).

I don't think $__CF_USER_TEXT_ENCODING has anything to do with
terminals (CoreFoundation, more likely).

Cheers,
mwh

Eurosymbol in xml document	5	Mar 4, 2008
print() and unicode strings (python 3.1)	12	Aug 24, 2009
How to print this character u'\u20ac' to DOS terminal	7	May 30, 2007
Unicode blues in Python3	14	Mar 23, 2010
why this is wrong?	1	Oct 9, 2006
Forcing any output (file / stdout) to UTF-8	0	Jun 6, 2010
codecs.register_error for "strict", unicode.encode() and str.decode()	0	Jul 26, 2012
Short confusing example with unicode, print, and __str__	0	Mar 5, 2008

print u"\u0432": why is this so hard? UnciodeEncodeError

Nelson Minar

Jon Willeke

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

David Eppstein

Nelson Minar

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Hye-Shik Chang

David Eppstein

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

David Eppstein

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Michael Hudson

Paul Prescod

Guest

Skip Montanaro

David Eppstein

David Eppstein

Michael Hudson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads