print u"\u0432": why is this so hard? UnciodeEncodeError

N

Nelson Minar

I have a simple goal. I want the following Python program to work:
print u"\u0432"

This program fails on my US Debian machine:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)

Actually, I have a complex goal: I want my SOAPpy program to work when
SOAPpy is in debug mode and is printing XML messages out to stdout.
Solving the simple problem will solve the complex one. Since I'm using
third party code, I can't go modify every print statement to call
encode() explictly.


The simplest solution I've come up with is this:
$ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"'

That seems to work reasonably well in Python 2.3 (but not 2.2!). But
then for some obscure reason if I redirect stdout in my shell it fails.
$ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"' > /dev/null

Why is that?


The only solution I've found that really works is reassigning
sys.stdout at the top of the script. That's an awful lot of work, but
it's the best I can do for now.

Why is Python not respecting my locale?


Here's my test program:

----------------------------------------------------------------------

#!/bin/bash -x

# Obliterate locale
for e in LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL; do
unset $e
done

# Doing the obvious thing has nonobvious effects
python2.3 -c 'print u"\u0432"' # fails, OK.
LC_ALL=en_US.utf8 python2.3 -c 'print u"\u0432"' # works!
LC_ALL=en_US.utf8 python2.3 -c 'print u"\u0432"' > /dev/null # fails, huh?

# These both work, but what a pain!
python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"'
python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"' > /dev/null

----------------------------------------------------------------------

And sample output:

----------------------------------------------------------------------

~/src/python/testUnicode.sh
+ unset LANG
+ unset LC_CTYPE
+ unset LC_NUMERIC
+ unset LC_TIME
+ unset LC_COLLATE
+ unset LC_MONETARY
+ unset LC_MESSAGES
+ unset LC_PAPER
+ unset LC_NAME
+ unset LC_ADDRESS
+ unset LC_TELEPHONE
+ unset LC_MEASUREMENT
+ unset LC_IDENTIFICATION
+ unset LC_ALL
+ python2.3 -c 'print u"\u0432"'
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)
+ LC_ALL=en_US.utf8
+ python2.3 -c 'print u"\u0432"'
в
+ LC_ALL=en_US.utf8
+ python2.3 -c 'print u"\u0432"'
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)
+ python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"'
в
+ python2.3 -c 'import sys, codecs; sys.stdout = codecs.getwriter("utf-8")(sys.__stdout__); print u"\u0432"'
 
J

Jon Willeke

Nelson said:
I have a simple goal. I want the following Python program to work:
print u"\u0432"

This program fails on my US Debian machine:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0432' in position 0: ordinal not in range(128)

Try playing with sys.setdefaultencoding(). You'll need to reload( sys )
to call it. If it solves your problem, you can make it permanent by
modifying site.py.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Nelson said:
I have a simple goal. I want the following Python program to work:
print u"\u0432"

As you have discovered, this is not so simple. Printing this character
might not be possible at all: If you have a terminal that just cannot
display CYRILLIC SMALL LETTER VE, then there is absolutely no way to
print the character - unless you change the terminal you use.
Actually, I have a complex goal: I want my SOAPpy program to work when
SOAPpy is in debug mode and is printing XML messages out to stdout.
Solving the simple problem will solve the complex one. Since I'm using
third party code, I can't go modify every print statement to call
encode() explictly.

This shows the real source of the problem. SOAPpy should not print the
strings, but repr them. For debugging, repr is more reliable than str,
as it can render virtually every object.
That seems to work reasonably well in Python 2.3 (but not 2.2!). But
then for some obscure reason if I redirect stdout in my shell it fails.
$ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"' > /dev/null

Why is that?

Python 2.3 discovers the encoding of your terminal, and will display
Unicode characters if the terminal supports them. Python 2.2 did not do
that, and the new feature is mainly useful in interactive mode.

When you redirect the output to a file, it is not a terminal anymore,
and Python cannot guess the encoding.
The only solution I've found that really works is reassigning
sys.stdout at the top of the script. That's an awful lot of work, but
it's the best I can do for now.

Why is Python not respecting my locale?

It is: however, your locale only tells Python the encoding of your
terminal, not the encoding of an arbitrary file you may write to.

Assigning sys.stdout is the right thing to do. I'm uncertain why
that could be an awful lot of work, as you do this only once...

Regards,
Martin
 
D

David Eppstein

That seems to work reasonably well in Python 2.3 (but not 2.2!). But
then for some obscure reason if I redirect stdout in my shell it fails.
$ LANG=en_US.UTF-8 python2.3 -c 'print u"\u0432"' > /dev/null

Why is that?

Python 2.3 discovers the encoding of your terminal, and will display
Unicode characters if the terminal supports them. Python 2.2 did not do
that, and the new feature is mainly useful in interactive mode.[/QUOTE]

Py2.3 sure doesn't discover the encoding of my terminal automatically:

hyperbolic ~: python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\u432' in
position 0: ordinal not in range(128)

(that 2 was something else looking like an angular capital B before I
copied and pasted it into my newsreader...and yes, utf8 is the correct
encoding.)
 
N

Nelson Minar

Thanks for your answers.

Martin v. Löwis said:
As you have discovered, this is not so simple. Printing this character
might not be possible at all

I know. I'm just trying to figure out an expedient way to get Python
to make a best effort.
When you redirect the output to a file, it is not a terminal anymore,
and Python cannot guess the encoding.

So when Python can't guess the encoding, it assumes that ASCII is the
best it can do? Even as an American that annoys me; what do folks who
need non-ASCII do in practice? Martin, what do you do when you write a
Python script that prints your own name?

I guess what I'd like is a way to set Python's default encoding and
have that respected for files, terminals, etc. I'd also like some way
to override the Unicode error mode. 'strict' is the right default, but
I'd like the option to do 'ignore' or 'replace' globally.
Assigning sys.stdout is the right thing to do. I'm uncertain why
that could be an awful lot of work, as you do this only once...

Now that I know the trick I can do it. But part of the joy of Python
is that it makes simple things simple. For a beginner to the language
having to learn about the difference between sys.stdout and
sys.__stdout__ seems a bit much.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

David said:
Py2.3 sure doesn't discover the encoding of my terminal automatically:

hyperbolic ~: python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin

Ah, Darwin. You lose.

If anybody can tell me how to programmatically discover the encoding
of Terminal.App, I'll happily incorporate a change into a 2.3.x release.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Nelson said:
So when Python can't guess the encoding, it assumes that ASCII is the
best it can do? Even as an American that annoys me;

In general, it uses the default encoding which, by default, is ASCII.

This has been chosen after long discussions, which discovered that any
other guess is wrong under likely circumstances (the specific
circumstances depending on what the guess is). If the guess is wrong,
you end up with moji-bake (nonsense characters), which are very hard
to track back to their source.

In the face of ambiguity, refuse the temptation to guess.

ASCII is the only guess that has no significant risk of ambiguity:
if something encodes successfully as ASCII, it would encode to the
very same byte order in nearly any other encoding.
what do folks who
need non-ASCII do in practice? Martin, what do you do when you write a
Python script that prints your own name?

It depends. If I print to the terminal, I use Unicode. If I print to
XML, I use Unicode, and expect that the XML writer will pick some
encoding, using XML character references if the o-umlat cannot be
encoded. If I print to HTML, I make sure an explicit META tag has
been added to denote the document as Latin-1, or I use &ouml;.
If I print to a log file, I explcitly use Latin-1, unless I know
that the encoding of that log file is meant to be UTF-8. And so on.

It is not that Python is making that complicated, it is complicated
by nature - until everybody switches to UTF-8, which may take another
20 years or so.
I guess what I'd like is a way to set Python's default encoding and
have that respected for files, terminals, etc. I'd also like some way
to override the Unicode error mode. 'strict' is the right default, but
I'd like the option to do 'ignore' or 'replace' globally.

Submit a patch that does that. I very much prefer to fix errors instead
of ignoring them.

Regards,
Martin
 
H

Hye-Shik Chang

David said:
Py2.3 sure doesn't discover the encoding of my terminal automatically:

hyperbolic ~: python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin

Ah, Darwin. You lose.

If anybody can tell me how to programmatically discover the encoding
of Terminal.App, I'll happily incorporate a change into a 2.3.x release.

The encoding of darwin terminal can be discovered by the routine
currently we have.

perky$ LC_ALL=ko_KR.UTF-8 python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.perky$ LC_ALL=ko_KR.eucKR python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.'eucKR'


Regards,
Hye-Shik
 
D

David Eppstein

Hye-Shik Chang said:
The encoding of darwin terminal can be discovered by the routine
currently we have.

perky$ LC_ALL=ko_KR.UTF-8 python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.perky$ LC_ALL=ko_KR.eucKR python
Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.'eucKR'

Well, no.

Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.'US-ASCII'

But, in fact, the encoding of my Terminal.App is utf8 (the usual
default), as set in the Terminal Inspector (Terminal->Window
Settings...) Display pane, Character Set Encoding menu.

Possibly you can find Terminal's preferred encoding in
Library/Preferences/com.apple.Terminal.plist (I just looked there, and
don't see it, unless it's maybe the StringEncoding:4 line) but it can be
changed from the default on a per-window basis and, like Martin, I don't
know how to find out its current setting.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Hye-Shik Chang said:
The encoding of darwin terminal can be discovered by the routine
currently we have.

perky$ LC_ALL=ko_KR.UTF-8 python

But that requires the user to set LC_ALL correctly. I'd rather
prefer if the standard installation of the system is supported,
where LC_ALL is not set. In particular, Terminal.App supports
changing its encoding through Settings, and I would like Python
to detect the current settings at startup time.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

David said:
Well, no.

Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

'US-ASCII'

You did not follow Hye-Shik's instructions closely enough. You
failed to set the LANG environment variable before starting Python.

Regards,
Martin
 
D

David Eppstein

"Martin v. Lowis said:
David said:
Well, no.

Python 2.3 (#1, Sep 13 2003, 00:49:11)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import sys;sys.stdin.encoding

'US-ASCII'

You did not follow Hye-Shik's instructions closely enough. You
failed to set the LANG environment variable before starting Python.

A system that requires me to manually set the LANG environment variable
is no better than a system that requires me to manually define an
encoding once in Python. It doesn't seem to answer your question about
automatic determination of the Terminal's encoding.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

David said:
A system that requires me to manually set the LANG environment variable
is no better than a system that requires me to manually define an
encoding once in Python.

Perhaps: depending on your criteria to judge "no better".
It does demonstrate that Python can be made to determine the terminal's
encoding, without changing any kind of Python or C source code.

Regards,
Martin
 
M

Michael Hudson

Martin v. Löwis said:
David said:
Py2.3 sure doesn't discover the encoding of my terminal automatically:
hyperbolic ~: python
Python 2.3 (#1, Sep 13 2003, 00:49:11) [GCC 3.3 20030304 (Apple
Computer, Inc. build 1495)] on darwin

Ah, Darwin. You lose.

If anybody can tell me how to programmatically discover the encoding
of Terminal.App, I'll happily incorporate a change into a 2.3.x release.

The user can set it per-terminal, at runtime, with no notification to
the running process that it has done so!

You can't even find out the encoding in use via apple events, which is
a bit of surprise. Filing a bug with apple might get that changed for
10.4...

Cheers,
mwh
 
P

Paul Prescod

Nelson said:
...

So when Python can't guess the encoding, it assumes that ASCII is the
best it can do? Even as an American that annoys me; what do folks who
need non-ASCII do in practice? Martin, what do you do when you write a
Python script that prints your own name?

I guess what I'd like is a way to set Python's default encoding and
have that respected for files, terminals, etc. I'd also like some way
to override the Unicode error mode. 'strict' is the right default, but
I'd like the option to do 'ignore' or 'replace' globally.

The Python community has traditionally discouraged machine-specific
configuration. The more you depend on the machine configuration the more
likely you are to have problems when you move your program from one
computer to another.

The bug is in the third-party module that does not deal properly with
Unicode data!
...
Now that I know the trick I can do it. But part of the joy of Python
is that it makes simple things simple. For a beginner to the language
having to learn about the difference between sys.stdout and
sys.__stdout__ seems a bit much.

Agreed: the module should handle it for you.

Paul Prescod
 
G

Guest

Michael said:
The user can set it per-terminal, at runtime, with no notification to
the running process that it has done so!

I would find it acceptable to say "don't do that, then", here. However,
changing the encoding should be visible to new programs running in the
terminal.

I wonder how much would break if Python would assume the terminal
encoding is UTF-8 on Darwin. Do people use different terminal
encodings?

Regards,
Martin
 
S

Skip Montanaro

Martin> I wonder how much would break if Python would assume the
Martin> terminal encoding is UTF-8 on Darwin. Do people use different
Martin> terminal encodings?

I generally use xterm instead of Terminal.app. I think it's encoding is
latin-1.

Skip
 
D

David Eppstein

"Martin v. Lowis said:
I would find it acceptable to say "don't do that, then", here. However,
changing the encoding should be visible to new programs running in the
terminal.

I wonder how much would break if Python would assume the terminal
encoding is UTF-8 on Darwin. Do people use different terminal
encodings?

Now I'm curious -- how do you even find out it's a Terminal window
you're looking at, rather than say an xterm?
 
D

David Eppstein

I wonder how much would break if Python would assume the terminal
encoding is UTF-8 on Darwin. Do people use different terminal
encodings?

Now I'm curious -- how do you even find out it's a Terminal window
you're looking at, rather than say an xterm?[/QUOTE]

Never mind, I just did a printenv and saw
TERM_PROGRAM=Apple_Terminal

But I also saw
__CF_USER_TEXT_ENCODING=0x1F5:0:0
....I wonder who's putting that there?
 
M

Michael Hudson

Martin v. Löwis said:
I would find it acceptable to say "don't do that, then", here. However,
changing the encoding should be visible to new programs running in the
terminal.

I wonder how much would break if Python would assume the terminal
encoding is UTF-8 on Darwin. Do people use different terminal
encodings?

How long is a piece of string? *I* don't change the encoding very often.

You could consider the output of 'defaults read com.apple.Terminal
StringEncoding' or equivalent API calls but it's fairly opaque (it's
'4' on my machine).

I don't think $__CF_USER_TEXT_ENCODING has anything to do with
terminals (CoreFoundation, more likely).

Cheers,
mwh
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top