If you are running on a Unix-y system, check your locale settings (LANG,
LC.*, et al). I think you'll likely find that your locale is really not
UTF-8. The following was on Python 3.1 on OS X 10.5, similar results
on Debian Linux:
$ cat t3.py
import sys
print(sys.stdout.encoding)
s = "¤"
print(s.encode("utf-8"))
print(s)
$ export LANG=en_US.UTF-8
$ python3.1 t3.py
UTF-8
b'\xe2\x82\xac'
¤
$ export LANG=C
$ python3.1 t3.py
US-ASCII
b'\xe2\x82\xac'
Traceback (most recent call last):
File "t3.py", line 7, in <module>
print(s)
UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in
position 0: ordinal not in range(128)
Hi,
Thanks for the response. My OS is mac osx 10.4.11. I'm not really
sure how to check my locale settings. Here is some stuff I tried:
$ echo $LANG
$ echo $LC_ALL
$ echo $LC_CTYPE
$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"
$man locale
....
....
....
ENVIRONMENT:
LANG
Used as a substitute for any unset LC_* variable. If LANG is unset it
will act as if set to "C". If any of LANG or LC_* are set to invalide
values locale acts as if they are all unset.
===========
As in your last example, my 'C' settings mean that an ascii codec is
used somewhere to encode() the unicode string.
--
The locale C or POSIX is a portable locale; its LC_CTYPE part
corresponds to the 7-bit ASCII character set.
http://linux.about.com/library/cmd/blcmdl3_setlocale.htm
--
Is this the way it works:
1) python sets the codec for sys.stdout to the LANG environment
variable.
2) It doesn't matter that my terminal's encoding is set to utf-8
because output has to pass through sys.stdout first.
So:
a) My terminal's environment is telling python(and all other programs
running in the terminal) that output sent to sys.stdout must be
encoded in ascii.
b) The solution is to set a LANG environment variable.
Why does echoing $LC_ALL or $LC_CTYPE just give me a blank string?
Previously, I've set environment variables that I want to be
permanent, e.g PATH, in ~/.bash_profile, so I did this:
~/.bash_profile:
--------------
....
....
LANG="en_US.UTF-8"
export LANG
and now python 3.1 acts like I expect it to:
-------
import locale
import sys
print(locale.getlocale(locale.LC_CTYPE))
print(sys.stdout.encoding)
s = "€"
print(s)
print(s.encode("utf-8"))
--output:--
('en_US', 'UTF8')
UTF-8
€
b'\xe2\x82\xac'
----------
In conclusion, as far as I can tell, if your python 3.1 program tries
to output a unicode string, and the unicode string cannot be encoded
by the codec specified in the user's LANG environment variable**, then
the user will get an encode error. Just because the programmer's
system can handle the output doesn't mean that another user's system
can. I guess that's the way it goes: if a user's environment is
telling all programs that it only wants ascii output to go to the
screen(sys.stdout), you can't(or shouldn't) do anything about it.
**Or if the LANG environment variable is not present, then the codec
corresponding to the locale settings(C' corresponds to ascii).
some good locale info:
http://www.chemie.fu-berlin.de/chemnet/use/info/libc/libc_19.html