print() and unicode strings (python 3.1)

7

7stud

======python 2.6 ======
import sys

print sys.getdefaultencoding()

s = u"\u20ac"
print s.encode("utf-8")


$ python2.6 1test.py
ascii



=====python 3.1 =======
import sys

print(sys.getdefaultencoding())

s = "€"
print(s.encode("utf-8"))
print(s)


$ python3.1 1test.py
utf-8
b'\xe2\x82\xac'

Traceback (most recent call last):
File "1test.py", line 7, in <module>
print(s)
UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in
position 0: ordinal not in range(12


I don't understand why I'm getting an encode error in python 3.1.
 
M

Martin v. Löwis

I don't understand why I'm getting an encode error in python 3.1.

The default encoding is not relevant here at all. Look at
sys.stdout.encoding.

Regards,
Martin
 
7

7stud

The default encoding is not relevant here at all. Look at
sys.stdout.encoding.

Regards,
Martin

Hi,

Thanks for the response. I get US-ASCII for both 2.6 and 3.1:

===python 3.1======
import sys

print(sys.stdout.encoding)


$ python3.1 1test.py
US-ASCII

I can't figure out a way to programatically set the encoding for
sys.stdout. So where does that leave me? python 3.1 won't let me
explicitly encode my unicode string, and python 3.1 implicitly does
the encoding with the wrong codec. And why would any programmer rely
on python 3.1's implicit encoding of unicode strings anyway?
Presumably, different systems will have different encodings for
sys.stdout, some encodings might cause encode errors.
 
S

Stefan Behnel

7stud said:
python 3.1 won't let me
explicitly encode my unicode string

Sure it does. But encoding a non-ASCII string to ASCII will necessarily fail.

and python 3.1 implicitly does
the encoding with the wrong codec.

That's not a Python problem, though. Your terminal is configured for
US-ASCII, so you can't output anything but US-ASCII characters.

Change your terminal setup to e.g. UTF-8 and see how things start working.

Stefan
 
7

7stud

Sure it does. But encoding a non-ASCII string to ASCII will necessarily fail.

As you should be able to see in the python 3.1 example I posted, I did
not encode the string using the ascii codec. I encoded it with the
utf-8 codec, and unfortunately in python 3.1 that creates a "bytes
string", and print()'ing a bytes string does not produce human
readable text.

That's not a Python problem, though. Your terminal is configured for
US-ASCII, so you can't output anything but US-ASCII characters.

My terminal is configured for utf-8, and from the output of the python
2.6 example I posted, it should be apparent that my terminal is
capable of rendering the euro character.
 
M

Martin v. Löwis

I can't figure out a way to programatically set the encoding for
sys.stdout. So where does that leave me?

You should be setting the terminal encoding administratively, not
programmatically.

Regards,
Martin
 
7

7stud

You should be setting the terminal encoding administratively, not
programmatically.

The terminal encoding has always been utf-8. It was not set
programmatically.

It seems to me that python 3.1's string handling is broken.
Apparently, in python 3.1 I am unable to explicitly set the encoding
of a string and print() it out with the result being human readable
text. On the other hand, if I let python do the encoding implicitly,
python uses a codec I don't want it to.
 
N

Ned Deily

7stud said:
The terminal encoding has always been utf-8. It was not set
programmatically.

It seems to me that python 3.1's string handling is broken.
Apparently, in python 3.1 I am unable to explicitly set the encoding
of a string and print() it out with the result being human readable
text. On the other hand, if I let python do the encoding implicitly,
python uses a codec I don't want it to.

If you are running on a Unix-y system, check your locale settings (LANG,
LC.*, et al). I think you'll likely find that your locale is really not
UTF-8. The following was on Python 3.1 on OS X 10.5, similar results
on Debian Linux:

$ cat t3.py
import sys
print(sys.stdout.encoding)
s = "¤"
print(s.encode("utf-8"))
print(s)

$ export LANG=en_US.UTF-8
$ python3.1 t3.py
UTF-8
b'\xe2\x82\xac'
¤

$ export LANG=C
$ python3.1 t3.py
US-ASCII
b'\xe2\x82\xac'
Traceback (most recent call last):
File "t3.py", line 7, in <module>
print(s)
UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in
position 0: ordinal not in range(128)
 
7

7stud

If you are running on a Unix-y system, check your locale settings (LANG,
LC.*, et al). I think you'll likely find that your locale is really not
UTF-8. The following was on Python 3.1 on OS X 10.5, similar results
on Debian Linux:

$ cat t3.py
import sys
print(sys.stdout.encoding)
s = "¤"
print(s.encode("utf-8"))
print(s)

$ export LANG=en_US.UTF-8
$ python3.1 t3.py
UTF-8
b'\xe2\x82\xac'
¤

$ export LANG=C
$ python3.1 t3.py
US-ASCII
b'\xe2\x82\xac'
Traceback (most recent call last):
File "t3.py", line 7, in <module>
print(s)
UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in
position 0: ordinal not in range(128)

Hi,

Thanks for the response. My OS is mac osx 10.4.11. I'm not really
sure how to check my locale settings. Here is some stuff I tried:

$ echo $LANG

$ echo $LC_ALL

$ echo $LC_CTYPE

$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

$man locale
....
....
....

ENVIRONMENT:
LANG
Used as a substitute for any unset LC_* variable. If LANG is unset it
will act as if set to "C". If any of LANG or LC_* are set to invalide
values locale acts as if they are all unset.

===========

As in your last example, my 'C' settings mean that an ascii codec is
used somewhere to encode() the unicode string.

--
The locale C or POSIX is a portable locale; its LC_CTYPE part
corresponds to the 7-bit ASCII character set.

http://linux.about.com/library/cmd/blcmdl3_setlocale.htm
--


Is this the way it works:


1) python sets the codec for sys.stdout to the LANG environment
variable.
2) It doesn't matter that my terminal's encoding is set to utf-8
because output has to pass through sys.stdout first.

So:

a) My terminal's environment is telling python(and all other programs
running in the terminal) that output sent to sys.stdout must be
encoded in ascii.
b) The solution is to set a LANG environment variable.


Why does echoing $LC_ALL or $LC_CTYPE just give me a blank string?


Previously, I've set environment variables that I want to be
permanent, e.g PATH, in ~/.bash_profile, so I did this:

~/.bash_profile:
--------------
....
....
LANG="en_US.UTF-8"
export LANG

and now python 3.1 acts like I expect it to:

-------
import locale
import sys

print(locale.getlocale(locale.LC_CTYPE))
print(sys.stdout.encoding)


s = "€"
print(s)

print(s.encode("utf-8"))

--output:--
('en_US', 'UTF8')
UTF-8

b'\xe2\x82\xac'
----------

In conclusion, as far as I can tell, if your python 3.1 program tries
to output a unicode string, and the unicode string cannot be encoded
by the codec specified in the user's LANG environment variable**, then
the user will get an encode error. Just because the programmer's
system can handle the output doesn't mean that another user's system
can. I guess that's the way it goes: if a user's environment is
telling all programs that it only wants ascii output to go to the
screen(sys.stdout), you can't(or shouldn't) do anything about it.

**Or if the LANG environment variable is not present, then the codec
corresponding to the locale settings(C' corresponds to ascii).

some good locale info:
http://www.chemie.fu-berlin.de/chemnet/use/info/libc/libc_19.html
 
N

Nobody

Why does echoing $LC_ALL or $LC_CTYPE just give me a blank string?

Because the variables aren't set.

The default locale for a particular category (e.g. LC_CTYPE) is taken from
$LC_ALL if that is set, otherwise $LC_CTYPE, otherwise $LANG, otherwise
"C" is used.

Normally, you would either set LANG (and possibly some individual LC_*
variables), or LC_ALL. There's no point in setting all of them.
In conclusion, as far as I can tell, if your python 3.1 program tries
to output a unicode string, and the unicode string cannot be encoded
by the codec specified in the user's LANG environment variable**, then
the user will get an encode error. Just because the programmer's
system can handle the output doesn't mean that another user's system
can. I guess that's the way it goes: if a user's environment is
telling all programs that it only wants ascii output to go to the
screen(sys.stdout), you can't(or shouldn't) do anything about it.

**Or if the LANG environment variable is not present, then the codec
corresponding to the locale settings(C' corresponds to ascii).

The underlying OS primitive can only handle bytes. If you read or write a
(unicode) string, Python needs to know which encoding is used. For Python
file objects created by the user (via open() etc), you can specify the
encoding; for those created by the runtime (e.g. sys.stdin), Python uses
the locale's LC_CTYPE category to select an encoding.

Data written to or read from text streams is encoded or decoded using the
stream's encoding. Filenames are encoded and decoded using the
filesystem encoding (sys.getfilesystemencoding()). Anything else uses the
default encoding (sys.getdefaultencoding()).

In Python 3, text streams are handled using io.TextIOWrapper:

http://docs.python.org/3.1/library/io.html#text-i-o

This implements a stream which can read and/or write text data on top of
one which can read and/or write binary data. The sys.std{in,out,err}
streams are instances of TextIOWrapper. You can get the underlying
binary stream from the "buffer" attribute, e.g.:

sys.stdout.buffer.write(b'hello world\n')

If you need to force a specific encoding (e.g. if the user has specified
an encoding via a command-line option), you can detach the existing
wrapper and create a new one, e.g.:

sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = new_encoding)
 
P

Piet van Oostrum

7stud said:
7> Thanks for the response. My OS is mac osx 10.4.11. I'm not really
7> sure how to check my locale settings. Here is some stuff I tried:
7> $ echo $LANG
7> $ echo $LC_ALL
7> $ echo $LC_CTYPE
7> $ locale
7> LANG=
7> LC_COLLATE="C"
7> LC_CTYPE="C"
7> LC_MESSAGES="C"
7> LC_MONETARY="C"
7> LC_NUMERIC="C"
7> LC_TIME="C"
7> LC_ALL="C"

IIRC, Mac OS X 10.4 does not set LANG or LC_* automatically. In 10.5
Terminal has an option in the preferences to set LANG according to the
encoding chosen (and presumably the language of the user).
 
7

7stud

The underlying OS primitive can only handle bytes. If you read or write a
(unicode) string, Python needs to know which encoding is used. For Python
file objects created by the user (via open() etc), you can specify the
encoding; for those created by the runtime (e.g. sys.stdin), Python uses
the locale's LC_CTYPE category to select an encoding.

Data written to or read from text streams is encoded or decoded using the
stream's encoding. Filenames are encoded and decoded using the
filesystem encoding (sys.getfilesystemencoding()). Anything else uses the
default encoding (sys.getdefaultencoding()).

In Python 3, text streams are handled using io.TextIOWrapper:

       http://docs.python.org/3.1/library/io.html#text-i-o

This implements a stream which can read and/or write text data on top of
one which can read and/or write binary data. The sys.std{in,out,err}
streams are instances of TextIOWrapper. You can get the underlying
binary stream from the "buffer" attribute, e.g.:

        sys.stdout.buffer.write(b'hello world\n')

If you need to force a specific encoding (e.g. if the user has specified
an encoding via a command-line option), you can detach the existing
wrapper and create a new one, e.g.:

        sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = new_encoding)

Thanks for the details.
 
Joined
Oct 26, 2010
Messages
1
Reaction score
0
PYTHONIOENCODING

You may want to try changing the environment variable "PYTHONIOENCODING" to "utf_8." I have written a webpage -- daveagp.wordpress.com/what-a-character/ -- with some details on my ordeal with this problem.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top