How does Python get the value for sys.stdin.encoding?

R

RG

I thought it was hard-coded into the Python executable at compile time,
but that is apparently not the case:

[ron@mickey:~]$ python
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.[ron@mickey:~]$ echo 'import sys;print sys.stdin.encoding' | python
None
[ron@mickey:~]$

And indeed, trying to pipe unicode into Python doesn't work, even though
it works fine when Python runs interactively. So how can I make this
work?

Thanks,
rg
 
B

Benjamin Kaplan

I thought it was hard-coded into the Python executable at compile time,
but that is apparently not the case:

[ron@mickey:~]$ python
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.[ron@mickey:~]$ echo 'import sys;print sys.stdin.encoding' | python
None
[ron@mickey:~]$

And indeed, trying to pipe unicode into Python doesn't work, even though
it works fine when Python runs interactively.  So how can I make this
work?

Sys.stdin and stdout are files, just like any other. There's nothing
special about them at compile time. When the interpreter starts, it
checks to see if they are ttys. If they are, then it tries to figure
out the terminal's encoding based on the environment. The code for
this is in pythonrun.c if you want to see exactly what it's doing. If
stdout and stdin aren't ttys, then their encoding stays as None and
the interpreter will use sys.getdefaultencoding() if you try printing
Unicode strings.

By the way, there is no such thing as piping Unicode into Python.
Unicode is an abstract concept where each character maps to a
codepoint. Pipes can only deal with bytes. You may be using one of the
5 encodings capable of holding the entire range of Unicode characters
(UTF-8, UTF-16 LE, UTF-16 BE, UTF-32 LE, and UTF-32 BE), but that's
not the same thing as Unicode. You really have to watch your encodings
when you pass data around between programs. There's no way to avoid
it.
 
R

RG

Benjamin Kaplan said:
I thought it was hard-coded into the Python executable at compile time,
but that is apparently not the case:

[ron@mickey:~]$ python
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import sys;print sys.stdin.encoding UTF-8
^D
[ron@mickey:~]$ echo 'import sys;print sys.stdin.encoding' | python
None
[ron@mickey:~]$

And indeed, trying to pipe unicode into Python doesn't work, even though
it works fine when Python runs interactively.  So how can I make this
work?

Sys.stdin and stdout are files, just like any other. There's nothing
special about them at compile time. When the interpreter starts, it
checks to see if they are ttys. If they are, then it tries to figure
out the terminal's encoding based on the environment. The code for
this is in pythonrun.c if you want to see exactly what it's doing.

Thanks. Looks like the magic incantation is:

export PYTHONIOENCODING='utf-8'
By the way, there is no such thing as piping Unicode into Python.

Yeah, I know. I should have said "piping UTF-8 encoded unicode" or
something like that.
You really have to watch your encodings
when you pass data around between programs. There's no way to avoid
it.

Yeah, I keep re-learning that lesson again and again.

rg
 
A

Anssi Saari

Benjamin Kaplan said:
Sys.stdin and stdout are files, just like any other. There's nothing
special about them at compile time. When the interpreter starts, it
checks to see if they are ttys. If they are, then it tries to figure
out the terminal's encoding based on the environment.

Just a related question, is looking at sys.stdin.encoding the proper
way of doing things? I've been working on a script to display some
email headers, some of which are encoded in MIME to various charsets.

Until now I have used whatever locale.getdefaultlocale() returns as
the target encoding, since "it seemed to work". Although on one
computer the call returns ISO-8859-15 even though I don't quite
understand why.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,039
Messages
2,570,376
Members
47,031
Latest member
AndreBucki

Latest Threads

Top