Yet another unicode WTF

R

Ron Garret

Python 2.6.2 on OS X 10.5.7:

[ron@mickey:~]$ echo $LANG
en_US.UTF-8
[ron@mickey:~]$ cat frob.py
#!/usr/bin/env python
print u'\u03BB'

[ron@mickey:~]$ ./frob.py
ª
[ron@mickey:~]$ ./frob.py > foo
Traceback (most recent call last):
File "./frob.py", line 2, in <module>
print u'\u03BB'
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03bb' in
position 0: ordinal not in range(128)


(That's supposed to be a small greek lambda, but I'm using a
brain-damaged news reader that won't let me set the character encoding.
It shows up correctly in my terminal.)

According to what I thought I knew about unix (and I had fancied myself
a bit of an expert until just now) this is impossible. Python is
obviously picking up a different default encoding when its output is
being piped to a file, but I always thought one of the fundamental
invariants of unix processes was that there's no way for a process to
know what's on the other end of its stdout.

Clues appreciated. Thanks.

rg
 
L

Lawrence D'Oliveiro

Ron said:
Python 2.6.2 on OS X 10.5.7:

Same result, Python 2.6.1-3 on Debian Unstable. My $LANG is en_NZ.UTF-8.
... I always thought one of the fundamental
invariants of unix processes was that there's no way for a process to
know what's on the other end of its stdout.

Well, there have long been functions like isatty(3). That's probably what's
involved here.
 
G

Gabriel Genellina

Python 2.6.2 on OS X 10.5.7:

[ron@mickey:~]$ echo $LANG
en_US.UTF-8
[ron@mickey:~]$ cat frob.py
#!/usr/bin/env python
print u'\u03BB'

[ron@mickey:~]$ ./frob.py
ª
[ron@mickey:~]$ ./frob.py > foo
Traceback (most recent call last):
File "./frob.py", line 2, in <module>
print u'\u03BB'
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03bb' in
position 0: ordinal not in range(128)


(That's supposed to be a small greek lambda, but I'm using a
brain-damaged news reader that won't let me set the character encoding.
It shows up correctly in my terminal.)

According to what I thought I knew about unix (and I had fancied myself
a bit of an expert until just now) this is impossible. Python is
obviously picking up a different default encoding when its output is
being piped to a file, but I always thought one of the fundamental
invariants of unix processes was that there's no way for a process to
know what's on the other end of its stdout.

It may be hard to know *who* is at the other end of the pipe, but it's
easy to know *what* kind of file it is.
Lots of programs detect whether stdout is a tty or not (using isatty(3))
and adapt their output accordingly; ls is one example.

Python knows the terminal encoding (or at least can make a good guess),
but a file may use *any* encoding you want, completely unrelated to your
terminal settings. So when stdout is redirected, Python refuses to guess
its encoding; see the PYTHONIOENCODING environment variable.
 
N

Ned Deily

Python 2.6.2 on OS X 10.5.7:

[ron@mickey:~]$ echo $LANG
en_US.UTF-8
[ron@mickey:~]$ cat frob.py
#!/usr/bin/env python
print u'\u03BB'

[ron@mickey:~]$ ./frob.py
ª
[ron@mickey:~]$ ./frob.py > foo
Traceback (most recent call last):
File "./frob.py", line 2, in <module>
print u'\u03BB'
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03bb' in
position 0: ordinal not in range(128)


(That's supposed to be a small greek lambda, but I'm using a
brain-damaged news reader that won't let me set the character encoding.
It shows up correctly in my terminal.)

According to what I thought I knew about unix (and I had fancied myself
a bit of an expert until just now) this is impossible. Python is
obviously picking up a different default encoding when its output is
being piped to a file, but I always thought one of the fundamental
invariants of unix processes was that there's no way for a process to
know what's on the other end of its stdout.

Clues appreciated. Thanks.

$ python2.6 -c 'import sys; print sys.stdout.encoding, \
sys.stdout.isatty()'
UTF-8 True
$ python2.6 -c 'import sys; print sys.stdout.encoding, \
sys.stdout.isatty()' > foo ; cat foo
None False
 
L

Lawrence D'Oliveiro

Gabriel said:
Python knows the terminal encoding (or at least can make a good guess),
but a file may use *any* encoding you want, completely unrelated to your
terminal settings.

It should still respect your localization settings, though.
 
N

Ned Deily

So shouldn't the second case also detect UTF-8? The filesystem knows
it's UTF-8, the shell knows it too. Why doesn't Python know it?

The filesystem knows what is UTF-8? While the setting of the locale
environment variables may influence how the file system interprets the
*name* of a file, it has no direct influence on what the *contents* of a
file is or is supposed to be. Remember in python 2.x, a file is a just
sequence of bytes. If you want to write encode Unicode to the file, you
need to use something like codecs.open to wrap the file object with the
proper streamwriter encoder.

What confuses matters in 2.x is the print statement's under-the-covers
implicit Unicode encoding for files connected to a terminal:

http://bugs.python.org/issue612627
http://bugs.python.org/issue4947
http://wiki.python.org/moin/PrintFails
x = u'\u0430\u0431\u0432'
print x [nice looking characters here]
sys.stdout.write(x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)'UTF-8'

In python 3.x, of course, the encoding happens automatically but you
still have to tell python, via the "encoding" argument to open, what the
encoding of the file's content is (or accept python's default which may
not be very useful):
'mac-roman'

WTF, indeed.
 
P

Paul Boddie

According to what I thought I knew about unix (and I had fancied myself
a bit of an expert until just now) this is impossible.  Python is
obviously picking up a different default encoding when its output is
being piped to a file, but I always thought one of the fundamental
invariants of unix processes was that there's no way for a process to
know what's on the other end of its stdout.

The only way to think about this (in Python 2.x, at least) is to
consider stream and file objects as things which only understand plain
byte strings. Consequently, use of the codecs module is required if
receiving/sending Unicode objects from/to streams and files.

Paul
 
P

Paul Boddie

Actually strings in Python 2.4 or later have the ‘encode’ method, with
no need for importing extra modules:

=====
$ python -c 'import sys; sys.stdout.write(u"\u03bb\n".encode("utf-8"))'
λ

$ python -c 'import sys; sys.stdout.write(u"\u03bb\n".encode("utf-8"))' > foo ; cat foo
λ
=====

Those are Unicode objects, not traditional Python strings. Although
strings do have decode and encode methods, even in Python 2.3, the
former is shorthand for the construction of a Unicode object using the
stated encoding whereas the latter seems to rely on the error-prone
automatic encoding detection in order to create a Unicode object and
then encode the result - in effect, recoding the string.

As I noted, if one wants to remain sane and not think about encoding
everything everywhere, creating a stream using a codecs module
function or class will permit the construction of something which
deals with Unicode objects satisfactorily.

Paul
 
N

Ned Deily

In python 3.x, of course, the encoding happens automatically but you
still have to tell python, via the "encoding" argument to open, what the
encoding of the file's content is (or accept python's default which may
not be very useful):

'mac-roman'

WTF, indeed.

BTW, I've opened a 3.1 release blocker issue about 'mac-roman' as a
default on OS X. Hard to believe none of us has noticed this up to now!

http://bugs.python.org/issue6202
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top