Trying to understand this moji-bake

S

Steven D'Aprano

I have an unexpected display error when dealing with Unicode strings, and
I cannot understand where the error is occurring. I suspect it's not
actually a Python issue, but I thought I'd ask here to start.

Using Python 3.3, if I print a unicode string from the command line, it
displays correctly. I'm using the KDE 3.5 Konsole application, with the
encoding set to the default (which ought to be UTF-8, I believe, although
I'm not completely sure). This displays correctly:

[steve@ando ~]$ python3.3 -c "print(u'ñøλπйж')"
ñøλπйж


Likewise for Python 3.2:

[steve@ando ~]$ python3.2 -c "print('ñøλπйж')"
ñøλπйж


But using Python 2.7, I get a really bad case of moji-bake:

[steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
ñøλÃùö


However, interactively it works fine:

[steve@ando ~]$ python2.7 -E
Python 2.7.2 (default, May 18 2012, 18:25:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.ñøλπйж


This occurs on at least two different machines, one using Centos and the
other Debian.

Anyone have any idea what's going on? I can replicate the display error
using Python 3 like this:

py> s = 'ñøλπйж'
py> print(s.encode('utf-8').decode('latin-1'))
ñøλÃùö

but I'm not sure why it's happening at the command line. Anyone have any
ideas?
 
C

Cameron Simpson

I have an unexpected display error when dealing with Unicode strings, and
I cannot understand where the error is occurring. I suspect it's not
actually a Python issue, but I thought I'd ask here to start.

Using Python 3.3, if I print a unicode string from the command line, it
displays correctly. I'm using the KDE 3.5 Konsole application, with the
encoding set to the default (which ought to be UTF-8, I believe, although
I'm not completely sure).

There are at least 2 layers: the encoding python is using for
transcription to the terminal and the decoding the terminal is
making of the byte stream to decide what to display.

The former can be printed with:

import sys
print(sys.stdout.encoding)

The latter depends on your desktop settings and KDE settings I
guess. I would hope the Konsole will decide based on your environment
settings. Running the shell command:

locale

will print the settings derived from that. Provided your environment
matches that which invoked the Konsole, that should be informative.

But I expect the Konsole is decoding using UTF-8 because so much
else works for you already.

I would point out that you could perhaps debug with something like this:

python2.7 ..... | od -c

which will print the output bytes. By printing to the terminal,
you're letting the terminal's decoding get in your way. It is fine
for seeing correct/incorrect results, but not so fine for seeing
the bytes causing them.
This displays correctly:
[steve@ando ~]$ python3.3 -c "print(u'ñøλπйж')"
ñøλπйж


Likewise for Python 3.2:
[steve@ando ~]$ python3.2 -c "print('ñøλπйж')"
ñøλπйж

But using Python 2.7, I get a really bad case of moji-bake:
[steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
ñøλÃùö

However, interactively it works fine:
[...]

Debug by printing sys.stdout.encoding at this point.

I do recall getting different output encodings depending on how
Python was invoked; I forget the pattern, but I also remember writing
some ghastly hack to work around it, which I can't find at the
moment...

Also see "man python2.7" in particular the PYTHONIOENCODING environment
variable. That might let you exert more control.

Cheers,
--
Cameron Simpson <[email protected]>

ASCII n s. [from the greek] Those people who, at certain times of the year,
have no shadow at noon; such are the inhabitatants of the torrid zone.
- 1837 copy of Johnson's Dictionary
 
C

Chris Angelico

But using Python 2.7, I get a really bad case of moji-bake:

[steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
ñøλÃùö

What's 2.7's default source code encoding? I thought it was ascii, but
maybe it's assuming (in the absence of a magic cookie) that it's
Latin-1.

ChrisA
 
W

wxjmfauth

Le samedi 25 janvier 2014 05:37:34 UTC+1, Steven D'Aprano a écrit :
I have an unexpected display error when dealing with Unicode strings, and

I cannot understand where the error is occurring. I suspect it's not

actually a Python issue, but I thought I'd ask here to start.



Using Python 3.3, if I print a unicode string from the command line, it

displays correctly. I'm using the KDE 3.5 Konsole application, with the

encoding set to the default (which ought to be UTF-8, I believe, although

I'm not completely sure). This displays correctly:



[steve@ando ~]$ python3.3 -c "print(u'ñøλπйж')"

ñøλπйж





Likewise for Python 3.2:



[steve@ando ~]$ python3.2 -c "print('ñøλπйж')"

ñøλπйж





But using Python 2.7, I get a really bad case of moji-bake:



[steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"

ñøλÃùö





However, interactively it works fine:



[steve@ando ~]$ python2.7 -E

Python 2.7.2 (default, May 18 2012, 18:25:10)

[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

ñøλπйж





This occurs on at least two different machines, one using Centos and the

other Debian.



Anyone have any idea what's going on? I can replicate the display error

using Python 3 like this:



py> s = 'ñøλπйж'

py> print(s.encode('utf-8').decode('latin-1'))

ñøλÃùö



but I'm not sure why it's happening at the command line. Anyone have any

ideas?

The basic problem is neither Python, nor the system (OS), nor
the terminal, nor the GUI console. The basic problem is that
all these elements [*] are not "speaking" the same language.

The second problem lies in Python itsself. Python attempts
to solve this problem by doing its own "cooking" based on the
elements, I pointed above [*], with the side effect the
situation may just become more confused and/or just not properly
working (sys.std***.encoding, print, GUI/terminal, souce
coding, ...)

The third problem is more *x specific. In many cases,
the Python "distribution" is tweaked in such a way to
make it working on a specific *x-version/distribution
(sys.getdefaultencoding(), site.py, sitecustomize.py)
and finally resulting in a non properly working Python.

Fourth problem. GUI applications supposed to mimick the
"real" terminal by doing and adding their own "recipes".

Fifth problem. The user who has to understand all this
stuff.

n-th problem, ...
jmf

PS I already understood all this stuff ten years ago!
 
P

Peter Pearson

But using Python 2.7, I get a really bad case of moji-bake:

[steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
ñøλÃùö

What's 2.7's default source code encoding? I thought it was ascii, but
maybe it's assuming (in the absence of a magic cookie) that it's
Latin-1.

ChrisA

I seem to be getting the same behavior as Steven:

$ python2.7 -c "print u'ñøλπйж'"
ñøλÀùö
$ python2.7 -c "import sys; print(sys.stdout.encoding)"
UTF-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
$ python2.7 -c "import sys; print(sys.stdin.encoding)"
UTF-8

Also, my GNOME Terminal 3.4.1.1 character encoding is "Unicode (UTF-8)".

HTH
 
O

Oscar Benjamin

But using Python 2.7, I get a really bad case of moji-bake:

[steve@ando ~]$ python2.7 -c "print u'ñøλπйж'"
ñøλÃùö

However, interactively it works fine:

[steve@ando ~]$ python2.7 -E
Python 2.7.2 (default, May 18 2012, 18:25:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.ñøλπйж

This occurs on at least two different machines, one using Centos and the
other Debian.

Same for me. It's to do with using a u literal:

$ python2.7 -c "print('ñøλπйж')"
ñøλπйж
$ python2.7 -c "print(u'ñøλπйж')"
ñøλÀùö
$ python2.7 -c "print(repr('ñøλπйж'))"
'\xc3\xb1\xc3\xb8\xce\xbb\xcf\x80\xd0\xb9\xd0\xb6'
$ python2.7 -c "print(repr(u'ñøλπйж'))"
u'\xc3\xb1\xc3\xb8\xce\xbb\xcf\x80\xd0\xb9\xd0\xb6'

$ python2.7
Python 2.7.5+ (default, Sep 19 2013, 13:49:51)
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.'ascii'

It works in the interactive prompt:
ñøλπйж

But the interactive prompt has an associated encoding:
'UTF-8'

If I put it into a utf-8 file with no encoding declared I get a SyntaxError:
$ cat tmp.py
s = u'ñøλπйж'
print(s)
oscar@tonis-laptop:~$ python2.7 tmp.py
File "tmp.py", line 1
SyntaxError: Non-ASCII character '\xc3' in file tmp.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for
details

If I add the encoding declaration it works:

oscar@tonis-laptop:~$ vim tmp.py
oscar@tonis-laptop:~$ cat tmp.py
# -*- coding: utf-8 -*-
s = u'ñøλπйж'
print(s)
oscar@tonis-laptop:~$ python2.7 tmp.py
ñøλπйж
oscar@tonis-laptop:~$

So I'd say that your original example should be a SyntaxError with
Python 2.7 but instead it implicitly uses latin-1.


Oscar
 
S

Steven D'Aprano

But using Python 2.7, I get a really bad case of moji-bake:

[steve@ando ~]$ python2.7 -c "print u'ñøλπйж'" ñøλÃùö

What's 2.7's default source code encoding? I thought it was ascii, but
maybe it's assuming (in the absence of a magic cookie) that it's
Latin-1.

I think that's it! Python 2.7 ought to raise a SyntaxError, since there's
no source encoding declared, while Python 3.3 defaults to UTF-8 which is
the same as my terminal. If there's a bug, it is that Python 2.7 doesn't
raise SyntaxError when called with -c and there are non-ASCII literals in
the source. Instead, it seems to be defaulting to Latin-1, hence the moji-
bake.

Thanks to everyone who responded!
 
C

Chris Angelico

If there's a bug, it is that Python 2.7 doesn't
raise SyntaxError when called with -c and there are non-ASCII literals in
the source. Instead, it seems to be defaulting to Latin-1, hence the moji-
bake.

That might well be a bug! I was reading the PEP, which was pretty
clear about it needing to be ASCII by default. It's not so clear about
-c but I would expect it to do the same.

ChrisA
 
T

Terry Reedy

$ python2.7 -c "import sys; print(sys.stdin.encoding)"
UTF-8

This isn't from stdin, though, it's about the interpretation of the
bytes of source code without a magic cookie.

According to PEP 263 [1], the default encoding should have become
"ascii" as of Python 2.5. That's what puzzles me.

I believe it is actually (but unofficially) latin-1 so that latin-1
accented chars can be used in identifiers even though only ascii is
officially supported.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,756
Messages
2,569,533
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top