Trying to understand this moji-bake

Steven D'Aprano · Jan 24, 2014

I have an unexpected display error when dealing with Unicode strings, and
I cannot understand where the error is occurring. I suspect it's not
actually a Python issue, but I thought I'd ask here to start.

Using Python 3.3, if I print a unicode string from the command line, it
displays correctly. I'm using the KDE 3.5 Konsole application, with the
encoding set to the default (which ought to be UTF-8, I believe, although
I'm not completely sure). This displays correctly:

[steve@ando ~]$ python3.3 -c "print(u'Ã±Ã¸Î»Ï€Ð¹Ð¶')"
Ã±Ã¸Î»Ï€Ð¹Ð¶

Likewise for Python 3.2:

[steve@ando ~]$ python3.2 -c "print('Ã±Ã¸Î»Ï€Ð¹Ð¶')"
Ã±Ã¸Î»Ï€Ð¹Ð¶

But using Python 2.7, I get a really bad case of moji-bake:

[steve@ando ~]$ python2.7 -c "print u'Ã±Ã¸Î»Ï€Ð¹Ð¶'"
ÃƒÂ±ÃƒÂ¸ÃŽÂ»ÃÃÂ¹ÃÂ¶

However, interactively it works fine:

[steve@ando ~]$ python2.7 -E
Python 2.7.2 (default, May 18 2012, 18:25:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.Ã±Ã¸Î»Ï€Ð¹Ð¶

This occurs on at least two different machines, one using Centos and the
other Debian.

Anyone have any idea what's going on? I can replicate the display error
using Python 3 like this:

py> s = 'Ã±Ã¸Î»Ï€Ð¹Ð¶'
py> print(s.encode('utf-8').decode('latin-1'))
ÃƒÂ±ÃƒÂ¸ÃŽÂ»ÃÃÂ¹ÃÂ¶

but I'm not sure why it's happening at the command line. Anyone have any
ideas?

Cameron Simpson · Jan 25, 2014

I have an unexpected display error when dealing with Unicode strings, and
I cannot understand where the error is occurring. I suspect it's not
actually a Python issue, but I thought I'd ask here to start.

Using Python 3.3, if I print a unicode string from the command line, it
displays correctly. I'm using the KDE 3.5 Konsole application, with the
encoding set to the default (which ought to be UTF-8, I believe, although
I'm not completely sure).

There are at least 2 layers: the encoding python is using for
transcription to the terminal and the decoding the terminal is
making of the byte stream to decide what to display.

The former can be printed with:

import sys
print(sys.stdout.encoding)

The latter depends on your desktop settings and KDE settings I
guess. I would hope the Konsole will decide based on your environment
settings. Running the shell command:

locale

will print the settings derived from that. Provided your environment
matches that which invoked the Konsole, that should be informative.

But I expect the Konsole is decoding using UTF-8 because so much
else works for you already.

I would point out that you could perhaps debug with something like this:

python2.7 ..... | od -c

which will print the output bytes. By printing to the terminal,
you're letting the terminal's decoding get in your way. It is fine
for seeing correct/incorrect results, but not so fine for seeing
the bytes causing them.

This displays correctly:
[steve@ando ~]$ python3.3 -c "print(u'Ã±Ã¸Î»Ï€Ð¹Ð¶')"
Ã±Ã¸Î»Ï€Ð¹Ð¶

Likewise for Python 3.2:
[steve@ando ~]$ python3.2 -c "print('Ã±Ã¸Î»Ï€Ð¹Ð¶')"
Ã±Ã¸Î»Ï€Ð¹Ð¶

But using Python 2.7, I get a really bad case of moji-bake:
[steve@ando ~]$ python2.7 -c "print u'Ã±Ã¸Î»Ï€Ð¹Ð¶'"
ÃƒÂ±ÃƒÂ¸ÃŽÂ»ÃÃÂ¹ÃÂ¶

However, interactively it works fine:

[...]

Debug by printing sys.stdout.encoding at this point.

I do recall getting different output encodings depending on how
Python was invoked; I forget the pattern, but I also remember writing
some ghastly hack to work around it, which I can't find at the
moment...

Also see "man python2.7" in particular the PYTHONIOENCODING environment
variable. That might let you exert more control.

Cheers,
--
Cameron Simpson <[email protected]>

ASCII n s. [from the greek] Those people who, at certain times of the year,
have no shadow at noon; such are the inhabitatants of the torrid zone.
- 1837 copy of Johnson's Dictionary

Chris Angelico · Jan 25, 2014

But using Python 2.7, I get a really bad case of moji-bake:

[steve@ando ~]$ python2.7 -c "print u'Ã±Ã¸Î»Ï€Ð¹Ð¶'"
ÃƒÂ±ÃƒÂ¸ÃŽÂ»ÃÃÂ¹ÃÂ¶

What's 2.7's default source code encoding? I thought it was ascii, but
maybe it's assuming (in the absence of a magic cookie) that it's
Latin-1.

ChrisA

wxjmfauth · Jan 25, 2014

Le samedi 25 janvier 2014 05:37:34 UTC+1, Steven D'Aprano a Ã©critÂ :

I have an unexpected display error when dealing with Unicode strings, and

I cannot understand where the error is occurring. I suspect it's not

actually a Python issue, but I thought I'd ask here to start.

Using Python 3.3, if I print a unicode string from the command line, it

displays correctly. I'm using the KDE 3.5 Konsole application, with the

encoding set to the default (which ought to be UTF-8, I believe, although

I'm not completely sure). This displays correctly:

[steve@ando ~]$ python3.3 -c "print(u'Ã±Ã¸Î»Ï€Ð¹Ð¶')"

Ã±Ã¸Î»Ï€Ð¹Ð¶

Likewise for Python 3.2:

[steve@ando ~]$ python3.2 -c "print('Ã±Ã¸Î»Ï€Ð¹Ð¶')"

Ã±Ã¸Î»Ï€Ð¹Ð¶

But using Python 2.7, I get a really bad case of moji-bake:

[steve@ando ~]$ python2.7 -c "print u'Ã±Ã¸Î»Ï€Ð¹Ð¶'"

ÃƒÂ±ÃƒÂ¸ÃŽÂ»ÃÃÂ¹ÃÂ¶

However, interactively it works fine:

[steve@ando ~]$ python2.7 -E

Python 2.7.2 (default, May 18 2012, 18:25:10)

[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2

Type "help", "copyright", "credits" or "license" for more information.

Ã±Ã¸Î»Ï€Ð¹Ð¶

This occurs on at least two different machines, one using Centos and the

other Debian.

Anyone have any idea what's going on? I can replicate the display error

using Python 3 like this:

py> s = 'Ã±Ã¸Î»Ï€Ð¹Ð¶'

py> print(s.encode('utf-8').decode('latin-1'))

ÃƒÂ±ÃƒÂ¸ÃŽÂ»ÃÃÂ¹ÃÂ¶

but I'm not sure why it's happening at the command line. Anyone have any

ideas?

The basic problem is neither Python, nor the system (OS), nor
the terminal, nor the GUI console. The basic problem is that
all these elements [*] are not "speaking" the same language.

The second problem lies in Python itsself. Python attempts
to solve this problem by doing its own "cooking" based on the
elements, I pointed above [*], with the side effect the
situation may just become more confused and/or just not properly
working (sys.std***.encoding, print, GUI/terminal, souce
coding, ...)

The third problem is more *x specific. In many cases,
the Python "distribution" is tweaked in such a way to
make it working on a specific *x-version/distribution
(sys.getdefaultencoding(), site.py, sitecustomize.py)
and finally resulting in a non properly working Python.

Fourth problem. GUI applications supposed to mimick the
"real" terminal by doing and adding their own "recipes".

Fifth problem. The user who has to understand all this
stuff.

n-th problem, ...
jmf

PS I already understood all this stuff ten years ago!

Peter Pearson · Jan 25, 2014

But using Python 2.7, I get a really bad case of moji-bake:

[steve@ando ~]$ python2.7 -c "print u'Ã±Ã¸Î»Ï€Ð¹Ð¶'"
ÃƒÂ±ÃƒÂ¸ÃŽÂ»ÃÃÂ¹ÃÂ¶

Click to expand...

What's 2.7's default source code encoding? I thought it was ascii, but
maybe it's assuming (in the absence of a magic cookie) that it's
Latin-1.

ChrisA

I seem to be getting the same behavior as Steven:

$ python2.7 -c "print u'Ã±Ã¸Î»Ï€Ð¹Ð¶'"
ÃƒÂ±ÃƒÂ¸ÃŽÂ»ÃÂ€ÃÂ¹ÃÂ¶
$ python2.7 -c "import sys; print(sys.stdout.encoding)"
UTF-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
$ python2.7 -c "import sys; print(sys.stdin.encoding)"
UTF-8

Also, my GNOME Terminal 3.4.1.1 character encoding is "Unicode (UTF-8)".

HTH

Chris Angelico · Jan 25, 2014

$ python2.7 -c "import sys; print(sys.stdin.encoding)"
UTF-8

This isn't from stdin, though, it's about the interpretation of the
bytes of source code without a magic cookie.

According to PEP 263 [1], the default encoding should have become
"ascii" as of Python 2.5. That's what puzzles me.

ChrisA

[1] http://www.python.org/dev/peps/pep-0263/

Oscar Benjamin · Jan 25, 2014

But using Python 2.7, I get a really bad case of moji-bake:

[steve@ando ~]$ python2.7 -c "print u'Ã±Ã¸Î»Ï€Ð¹Ð¶'"
ÃƒÂ±ÃƒÂ¸ÃŽÂ»ÃÃÂ¹ÃÂ¶

However, interactively it works fine:

[steve@ando ~]$ python2.7 -E
Python 2.7.2 (default, May 18 2012, 18:25:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.Ã±Ã¸Î»Ï€Ð¹Ð¶

This occurs on at least two different machines, one using Centos and the
other Debian.

Same for me. It's to do with using a u literal:

$ python2.7 -c "print('Ã±Ã¸Î»Ï€Ð¹Ð¶')"
Ã±Ã¸Î»Ï€Ð¹Ð¶
$ python2.7 -c "print(u'Ã±Ã¸Î»Ï€Ð¹Ð¶')"
ÃƒÂ±ÃƒÂ¸ÃŽÂ»Ãâ‚¬ÃÂ¹ÃÂ¶
$ python2.7 -c "print(repr('Ã±Ã¸Î»Ï€Ð¹Ð¶'))"
'\xc3\xb1\xc3\xb8\xce\xbb\xcf\x80\xd0\xb9\xd0\xb6'
$ python2.7 -c "print(repr(u'Ã±Ã¸Î»Ï€Ð¹Ð¶'))"
u'\xc3\xb1\xc3\xb8\xce\xbb\xcf\x80\xd0\xb9\xd0\xb6'

$ python2.7
Python 2.7.5+ (default, Sep 19 2013, 13:49:51)
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.'ascii'

It works in the interactive prompt:
Ã±Ã¸Î»Ï€Ð¹Ð¶

But the interactive prompt has an associated encoding:
'UTF-8'

If I put it into a utf-8 file with no encoding declared I get a SyntaxError:
$ cat tmp.py
s = u'Ã±Ã¸Î»Ï€Ð¹Ð¶'
print(s)
oscar@tonis-laptop:~$ python2.7 tmp.py
File "tmp.py", line 1
SyntaxError: Non-ASCII character '\xc3' in file tmp.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for
details

If I add the encoding declaration it works:

oscar@tonis-laptop:~$ vim tmp.py
oscar@tonis-laptop:~$ cat tmp.py
# -*- coding: utf-8 -*-
s = u'Ã±Ã¸Î»Ï€Ð¹Ð¶'
print(s)
oscar@tonis-laptop:~$ python2.7 tmp.py
Ã±Ã¸Î»Ï€Ð¹Ð¶
oscar@tonis-laptop:~$

So I'd say that your original example should be a SyntaxError with
Python 2.7 but instead it implicitly uses latin-1.

Oscar

Steven D'Aprano · Jan 25, 2014

But using Python 2.7, I get a really bad case of moji-bake:

[steve@ando ~]$ python2.7 -c "print u'Ã±Ã¸Î»Ï€Ð¹Ð¶'" ÃƒÂ±ÃƒÂ¸ÃŽÂ»ÃÃÂ¹ÃÂ¶

Click to expand...

What's 2.7's default source code encoding? I thought it was ascii, but
maybe it's assuming (in the absence of a magic cookie) that it's
Latin-1.

I think that's it! Python 2.7 ought to raise a SyntaxError, since there's
no source encoding declared, while Python 3.3 defaults to UTF-8 which is
the same as my terminal. If there's a bug, it is that Python 2.7 doesn't
raise SyntaxError when called with -c and there are non-ASCII literals in
the source. Instead, it seems to be defaulting to Latin-1, hence the moji-
bake.

Thanks to everyone who responded!

Chris Angelico · Jan 25, 2014

If there's a bug, it is that Python 2.7 doesn't
raise SyntaxError when called with -c and there are non-ASCII literals in
the source. Instead, it seems to be defaulting to Latin-1, hence the moji-
bake.

That might well be a bug! I was reading the PEP, which was pretty
clear about it needing to be ASCII by default. It's not so clear about
-c but I would expect it to do the same.

ChrisA

Terry Reedy · Jan 25, 2014

$ python2.7 -c "import sys; print(sys.stdin.encoding)"
UTF-8

Click to expand...

This isn't from stdin, though, it's about the interpretation of the
bytes of source code without a magic cookie.

According to PEP 263 [1], the default encoding should have become
"ascii" as of Python 2.5. That's what puzzles me.

I believe it is actually (but unofficially) latin-1 so that latin-1
accented chars can be used in identifiers even though only ascii is
officially supported.

Trying to understand this moji-bake

Steven D'Aprano

Cameron Simpson

Chris Angelico

wxjmfauth

Peter Pearson

Chris Angelico

Oscar Benjamin

Steven D'Aprano

Chris Angelico

Terry Reedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads