Python 3.0 crashes displaying Unicode at interactive prompt

John Machin · Dec 13, 2008

Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.u'\u9876'

# As expected

Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win 32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\python30\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
position
1: character maps to <undefined>

# *NOT* as expected (by me, that is)

Is this the intended outcome?

Vlastimil Brom · Dec 13, 2008

2008/12/13 John Machin said:
Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.u'\u9876'

# As expected

Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win 32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\python30\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
position
1: character maps to <undefined>

# *NOT* as expected (by me, that is)

Is this the intended outcome?

I also found this a bit surprising, but it seems to be the intended
behaviour (on a non-unicode console)

http://docs.python.org/3.0/whatsnew/3.0.html
"PEP 3138: The repr() of a string no longer escapes non-ASCII
characters. It still escapes control characters and code points with
non-printable status in the Unicode standard, however."

I get the same error in windows cmd, (Idle prints the respective glyph
correctly).
To get the old behaviour of repr, one can use ascii, I suppose.

Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\Python30\lib\encodings\cp852.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in position

Chris Rebert · Dec 13, 2008

Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.u'\u9876'

# As expected

Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win 32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\python30\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
position
1: character maps to <undefined>

# *NOT* as expected (by me, that is)

Is this the intended outcome?

When Python tries to display the character, it must first encode it
because IO is done in bytes, not Unicode codepoints. When it tries to
encode it in CP850 (apparently your system's default encoding judging
by the traceback), it unsurprisingly fails (CP850 is an old Western
Europe codec, which obviously can't encode an Asian character like the
one in question). To signal that failure, it raises an exception, thus
the error you see.
This is intended behavior. Either change your default system/terminal
encoding to one that can handle such characters or explicitly encode
the string and use one of the provided options for dealing with
unencodable characters.

Also, please don't call it a "crash" as that's very misleading. The
Python interpreter didn't dump core, an exception was merely thrown.
There's a world of difference.

Cheers,
Chris

John Machin · Dec 13, 2008

Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.

x = u'\u9876'
x u'\u9876'

Click to expand...

# As expected

Click to expand...

Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win 32
Type "help", "copyright", "credits" or "license" for more information.

x = '\u9876'
x

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\python30\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
position
1: character maps to <undefined>

Click to expand...

# *NOT* as expected (by me, that is)

Click to expand...

Is this the intended outcome?

Click to expand...

When Python tries to display the character, it must first encode it
because IO is done in bytes, not Unicode codepoints. When it tries to
encode it in CP850 (apparently your system's default encoding judging
by the traceback), it unsurprisingly fails (CP850 is an old Western
Europe codec, which obviously can't encode an Asian character like the
one in question). To signal that failure, it raises an exception, thus
the error you see.
This is intended behavior.

I see. That means that the behaviour in Python 1.6 to 2.6 (i.e.
encoding the text using the repr() function (as then defined) was not
intended behaviour?

Either change your default system/terminal
encoding to one that can handle such characters or explicitly encode
the string and use one of the provided options for dealing with
unencodable characters.

You are missing the point. I don't care about the visual
representation. What I care about is an unambiguous representation
that can be used when communicating about problems across cultures/
networks/mail-clients/news-readers ... the sort of problems that are
initially advised as "I got this UnicodeEncodeError" and accompanied
by no data or garbled data.

Also, please don't call it a "crash" as that's very misleading. The
Python interpreter didn't dump core, an exception was merely thrown.

"spew nonsense on the screen and then stop" is about as useful and as
astonishing as "dump core".

core? You mean like ferrite doughnuts on a wire trellis? I thought
that went out of fashion before cp850 was invented

Martin v. Löwis · Dec 13, 2008

This is intended behavior.

I see. That means that the behaviour in Python 1.6 to 2.6 (i.e.
encoding the text using the repr() function (as then defined) was not
intended behaviour?

Sure. This behavior has not changed. It still uses repr().

Of course, the string type has changed in 3.0, and now uses a different
definition of repr.

Regards,
Martin

John Machin · Dec 13, 2008

Sure.

"Sure" as in "sure, it was not intended behaviour"?

This behavior has not changed. It still uses repr().

Of course, the string type has changed in 3.0, and now uses a different
definition of repr.

So was the above-reported non-crash consequence of the change of
definition of repr intended?

Lie Ryan · Dec 14, 2008

Python 2.6.1 (r261:67517, Dec Â 4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more
information.
x = u'\u9876'
x
u'\u9876'

Click to expand...

# As expected

Click to expand...

Python 3.0 (r30:67507, Dec Â 3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win 32
Type "help", "copyright", "credits" or "license" for more
information.
x = '\u9876'
x
Traceback (most recent call last):
Â File "<stdin>", line 1, in <module>
Â File "C:\python30\lib\io.py", line 1491, in write
Â Â b = encoder.encode(s)
Â File "C:\python30\lib\encodings\cp850.py", line 19, in encode
Â Â return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876'
in position
1: character maps to <undefined>

Click to expand...

# *NOT* as expected (by me, that is)

Click to expand...

Is this the intended outcome?

Click to expand...

When Python tries to display the character, it must first encode it
because IO is done in bytes, not Unicode codepoints. When it tries to
encode it in CP850 (apparently your system's default encoding judging
by the traceback), it unsurprisingly fails (CP850 is an old Western
Europe codec, which obviously can't encode an Asian character like the
one in question). To signal that failure, it raises an exception, thus
the error you see.
This is intended behavior.

Click to expand...

I see. That means that the behaviour in Python 1.6 to 2.6 (i.e. encoding
the text using the repr() function (as then defined) was not intended
behaviour?

Either change your default system/terminal encoding to one that can
handle such characters or explicitly encode the string and use one of
the provided options for dealing with unencodable characters.

Click to expand...

You are missing the point. I don't care about the visual representation.
What I care about is an unambiguous representation that can be used when
communicating about problems across cultures/
networks/mail-clients/news-readers ... the sort of problems that are
initially advised as "I got this UnicodeEncodeError" and accompanied by
no data or garbled data.

Python defaulted to using strict encoding, which means to throw errors on
unencodable characters, but this is NOT the only behavior, you can change
the behavior to "replace using placeholder character" or "ignore any
errors and discard unencodable characters"

| errors can be 'strict', 'replace' or 'ignore' and defaults
| to 'strict'.

If you don't like the default behavior or you want another kind of
behavior, you're welcome to file a bug report at http://bugs.python.org

"spew nonsense on the screen and then stop" is about as useful and as
astonishing as "dump core".

That's an interesting definition of crash. You're just like saying: "C
has crashed because I made a bug in my program". In this context, it is
your program that crashes, not python nor C, it is misleading to say so.

It will be python's crash if:
1. Python 'segfault'ed
2. Python interpreter exits before there is instruction to exit (either
implicit (e.g. falling to the last line of the script) or explicit (e.g
sys.exit or raise SystemExit))
3. Python core dumped
4. Python does something that is not documented

Martin v. Löwis · Dec 14, 2008

"Sure" as in "sure, it was not intended behaviour"?

It was intended behavior, and still is in 3.0.

So was the above-reported non-crash consequence of the change of
definition of repr intended?

Yes. If you want a display that is guaranteed to work on your terminal,
use the ascii() builtin function.

py> x = '\u9876'
py> ascii(x)
"'\\u9876'"
py> print(ascii(x))
'\u9876'

Regards,
Martin

Paul Boddie · Dec 14, 2008

Yes. If you want a display that is guaranteed to work on your terminal,
use the ascii() builtin function.

But shouldn't the production of an object's representation via repr be
a "safe" operation? That is, the operation should always produce a
result, regardless of environmental factors like the locale or
terminal's encoding support. If John were printing the object, it
would be a different matter, but he apparently just wants to see a
sequence of characters which represents the object.

Paul

Martin v. Löwis · Dec 14, 2008

But shouldn't the production of an object's representation via repr be

a "safe" operation?

It's a trade-off. It should also be legible.

Regards,
Martin

Fuzzyman · Dec 14, 2008

That's an interesting definition of crash. You're just like saying: "C
has crashed because I made a bug in my program". In this context, it is
your program that crashes, not python nor C, it is misleading to say so.

It will be python's crash if:
1. Python 'segfault'ed
2. Python interpreter exits before there is instruction to exit (either
implicit (e.g. falling to the last line of the script) or explicit (e.g
sys.exit or raise SystemExit))
3. Python core dumped
4. Python does something that is not documented

It seems to me to be a generally accepted term when an application
stops due to an unhandled error to say that it crashed.

Michael Foord
http://www.ironpythoninaction.com/

James Mills · Dec 14, 2008

It seems to me to be a generally accepted term when an application
stops due to an unhandled error to say that it crashed.

it == application
Yes.

--------------------

#!/usr/bin/env python

from traceback import format_exc

def foo():
print "Hello World!"

def main():
try:
foo()
except Exception, error:
print "ERROR: %s" % error
print format_exc()

if __name__ == "__main__":
main()

Paul Boddie · Dec 14, 2008

It's a trade-off. It should also be legible.

Right. I can understand that unlike Python 2.x, a representation of a
string in Python 3.x (whose equivalent in Python 2.x would be a
Unicode object) must also be a string (as opposed to a byte string in
Python 2.x), and that no decision can be taken to choose "safe"
representations for characters which cannot be displayed in a
terminal. In examples, for Python 2.x...
"u'\\xe6\\xf8\\xe5'"

....and for Python 3.x...
"'æøå'"

....with an ISO-8859-15 terminal. Python 2.x could conceivably be
smarter about encoding representations, but chooses not to be since
the smarter behaviour would need to involve knowing that an "output
situation" was imminent. Python 3.x, on the other hand, leaves issues
of encoding to the generic I/O pipeline, causing the described
problem.

Of course, repr will always work if its output does not get sent to
sys.stdout or an insufficiently capable output stream, but I suppose
usage of repr for debugging purposes, where one may wish to inspect
character values, must be superseded by usage of the ascii function,
as you point out. It's unfortunate that the default behaviour isn't
optimal at the interactive prompt for some configurations, though.

Paul

Martin v. Löwis · Dec 14, 2008

It's unfortunate that the default behaviour isn't

optimal at the interactive prompt for some configurations, though.

As I said, it's a trade-off. The alternative, if it was the default,
wouldn't be optimal at the interactive prompt for some other
configurations.

In particular, users of non-latin scripts have been complaining that
they can't read their strings - hence the change, which now actually
allows these users to read the text that is stored in the strings.

The question really is why John Machin has a string that contains
'\u9876' (which is a Chinese character), yet his terminal is incapable
of displaying that character. More likely, people will typically
encounter only characters in their data that their terminals are
also capable of displaying (or else the terminal would be pretty
useless)

In the long run, it might be useful to have an error handler on
sys.stdout in interactive mode, which escapes characters that
cannot be encoded (perhaps in a different color, if the terminal
supports colors, to make it clear that it is an escape sequence)

Regards,
Martin

jhermann · Dec 17, 2008

Assuming those survived the switch to 3.0, you can use site.py und
sys.displayhook to customize to the old behaviour (i.e. change it to a
version using ascii instead of repr). Since this only affects
interactive use, it's also no problem for portability of code, unlike
"solutions" like forcing the defaultencoding etc.

unable to print Unicode characters in Python 3	12	Jan 26, 2009
pdb in 3.0 very buggy (Win XP Home)	1	Feb 13, 2009
Python 3.0b2 cannot map '\u12b'	8	Aug 31, 2008
input() on python 2.7.5 vs 3.3.2	3	Dec 12, 2013
Python interactive help()	1	Oct 19, 2012
import wx works interactive but not from script	4	Feb 10, 2009
Python2.6 + win32com crashes with unicode bug	5	Oct 29, 2009
str(bytes) in Python 3.0	0	Apr 12, 2008

Python 3.0 crashes displaying Unicode at interactive prompt

John Machin

Vlastimil Brom

Chris Rebert

John Machin

Martin v. Löwis

John Machin

Lie Ryan

Martin v. Löwis

Paul Boddie

Martin v. Löwis

Fuzzyman

James Mills

Paul Boddie

Martin v. Löwis

jhermann

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads