Python 3.0 crashes displaying Unicode at interactive prompt

J

John Machin

Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.u'\u9876'

# As expected

Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win 32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\python30\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
position
1: character maps to <undefined>

# *NOT* as expected (by me, that is)

Is this the intended outcome?
 
V

Vlastimil Brom

2008/12/13 John Machin said:
Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.u'\u9876'

# As expected

Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win 32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\python30\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
position
1: character maps to <undefined>

# *NOT* as expected (by me, that is)

Is this the intended outcome?

I also found this a bit surprising, but it seems to be the intended
behaviour (on a non-unicode console)

http://docs.python.org/3.0/whatsnew/3.0.html
"PEP 3138: The repr() of a string no longer escapes non-ASCII
characters. It still escapes control characters and code points with
non-printable status in the Unicode standard, however."

I get the same error in windows cmd, (Idle prints the respective glyph
correctly).
To get the old behaviour of repr, one can use ascii, I suppose.

Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\Python30\lib\encodings\cp852.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in position
 
C

Chris Rebert

Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.u'\u9876'

# As expected

Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win 32
Type "help", "copyright", "credits" or "license" for more information.Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\python30\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
position
1: character maps to <undefined>

# *NOT* as expected (by me, that is)

Is this the intended outcome?

When Python tries to display the character, it must first encode it
because IO is done in bytes, not Unicode codepoints. When it tries to
encode it in CP850 (apparently your system's default encoding judging
by the traceback), it unsurprisingly fails (CP850 is an old Western
Europe codec, which obviously can't encode an Asian character like the
one in question). To signal that failure, it raises an exception, thus
the error you see.
This is intended behavior. Either change your default system/terminal
encoding to one that can handle such characters or explicitly encode
the string and use one of the provided options for dealing with
unencodable characters.

Also, please don't call it a "crash" as that's very misleading. The
Python interpreter didn't dump core, an exception was merely thrown.
There's a world of difference.

Cheers,
Chris
 
J

John Machin

Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
x = u'\u9876'
x u'\u9876'

# As expected
Python 3.0 (r30:67507, Dec  3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win 32
Type "help", "copyright", "credits" or "license" for more information.
x = '\u9876'
x
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\python30\lib\io.py", line 1491, in write
   b = encoder.encode(s)
 File "C:\python30\lib\encodings\cp850.py", line 19, in encode
   return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
position
1: character maps to <undefined>
# *NOT* as expected (by me, that is)
Is this the intended outcome?

When Python tries to display the character, it must first encode it
because IO is done in bytes, not Unicode codepoints. When it tries to
encode it in CP850 (apparently your system's default encoding judging
by the traceback), it unsurprisingly fails (CP850 is an old Western
Europe codec, which obviously can't encode an Asian character like the
one in question). To signal that failure, it raises an exception, thus
the error you see.
This is intended behavior.

I see. That means that the behaviour in Python 1.6 to 2.6 (i.e.
encoding the text using the repr() function (as then defined) was not
intended behaviour?
Either change your default system/terminal
encoding to one that can handle such characters or explicitly encode
the string and use one of the provided options for dealing with
unencodable characters.

You are missing the point. I don't care about the visual
representation. What I care about is an unambiguous representation
that can be used when communicating about problems across cultures/
networks/mail-clients/news-readers ... the sort of problems that are
initially advised as "I got this UnicodeEncodeError" and accompanied
by no data or garbled data.
Also, please don't call it a "crash" as that's very misleading. The
Python interpreter didn't dump core, an exception was merely thrown.

"spew nonsense on the screen and then stop" is about as useful and as
astonishing as "dump core".

core? You mean like ferrite doughnuts on a wire trellis? I thought
that went out of fashion before cp850 was invented :)
 
M

Martin v. Löwis

This is intended behavior.
I see. That means that the behaviour in Python 1.6 to 2.6 (i.e.
encoding the text using the repr() function (as then defined) was not
intended behaviour?

Sure. This behavior has not changed. It still uses repr().

Of course, the string type has changed in 3.0, and now uses a different
definition of repr.

Regards,
Martin
 
J

John Machin


"Sure" as in "sure, it was not intended behaviour"?
This behavior has not changed. It still uses repr().

Of course, the string type has changed in 3.0, and now uses a different
definition of repr.

So was the above-reported non-crash consequence of the change of
definition of repr intended?
 
L

Lie Ryan

Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more
information.
x = u'\u9876'
x
u'\u9876'
# As expected
Python 3.0 (r30:67507, Dec  3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win 32
Type "help", "copyright", "credits" or "license" for more
information.
x = '\u9876'
x
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\python30\lib\io.py", line 1491, in write
   b = encoder.encode(s)
 File "C:\python30\lib\encodings\cp850.py", line 19, in encode
   return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u9876'
in position
1: character maps to <undefined>
# *NOT* as expected (by me, that is)
Is this the intended outcome?

When Python tries to display the character, it must first encode it
because IO is done in bytes, not Unicode codepoints. When it tries to
encode it in CP850 (apparently your system's default encoding judging
by the traceback), it unsurprisingly fails (CP850 is an old Western
Europe codec, which obviously can't encode an Asian character like the
one in question). To signal that failure, it raises an exception, thus
the error you see.
This is intended behavior.

I see. That means that the behaviour in Python 1.6 to 2.6 (i.e. encoding
the text using the repr() function (as then defined) was not intended
behaviour?
Either change your default system/terminal encoding to one that can
handle such characters or explicitly encode the string and use one of
the provided options for dealing with unencodable characters.

You are missing the point. I don't care about the visual representation.
What I care about is an unambiguous representation that can be used when
communicating about problems across cultures/
networks/mail-clients/news-readers ... the sort of problems that are
initially advised as "I got this UnicodeEncodeError" and accompanied by
no data or garbled data.

Python defaulted to using strict encoding, which means to throw errors on
unencodable characters, but this is NOT the only behavior, you can change
the behavior to "replace using placeholder character" or "ignore any
errors and discard unencodable characters"

| errors can be 'strict', 'replace' or 'ignore' and defaults
| to 'strict'.

If you don't like the default behavior or you want another kind of
behavior, you're welcome to file a bug report at http://bugs.python.org
"spew nonsense on the screen and then stop" is about as useful and as
astonishing as "dump core".

That's an interesting definition of crash. You're just like saying: "C
has crashed because I made a bug in my program". In this context, it is
your program that crashes, not python nor C, it is misleading to say so.

It will be python's crash if:
1. Python 'segfault'ed
2. Python interpreter exits before there is instruction to exit (either
implicit (e.g. falling to the last line of the script) or explicit (e.g
sys.exit or raise SystemExit))
3. Python core dumped
4. Python does something that is not documented
 
M

Martin v. Löwis

"Sure" as in "sure, it was not intended behaviour"?

It was intended behavior, and still is in 3.0.
So was the above-reported non-crash consequence of the change of
definition of repr intended?

Yes. If you want a display that is guaranteed to work on your terminal,
use the ascii() builtin function.

py> x = '\u9876'
py> ascii(x)
"'\\u9876'"
py> print(ascii(x))
'\u9876'

Regards,
Martin
 
P

Paul Boddie

Yes. If you want a display that is guaranteed to work on your terminal,
use the ascii() builtin function.

But shouldn't the production of an object's representation via repr be
a "safe" operation? That is, the operation should always produce a
result, regardless of environmental factors like the locale or
terminal's encoding support. If John were printing the object, it
would be a different matter, but he apparently just wants to see a
sequence of characters which represents the object.

Paul
 
M

Martin v. Löwis

But shouldn't the production of an object's representation via repr be
a "safe" operation?

It's a trade-off. It should also be legible.

Regards,
Martin
 
F

Fuzzyman

That's an interesting definition of crash. You're just like saying: "C
has crashed because I made a bug in my program". In this context, it is
your program that crashes, not python nor C, it is misleading to say so.

It will be python's crash if:
1. Python 'segfault'ed
2. Python interpreter exits before there is instruction to exit (either
implicit (e.g. falling to the last line of the script) or explicit (e.g
sys.exit or raise SystemExit))
3. Python core dumped
4. Python does something that is not documented

It seems to me to be a generally accepted term when an application
stops due to an unhandled error to say that it crashed.

Michael Foord
http://www.ironpythoninaction.com/
 
J

James Mills

It seems to me to be a generally accepted term when an application
stops due to an unhandled error to say that it crashed.

it == application
Yes.

--------------------

#!/usr/bin/env python

from traceback import format_exc

def foo():
print "Hello World!"

def main():
try:
foo()
except Exception, error:
print "ERROR: %s" % error
print format_exc()

if __name__ == "__main__":
main()
 
P

Paul Boddie

It's a trade-off. It should also be legible.

Right. I can understand that unlike Python 2.x, a representation of a
string in Python 3.x (whose equivalent in Python 2.x would be a
Unicode object) must also be a string (as opposed to a byte string in
Python 2.x), and that no decision can be taken to choose "safe"
representations for characters which cannot be displayed in a
terminal. In examples, for Python 2.x...
"u'\\xe6\\xf8\\xe5'"

....and for Python 3.x...
"'æøå'"

....with an ISO-8859-15 terminal. Python 2.x could conceivably be
smarter about encoding representations, but chooses not to be since
the smarter behaviour would need to involve knowing that an "output
situation" was imminent. Python 3.x, on the other hand, leaves issues
of encoding to the generic I/O pipeline, causing the described
problem.

Of course, repr will always work if its output does not get sent to
sys.stdout or an insufficiently capable output stream, but I suppose
usage of repr for debugging purposes, where one may wish to inspect
character values, must be superseded by usage of the ascii function,
as you point out. It's unfortunate that the default behaviour isn't
optimal at the interactive prompt for some configurations, though.

Paul
 
M

Martin v. Löwis

It's unfortunate that the default behaviour isn't
optimal at the interactive prompt for some configurations, though.

As I said, it's a trade-off. The alternative, if it was the default,
wouldn't be optimal at the interactive prompt for some other
configurations.

In particular, users of non-latin scripts have been complaining that
they can't read their strings - hence the change, which now actually
allows these users to read the text that is stored in the strings.

The question really is why John Machin has a string that contains
'\u9876' (which is a Chinese character), yet his terminal is incapable
of displaying that character. More likely, people will typically
encounter only characters in their data that their terminals are
also capable of displaying (or else the terminal would be pretty
useless)

In the long run, it might be useful to have an error handler on
sys.stdout in interactive mode, which escapes characters that
cannot be encoded (perhaps in a different color, if the terminal
supports colors, to make it clear that it is an escape sequence)

Regards,
Martin
 
J

jhermann

Assuming those survived the switch to 3.0, you can use site.py und
sys.displayhook to customize to the old behaviour (i.e. change it to a
version using ascii instead of repr). Since this only affects
interactive use, it's also no problem for portability of code, unlike
"solutions" like forcing the defaultencoding etc.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top