Python 3.0 crashes displaying Unicode at interactive prompt

Discussion in 'Python' started by John Machin, Dec 13, 2008.

  1. John Machin

    John Machin Guest

    Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
    (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> x = u'\u9876'
    >>> x

    u'\u9876'

    # As expected

    Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit
    (Intel)] on win 32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> x = '\u9876'
    >>> x

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\python30\lib\io.py", line 1491, in write
    b = encoder.encode(s)
    File "C:\python30\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
    position
    1: character maps to <undefined>

    # *NOT* as expected (by me, that is)

    Is this the intended outcome?
     
    John Machin, Dec 13, 2008
    #1
    1. Advertising

  2. 2008/12/13 John Machin <>:
    >
    > Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
    > (Intel)] on win32
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> x = u'\u9876'
    >>>> x

    > u'\u9876'
    >
    > # As expected
    >
    > Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit
    > (Intel)] on win 32
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> x = '\u9876'
    >>>> x

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in <module>
    > File "C:\python30\lib\io.py", line 1491, in write
    > b = encoder.encode(s)
    > File "C:\python30\lib\encodings\cp850.py", line 19, in encode
    > return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    > UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
    > position
    > 1: character maps to <undefined>
    >
    > # *NOT* as expected (by me, that is)
    >
    > Is this the intended outcome?
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >


    I also found this a bit surprising, but it seems to be the intended
    behaviour (on a non-unicode console)

    http://docs.python.org/3.0/whatsnew/3.0.html
    "PEP 3138: The repr() of a string no longer escapes non-ASCII
    characters. It still escapes control characters and code points with
    non-printable status in the Unicode standard, however."

    I get the same error in windows cmd, (Idle prints the respective glyph
    correctly).
    To get the old behaviour of repr, one can use ascii, I suppose.

    Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit (Intel)] on win
    32
    Type "help", "copyright", "credits" or "license" for more information.

    >>> repr('\u9876')

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\Python30\lib\io.py", line 1491, in write
    b = encoder.encode(s)
    File "C:\Python30\lib\encodings\cp852.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in position
    2: character maps to <undefined>
    >>> '\u9876'.encode("unicode-escape")

    b'\\u9876'
    >>> ascii('\u9876')

    "'\\u9876'"
    >>>
     
    Vlastimil Brom, Dec 13, 2008
    #2
    1. Advertising

  3. John Machin

    Chris Rebert Guest

    On Sat, Dec 13, 2008 at 12:28 PM, John Machin <> wrote:
    >
    > Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
    > (Intel)] on win32
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> x = u'\u9876'
    >>>> x

    > u'\u9876'
    >
    > # As expected
    >
    > Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit
    > (Intel)] on win 32
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> x = '\u9876'
    >>>> x

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in <module>
    > File "C:\python30\lib\io.py", line 1491, in write
    > b = encoder.encode(s)
    > File "C:\python30\lib\encodings\cp850.py", line 19, in encode
    > return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    > UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
    > position
    > 1: character maps to <undefined>
    >
    > # *NOT* as expected (by me, that is)
    >
    > Is this the intended outcome?


    When Python tries to display the character, it must first encode it
    because IO is done in bytes, not Unicode codepoints. When it tries to
    encode it in CP850 (apparently your system's default encoding judging
    by the traceback), it unsurprisingly fails (CP850 is an old Western
    Europe codec, which obviously can't encode an Asian character like the
    one in question). To signal that failure, it raises an exception, thus
    the error you see.
    This is intended behavior. Either change your default system/terminal
    encoding to one that can handle such characters or explicitly encode
    the string and use one of the provided options for dealing with
    unencodable characters.

    Also, please don't call it a "crash" as that's very misleading. The
    Python interpreter didn't dump core, an exception was merely thrown.
    There's a world of difference.

    Cheers,
    Chris

    --
    Follow the path of the Iguana...
    http://rebertia.com
     
    Chris Rebert, Dec 13, 2008
    #3
  4. John Machin

    John Machin Guest

    On Dec 14, 8:07 am, "Chris Rebert" <> wrote:
    > On Sat, Dec 13, 2008 at 12:28 PM, John Machin <> wrote:
    >
    > > Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
    > > (Intel)] on win32
    > > Type "help", "copyright", "credits" or "license" for more information.
    > >>>> x = u'\u9876'
    > >>>> x

    > > u'\u9876'

    >
    > > # As expected

    >
    > > Python 3.0 (r30:67507, Dec  3 2008, 20:14:27) [MSC v.1500 32 bit
    > > (Intel)] on win 32
    > > Type "help", "copyright", "credits" or "license" for more information.
    > >>>> x = '\u9876'
    > >>>> x

    > > Traceback (most recent call last):
    > >  File "<stdin>", line 1, in <module>
    > >  File "C:\python30\lib\io.py", line 1491, in write
    > >    b = encoder.encode(s)
    > >  File "C:\python30\lib\encodings\cp850.py", line 19, in encode
    > >    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    > > UnicodeEncodeError: 'charmap' codec can't encode character '\u9876' in
    > > position
    > > 1: character maps to <undefined>

    >
    > > # *NOT* as expected (by me, that is)

    >
    > > Is this the intended outcome?

    >
    > When Python tries to display the character, it must first encode it
    > because IO is done in bytes, not Unicode codepoints. When it tries to
    > encode it in CP850 (apparently your system's default encoding judging
    > by the traceback), it unsurprisingly fails (CP850 is an old Western
    > Europe codec, which obviously can't encode an Asian character like the
    > one in question). To signal that failure, it raises an exception, thus
    > the error you see.
    > This is intended behavior.


    I see. That means that the behaviour in Python 1.6 to 2.6 (i.e.
    encoding the text using the repr() function (as then defined) was not
    intended behaviour?

    > Either change your default system/terminal
    > encoding to one that can handle such characters or explicitly encode
    > the string and use one of the provided options for dealing with
    > unencodable characters.


    You are missing the point. I don't care about the visual
    representation. What I care about is an unambiguous representation
    that can be used when communicating about problems across cultures/
    networks/mail-clients/news-readers ... the sort of problems that are
    initially advised as "I got this UnicodeEncodeError" and accompanied
    by no data or garbled data.

    > Also, please don't call it a "crash" as that's very misleading. The
    > Python interpreter didn't dump core, an exception was merely thrown.


    "spew nonsense on the screen and then stop" is about as useful and as
    astonishing as "dump core".

    core? You mean like ferrite doughnuts on a wire trellis? I thought
    that went out of fashion before cp850 was invented :)
     
    John Machin, Dec 13, 2008
    #4
  5. >> This is intended behavior.
    >
    > I see. That means that the behaviour in Python 1.6 to 2.6 (i.e.
    > encoding the text using the repr() function (as then defined) was not
    > intended behaviour?


    Sure. This behavior has not changed. It still uses repr().

    Of course, the string type has changed in 3.0, and now uses a different
    definition of repr.

    Regards,
    Martin
     
    Martin v. Löwis, Dec 13, 2008
    #5
  6. John Machin

    John Machin Guest

    On Dec 14, 9:20 am, "Martin v. Löwis" <> wrote:
    > >> This is intended behavior.

    >
    > > I see. That means that the behaviour in Python 1.6 to 2.6 (i.e.
    > > encoding the text using the repr() function (as then defined) was not
    > > intended behaviour?

    >
    > Sure.


    "Sure" as in "sure, it was not intended behaviour"?

    > This behavior has not changed. It still uses repr().
    >
    > Of course, the string type has changed in 3.0, and now uses a different
    > definition of repr.


    So was the above-reported non-crash consequence of the change of
    definition of repr intended?
     
    John Machin, Dec 13, 2008
    #6
  7. John Machin

    Lie Ryan Guest

    On Sat, 13 Dec 2008 14:09:04 -0800, John Machin wrote:

    > On Dec 14, 8:07 am, "Chris Rebert" <> wrote:
    >> On Sat, Dec 13, 2008 at 12:28 PM, John Machin <>
    >> wrote:
    >>
    >> > Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
    >> > (Intel)] on win32
    >> > Type "help", "copyright", "credits" or "license" for more
    >> > information.
    >> >>>> x = u'\u9876'
    >> >>>> x
    >> > u'\u9876'

    >>
    >> > # As expected

    >>
    >> > Python 3.0 (r30:67507, Dec  3 2008, 20:14:27) [MSC v.1500 32 bit
    >> > (Intel)] on win 32
    >> > Type "help", "copyright", "credits" or "license" for more
    >> > information.
    >> >>>> x = '\u9876'
    >> >>>> x
    >> > Traceback (most recent call last):
    >> >  File "<stdin>", line 1, in <module>
    >> >  File "C:\python30\lib\io.py", line 1491, in write
    >> >    b = encoder.encode(s)
    >> >  File "C:\python30\lib\encodings\cp850.py", line 19, in encode
    >> >    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    >> > UnicodeEncodeError: 'charmap' codec can't encode character '\u9876'
    >> > in position
    >> > 1: character maps to <undefined>

    >>
    >> > # *NOT* as expected (by me, that is)

    >>
    >> > Is this the intended outcome?

    >>
    >> When Python tries to display the character, it must first encode it
    >> because IO is done in bytes, not Unicode codepoints. When it tries to
    >> encode it in CP850 (apparently your system's default encoding judging
    >> by the traceback), it unsurprisingly fails (CP850 is an old Western
    >> Europe codec, which obviously can't encode an Asian character like the
    >> one in question). To signal that failure, it raises an exception, thus
    >> the error you see.
    >> This is intended behavior.

    >
    > I see. That means that the behaviour in Python 1.6 to 2.6 (i.e. encoding
    > the text using the repr() function (as then defined) was not intended
    > behaviour?
    >
    >> Either change your default system/terminal encoding to one that can
    >> handle such characters or explicitly encode the string and use one of
    >> the provided options for dealing with unencodable characters.

    >
    > You are missing the point. I don't care about the visual representation.
    > What I care about is an unambiguous representation that can be used when
    > communicating about problems across cultures/
    > networks/mail-clients/news-readers ... the sort of problems that are
    > initially advised as "I got this UnicodeEncodeError" and accompanied by
    > no data or garbled data.


    Python defaulted to using strict encoding, which means to throw errors on
    unencodable characters, but this is NOT the only behavior, you can change
    the behavior to "replace using placeholder character" or "ignore any
    errors and discard unencodable characters"

    | errors can be 'strict', 'replace' or 'ignore' and defaults
    | to 'strict'.

    If you don't like the default behavior or you want another kind of
    behavior, you're welcome to file a bug report at http://bugs.python.org

    >> Also, please don't call it a "crash" as that's very misleading. The
    >> Python interpreter didn't dump core, an exception was merely thrown.

    >
    > "spew nonsense on the screen and then stop" is about as useful and as
    > astonishing as "dump core".


    That's an interesting definition of crash. You're just like saying: "C
    has crashed because I made a bug in my program". In this context, it is
    your program that crashes, not python nor C, it is misleading to say so.

    It will be python's crash if:
    1. Python 'segfault'ed
    2. Python interpreter exits before there is instruction to exit (either
    implicit (e.g. falling to the last line of the script) or explicit (e.g
    sys.exit or raise SystemExit))
    3. Python core dumped
    4. Python does something that is not documented
     
    Lie Ryan, Dec 14, 2008
    #7
  8. > "Sure" as in "sure, it was not intended behaviour"?

    It was intended behavior, and still is in 3.0.

    >> This behavior has not changed. It still uses repr().
    >>
    >> Of course, the string type has changed in 3.0, and now uses a different
    >> definition of repr.

    >
    > So was the above-reported non-crash consequence of the change of
    > definition of repr intended?


    Yes. If you want a display that is guaranteed to work on your terminal,
    use the ascii() builtin function.

    py> x = '\u9876'
    py> ascii(x)
    "'\\u9876'"
    py> print(ascii(x))
    '\u9876'

    Regards,
    Martin
     
    Martin v. Löwis, Dec 14, 2008
    #8
  9. John Machin

    Paul Boddie Guest

    On 14 Des, 05:46, "Martin v. Löwis" <> wrote:
    >
    > Yes. If you want a display that is guaranteed to work on your terminal,
    > use the ascii() builtin function.


    But shouldn't the production of an object's representation via repr be
    a "safe" operation? That is, the operation should always produce a
    result, regardless of environmental factors like the locale or
    terminal's encoding support. If John were printing the object, it
    would be a different matter, but he apparently just wants to see a
    sequence of characters which represents the object.

    Paul
     
    Paul Boddie, Dec 14, 2008
    #9
  10. > But shouldn't the production of an object's representation via repr be
    > a "safe" operation?


    It's a trade-off. It should also be legible.

    Regards,
    Martin
     
    Martin v. Löwis, Dec 14, 2008
    #10
  11. John Machin

    Fuzzyman Guest


    > That's an interesting definition of crash. You're just like saying: "C
    > has crashed because I made a bug in my program". In this context, it is
    > your program that crashes, not python nor C, it is misleading to say so.
    >
    > It will be python's crash if:
    > 1. Python 'segfault'ed
    > 2. Python interpreter exits before there is instruction to exit (either
    > implicit (e.g. falling to the last line of the script) or explicit (e.g
    > sys.exit or raise SystemExit))
    > 3. Python core dumped
    > 4. Python does something that is not documented


    It seems to me to be a generally accepted term when an application
    stops due to an unhandled error to say that it crashed.

    Michael Foord
    http://www.ironpythoninaction.com/
     
    Fuzzyman, Dec 14, 2008
    #11
  12. John Machin

    James Mills Guest

    On Mon, Dec 15, 2008 at 9:03 AM, Fuzzyman <> wrote:
    > It seems to me to be a generally accepted term when an application
    > stops due to an unhandled error to say that it crashed.


    it == application
    Yes.

    --------------------

    #!/usr/bin/env python

    from traceback import format_exc

    def foo():
    print "Hello World!"

    def main():
    try:
    foo()
    except Exception, error:
    print "ERROR: %s" % error
    print format_exc()

    if __name__ == "__main__":
    main()

    --------------------

    --JamesMills
     
    James Mills, Dec 14, 2008
    #12
  13. John Machin

    Paul Boddie Guest

    On 14 Des, 22:13, "Martin v. Löwis" <> wrote:
    > > But shouldn't the production of an object's representation via repr be
    > > a "safe" operation?

    >
    > It's a trade-off. It should also be legible.


    Right. I can understand that unlike Python 2.x, a representation of a
    string in Python 3.x (whose equivalent in Python 2.x would be a
    Unicode object) must also be a string (as opposed to a byte string in
    Python 2.x), and that no decision can be taken to choose "safe"
    representations for characters which cannot be displayed in a
    terminal. In examples, for Python 2.x...

    >>> u"æøå"

    u'\xe6\xf8\xe5'
    >>> repr(u"æøå")

    "u'\\xe6\\xf8\\xe5'"

    ....and for Python 3.x...

    >>> "æøå"

    'æøå'
    >>> repr("æøå")

    "'æøå'"

    ....with an ISO-8859-15 terminal. Python 2.x could conceivably be
    smarter about encoding representations, but chooses not to be since
    the smarter behaviour would need to involve knowing that an "output
    situation" was imminent. Python 3.x, on the other hand, leaves issues
    of encoding to the generic I/O pipeline, causing the described
    problem.

    Of course, repr will always work if its output does not get sent to
    sys.stdout or an insufficiently capable output stream, but I suppose
    usage of repr for debugging purposes, where one may wish to inspect
    character values, must be superseded by usage of the ascii function,
    as you point out. It's unfortunate that the default behaviour isn't
    optimal at the interactive prompt for some configurations, though.

    Paul
     
    Paul Boddie, Dec 14, 2008
    #13
  14. > It's unfortunate that the default behaviour isn't
    > optimal at the interactive prompt for some configurations, though.


    As I said, it's a trade-off. The alternative, if it was the default,
    wouldn't be optimal at the interactive prompt for some other
    configurations.

    In particular, users of non-latin scripts have been complaining that
    they can't read their strings - hence the change, which now actually
    allows these users to read the text that is stored in the strings.

    The question really is why John Machin has a string that contains
    '\u9876' (which is a Chinese character), yet his terminal is incapable
    of displaying that character. More likely, people will typically
    encounter only characters in their data that their terminals are
    also capable of displaying (or else the terminal would be pretty
    useless)

    In the long run, it might be useful to have an error handler on
    sys.stdout in interactive mode, which escapes characters that
    cannot be encoded (perhaps in a different color, if the terminal
    supports colors, to make it clear that it is an escape sequence)

    Regards,
    Martin
     
    Martin v. Löwis, Dec 14, 2008
    #14
  15. John Machin

    jhermann Guest

    Assuming those survived the switch to 3.0, you can use site.py und
    sys.displayhook to customize to the old behaviour (i.e. change it to a
    version using ascii instead of repr). Since this only affects
    interactive use, it's also no problem for portability of code, unlike
    "solutions" like forcing the defaultencoding etc.
     
    jhermann, Dec 17, 2008
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. gaurav kashyap
    Replies:
    2
    Views:
    622
    gaurav kashyap
    Oct 30, 2008
  2. gaurav kashyap
    Replies:
    3
    Views:
    688
    gaurav kashyap
    Oct 31, 2008
  3. walterbyrd
    Replies:
    3
    Views:
    309
    Diez B. Roggisch
    Dec 5, 2008
  4. Mel
    Replies:
    10
    Views:
    3,129
    Sailaja Appi
    Feb 13, 2009
  5. ~km
    Replies:
    3
    Views:
    383
Loading...

Share This Page