a question about Chinese characters in aPython Program

Discussion in 'Python' started by Liang Chen, Oct 20, 2008.

  1. Liang Chen

    Liang Chen Guest

    Hope you all had a nice weekend.

    I have a question that I hope someone can help me out. I want to run a Python program that uses Tkinter for the user interface (GUI). The program allows me to type Chinese characters, but neverthelss is unable to show them up on screen. The follow is some of the error message I received after I logged off the program:

    "Could not write output: <type "exceptions: UnicodeEncodeError'>, 'ascii' codec can't encode characters in position 0-1: ordinal not in range (128)"

    Any suggestion will be appreciated.

    Sincerely,

    Liang


    Liang Chen,Ph.D.
    Assistant Professor
    University of Georgia
    Communication Sciences and Special Education
    542 Aderhold Hall
    Athens, GA 30602

    Phone: 706-542-4566
     
    Liang Chen, Oct 20, 2008
    #1
    1. Advertising

  2. Liang Chen

    est Guest

    Re: a question about Chinese characters in a Python Program

    On Oct 20, 10:48 am, Liang Chen <> wrote:
    > Hope you all had a nice weekend.
    >
    > I have a question that I hope someone can help me out. I want to run a Python program that uses Tkinter for the user interface (GUI). The program allows me to type Chinese characters, but neverthelss is unable to show them up on screen. The follow is some of the error message I received after I logged off the program:
    >
    > "Could not write output: <type "exceptions: UnicodeEncodeError'>, 'ascii' codec can't encode characters in position 0-1: ordinal not in range (128)"
    >
    > Any suggestion will be appreciated.
    >
    > Sincerely,
    >
    > Liang
    >
    > Liang Chen,Ph.D.
    > Assistant Professor
    > University of Georgia
    > Communication Sciences and Special Education
    > 542 Aderhold Hall
    > Athens, GA 30602
    >
    > Phone: 706-542-4566


    Personally I call it a serious bug in python, but sadly most of python
    community members do not agree
    .. It may be a internal str() that caused this issue.

    https://groups.google.com/group/comp.lang.python/t/ca6ade6b6f5f3052
    http://bugs.python.org/issue3648
     
    est, Oct 20, 2008
    #2
    1. Advertising

  3. Liang Chen

    Paul Boddie Guest

    Re: a question about Chinese characters in a Python Program

    On 20 Okt, 07:32, est <> wrote:
    >
    > Personally I call it a serious bug in python


    Normally I'd entertain the possibility of bugs in Python, but your
    reasoning is a bit thin (in http://bugs.python.org/issue3648): "Why
    cann't Python just define ascii to range(256)"

    I do accept that it can be awkward to output text to the console, for
    example, but you have to consider that the console might not be
    configured to display any character you can throw at it. My console is
    configured for ISO-8859-15 (something like your magical "ascii to
    range(256)" only where someone has to decide what those 256 characters
    actually are), but that isn't going to help me display CJK characters.
    A solution might be to generate UTF-8 and then get the user to display
    the output in an appropriately configured application, but even then
    someone has to say that it's UTF-8 and not some other encoding that's
    being used. As discussed in another recent thread, Python 2.x does
    make some reasonable guesses about such matters to the extent that
    it's possible automatically (without magical knowledge).

    There is also the problem about use of the "str" built-in function or
    any operation where some Unicode object may be converted to a plain
    string. It is now recommended that you only convert to plain strings
    when you need to produce a sequence of bytes (for output, for
    example), and that you indicate how the Unicode values are encoded as
    bytes (by specifying an encoding). Python 3.x doesn't really change
    this: it just makes the Unicode/text vs. bytes distinction more
    obvious.

    Paul
     
    Paul Boddie, Oct 20, 2008
    #3
  4. Liang Chen

    est Guest

    Re: a question about Chinese characters in a Python Program

    On Oct 20, 6:47 pm, Paul Boddie <> wrote:
    > On 20 Okt, 07:32, est <> wrote:
    >
    >
    >
    > > Personally I call it a serious bug in python

    >
    > Normally I'd entertain the possibility of bugs in Python, but your
    > reasoning is a bit thin (inhttp://bugs.python.org/issue3648):"Why
    > cann't Python just define ascii to range(256)"
    >
    > I do accept that it can be awkward to output text to the console, for
    > example, but you have to consider that the console might not be
    > configured to display any character you can throw at it. My console is
    > configured for ISO-8859-15 (something like your magical "ascii to
    > range(256)" only where someone has to decide what those 256 characters
    > actually are), but that isn't going to help me display CJK characters.
    > A solution might be to generate UTF-8 and then get the user to display
    > the output in an appropriately configured application, but even then
    > someone has to say that it's UTF-8 and not some other encoding that's
    > being used. As discussed in another recent thread, Python 2.x does
    > make some reasonable guesses about such matters to the extent that
    > it's possible automatically (without magical knowledge).
    >
    > There is also the problem about use of the "str" built-in function or
    > any operation where some Unicode object may be converted to a plain
    > string. It is now recommended that you only convert to plain strings
    > when you need to produce a sequence of bytes (for output, for
    > example), and that you indicate how the Unicode values are encoded as
    > bytes (by specifying an encoding). Python 3.x doesn't really change
    > this: it just makes the Unicode/text vs. bytes distinction more
    > obvious.
    >
    > Paul


    Thanks for the long comment Paul, but it didn't help massive errors in
    Python encoding.

    IMHO it's even better to output wrong encodings rather than halt the
    WHOLE damn program by an exception

    When debugging encoding problems, the solution is simple. If
    characters display wrong, switch to another encoding, one of them must
    be right.

    But it's tiring in python to deal with encodings, you have to wrap
    EVERY SINGLE character expression with try ... except ... just imagine
    what pain it is.

    Just like the example I gave in Google Groups, u'\ue863' can NEVER be
    encoded into '\xfe\x9f'. Not a chance, because python REFUSE to handle
    a byte that is greater than range(128).

    Strangely the 'mbcs' encoding system can. Does 'mbcs' have magic or
    something? But it's Windows-specific

    Dealing with character encodings is really simple. AFAIK early
    encoding before Unicode, although they have many names, are all based
    on hacks. Take Chinese characters as an example. They are called
    GB2312 encoding, in fact it is totally compatible with range(256)
    ANSI. (There are minor issues like display half of a wide-character in
    a question mark ? but at least it's readable) If you just output
    serials of byte array, it IS GB2312. The same is true with BIG5, JIS,
    etc.


    Like I said, str() should NOT throw an exception BY DESIGN, it's a
    basic language standard. str() is not only a convert to string
    function, but also a serialization in most cases.(e.g. socket) My
    simple suggestion is: If it's a unicode character, output as UTF-8;
    other wise just ouput byte array, please do not encode it with really
    stupid range(128) ASCII. It's not guessing, it's totally wrong.
     
    est, Oct 20, 2008
    #4
  5. Liang Chen

    Paul Boddie Guest

    Re: a question about Chinese characters in a Python Program

    On 20 Okt, 15:30, est <> wrote:
    >
    > Thanks for the long comment Paul, but it didn't help massive errors in
    > Python encoding.
    >
    > IMHO it's even better to output wrong encodings rather than halt the
    > WHOLE damn program by an exception


    I disagree. Maybe I'll now get round to uploading an amusing pictorial
    example of this strategy just to illustrate where it can lead. CJK
    characters may be more demanding to deal with than various European
    characters, but I've seen public advertisements (admittedly aimed at
    IT course applicants) which made jokes about stuff like "å" and "ø"
    appearing in documents instead of the intended European characters, so
    it's fairly safe to say that people do care what gets written out from
    computer programs.

    > When debugging encoding problems, the solution is simple. If
    > characters display wrong, switch to another encoding, one of them must
    > be right.
    >
    > But it's tiring in python to deal with encodings, you have to wrap
    > EVERY SINGLE character expression with try ... except ... just imagine
    > what pain it is.


    If everything is in Unicode then you don't have to think about
    encodings. I recommend using things like codecs.open to ensure that
    input and output even produce and consume Unicode objects when dealing
    with files.

    > Just like the example I gave in Google Groups, u'\ue863' can NEVER be
    > encoded into '\xfe\x9f'. Not a chance, because python REFUSE to handle
    > a byte that is greater than range(128).


    Aside from the matter of which encoding you'd need to use to convert
    u'\ue863' into '\xfe\x9f', it has nothing to do with any implicit byte
    value range. To get from a Unicode object to a sequence of bytes
    (since that is the external representation of the text for other
    programs), Python has to perform a conversion. As a safe (but
    obviously conservative) default, Python only attempts to convert each
    Unicode character to a byte value using the ASCII character value
    table which is only defined for characters 0 to 127 - there's no such
    thing as "8-bit ASCII".

    Python doesn't attempt to automatically convert using other character
    tables (encodings, in other words), since there is quite a large
    possibility that the result, if not produced for the correct encoding,
    will not produce the desired visual effect. If I start with, say,
    character "ø" and encode it using UTF-8, I get a sequence of bytes
    which, if interpreted by a program expecting ISO-8859-15 will appear
    as "ø". If I encode the character using ISO-8859-15 and then feed the
    resulting byte sequence to a program expecting UTF-8, it will probably
    either complain or produce an incorrect visual effect. The reason why
    ASCII is safer (although not entirely safe) is because many encodings
    support ASCII as a subset of themselves.

    > Strangely the 'mbcs' encoding system can. Does 'mbcs' have magic or
    > something? But it's Windows-specific


    I thought Microsoft used some UTF-16 variant. That would explain how
    it can handle more or less everything.

    > Dealing with character encodings is really simple. AFAIK early
    > encoding before Unicode, although they have many names, are all based
    > on hacks. Take Chinese characters as an example. They are called
    > GB2312 encoding, in fact it is totally compatible with range(256)
    > ANSI. (There are minor issues like display half of a wide-character in
    > a question mark ? but at least it's readable) If you just output
    > serials of byte array, it IS GB2312. The same is true with BIG5, JIS,
    > etc.


    From the Wikipedia page, it appears that you need to convert GB2312
    values to EUC-CN by a relatively straightforward process, and can then
    output the resulting byte sequence in an ASCII compatible way,
    provided that you filter out all the byte values greater than 127:
    these filtered bytes would produce nonsense for anyone using a program
    not expecting EUC-CN. UTF-8 has some similar properties, but as I
    noted above, you wouldn't want to read most of the output if your
    program wasn't expecting UTF-8.

    > Like I said, str() should NOT throw an exception BY DESIGN, it's a
    > basic language standard. str() is not only a convert to string
    > function, but also a serialization in most cases.(e.g. socket) My
    > simple suggestion is: If it's a unicode character, output as UTF-8;
    > other wise just ouput byte array, please do not encode it with really
    > stupid range(128) ASCII. It's not guessing, it's totally wrong.


    I think it's unfortunate that "str" is now potentially unreliable for
    certain uses, but to just output an arbitrary byte sequence (unless by
    byte array you mean a representation of the numeric values) is the
    wrong thing to do unless you don't care about the output; in which
    case, you could just as well use "repr" instead. I think the output of
    "str" vs. "unicode" especially with regard to Unicode objects was
    discussed extensively on the python-dev mailing list at one point.

    I don't disagree that people sometimes miss a way of having Python or
    some library "do the right thing" when writing stuff out. I could
    imagine a wrapper for Python accepting UTF-8 whose purpose is to
    "blank out" characters which the console cannot handle, and people
    might use this wrapper explicitly because that is the "right thing"
    for them. Indeed, such a program may already exist for a more general
    audience since I imagine that it could be fairly useful.

    Paul
     
    Paul Boddie, Oct 20, 2008
    #5
  6. Re: a question about Chinese characters in a Python Program

    On Mon, 20 Oct 2008 06:30:09 -0700, est wrote:

    > Like I said, str() should NOT throw an exception BY DESIGN, it's a basic
    > language standard.


    int() is also a basic language standard, but it is perfectly acceptable
    for int() to raise an exception if you ask it to convert something into
    an integer that can't be converted:

    int("cat")

    What else would you expect int() to do but raise an exception?

    If you ask str() to convert something into a string which can't be
    converted, then what else should it do other than raise an exception?
    Whatever answer you give, somebody else will argue it should do another
    thing. Maybe I want failed characters replaced with '?'. Maybe Fred wants
    failed characters deleted altogether. Susan wants UTF-16. George wants
    Latin-1.

    The simple fact is that there is no 1:1 mapping from all 65,000+ Unicode
    characters to the 256 bytes used by byte strings, so there *must* be an
    encoding, otherwise you don't know which characters map to which bytes.

    ASCII has the advantage of being the lowest common denominator. Perhaps
    it doesn't make too many people very happy, but it makes everyone equally
    unhappy.



    > str() is not only a convert to string function, but
    > also a serialization in most cases.(e.g. socket) My simple suggestion
    > is: If it's a unicode character, output as UTF-8;


    Why UTF-8? That will never do. I want it output as UCS-4.


    > other wise just ouput
    > byte array, please do not encode it with really stupid range(128) ASCII.
    > It's not guessing, it's totally wrong.


    If you start with a byte string, you can always get a byte string:

    >>> s = '\x96 \xa0 \xaa' # not ASCII characters
    >>> s

    '\x96 \xa0 \xaa'
    >>> str(s)

    '\x96 \xa0 \xaa'



    --
    Steven
     
    Steven D'Aprano, Oct 20, 2008
    #6
  7. Liang Chen

    est Guest

    Re: a question about Chinese characters in a Python Program

    On Oct 20, 11:46 pm, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Mon, 20 Oct 2008 06:30:09 -0700, est wrote:
    > > Like I said, str() should NOT throw an exception BY DESIGN, it's a basic
    > > language standard.

    >
    > int() is also a basic language standard, but it is perfectly acceptable
    > for int() to raise an exception if you ask it to convert something into
    > an integer that can't be converted:
    >
    > int("cat")
    >
    > What else would you expect int() to do but raise an exception?
    >
    > If you ask str() to convert something into a string which can't be
    > converted, then what else should it do other than raise an exception?
    > Whatever answer you give, somebody else will argue it should do another
    > thing. Maybe I want failed characters replaced with '?'. Maybe Fred wants
    > failed characters deleted altogether. Susan wants UTF-16. George wants
    > Latin-1.
    >
    > The simple fact is that there is no 1:1 mapping from all 65,000+ Unicode
    > characters to the 256 bytes used by byte strings, so there *must* be an
    > encoding, otherwise you don't know which characters map to which bytes.
    >
    > ASCII has the advantage of being the lowest common denominator. Perhaps
    > it doesn't make too many people very happy, but it makes everyone equally
    > unhappy.
    >
    > > str() is not only a convert to string function, but
    > > also a serialization in most cases.(e.g. socket) My simple suggestion
    > > is: If it's a unicode character, output as UTF-8;

    >
    > Why UTF-8? That will never do. I want it output as UCS-4.
    >
    > > other wise just ouput
    > > byte array, please do not encode it with really stupid range(128) ASCII..
    > > It's not guessing, it's totally wrong.

    >
    > If you start with a byte string, you can always get a byte string:
    >
    > >>> s = '\x96 \xa0 \xaa'  # not ASCII characters
    > >>> s

    > '\x96 \xa0 \xaa'
    > >>> str(s)

    >
    > '\x96 \xa0 \xaa'
    >
    > --
    > Steven


    In fact Python handles characters well than most other open-source
    programming languages. But still:

    1. You can explain str() in 1000 ways, there are 1001 more confusing
    error on all kinds of python apps. (Not only some of the scripts I've
    written, but also famous enough apps like Boa Constructor
    http://i36.tinypic.com/1gqekh.jpg. This sucks hard, right?)


    2. Anyone please kindly tell me how can I define a customized encoding
    (namely 'ansi') which handles range(256) so I can
    sys.setdefaultencoding('ansi') once and for all?
     
    est, Oct 20, 2008
    #7
  8. Liang Chen

    Lie Ryan Guest

    Re: a question about Chinese characters in a Python Program

    On Sun, 19 Oct 2008 22:32:20 -0700, est wrote:

    > On Oct 20, 10:48 am, Liang Chen <> wrote:
    >> Hope you all had a nice weekend.
    >>
    >> I have a question that I hope someone can help me out. I want to run a
    >> Python program that uses Tkinter for the user interface (GUI). The
    >> program allows me to type Chinese characters, but neverthelss is unable
    >> to show them up on screen. The follow is some of the error message I
    >> received after I logged off the program:
    >>
    >> "Could not write output: <type "exceptions: UnicodeEncodeError'>,
    >> 'ascii' codec can't encode characters in position 0-1: ordinal not in
    >> range (128)"
    >>
    >> Any suggestion will be appreciated.
    >>
    >> Sincerely,
    >>
    >> Liang
    >>
    >> Liang Chen,Ph.D.
    >> Assistant Professor
    >> University of Georgia
    >> Communication Sciences and Special Education 542 Aderhold Hall
    >> Athens, GA 30602
    >>
    >> Phone: 706-542-4566

    >
    > Personally I call it a serious bug in python, but sadly most of python
    > community members do not agree
    > . It may be a internal str() that caused this issue.


    No, it's not a bug, it's a correct behavior that is the most correct
    behavior, although some people might not be able to immediately grab the
    reasons why it is correct and why defining ascii as range(256) is plain
    wrong.

    Anyway, if you haven't noticed, str() is capable of emitting all
    characters in range(256), e.g. str('\xff'). ascii though, doesn't allow
    that, as ascii is a 7-bit encoding, latin-1, ansi, and other ascii
    extensions are 8-bit encodings, but not ascii itself.
     
    Lie Ryan, Oct 20, 2008
    #8
  9. Liang Chen

    John Machin Guest

    Re: a question about Chinese characters in a Python Program

    On Oct 21, 1:45 am, Paul Boddie <> wrote:
    > From the Wikipedia page, it appears that you need to convert GB2312
    > values to EUC-CN by a relatively straightforward process, and can then
    > output the resulting byte sequence in an ASCII compatible way,
    > provided that you filter out all the byte values greater than 127:
    > these filtered bytes would produce nonsense for anyone using a program
    > not expecting EUC-CN. UTF-8 has some similar properties, but as I
    > noted above, you wouldn't want to read most of the output if your
    > program wasn't expecting UTF-8.


    What the Wikipedia page doesn't say is that the number of people who
    grok the concept of a GB2312 codepoint is vanishingly small, and the
    number of people who would actually have GB2312 codepoints in a file
    is smaller still. When people say their data is GB2312, they mean
    "GB<something> encoded as EUC-CN". So the relatively straightforward
    process is not required in practice.

    I don't understand the point or value of filtering out all byte values
    greater than 127:

    If the data is really GB2312, this would throw out all the Chinese
    characters.

    If the GB<something> is, as is likely, really GBK aka cp936 (a
    superset of GB2312), then the second byte of a Chinese character may
    be in the ASCII range, and the result of the filter would comprise the
    true ASCII characters plus some garbage ASCII characters.
     
    John Machin, Oct 21, 2008
    #9
  10. Liang Chen

    John Machin Guest

    Re: a question about Chinese characters in a Python Program

    On Oct 21, 11:03 pm, Ben Finney <>
    wrote:
    > John Machin <> writes:
    > > I don't understand the point or value of filtering out all byte values
    > > greater than 127

    >
    > That's only done if the encoding isn't otherwise specified. In which
    > case, ASCII is the documented default encoding. In which case, it
    > *must* be restricted to code points 0+IBM-127, otherwise it's not ASCII.
    >
    > The value of doing this is to make it rapidly and repeatably apparent
    > when the programmer's assumptions about character encoding are false,
    > allowing the programming error to be fixed early rather than late.


    "make it rapidly and repeatably apparent ..." is much better achieved
    by raising an exception.

    > This is, in my estimation, of more value than heuristic magic to
    > +IBw-guess+IB0- the encoding, and the resultant debugging nightmare when
    > that guesswork fails in unpredictable ways later in the program's
    > life.


    Was I suggesting "heuristic magic"?

    What is that 0+IBM-127 +IBw-guess+IB0- gibberish in your posting?
     
    John Machin, Oct 22, 2008
    #10
  11. Liang Chen

    John Machin Guest

    Re: a question about Chinese characters in a Python Program

    On Oct 22, 11:07 am, Ben Finney <>
    wrote:
    > John Machin <> writes:


    > > What is that 0+IBM-127 +IBw-guess+IB0- gibberish in your posting?

    >
    > It wasn't in my message as sent to my news server, nor as I read the
    > message in comp.lang.python. The message was encoded using UTF-8.
    > Perhaps it's since been munged in transit to your eyeballs by any of a
    > number of intermediaries.


    Would you believe:

    >>> '0+IBM-127 +IBw-guess+IB0-'.decode('utf7')

    u'0\u2013127 \u201cguess\u201d'
     
    John Machin, Oct 22, 2008
    #11
  12. Liang Chen

    wmr Guest

    Re: a question about Chinese characters in a Python Program

    ****
     
    wmr, Nov 9, 2008
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Simon Chung-Jen Chuang

    WebForm contain Chinese characters...

    Simon Chung-Jen Chuang, Jul 4, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    385
    dave wanta
    Jul 4, 2003
  2. Calvin Lai

    Chinese characters in ASP:DropDownList

    Calvin Lai, Dec 21, 2003, in forum: ASP .Net
    Replies:
    0
    Views:
    518
    Calvin Lai
    Dec 21, 2003
  3. omegaman
    Replies:
    1
    Views:
    571
    omegaman
    Sep 21, 2004
  4. Terry Reedy
    Replies:
    0
    Views:
    398
    Terry Reedy
    Oct 20, 2008
  5. Hannu Krosing
    Replies:
    0
    Views:
    251
    Hannu Krosing
    Mar 31, 2012
Loading...

Share This Page