cp936 uses gbk codec,doesn't decode `\x80` as U+20AC EURO SIGN

Discussion in 'Python' started by John Machin, Oct 10, 2010.

  1. John Machin

    John Machin Guest

    |>>> '\x80'.decode('cp936')
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'gbk' codec can't decode byte 0x80
    in position 0: incomplete multibyte sequence

    However:

    Retrieved 2010-10-10 from
    http://www.unicode.org/Public
    /MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT

    # Name: cp936 to Unicode table
    # Unicode version: 2.0
    # Table version: 2.01
    # Table format: Format A
    # Date: 1/7/2000
    #
    # Contact:
    ...
    0x7F 0x007F #DELETE
    0x80 0x20AC #EURO SIGN
    0x81 #DBCS LEAD BYTE

    Retrieved 2010-10-10 from
    http://msdn.microsoft.com/en-us/goglobal/cc305153.aspx

    Windows Codepage 936
    [pictorial mapping; shows 80 mapping to 20AC]

    Retrieved 2010-10-10 from
    http://demo.icu-project.org
    /icu-bin/convexp?conv=windows-936-2000&s=ALL

    [pictorial mapping for converter
    "windows-936-2000" with
    aliases including GBK, CP936, MS936;
    shows 80 mapping to 20AC]

    So Microsoft appears to think that
    cp936 includes the euro,
    and the ICU project seem to think that GBK and cp936
    both include the euro.

    A couple of questions:

    Is this a bug or a shrug?

    Where can one find the mapping tables
    from which the various CJK codecs are derived?
     
    John Machin, Oct 10, 2010
    #1
    1. Advertisements

  2. Bug, IMHO.

    Uli
     
    Ulrich Eckhardt, Oct 11, 2010
    #2
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.