cp936 uses gbk codec,doesn't decode `\x80` as U+20AC EURO SIGN

John Machin · Oct 10, 2010

|>>> '\x80'.decode('cp936')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80
in position 0: incomplete multibyte sequence

However:

Retrieved 2010-10-10 from
http://www.unicode.org/Public
/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT

# Name: cp936 to Unicode table
# Unicode version: 2.0
# Table version: 2.01
# Table format: Format A
# Date: 1/7/2000
#
# Contact: (e-mail address removed)
...
0x7F 0x007F #DELETE
0x80 0x20AC #EURO SIGN
0x81 #DBCS LEAD BYTE

Retrieved 2010-10-10 from
http://msdn.microsoft.com/en-us/goglobal/cc305153.aspx

Windows Codepage 936
[pictorial mapping; shows 80 mapping to 20AC]

Retrieved 2010-10-10 from
http://demo.icu-project.org
/icu-bin/convexp?conv=windows-936-2000&s=ALL

[pictorial mapping for converter
"windows-936-2000" with
aliases including GBK, CP936, MS936;
shows 80 mapping to 20AC]

So Microsoft appears to think that
cp936 includes the euro,
and the ICU project seem to think that GBK and cp936
both include the euro.

A couple of questions:

Is this a bug or a shrug?

Where can one find the mapping tables
from which the various CJK codecs are derived?

Ulrich Eckhardt · Oct 11, 2010

John said:
|>>> '\x80'.decode('cp936')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80
in position 0: incomplete multibyte sequence [...]
So Microsoft appears to think that
cp936 includes the euro,
and the ICU project seem to think that GBK and cp936
both include the euro.

A couple of questions:

Is this a bug or a shrug?

Bug, IMHO.

Uli

cp936 uses gbk codec,doesn't decode `\x80` as U+20AC EURO SIGN

John Machin

Ulrich Eckhardt

Members online

Forum statistics

Latest Threads