latin1 and cp1252 inconsistent?

Discussion in 'Python' started by buck@yelp.com, Nov 16, 2012.

  1. Guest

    Latin1 has a block of 32 undefined characters.
    Windows-1252 (aka cp1252) fills in 27 of these characters but leaves five undefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D

    The byte 0x81 decoded with latin gives the unicode 0x81.
    Decoding the same byte with windows-1252 yields a stack trace with `UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>`

    This seems inconsistent to me, given that this byte is equally undefined inthe two standards.

    Also, the html5 standard says:

    When a user agent [browser] would otherwise use a character encoding given in the first column [ISO-8859-1, aka latin1] of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row [windows-1252, aka cp1252].

    http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0


    The current implementation of windows-1252 isn't usable for this purpose (areplacement of latin1), since it will throw an error in cases that latin1 would succeed.
    , Nov 16, 2012
    #1
    1. Advertising

  2. Ian Kelly Guest

    On Fri, Nov 16, 2012 at 2:44 PM, <> wrote:
    > Latin1 has a block of 32 undefined characters.


    These characters are not undefined. 0x80-0x9f are the C1 control
    codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
    their Unicode mappings are well defined.

    http://tools.ietf.org/html/rfc1345

    > Windows-1252 (aka cp1252) fills in 27 of these characters but leaves fiveundefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D


    In CP 1252, these codes are actually undefined.

    http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx

    > Also, the html5 standard says:
    >
    > When a user agent [browser] would otherwise use a character encoding given in the first column [ISO-8859-1, aka latin1] of the following table to either convert content to Unicode characters or convert Unicode characters tobytes, it must instead use the encoding given in the cell in the second column of the same row [windows-1252, aka cp1252].
    >
    > http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0
    >
    >
    > The current implementation of windows-1252 isn't usable for this purpose (a replacement of latin1), since it will throw an error in cases that latin1 would succeed.


    You can use a non-strict error handling scheme to prevent the error.

    >>> b'hello \x81 world'.decode('cp1252')

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "c:\python33\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position
    6: character maps to <undefined>

    >>> b'hello \x81 world'.decode('cp1252', 'replace')

    'hello \ufffd world'
    >>> b'hello \x81 world'.decode('cp1252', 'ignore')

    'hello world'
    Ian Kelly, Nov 16, 2012
    #2
    1. Advertising

  3. Guest

    On Friday, November 16, 2012 2:34:32 PM UTC-8, Ian wrote:
    > On Fri, Nov 16, 2012 at 2:44 PM, <buck> wrote:
    >
    > > Latin1 has a block of 32 undefined characters.

    >
    >
    > These characters are not undefined. 0x80-0x9f are the C1 control
    > codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
    > their Unicode mappings are well defined.


    They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf

    """ The shaded positions in the code table correspond
    to bit combinations that do not represent graphic
    characters. Their use is outside the scope of
    ISO/IEC 8859; it is specified in other International
    Standards, for example ISO/IEC 6429.


    However it's reasonable for 0x81 to decode to U+81 because the unicode standard says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

    """ The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.


    > You can use a non-strict error handling scheme to prevent the error.
    > >>> b'hello \x81 world'.decode('cp1252', 'replace')

    > 'hello \ufffd world'


    This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.
    , Nov 16, 2012
    #3
  4. Guest

    On Friday, November 16, 2012 2:34:32 PM UTC-8, Ian wrote:
    > On Fri, Nov 16, 2012 at 2:44 PM, <buck> wrote:
    >
    > > Latin1 has a block of 32 undefined characters.

    >
    >
    > These characters are not undefined. 0x80-0x9f are the C1 control
    > codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
    > their Unicode mappings are well defined.


    They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf

    """ The shaded positions in the code table correspond
    to bit combinations that do not represent graphic
    characters. Their use is outside the scope of
    ISO/IEC 8859; it is specified in other International
    Standards, for example ISO/IEC 6429.


    However it's reasonable for 0x81 to decode to U+81 because the unicode standard says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

    """ The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.


    > You can use a non-strict error handling scheme to prevent the error.
    > >>> b'hello \x81 world'.decode('cp1252', 'replace')

    > 'hello \ufffd world'


    This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.
    , Nov 16, 2012
    #4
  5. Dave Angel Guest

    On 11/16/2012 06:27 PM, wrote:
    > (doublespaced nonsense deleted. GoogleGropups strikes again.)
    > This creates a non-reversible encoding, and loss of data, which isn't
    > acceptable for my application.


    So tell us more about your application. If you have data which is
    invalid, and you encode it to some other form, you have to expect that
    it won't be reversible. But maybe your data isn't really characters at
    all, and you're just trying to manipulate bytes?

    Without a use case, we really can't guess. The fact that you are
    waffling between latin1 and 1252 indicates this isn't really character data.

    Also, while you're at it, please specify the Python version and OS
    you're on. You haven't given us any code to guess it from.

    --

    DaveA
    Dave Angel, Nov 17, 2012
    #5
  6. Ian Kelly Guest

    On Fri, Nov 16, 2012 at 4:27 PM, <> wrote:
    > They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf
    >
    > """ The shaded positions in the code table correspond
    > to bit combinations that do not represent graphic
    > characters. Their use is outside the scope of
    > ISO/IEC 8859; it is specified in other International
    > Standards, for example ISO/IEC 6429.


    It gets murkier than that. I don't want to spend time hunting down
    the relevant documents, so I'll just quote from Wikipedia:

    """
    In 1992, the IANA registered the character map ISO_8859-1:1987, more
    commonly known by its preferred MIME name of ISO-8859-1 (note the
    extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on
    the Internet. This map assigns the C0 and C1 control characters to the
    unassigned code values thus provides for 256 characters via every
    possible 8-bit value.
    """

    http://en.wikipedia.org/wiki/ISO/IEC_8859-1#History

    >> You can use a non-strict error handling scheme to prevent the error.
    >> >>> b'hello \x81 world'.decode('cp1252', 'replace')

    >> 'hello \ufffd world'

    >
    > This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.


    Well, what characters would you have these bytes decode to,
    considering that they're undefined? If the string is really CP-1252,
    then the presence of undefined characters in the document does not
    signify "data". They're just junk bytes, possibly indicative of data
    corruption. If on the other hand the string is really Latin-1, and
    you *know* that it is Latin-1, then you should probably forget the
    aliasing recommendation and just decode it as Latin-1.

    Apparently this Latin-1 -> CP-1252 encoding aliasing is already
    commonly performed by modern user agents. What do IE and Firefox do
    when presented with a Latin-1 encoding and undefined CP-1252 codings?
    Ian Kelly, Nov 17, 2012
    #6
  7. Nobody Guest

    On Fri, 16 Nov 2012 13:44:03 -0800, buck wrote:

    > When a user agent [browser] would otherwise use a character encoding given
    > in the first column [ISO-8859-1, aka latin1] of the following table to
    > either convert content to Unicode characters or convert Unicode characters
    > to bytes, it must instead use the encoding given in the cell in the second
    > column of the same row [windows-1252, aka cp1252].


    It goes on to say:

    The requirement to treat certain encodings as other encodings according
    to the table above is a willful violation of the W3C Character Model
    specification, motivated by a desire for compatibility with legacy
    content. [CHARMOD]

    IOW: Microsoft's "embrace, extend, extinguish" strategy has been too
    successful and now we have to deal with it. If HTML content is tagged as
    using ISO-8859-1, it's more likely that it's actually Windows-1252 content
    generated by someone who doesn't know the difference.

    Given that the only differences between the two are for code points which
    are in the C1 range (0x80-0x9F), which should never occur in HTML, parsing
    ISO-8859-1 as Windows-1252 should be harmless.

    If you need to support either, you can parse it as ISO-8859-1 then
    explicitly convert C1 codes to their Windows-1252 equivalents as a
    post-processing step, e.g. using the .translate() method.
    Nobody, Nov 17, 2012
    #7
  8. Ian Kelly Guest

    On Fri, Nov 16, 2012 at 5:33 PM, Nobody <> wrote:
    > If you need to support either, you can parse it as ISO-8859-1 then
    > explicitly convert C1 codes to their Windows-1252 equivalents as a
    > post-processing step, e.g. using the .translate() method.


    Or just create a custom codec by taking the one in
    Lib/encodings/cp1252.py and modifying it slightly.


    >>> import codecs
    >>> import cp1252a
    >>> codecs.register(lambda n: cp1252a.getregentry() if n == "cp1252a" else None)
    >>> b'\x81\x8d\x8f\x90\x9d'.decode('cp1252a')

    '♕♖♗♘♙'
    Ian Kelly, Nov 17, 2012
    #8
  9. Guest

    On Friday, November 16, 2012 4:33:14 PM UTC-8, Nobody wrote:
    > On Fri, 16 Nov 2012 13:44:03 -0800, buck wrote:
    > IOW: Microsoft's "embrace, extend, extinguish" strategy has been too
    > successful and now we have to deal with it. If HTML content is tagged as
    > using ISO-8859-1, it's more likely that it's actually Windows-1252 content
    > generated by someone who doesn't know the difference.


    Yes that's exactly what it says.

    > Given that the only differences between the two are for code points which
    > are in the C1 range (0x80-0x9F), which should never occur in HTML, parsing
    > ISO-8859-1 as Windows-1252 should be harmless.


    "should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I see this as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as is done here:

    http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt

    and here:

    ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

    This is in line with the unicode standard, which says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

    > There are 65 code points set aside in the Unicode Standard for compatibility with the C0
    > and C1 control codes defined in the ISO/IEC 2022 framework. The ranges ofthese code
    > points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to the 8-bit
    > controls 0x00 to 0x1F (C0 controls), 0x7F (delete), and 0x80 to 0x9F (C1 controls),
    > respectively ... There is a simple, one-to-one mapping between 7-bit (and8-bit) control
    > codes and the Unicode control codes: every 7-bit (or 8-bit) control code is numerically
    > equal to its corresponding Unicode code point.


    IOW: Bytes with undefined semantics in the C0/C1 range are "control codes",which decode to the unicode-point of equal value.

    This is exactly the section which allows latin1 to decode 0x81 to U+81, even though ISO-8859-1 explicitly does not define semantics for that byte (6.2ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf)
    , Nov 17, 2012
    #9
  10. Ian Kelly Guest

    On Sat, Nov 17, 2012 at 9:56 AM, <> wrote:
    > "should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I seethis as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as is done here:
    >
    > http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt
    >
    > and here:
    >
    > ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt


    The README for the "BestFit" document states:

    """
    These tables include "best fit" behavior which is not present in the
    other files. Examples of best fit
    are converting fullwidth letters to their counterparts when converting
    to single byte code pages, and
    mapping the Infinity character to the number 8.
    """

    This does not sound like appropriate behavior for a generalized
    conversion scheme. It is also noted that the "BestFit" document is
    not authoritative at:

    http://www.iana.org/assignments/charset-reg/windows-1252


    > This is in line with the unicode standard, which says: http://www.unicode..org/versions/Unicode6.2.0/ch16.pdf
    >
    >> There are 65 code points set aside in the Unicode Standard for compatibility with the C0
    >> and C1 control codes defined in the ISO/IEC 2022 framework. The ranges of these code
    >> points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to the 8-bit
    >> controls 0x00 to 0x1F (C0 controls), 0x7F (delete), and 0x80 to 0x9F (C1controls),
    >> respectively ... There is a simple, one-to-one mapping between 7-bit (and 8-bit) control
    >> codes and the Unicode control codes: every 7-bit (or 8-bit) control codeis numerically
    >> equal to its corresponding Unicode code point.

    >
    > IOW: Bytes with undefined semantics in the C0/C1 range are "control codes", which decode to the unicode-point of equal value.
    >
    > This is exactly the section which allows latin1 to decode 0x81 to U+81, even though ISO-8859-1 explicitly does not define semantics for that byte (6..2 ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf)


    But Latin-1 explicitly defers to to the control codes for those
    characters. CP-1252 does not; the reason those characters are left
    undefined is to allow for future expansion, such as when Microsoft
    added the Euro sign at 0x80.

    Since we're talking about conversion from bytes to Unicode, I think
    the most authoritative source we could possibly reference would be the
    official ISO 10646 conversion tables for the character sets in
    question. I understand those are to be found here:

    http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT

    and here:

    http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

    Note that the ISO-8859-1 mapping defines the C0 and C1 codes, whereas
    the cp1252 mapping leaves those five codes undefined. This would seem
    to indicate that Python is correctly decoding CP-1252 according to the
    Unicode standard.
    Ian Kelly, Nov 17, 2012
    #10
  11. Ian Kelly Guest

    On Sat, Nov 17, 2012 at 11:08 AM, Ian Kelly <> wrote:
    > On Sat, Nov 17, 2012 at 9:56 AM, <> wrote:
    >> "should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I see this as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as isdone here:
    >>
    >> http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt
    >>
    >> and here:
    >>
    >> ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

    >
    > The README for the "BestFit" document states:
    >
    > """
    > These tables include "best fit" behavior which is not present in the
    > other files. Examples of best fit
    > are converting fullwidth letters to their counterparts when converting
    > to single byte code pages, and
    > mapping the Infinity character to the number 8.
    > """
    >
    > This does not sound like appropriate behavior for a generalized
    > conversion scheme. It is also noted that the "BestFit" document is
    > not authoritative at:
    >
    > http://www.iana.org/assignments/charset-reg/windows-1252


    I meant to also comment on the first link, but forgot. As that
    document is published by the W3C, I understand it to be specific to
    the Web, which Python is not. Hence I think the more general Unicode
    specification is more appropriate for Python.
    Ian Kelly, Nov 17, 2012
    #11
  12. Nobody Guest

    On Sat, 17 Nov 2012 08:56:46 -0800, buck wrote:

    >> Given that the only differences between the two are for code points
    >> which are in the C1 range (0x80-0x9F), which should never occur in HTML,
    >> parsing ISO-8859-1 as Windows-1252 should be harmless.

    >
    > "should" is a wish. The reality is that documents (and especially URLs)
    > exist that can be decoded with latin1, but will backtrace with cp1252.


    In which case, they're probably neither ISO-8859-1 nor Windows-1252, but
    some other (unknown) encoding which has acquired the ISO-8859-1 label
    "by default".

    In that situation, if you still need to know the encoding, you need to
    resort to heuristics such as those employed by the chardet library.
    Nobody, Nov 17, 2012
    #12
  13. On Fri, 16 Nov 2012 15:27:54 -0800 (PST), declaimed the
    following in gmane.comp.python.general:

    > On Friday, November 16, 2012 2:34:32 PM UTC-8, Ian wrote:
    > > On Fri, Nov 16, 2012 at 2:44 PM, <buck> wrote:
    > >
    > > > Latin1 has a block of 32 undefined characters.

    > >
    > >
    > > These characters are not undefined. 0x80-0x9f are the C1 control
    > > codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
    > > their Unicode mappings are well defined.

    >
    > They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf
    >
    > """ The shaded positions in the code table correspond
    > to bit combinations that do not represent graphic
    > characters. Their use is outside the scope of
    > ISO/IEC 8859; it is specified in other International
    > Standards, for example ISO/IEC 6429.
    >

    This quote only states that those position do not represent
    displayable glyphs, and indicates the 8859 is only concerned with
    codings for display. It does NOT say they are "undefined".
    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
    Dennis Lee Bieber, Nov 18, 2012
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mickey Segal
    Replies:
    5
    Views:
    33,542
    Mickey Segal
    Apr 20, 2005
  2. David Eppstein

    Distinguishing cp850 and cp1252?

    David Eppstein, Nov 3, 2003, in forum: Python
    Replies:
    3
    Views:
    584
    David Eppstein
    Nov 3, 2003
  3. Do Re Mi chel La Si Do

    To circumvent the bug cp1252

    Do Re Mi chel La Si Do, May 15, 2005, in forum: Python
    Replies:
    0
    Views:
    425
    Do Re Mi chel La Si Do
    May 15, 2005
  4. =?iso-8859-1?B?bW9vcJk=?=

    Cp1252 problem

    =?iso-8859-1?B?bW9vcJk=?=, Sep 27, 2006, in forum: Java
    Replies:
    2
    Views:
    41,078
    Mike Schilling
    Sep 27, 2006
  5. Méta-MCI

    Bug? import cp1252

    Méta-MCI, May 12, 2007, in forum: Python
    Replies:
    2
    Views:
    481
    Méta-MCI
    May 14, 2007
Loading...

Share This Page