latin1 and cp1252 inconsistent?

B

buck

Latin1 has a block of 32 undefined characters.
Windows-1252 (aka cp1252) fills in 27 of these characters but leaves five undefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D

The byte 0x81 decoded with latin gives the unicode 0x81.
Decoding the same byte with windows-1252 yields a stack trace with `UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>`

This seems inconsistent to me, given that this byte is equally undefined inthe two standards.

Also, the html5 standard says:

When a user agent [browser] would otherwise use a character encoding given in the first column [ISO-8859-1, aka latin1] of the following table to either convert content to Unicode characters or convert Unicode characters to bytes, it must instead use the encoding given in the cell in the second column of the same row [windows-1252, aka cp1252].

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0


The current implementation of windows-1252 isn't usable for this purpose (areplacement of latin1), since it will throw an error in cases that latin1 would succeed.
 
I

Ian Kelly

Latin1 has a block of 32 undefined characters.

These characters are not undefined. 0x80-0x9f are the C1 control
codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
their Unicode mappings are well defined.

http://tools.ietf.org/html/rfc1345
Windows-1252 (aka cp1252) fills in 27 of these characters but leaves fiveundefined: 0x81, 0x8D, 0x8F, 0x90, 0x9D

In CP 1252, these codes are actually undefined.

http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx
Also, the html5 standard says:

When a user agent [browser] would otherwise use a character encoding given in the first column [ISO-8859-1, aka latin1] of the following table to either convert content to Unicode characters or convert Unicode characters tobytes, it must instead use the encoding given in the cell in the second column of the same row [windows-1252, aka cp1252].

http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0


The current implementation of windows-1252 isn't usable for this purpose (a replacement of latin1), since it will throw an error in cases that latin1 would succeed.

You can use a non-strict error handling scheme to prevent the error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\python33\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position
'hello world'
 
B

buck

These characters are not undefined. 0x80-0x9f are the C1 control
codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
their Unicode mappings are well defined.

They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf

""" The shaded positions in the code table correspond
to bit combinations that do not represent graphic
characters. Their use is outside the scope of
ISO/IEC 8859; it is specified in other International
Standards, for example ISO/IEC 6429.


However it's reasonable for 0x81 to decode to U+81 because the unicode standard says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

""" The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.

You can use a non-strict error handling scheme to prevent the error.
'hello \ufffd world'

This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.
 
B

buck

These characters are not undefined. 0x80-0x9f are the C1 control
codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
their Unicode mappings are well defined.

They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf

""" The shaded positions in the code table correspond
to bit combinations that do not represent graphic
characters. Their use is outside the scope of
ISO/IEC 8859; it is specified in other International
Standards, for example ISO/IEC 6429.


However it's reasonable for 0x81 to decode to U+81 because the unicode standard says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

""" The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.

You can use a non-strict error handling scheme to prevent the error.
'hello \ufffd world'

This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.
 
D

Dave Angel

(doublespaced nonsense deleted. GoogleGropups strikes again.)
This creates a non-reversible encoding, and loss of data, which isn't
acceptable for my application.

So tell us more about your application. If you have data which is
invalid, and you encode it to some other form, you have to expect that
it won't be reversible. But maybe your data isn't really characters at
all, and you're just trying to manipulate bytes?

Without a use case, we really can't guess. The fact that you are
waffling between latin1 and 1252 indicates this isn't really character data.

Also, while you're at it, please specify the Python version and OS
you're on. You haven't given us any code to guess it from.
 
I

Ian Kelly

They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf

""" The shaded positions in the code table correspond
to bit combinations that do not represent graphic
characters. Their use is outside the scope of
ISO/IEC 8859; it is specified in other International
Standards, for example ISO/IEC 6429.

It gets murkier than that. I don't want to spend time hunting down
the relevant documents, so I'll just quote from Wikipedia:

"""
In 1992, the IANA registered the character map ISO_8859-1:1987, more
commonly known by its preferred MIME name of ISO-8859-1 (note the
extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on
the Internet. This map assigns the C0 and C1 control characters to the
unassigned code values thus provides for 256 characters via every
possible 8-bit value.
"""

http://en.wikipedia.org/wiki/ISO/IEC_8859-1#History
This creates a non-reversible encoding, and loss of data, which isn't acceptable for my application.

Well, what characters would you have these bytes decode to,
considering that they're undefined? If the string is really CP-1252,
then the presence of undefined characters in the document does not
signify "data". They're just junk bytes, possibly indicative of data
corruption. If on the other hand the string is really Latin-1, and
you *know* that it is Latin-1, then you should probably forget the
aliasing recommendation and just decode it as Latin-1.

Apparently this Latin-1 -> CP-1252 encoding aliasing is already
commonly performed by modern user agents. What do IE and Firefox do
when presented with a Latin-1 encoding and undefined CP-1252 codings?
 
N

Nobody

When a user agent [browser] would otherwise use a character encoding given
in the first column [ISO-8859-1, aka latin1] of the following table to
either convert content to Unicode characters or convert Unicode characters
to bytes, it must instead use the encoding given in the cell in the second
column of the same row [windows-1252, aka cp1252].

It goes on to say:

The requirement to treat certain encodings as other encodings according
to the table above is a willful violation of the W3C Character Model
specification, motivated by a desire for compatibility with legacy
content. [CHARMOD]

IOW: Microsoft's "embrace, extend, extinguish" strategy has been too
successful and now we have to deal with it. If HTML content is tagged as
using ISO-8859-1, it's more likely that it's actually Windows-1252 content
generated by someone who doesn't know the difference.

Given that the only differences between the two are for code points which
are in the C1 range (0x80-0x9F), which should never occur in HTML, parsing
ISO-8859-1 as Windows-1252 should be harmless.

If you need to support either, you can parse it as ISO-8859-1 then
explicitly convert C1 codes to their Windows-1252 equivalents as a
post-processing step, e.g. using the .translate() method.
 
I

Ian Kelly

If you need to support either, you can parse it as ISO-8859-1 then
explicitly convert C1 codes to their Windows-1252 equivalents as a
post-processing step, e.g. using the .translate() method.

Or just create a custom codec by taking the one in
Lib/encodings/cp1252.py and modifying it slightly.

'♕♖♗♘♙'
 
B

buck

Microsoft's "embrace, extend, extinguish" strategy has been too
successful and now we have to deal with it. If HTML content is tagged as
using ISO-8859-1, it's more likely that it's actually Windows-1252 content
generated by someone who doesn't know the difference.

Yes that's exactly what it says.
Given that the only differences between the two are for code points which
are in the C1 range (0x80-0x9F), which should never occur in HTML, parsing
ISO-8859-1 as Windows-1252 should be harmless.

"should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I see this as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as is done here:

http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt

and here:

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

This is in line with the unicode standard, which says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
There are 65 code points set aside in the Unicode Standard for compatibility with the C0
and C1 control codes defined in the ISO/IEC 2022 framework. The ranges ofthese code
points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to the 8-bit
controls 0x00 to 0x1F (C0 controls), 0x7F (delete), and 0x80 to 0x9F (C1 controls),
respectively ... There is a simple, one-to-one mapping between 7-bit (and8-bit) control
codes and the Unicode control codes: every 7-bit (or 8-bit) control code is numerically
equal to its corresponding Unicode code point.

IOW: Bytes with undefined semantics in the C0/C1 range are "control codes",which decode to the unicode-point of equal value.

This is exactly the section which allows latin1 to decode 0x81 to U+81, even though ISO-8859-1 explicitly does not define semantics for that byte (6.2ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf)
 
I

Ian Kelly

"should" is a wish. The reality is that documents (and especially URLs) exist that can be decoded with latin1, but will backtrace with cp1252. I seethis as a sign that a small refactorization of cp1252 is in order. The proposal is to change those "UNDEFINED" entries to "<control>" entries, as is done here:

http://dvcs.w3.org/hg/encoding/raw-file/tip/index-windows-1252.txt

and here:

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

The README for the "BestFit" document states:

"""
These tables include "best fit" behavior which is not present in the
other files. Examples of best fit
are converting fullwidth letters to their counterparts when converting
to single byte code pages, and
mapping the Infinity character to the number 8.
"""

This does not sound like appropriate behavior for a generalized
conversion scheme. It is also noted that the "BestFit" document is
not authoritative at:

http://www.iana.org/assignments/charset-reg/windows-1252

This is in line with the unicode standard, which says: http://www.unicode..org/versions/Unicode6.2.0/ch16.pdf


IOW: Bytes with undefined semantics in the C0/C1 range are "control codes", which decode to the unicode-point of equal value.

This is exactly the section which allows latin1 to decode 0x81 to U+81, even though ISO-8859-1 explicitly does not define semantics for that byte (6..2 ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf)

But Latin-1 explicitly defers to to the control codes for those
characters. CP-1252 does not; the reason those characters are left
undefined is to allow for future expansion, such as when Microsoft
added the Euro sign at 0x80.

Since we're talking about conversion from bytes to Unicode, I think
the most authoritative source we could possibly reference would be the
official ISO 10646 conversion tables for the character sets in
question. I understand those are to be found here:

http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT

and here:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

Note that the ISO-8859-1 mapping defines the C0 and C1 codes, whereas
the cp1252 mapping leaves those five codes undefined. This would seem
to indicate that Python is correctly decoding CP-1252 according to the
Unicode standard.
 
I

Ian Kelly

The README for the "BestFit" document states:

"""
These tables include "best fit" behavior which is not present in the
other files. Examples of best fit
are converting fullwidth letters to their counterparts when converting
to single byte code pages, and
mapping the Infinity character to the number 8.
"""

This does not sound like appropriate behavior for a generalized
conversion scheme. It is also noted that the "BestFit" document is
not authoritative at:

http://www.iana.org/assignments/charset-reg/windows-1252

I meant to also comment on the first link, but forgot. As that
document is published by the W3C, I understand it to be specific to
the Web, which Python is not. Hence I think the more general Unicode
specification is more appropriate for Python.
 
N

Nobody

"should" is a wish. The reality is that documents (and especially URLs)
exist that can be decoded with latin1, but will backtrace with cp1252.

In which case, they're probably neither ISO-8859-1 nor Windows-1252, but
some other (unknown) encoding which has acquired the ISO-8859-1 label
"by default".

In that situation, if you still need to know the encoding, you need to
resort to heuristics such as those employed by the chardet library.
 
D

Dennis Lee Bieber

They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf

""" The shaded positions in the code table correspond
to bit combinations that do not represent graphic
characters. Their use is outside the scope of
ISO/IEC 8859; it is specified in other International
Standards, for example ISO/IEC 6429.
This quote only states that those position do not represent
displayable glyphs, and indicates the 8859 is only concerned with
codings for display. It does NOT say they are "undefined".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top