newbie question about character encoding: what does 0xC0 0x8A have in common with 0xE0 0x80 0x8A?

J

Jake Barnes

I'm afriad the below is almost gibberish to me. What do these 5
formulations have in common? Is it true that they all specify the same
character? How is that possible?


====================================

http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucs

An important note for developers of UTF-8 decoding routines: For
security reasons, a UTF-8 decoder must not accept UTF-8 sequences that
are longer than necessary to encode a character. For example, the
character U+000A (line feed) must be accepted from a UTF-8 stream only
in the form 0x0A, but not in any of the following five possible
overlong forms:

0xC0 0x8A
0xE0 0x80 0x8A
0xF0 0x80 0x80 0x8A
0xF8 0x80 0x80 0x80 0x8A
0xFC 0x80 0x80 0x80 0x80 0x8A
 
J

Jukka K. Korpela

Jake Barnes said:
I'm afriad the below is almost gibberish to me.

Is it relevant to you? Is it an XML issue?
An important note for developers of UTF-8 decoding routines:

Are you developing a UTF-8 decoder? How does that relate to XML?
(XML can be UTF-8 encoded, and often is, but so what?)
 
R

Richard Tobin

Jake Barnes said:
I'm afriad the below is almost gibberish to me. What do these 5
formulations have in common? Is it true that they all specify the same
character? How is that possible?

UTF-8 represents unicode characters as variable length sequences of
bytes, with smaller unicode numbers having shorter sequences.

Characters below 0x80 (those requiring at most 7 bits, the ASCII
characters) are represented as a single byte, and are the same as in
their ASCII representations. So the example you quote, the line feed
character, is represented as 0x0A.

Characters from 0x80 to 0x7FF (those requiring between 8 and 11 bits)
are represented by two bytes. In binary, the bytes are 110xxxxx
10xxxxxx, the 11 bits being distributed with high-order 5 in the first
byte and the low-order 6 in second byte.

To put it another way, a character c is represented as 0xC0 + (c >> 6)
followed by 0x80 + (c & 0x3F).

Now you *could* represent 0x0A in this two-byte form, as 11000000
10001010 (0xC0 0x8A), but UTF-8 says that you must not do this: you
must use the single byte version. And a UTF-8 decoder must give an
error if it encounters a linefeed encoded as 0xC0 0x8A.

Similarly, characters from 0x800 to 0xFFFF (those requiring between 12
and 16 bits) are represented by three bytes. In binary, the bytes are
1110xxxx 10xxxxxx 10xxxxxx, with 4 of the 16 bits in the first byte 6
in each of the second and third.

Again you *could* represent 0x0A in this three-byte form, as 11100000
10000000 10001010 (0xE0 0x80 0x8A), but again UTF-8 says you must not.

And so on. Each length of UTF-8 sequence has enough bits to represent
all the character from zero to some limit, but it must only be used
for representing the characters that can't be represented by a shorter
sequence.

-- Richard
 
I

Ian Rastall

How does that relate to XML?

Jukka's being ornery, but he does have an excellent introduction to
character code issues here: http://www.cs.tut.fi/~jkorpela/chars.html

Hope that's helpful, although the word "newbie" doesn't usually refer
to someone who is wondering how six different hexadecimal numbers can
refer to the same character, so maybe the link is of no use. :)

Ian
 
J

Jake Barnes

Jukka said:
Is it relevant to you? Is it an XML issue?


Are you developing a UTF-8 decoder? How does that relate to XML?
(XML can be UTF-8 encoded, and often is, but so what?)

I wrote a PHP script to generate an RSS feed from some weblog entries,
but the XML dies because there are garbage characters in the feed. Some
people using the weblog script have been typing their entries in
Microsoft Word or other word processors, and the copying and pasting
the text into the weblogs. I was trying to figure out how to clean up
the feed. To do so, I've been forced to study character encoding
issues.
 

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top