newbie question about character encoding: what does 0xC0 0x8A have in common with 0xE0 0x80 0x8A?

Discussion in 'XML' started by Jake Barnes, Nov 17, 2005.

  1. Jake Barnes

    Jake Barnes Guest

    I'm afriad the below is almost gibberish to me. What do these 5
    formulations have in common? Is it true that they all specify the same
    character? How is that possible?


    ====================================

    http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucs

    An important note for developers of UTF-8 decoding routines: For
    security reasons, a UTF-8 decoder must not accept UTF-8 sequences that
    are longer than necessary to encode a character. For example, the
    character U+000A (line feed) must be accepted from a UTF-8 stream only
    in the form 0x0A, but not in any of the following five possible
    overlong forms:

    0xC0 0x8A
    0xE0 0x80 0x8A
    0xF0 0x80 0x80 0x8A
    0xF8 0x80 0x80 0x80 0x8A
    0xFC 0x80 0x80 0x80 0x80 0x8A
    Jake Barnes, Nov 17, 2005
    #1
    1. Advertising

  2. "Jake Barnes" <> wrote:

    > I'm afriad the below is almost gibberish to me.


    Is it relevant to you? Is it an XML issue?

    > An important note for developers of UTF-8 decoding routines:


    Are you developing a UTF-8 decoder? How does that relate to XML?
    (XML can be UTF-8 encoded, and often is, but so what?)

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Jukka K. Korpela, Nov 17, 2005
    #2
    1. Advertising

  3. In article <>,
    Jake Barnes <> wrote:

    >I'm afriad the below is almost gibberish to me. What do these 5
    >formulations have in common? Is it true that they all specify the same
    >character? How is that possible?


    UTF-8 represents unicode characters as variable length sequences of
    bytes, with smaller unicode numbers having shorter sequences.

    Characters below 0x80 (those requiring at most 7 bits, the ASCII
    characters) are represented as a single byte, and are the same as in
    their ASCII representations. So the example you quote, the line feed
    character, is represented as 0x0A.

    Characters from 0x80 to 0x7FF (those requiring between 8 and 11 bits)
    are represented by two bytes. In binary, the bytes are 110xxxxx
    10xxxxxx, the 11 bits being distributed with high-order 5 in the first
    byte and the low-order 6 in second byte.

    To put it another way, a character c is represented as 0xC0 + (c >> 6)
    followed by 0x80 + (c & 0x3F).

    Now you *could* represent 0x0A in this two-byte form, as 11000000
    10001010 (0xC0 0x8A), but UTF-8 says that you must not do this: you
    must use the single byte version. And a UTF-8 decoder must give an
    error if it encounters a linefeed encoded as 0xC0 0x8A.

    Similarly, characters from 0x800 to 0xFFFF (those requiring between 12
    and 16 bits) are represented by three bytes. In binary, the bytes are
    1110xxxx 10xxxxxx 10xxxxxx, with 4 of the 16 bits in the first byte 6
    in each of the second and third.

    Again you *could* represent 0x0A in this three-byte form, as 11100000
    10000000 10001010 (0xE0 0x80 0x8A), but again UTF-8 says you must not.

    And so on. Each length of UTF-8 sequence has enough bits to represent
    all the character from zero to some limit, but it must only be used
    for representing the characters that can't be represented by a shorter
    sequence.

    -- Richard
    Richard Tobin, Nov 17, 2005
    #3
  4. Jake Barnes

    Ian Rastall Guest

    On Thu, 17 Nov 2005 23:11:55 +0000 (UTC), "Jukka K. Korpela"
    <> wrote:

    >How does that relate to XML?


    Jukka's being ornery, but he does have an excellent introduction to
    character code issues here: http://www.cs.tut.fi/~jkorpela/chars.html

    Hope that's helpful, although the word "newbie" doesn't usually refer
    to someone who is wondering how six different hexadecimal numbers can
    refer to the same character, so maybe the link is of no use. :)

    Ian
    Ian Rastall, Nov 18, 2005
    #4
  5. Jake Barnes

    Jake Barnes Guest

    Re: newbie question about character encoding: what does 0xC0 0x8A have in common with 0xE0 0x80 0x8A???

    Jukka K. Korpela wrote:
    > "Jake Barnes" <> wrote:
    >
    > > I'm afriad the below is almost gibberish to me.

    >
    > Is it relevant to you? Is it an XML issue?
    >
    > > An important note for developers of UTF-8 decoding routines:

    >
    > Are you developing a UTF-8 decoder? How does that relate to XML?
    > (XML can be UTF-8 encoded, and often is, but so what?)


    I wrote a PHP script to generate an RSS feed from some weblog entries,
    but the XML dies because there are garbage characters in the feed. Some
    people using the weblog script have been typing their entries in
    Microsoft Word or other word processors, and the copying and pasting
    the text into the weblogs. I was trying to figure out how to clean up
    the feed. To do so, I've been forced to study character encoding
    issues.
    Jake Barnes, Dec 5, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. raavi
    Replies:
    2
    Views:
    896
    raavi
    Mar 2, 2006
  2. flamesrock
    Replies:
    7
    Views:
    946
    flamesrock
    Jan 4, 2005
  3. Denny
    Replies:
    1
    Views:
    759
  4. John Reye
    Replies:
    28
    Views:
    1,331
    Tim Rentsch
    May 8, 2012
  5. Jason Mellone
    Replies:
    3
    Views:
    76
    Jurko Gospodnetić
    May 7, 2014
Loading...

Share This Page