Confusion between UTF-8 and Unicode

Discussion in 'Java' started by Celia, Mar 16, 2005.

  1. Celia

    Celia Guest

    I've looked up UTF-8 and Unicode in the Wikipedia, and at Dictionary.com,
    but I'm not grokking it yet.

    From what I understand:

    Unicode:
    Every human language character 'a', '7', '*', etc is converted into a
    16 bit number.

    UTF-8:
    Every human language character is converted into a 1 or 2 byte number
    to make it align with ASCII and be useable with non Unicode enabled apps.


    According to Wikipedia:
    UTF-8 (8-bit Unicode Transformation Format) is a lossless, variable-
    length character encoding for Unicode...


    If these are correct descriptions then that would make UTF-8 _not_
    something which is on top of, or for, unicode, but a variation of unicode.
    I thought Unicode _is_ a character encoding.



    Please show me my ignorance.
    Non-technical analogies would be particularly helpful.



    -C
     
    Celia, Mar 16, 2005
    #1
    1. Advertisements

  2. Celia

    shakah Guest

    I think Unicode is always two 8-bit bytes per character, UTF-8 is
    either 1- or 2-bytes per character.
     
    shakah, Mar 16, 2005
    #2
    1. Advertisements

  3. Celia

    Malte Guest

    Malte, Mar 16, 2005
    #3
  4. Rather than looking up somebody's definition, pay a visit to a solid
    source: <http://www.unicode.org/>. That should get you rolling.

    = Steve =
     
    Steve W. Jackson, Mar 16, 2005
    #4
  5. Celia

    Alan Moore Guest

    An encoding is a way to translate a stream of bits (on disk, in
    memory, etc.) into characters. Unicode is not an encoding, it's a
    character set: a way of assigning numeric values to characters, like
    ASCII. With ASCII, we never needed to make the distinction between a
    character set an an encoding, because each byte represents a
    character. But Unicode characters can have values up to 2^32, which
    means we would need four bytes to represent each character if we were
    to use the same approach to encoding as we do with ASCII.
    (Originally, Unicode character values only went up to 2^16, but they
    discovered that wasn't sufficient.) That was a pretty difficult pill
    for programmers and administrators to swallow, so UTF-8 was invented
    as a compromise. Characters in the 7-bit ASCII range only take one
    byte to encode, while two bytes can convey the extended ASCII
    characters plus many characters from other Western character sets. As
    the character values become larger, UTF-8 becomes less efficient than
    a simple numeric-value encoding, but if you're dealing mainly with
    ASCII, it works very well.

    This is all terribly simplified but I hope I've made the point:
    Unicode is not an encoding, it's a character set (THE character set,
    if you will). UTF-8 is an encoding of the Unicode character set.
     
    Alan Moore, Mar 16, 2005
    #5
  6. Celia

    Edwin Martin Guest

    UTF-8 is an encoding of Unicode in such a way that a plain ASCII file is
    also a valid UTF-8 file (with the same contents, ofcourse).

    See also:

    The Absolute Minimum Every Software Developer Absolutely, Positively
    Must Know About Unicode and Character Sets (No Excuses!)

    http://www.joelonsoftware.com/articles/Unicode.html

    Edwin Martin
     
    Edwin Martin, Mar 16, 2005
    #6
  7. BTW, UTF-8 also produces 3 byte results. It has to in order to be lossless in encoding
    16-bit Unicode (think about it). Also, it doesn't encode every human language character
    because it only encodes 16-bit Unicode.
     
    Lee Fesperman, Mar 17, 2005
    #7
  8. Celia

    Aquila Deus Guest

    Ummmm, because many people and softwares actually refer to the
    "Unicode-16" encoding when they use the word "Unicode".
     
    Aquila Deus, Mar 17, 2005
    #8
  9. Celia

    blake.ong Guest

    Is it true that a webpage for example... which uses Unicode can display
    different languages in the same webpage? while UTF-8 cant....?
     
    blake.ong, Mar 17, 2005
    #9
  10. That is incorrect information.

    Unicode defines a _set_ _of_ _characters_.

    UTF-8 defines a way to represent the characters defined by
    Unicode in binary (== bytes). There are also other ways to
    represent unicode characters in binary than just UTF-8
    (for example UTF-16BE and UTF-16LE).

    Web pages and different languages are most likely related
    to XML and it's lang-attribute.
     
    Antti S. Brax, Mar 17, 2005
    #10
  11. Celia

    Aquila Deus Guest

    Yes it can. UTF-8 is just one of unicode encodings.
    Exactly right, what's confusing is that many people use term "Unicode"
    to refer to the "UTF-16BE" and/or "UTF-16LE" encoding.
    Save all files in UTF-8 and then you don't need to worry about
    languages anymore :)
     
    Aquila Deus, Mar 17, 2005
    #11
  12. Celia

    Bryce Guest

    Read this:
    http://www.joelonsoftware.com/articles/Unicode.html
    UTF-8 is a variable byte code page. For normal US ASCII characters,
    only a single byte is required.

    No, Unicode is a "dictionary" of all characters. UTF-8 is a code page
    (or character set if you will), its a way of representing that Unicode
    character in memory.

    For example:
    Lets take the Euro symbol. In Unicode, its represented:
    U+20AC

    Its represented in UTF-8 in memory as:
    E2 82
    Read the article I posted above, and it should shed some light on the
    subject
     
    Bryce, Mar 17, 2005
    #12
  13. Celia

    Bryce Guest

    Couldn't have said it any better.
     
    Bryce, Mar 17, 2005
    #13
  14. Celia

    Bryce Guest

    UTF-8 can have up to 6 bytes.
     
    Bryce, Mar 17, 2005
    #14
  15. Celia

    Bryce Guest

    A webpage can't "use" Unicode. It is either UTF-8, or some other
    encoding.
     
    Bryce, Mar 17, 2005
    #15
  16. Oops, I didn't realize that. I'm afraid my information had come from reverse
    engineering. After thinking on it later, I had come to the conclusion that the encoding
    could support 32-bit Unicode with 6 bytes. Thanks for the correction.
     
    Lee Fesperman, Mar 17, 2005
    #16
  17. Celia

    Chris Smith Guest

    It's actually even more complicated than that. Java, in all cases where
    it implements UTF-8, supports a kind of pseudo-UTF-8. This Java-
    specific encoding first encodes the Unicode text as UTF-16, and then
    uses only the 1-byte, 2-byte, and 3-byte forms of UTF-8. So it's
    *correct* to say that UTF-8 can be up to six bytes long, but it's
    perhaps misleading in the context of Java unless a disclaimer is added.

    --
    www.designacourse.com
    The Easiest Way To Train Anyone... Anywhere.

    Chris Smith - Lead Software Developer/Technical Trainer
    MindIQ Corporation
     
    Chris Smith, Mar 18, 2005
    #17
  18. Saving pages in UTF-8 only releaves me from worrying about
    encoding Å, Ä and Ö. Using UTF-8 won't magically give the
    english speaking world a clue about how to pronounce them.
    :)
     
    Antti S. Brax, Mar 18, 2005
    #18
  19. Celia

    Chris Uppal Guest

    I hate to say it, but you are over-simplifying ;-)

    Unfortunately, picture has become quite confused (and Sun, IMO, have
    unnecessarily and irresponsible added to this). So here's my attempt to add to
    the confusion...

    Let's start with UTF-8. There are two "official" standards for the encoding
    known as UTF-8. One is in ISO/IEC 10646 (which I haven't read, btw, I'm going
    on hearsay here), and is summarised in RFC 2279. That defines an encoding of
    31-bit values in up to 6 bytes. I believe the same encoding would work
    perfectly well for the full 32-bit range, but it is artificially limited to
    31-bit values. The second "official" standard for UTF-8 is that of the
    Unicode consortium; their version of it is identical to the IS0 version except
    that it is further limited (artificially) to the 24-bit range, and hence never
    requires more than 4 bytes to encode a value. IMO, this is a mistake on the
    part of the Unicode people -- implementations should be required to decode the
    full ISO range (including the extended private use area) rather than being
    required (as I understand it) to abort with an error if ISO-encoded >24-bit
    data is encountered. Still, in practise, for Unicode data (which is always 24
    bit or less) there is no difference between the formats.

    Now Sun enter the picture. Start with the situation before Java 5. Java (as
    of then) used Unicode internally. Not any /encoding/, just pure abstract
    Unicode
    data -- each String corresponds to an a sequence of characters from the Unicode
    repertoire. That's all very nice and clear, unfortunately there are a couple
    of snakes in this Eden.

    One is that the primitive type 'char' is a 16-bit quantity, so most Unicode
    characters cannot be represented in Java. Fortunately those characters (the
    ones outside the 16-bit range) are used relatively infrequently, so we mostly
    managed to get along with Java the way it was. It's obviously a problem
    waiting to happen, though, especially if a Java program is receiving Unicode
    data from a source that is not hamstrung by a crippled Unicode implementation.
    (XML data is Unicode, for instance, and it'd be unfortunate if a Java XML
    implementation barfed when faced with perfectly valid XML).

    The second problem is less severe -- in fact it only causes confusion, not
    actual functional limitations. Sun decided to define their own encoding for
    Unicode data. I have no problems with that, it's a sensible encoding for its
    purpose(s). Where they displayed flabbergasting irresponsibility was to call
    it "UTF-8" too. Admittedly it's closely related to UTF-8, but it is neither
    upwardly nor downwardly compatible with it. That encoding (call it
    pseudo-UTF-8) can only encode values in the 16 bit range, and so never uses
    more than 3 bytes per "character" (however it uses 2 bytes for 0, whereas true
    UTF-8 uses only 1 byte). Since Sun blithely named various methods that
    manipulate data in this format with some variation on 'utf8' (e.g.
    ObjectOutputStream.writeUTF8() or the JNI function GetStringUTFChars()) that
    has added to the confusion. OTOH, the CharsetEncoder called "UTF-8" does
    perform true UTF-8 encoding (not pseudo-UTF8), at least for the sequences of
    16-bit limited 'char's that could be fed to it prior to Java5.

    But Java programmers are rarely satisfied. We demand ever greater complexity,
    baroque over-engineering piled on confounding intricacy. So Sun, responding as
    ever to the needs of the community, decided to Act...

    Java 5 adds another layer of confusion. To Sun's credit, the misnamed
    references to "UTF-8" have been clearly documented as such (but not, alas,
    deprecated and renamed). However it was necessary to do something about the
    16-bit limit. To be honest, I don't think that Sun had any choice in the
    solution they've adopted, but that doesn't make it any less vile.

    Since Java 5, Strings (and similar) are no longer pure abstractions of Unicode
    character sequences. The 'char' datatype no longer represents (in any useful
    sense) a Unicode character. No, by fiat the objects that used to hold pure
    abstract Unicode, now contain an /encoded/ representation -- specifically
    UTF-16. The so-called 'char' datatype no longer holds pure Unicode characters,
    but instead is used to hold the 16-bit quantities that are used by the UTF-16
    encoding. String.charAt() no longer returns the nth character of the Unicode
    string, but returns the nth 16-bit value from the UTF-16 encoding of the
    Unicode string (and, as such, is useless in any context that is about the
    textual meaning of the string -- Character.isUpperCase(char) for instance no
    longer makes any sense at all). Actual semantic textual elements are now
    represented as 'int's. (Of course, Unicode makes it clear that the
    "characters" in a Unicode sequence do not necessarily map directly onto the
    "textual elements" that a human reader would perceive -- there are diacritical
    marks and so on -- but that's just another delicious layer of complexity in the
    cake...)

    Incidentally, this means that some legal Java Strings are no longer legal
    Unicode. Not merely that they can (in principle) contain sequences that are
    meaningless when interpreted as UTF-16, but that they can contain sequences
    that conforming Unicode implementations are required to reject (for security
    reasons). I am reasonably hopeful that the Unicode CharsetEncoders will detect
    such malformed sequences and refuse to generate correspondingly malformed (and
    illegal) byte-sequences, but I haven't yet checked.

    All this is pretty unfortunate. We are left in a position where we can either
    do our own handling of the UTF-16 encoding (very error prone, especially as
    many mistaken assumptions about the textual meaning of 'char' values won't be
    caught be the compiler /or/ by unsophisticated testing), or switch over to
    using the newer APIs (which are unnecessarily clunky, IMO. For instance why
    is there no easy way to iterate over the logical elements of a String ? They
    are
    also confusingly low-level technical, with much talk of 'surrogate pairs' and
    so on). Or, I suppose, we could create our own Unicode-aware objects and use
    those in preference to the supplied 'char' and java.lang.String, but then what
    do we do with all the other software that expects to work with Strings (and
    similar) ?

    Oh yes, and what about quasi-UTF-8 ? Sun have seized the bull by the horns and
    /made no change/... An admittedly ingenious solution to a technical problem --
    arguably even quite elegant. But it does mean that the JVM communicates with
    the real world using data that is encoded twice; 24-bit Unicode data is first
    encoded into UTF-16, and then that is encoded again using the old quasi-UTF-8
    format. Thus a 24-bit character can require 1, 2, 3 or 6 bytes to encode.

    I love this stuff. Just love it...

    -- chris
     
    Chris Uppal, Mar 18, 2005
    #19
  20. Celia

    Alan Moore Guest

    We can tell. ;)

    I'm sure this was far more than the OP wanted to know, but it clears
    up some questions I've had for a while, so thanks, Chris. If only we
    could go back and make 'char' a 32-bit type...
     
    Alan Moore, Mar 18, 2005
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.