Chris said:
It's actually even more complicated than that.
I hate to say it, but you are over-simplifying ;-)
Unfortunately, picture has become quite confused (and Sun, IMO, have
unnecessarily and irresponsible added to this). So here's my attempt to add to
the confusion...
Let's start with UTF-8. There are two "official" standards for the encoding
known as UTF-8. One is in ISO/IEC 10646 (which I haven't read, btw, I'm going
on hearsay here), and is summarised in RFC 2279. That defines an encoding of
31-bit values in up to 6 bytes. I believe the same encoding would work
perfectly well for the full 32-bit range, but it is artificially limited to
31-bit values. The second "official" standard for UTF-8 is that of the
Unicode consortium; their version of it is identical to the IS0 version except
that it is further limited (artificially) to the 24-bit range, and hence never
requires more than 4 bytes to encode a value. IMO, this is a mistake on the
part of the Unicode people -- implementations should be required to decode the
full ISO range (including the extended private use area) rather than being
required (as I understand it) to abort with an error if ISO-encoded >24-bit
data is encountered. Still, in practise, for Unicode data (which is always 24
bit or less) there is no difference between the formats.
Now Sun enter the picture. Start with the situation before Java 5. Java (as
of then) used Unicode internally. Not any /encoding/, just pure abstract
Unicode
data -- each String corresponds to an a sequence of characters from the Unicode
repertoire. That's all very nice and clear, unfortunately there are a couple
of snakes in this Eden.
One is that the primitive type 'char' is a 16-bit quantity, so most Unicode
characters cannot be represented in Java. Fortunately those characters (the
ones outside the 16-bit range) are used relatively infrequently, so we mostly
managed to get along with Java the way it was. It's obviously a problem
waiting to happen, though, especially if a Java program is receiving Unicode
data from a source that is not hamstrung by a crippled Unicode implementation.
(XML data is Unicode, for instance, and it'd be unfortunate if a Java XML
implementation barfed when faced with perfectly valid XML).
The second problem is less severe -- in fact it only causes confusion, not
actual functional limitations. Sun decided to define their own encoding for
Unicode data. I have no problems with that, it's a sensible encoding for its
purpose(s). Where they displayed flabbergasting irresponsibility was to call
it "UTF-8" too. Admittedly it's closely related to UTF-8, but it is neither
upwardly nor downwardly compatible with it. That encoding (call it
pseudo-UTF-8) can only encode values in the 16 bit range, and so never uses
more than 3 bytes per "character" (however it uses 2 bytes for 0, whereas true
UTF-8 uses only 1 byte). Since Sun blithely named various methods that
manipulate data in this format with some variation on 'utf8' (e.g.
ObjectOutputStream.writeUTF8() or the JNI function GetStringUTFChars()) that
has added to the confusion. OTOH, the CharsetEncoder called "UTF-8" does
perform true UTF-8 encoding (not pseudo-UTF8), at least for the sequences of
16-bit limited 'char's that could be fed to it prior to Java5.
But Java programmers are rarely satisfied. We demand ever greater complexity,
baroque over-engineering piled on confounding intricacy. So Sun, responding as
ever to the needs of the community, decided to Act...
Java 5 adds another layer of confusion. To Sun's credit, the misnamed
references to "UTF-8" have been clearly documented as such (but not, alas,
deprecated and renamed). However it was necessary to do something about the
16-bit limit. To be honest, I don't think that Sun had any choice in the
solution they've adopted, but that doesn't make it any less vile.
Since Java 5, Strings (and similar) are no longer pure abstractions of Unicode
character sequences. The 'char' datatype no longer represents (in any useful
sense) a Unicode character. No, by fiat the objects that used to hold pure
abstract Unicode, now contain an /encoded/ representation -- specifically
UTF-16. The so-called 'char' datatype no longer holds pure Unicode characters,
but instead is used to hold the 16-bit quantities that are used by the UTF-16
encoding. String.charAt() no longer returns the nth character of the Unicode
string, but returns the nth 16-bit value from the UTF-16 encoding of the
Unicode string (and, as such, is useless in any context that is about the
textual meaning of the string -- Character.isUpperCase(char) for instance no
longer makes any sense at all). Actual semantic textual elements are now
represented as 'int's. (Of course, Unicode makes it clear that the
"characters" in a Unicode sequence do not necessarily map directly onto the
"textual elements" that a human reader would perceive -- there are diacritical
marks and so on -- but that's just another delicious layer of complexity in the
cake...)
Incidentally, this means that some legal Java Strings are no longer legal
Unicode. Not merely that they can (in principle) contain sequences that are
meaningless when interpreted as UTF-16, but that they can contain sequences
that conforming Unicode implementations are required to reject (for security
reasons). I am reasonably hopeful that the Unicode CharsetEncoders will detect
such malformed sequences and refuse to generate correspondingly malformed (and
illegal) byte-sequences, but I haven't yet checked.
All this is pretty unfortunate. We are left in a position where we can either
do our own handling of the UTF-16 encoding (very error prone, especially as
many mistaken assumptions about the textual meaning of 'char' values won't be
caught be the compiler /or/ by unsophisticated testing), or switch over to
using the newer APIs (which are unnecessarily clunky, IMO. For instance why
is there no easy way to iterate over the logical elements of a String ? They
are
also confusingly low-level technical, with much talk of 'surrogate pairs' and
so on). Or, I suppose, we could create our own Unicode-aware objects and use
those in preference to the supplied 'char' and java.lang.String, but then what
do we do with all the other software that expects to work with Strings (and
similar) ?
Oh yes, and what about quasi-UTF-8 ? Sun have seized the bull by the horns and
/made no change/... An admittedly ingenious solution to a technical problem --
arguably even quite elegant. But it does mean that the JVM communicates with
the real world using data that is encoded twice; 24-bit Unicode data is first
encoded into UTF-16, and then that is encoded again using the old quasi-UTF-8
format. Thus a 24-bit character can require 1, 2, 3 or 6 bytes to encode.
I love this stuff. Just love it...
-- chris