However, there is still one major error. It's near the bottom under
"Exploring Java's UTF Support". First off, it still isn't plain that 2
out of the four options you mention (1 and 3) have /nothing at all/ to
do with UTF-8. The so-called "modified UTF-8" format is not compatible
(upwards or downwards) with UTF-8. So I don't think you should mix
references to the two together, and certainly not intermingle them as
if they were all of comparable relevance. Specifically, the page
states (slightly further up, under "DataOutputStream.writeUTF()") that
the length is "followed by a standard UTF-8 byte encoding of the
String"; that is simply not true. You note already that Quasi-UTF-8
encodes 0x0 differently from UTF-8, which all by itself is enough to
make writeUTF() useless for interoperability with standards compliant
encodings
I disagree. The only difference for16-bit is the way 0 is encoded,
and the Sun encoding comes out in the wash even when you decode making
no special provision for it. You are making a mountain out of a null.
They behave 99% the same way so it makes sense to discuss them both
under the
http://mindprod.com/jgloss/utf.html
It is even less of a difference from a practical point of view than
the presence of absence of BOMs.
Personally, I don't see the point of any great rush to support 32-bit
Unicode. The new symbols will be rarely used. Consider what's there.
The only one I would conceivably use are musical symbols and
Mathematical Alphanumeric symbols (especially the German black letters
so favoured in real analysis). The rest I can't imagine ever using
unless I took up a career in anthropology, i.e. linear B syllabary (I
have not a clue what it is), linear B ideograms (Looks like symbols
for categorising cave petroglyphs), Aegean Numbers (counting with
stones and sticks), Old Italic (looks like Phoenecian), Gothic
(medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian
(George Bernard Shaw's phonetic script), Osmanya (Somalian), Cypriot
syllabary, Byzantine music symbols (looks like Arabic), Musical
Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK
extensions(Chinese Japanese Korean) and tags (letters with blank
“price tags”).
I think 32-bit Unicode becomes a matter of the tail wagging the dog,
spurred by the technical challenge rather than a practical necessity.
In the process, ordinary 16-bit character handling is turned into a
bleeding mess, for almost no benefit.
I think we should for the most part simply ignore 32-bit and continue
using the String class as we always have presuming every character is
16-bits.