Mike said:
"Encoding" in Java specifically means "way of representing 16-bit unicode
characters in 8-bit bytes". Characters in a Java string *are* 16-bit
unicode. In that sense, they're not encoded, because they're in their
native form.
I don't really like that way of looking at it -- I think it's misleading.
Here's how I see it:
There are two ways to think of Java Strings.
The first is the way that we are /supposed/ to be able to think about them, and
it is usually the best way. But, unfortunately, it is technically incorrect.
The second is technically correct but is harder to think about and may cause
confusion.
So here's the first way. Strings are collections of characters. Characters
are Unicode characters. And as such Strings and chars are pure Unicode data.
There is no "encoding" involved at all (since encoding is how you translate
pure Unicode data into sequences of bytes -- and Java's Strings are not
sequences of bytes). So you manipulate Strings and chars directly without
worrying about encodings (which are irrelevant). It's only when you want to
convert between Strings and sequences of bytes (e.g. writing to file) that you
have to consider what encoding to use (and you always /do/ have to consider it
since files don't hold Strings, but only sequences of bytes. If you want to
put Strings into a file then you /have/ to choose an encoding -- if you don't
then the system will choose one for you, which isn't often what you want it to
do).
That's the simple version of the story. Now the second version, which is
technically accurate, but much nastier.
Due to an unfortunate set of circumstances Java has hardwired the idea that
there are <= 2**16 Unicode characters. That assumption is incorrect. It is
unfortunate that Unicode didn't go public on that until a few months after Java
became set in stone (although there /must/ have been people working for Sun who
knew all about it long before that). It's even more unfortunate that the size
of a char /was/ set in stone; and very, very, unfortunate that instead of
responding to the problem instantly, the Java designers spent about a decade
apparently hoping that the problem would just go away by itself. It didn't and
instead the situation grew worse and worse...
Anyway, brickbats aside, what has happened is that since the 16-bit limit on a
char cannot be changed, Sun have been forced to redefine what a String /is/.
It is no longer considered to be "pure Unicode data", but is now considered to
be formally a sequence of 16-bit values which /encode/ a Unicode string using
UTF-16. So now, even though Strings are not sequences of bytes, it is now
technically correct to say that Java's Strings are encoded in UTF-16.
Fortunately, for many purposes, we can still use the simpler picture ("Strings
are pure Unicode"), since that works perfectly well provided we are only using
characters in the 16-bit range of Unicode (as the OP's example was). But if we
have to deal with characters outside that range, then we have to use the
second, more complicated, picture to understand what's going on.
-- chris