Harald Kirsch said:
Given a ByteBuffer, I use a CharsetDecoder to
decode the bytes into a sequence of characters.
Assuming that there is no decoding error, will the
corresponding CharsetEncoder applied to the character
sequence (well, copying it to a CharBuffer)
result in the identical sequence of bytes I
started with.
Put another way: Are there UNICODE encodings where one
character can have more than one code sequence (in the
same encoding)?
Short answer: there are sequences that have more than one encoding. And
there are encodings that have more than one character representations.
Longer answer: there are Unicode character sequences that are supposed to be
semantically identical to other (single) Unicode characters; it would be
legal (but unlikely) for a character encoder to substitute a semantically
identical sequence of characters before doing the conversion to bytes. Or to
decode a sequence of bytes, sense that one of these character sequences had
been produced, and substitute the single character.
For example:
The character sequence:
0x0055 (LATIN CAPITAL LETTER U)
0x0308 (COMBINING DIAERESIS)
is supposed to be semantically identical to:
0x00DC (LATIN CAPITAL LETTER U WITH DIAERESIS)
It is legal to represent either of these in UTF-8 with:
0x55
0x88
0x06
Or:
0xDC
0x01
And likewise, to convert either sequence of bytes to either sequence of
characters. (Note that the first sequence of characters is trivially
converted to the first sequence of bytes; the second sequence of bytes is a
legal representation only because the spec says that the first character
sequence is identical to the second character sequence, not because there's
any magic bit-twiddling that will produce that second byte sequence from the
first character sequence.)
I would expect that most character encoders do not bother to substitute
semantically-identical sequences, instead relying on simple bit-twiddling to
do the encoding. But beware that these kinds of substitutions are perfectly
legal. Therefore, encoding a string and decoding it back is supposed to
yield a semantically identical string, but it may not be exactly the same
string, character-for-character, as the one you started with.
-- Adam Maass