java and unicode: is decode always revertible by encode

H

Harald Kirsch

Given a ByteBuffer, I use a CharsetDecoder to
decode the bytes into a sequence of characters.

Assuming that there is no decoding error, will the
corresponding CharsetEncoder applied to the character
sequence (well, copying it to a CharBuffer)
result in the identical sequence of bytes I
started with.

Put another way: Are there UNICODE encodings where one
character can have more than one code sequence (in the
same encoding)?

Harald.
 
A

Adam Maass

Harald Kirsch said:
Given a ByteBuffer, I use a CharsetDecoder to
decode the bytes into a sequence of characters.

Assuming that there is no decoding error, will the
corresponding CharsetEncoder applied to the character
sequence (well, copying it to a CharBuffer)
result in the identical sequence of bytes I
started with.

Put another way: Are there UNICODE encodings where one
character can have more than one code sequence (in the
same encoding)?

Short answer: there are sequences that have more than one encoding. And
there are encodings that have more than one character representations.

Longer answer: there are Unicode character sequences that are supposed to be
semantically identical to other (single) Unicode characters; it would be
legal (but unlikely) for a character encoder to substitute a semantically
identical sequence of characters before doing the conversion to bytes. Or to
decode a sequence of bytes, sense that one of these character sequences had
been produced, and substitute the single character.

For example:

The character sequence:

0x0055 (LATIN CAPITAL LETTER U)
0x0308 (COMBINING DIAERESIS)

is supposed to be semantically identical to:

0x00DC (LATIN CAPITAL LETTER U WITH DIAERESIS)


It is legal to represent either of these in UTF-8 with:

0x55
0x88
0x06

Or:

0xDC
0x01


And likewise, to convert either sequence of bytes to either sequence of
characters. (Note that the first sequence of characters is trivially
converted to the first sequence of bytes; the second sequence of bytes is a
legal representation only because the spec says that the first character
sequence is identical to the second character sequence, not because there's
any magic bit-twiddling that will produce that second byte sequence from the
first character sequence.)

I would expect that most character encoders do not bother to substitute
semantically-identical sequences, instead relying on simple bit-twiddling to
do the encoding. But beware that these kinds of substitutions are perfectly
legal. Therefore, encoding a string and decoding it back is supposed to
yield a semantically identical string, but it may not be exactly the same
string, character-for-character, as the one you started with.

-- Adam Maass
 
H

Harald Kirsch

Roedy Green said:
There is another problem. If you chose a traditional single byte 8-bit
encoding, say Cp437, there are only 256 encodings to go round for all
64K unicode characters. Obviously, some are going to have to collapse
onto the same character, and so won't decode back to where they
started.

Further some of these 8 bit encodings have a few strange characters
that don't exist in Unicode.

This describes an encode/decode pair. And as you say, problems have to
be expected when converting from Unicode into a character set with only
256 characters.

My problem however is a decode/encode pair. And indeed then it was a bit
of a surprise that, as Adam Maass said, X -decode-> Y -encode-> Z does not
necessarily result in X.equals(Z).

Since I actually only wanted to recover the length of the
byte sequence resulting in a char sequence, I now found another
solution. I watch the decoder working and keep a cache
of the length info.

Thanks,
Harald.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top