java and unicode: is decode always revertible by encode

Harald Kirsch · Aug 27, 2003

Given a ByteBuffer, I use a CharsetDecoder to
decode the bytes into a sequence of characters.

Assuming that there is no decoding error, will the
corresponding CharsetEncoder applied to the character
sequence (well, copying it to a CharBuffer)
result in the identical sequence of bytes I
started with.

Put another way: Are there UNICODE encodings where one
character can have more than one code sequence (in the
same encoding)?

Harald.

Adam Maass · Aug 28, 2003

Harald Kirsch said:
Given a ByteBuffer, I use a CharsetDecoder to
decode the bytes into a sequence of characters.

Assuming that there is no decoding error, will the
corresponding CharsetEncoder applied to the character
sequence (well, copying it to a CharBuffer)
result in the identical sequence of bytes I
started with.

Put another way: Are there UNICODE encodings where one
character can have more than one code sequence (in the
same encoding)?

Short answer: there are sequences that have more than one encoding. And
there are encodings that have more than one character representations.

Longer answer: there are Unicode character sequences that are supposed to be
semantically identical to other (single) Unicode characters; it would be
legal (but unlikely) for a character encoder to substitute a semantically
identical sequence of characters before doing the conversion to bytes. Or to
decode a sequence of bytes, sense that one of these character sequences had
been produced, and substitute the single character.

For example:

The character sequence:

0x0055 (LATIN CAPITAL LETTER U)
0x0308 (COMBINING DIAERESIS)

is supposed to be semantically identical to:

0x00DC (LATIN CAPITAL LETTER U WITH DIAERESIS)

It is legal to represent either of these in UTF-8 with:

0x55
0x88
0x06

Or:

0xDC
0x01

And likewise, to convert either sequence of bytes to either sequence of
characters. (Note that the first sequence of characters is trivially
converted to the first sequence of bytes; the second sequence of bytes is a
legal representation only because the spec says that the first character
sequence is identical to the second character sequence, not because there's
any magic bit-twiddling that will produce that second byte sequence from the
first character sequence.)

I would expect that most character encoders do not bother to substitute
semantically-identical sequences, instead relying on simple bit-twiddling to
do the encoding. But beware that these kinds of substitutions are perfectly
legal. Therefore, encoding a string and decoding it back is supposed to
yield a semantically identical string, but it may not be exactly the same
string, character-for-character, as the one you started with.

-- Adam Maass

Harald Kirsch · Aug 28, 2003

Roedy Green said:
There is another problem. If you chose a traditional single byte 8-bit
encoding, say Cp437, there are only 256 encodings to go round for all
64K unicode characters. Obviously, some are going to have to collapse
onto the same character, and so won't decode back to where they
started.

Further some of these 8 bit encodings have a few strange characters
that don't exist in Unicode.

This describes an encode/decode pair. And as you say, problems have to
be expected when converting from Unicode into a character set with only
256 characters.

My problem however is a decode/encode pair. And indeed then it was a bit
of a surprise that, as Adam Maass said, X -decode-> Y -encode-> Z does not
necessarily result in X.equals(Z).

Since I actually only wanted to recover the length of the
byte sequence resulting in a char sequence, I now found another
solution. I watch the decoder working and keep a cache
of the length info.

Thanks,
Harald.

Decoding no of ways and printing each decode message	2	Jun 1, 2021
Unicode Chars in Windows Path	12	Apr 3, 2014
How do I encode and decode this data to write to a file?	11	Apr 29, 2013
Why are some unicode error handlers "encode only"?	2	Mar 11, 2012
Is the pod of Encode::MIME::Header giving wrong advice?	5	Apr 23, 2014
Python beginner, unicode encode/decode Q	1	Jul 14, 2008
encode/decode misunderstanding	3	Jul 26, 2007
decode a string to "Perl's internal form" without Encode module?	4	Feb 28, 2007

java and unicode: is decode always revertible by encode

Harald Kirsch

Adam Maass

Harald Kirsch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads