java and unicode: is decode always revertible by encode

Discussion in 'Java' started by Harald Kirsch, Aug 27, 2003.

  1. Given a ByteBuffer, I use a CharsetDecoder to
    decode the bytes into a sequence of characters.

    Assuming that there is no decoding error, will the
    corresponding CharsetEncoder applied to the character
    sequence (well, copying it to a CharBuffer)
    result in the identical sequence of bytes I
    started with.

    Put another way: Are there UNICODE encodings where one
    character can have more than one code sequence (in the
    same encoding)?

    Harald.
     
    Harald Kirsch, Aug 27, 2003
    #1
    1. Advertising

  2. Harald Kirsch

    Adam Maass Guest

    "Harald Kirsch" <> wrote:
    > Given a ByteBuffer, I use a CharsetDecoder to
    > decode the bytes into a sequence of characters.
    >
    > Assuming that there is no decoding error, will the
    > corresponding CharsetEncoder applied to the character
    > sequence (well, copying it to a CharBuffer)
    > result in the identical sequence of bytes I
    > started with.
    >
    > Put another way: Are there UNICODE encodings where one
    > character can have more than one code sequence (in the
    > same encoding)?
    >


    Short answer: there are sequences that have more than one encoding. And
    there are encodings that have more than one character representations.

    Longer answer: there are Unicode character sequences that are supposed to be
    semantically identical to other (single) Unicode characters; it would be
    legal (but unlikely) for a character encoder to substitute a semantically
    identical sequence of characters before doing the conversion to bytes. Or to
    decode a sequence of bytes, sense that one of these character sequences had
    been produced, and substitute the single character.

    For example:

    The character sequence:

    0x0055 (LATIN CAPITAL LETTER U)
    0x0308 (COMBINING DIAERESIS)

    is supposed to be semantically identical to:

    0x00DC (LATIN CAPITAL LETTER U WITH DIAERESIS)


    It is legal to represent either of these in UTF-8 with:

    0x55
    0x88
    0x06

    Or:

    0xDC
    0x01


    And likewise, to convert either sequence of bytes to either sequence of
    characters. (Note that the first sequence of characters is trivially
    converted to the first sequence of bytes; the second sequence of bytes is a
    legal representation only because the spec says that the first character
    sequence is identical to the second character sequence, not because there's
    any magic bit-twiddling that will produce that second byte sequence from the
    first character sequence.)

    I would expect that most character encoders do not bother to substitute
    semantically-identical sequences, instead relying on simple bit-twiddling to
    do the encoding. But beware that these kinds of substitutions are perfectly
    legal. Therefore, encoding a string and decoding it back is supposed to
    yield a semantically identical string, but it may not be exactly the same
    string, character-for-character, as the one you started with.

    -- Adam Maass
     
    Adam Maass, Aug 28, 2003
    #2
    1. Advertising

  3. Roedy Green <> wrote in message news:<>...
    > On Wed, 27 Aug 2003 17:55:57 -0700, "Adam Maass"
    > <> wrote or quoted :
    >
    > >Therefore, encoding a string and decoding it back is supposed to
    > >yield a semantically identical string, but it may not be exactly the same
    > >string, character-for-character, as the one you started with.

    >
    > There is another problem. If you chose a traditional single byte 8-bit
    > encoding, say Cp437, there are only 256 encodings to go round for all
    > 64K unicode characters. Obviously, some are going to have to collapse
    > onto the same character, and so won't decode back to where they
    > started.
    >
    > Further some of these 8 bit encodings have a few strange characters
    > that don't exist in Unicode.


    This describes an encode/decode pair. And as you say, problems have to
    be expected when converting from Unicode into a character set with only
    256 characters.

    My problem however is a decode/encode pair. And indeed then it was a bit
    of a surprise that, as Adam Maass said, X -decode-> Y -encode-> Z does not
    necessarily result in X.equals(Z).

    Since I actually only wanted to recover the length of the
    byte sequence resulting in a char sequence, I now found another
    solution. I watch the decoder working and keep a cache
    of the length info.

    Thanks,
    Harald.
     
    Harald Kirsch, Aug 28, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    c++ support for unicode, utf-8, encode/decode, ifstream, wstream?

    =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=, Jan 20, 2006, in forum: C++
    Replies:
    12
    Views:
    6,381
    JustBoo
    Jan 23, 2006
  2. anonymous
    Replies:
    1
    Views:
    639
  3. Kless

    Decode/encode Unicode

    Kless, Aug 28, 2008, in forum: Ruby
    Replies:
    4
    Views:
    148
    Kless
    Aug 28, 2008
  4. Alan Franzoni
    Replies:
    0
    Views:
    212
    Alan Franzoni
    Jul 27, 2012
  5. Peter Otten
    Replies:
    0
    Views:
    210
    Peter Otten
    Jul 27, 2012
Loading...

Share This Page