What is the best charset to choose for binary serialization

Discussion in 'Java' started by mtp, Mar 27, 2006.

  1. mtp

    mtp Guest

    Hello,

    i need to binary serialize some strings in a Java application. Since
    there is no restriction at all on the strings, i need to handle all the
    characters that java.lang.String handles.

    What is the "innner" charset of String class? Since Java must store
    characters in memory, it must use some kind of internal charset. If i
    use the same, i won't have any trouble, i believe... am i right?

    So what is the best charset?

    Thanks
    mtp, Mar 27, 2006
    #1
    1. Advertising

  2. mtp wrote:
    >
    > What is the "innner" charset of String class? Since Java must store
    > characters in memory, it must use some kind of internal charset. If i
    > use the same, i won't have any trouble, i believe... am i right?


    Read the api doc, the answer is there in plain sight.

    /tom
    tom fredriksen, Mar 27, 2006
    #2
    1. Advertising

  3. mtp

    Chris Smith Guest

    mtp <> wrote:
    > i need to binary serialize some strings in a Java application. Since
    > there is no restriction at all on the strings, i need to handle all the
    > characters that java.lang.String handles.
    >
    > What is the "innner" charset of String class? Since Java must store
    > characters in memory, it must use some kind of internal charset. If i
    > use the same, i won't have any trouble, i believe... am i right?


    There are actually a couple character sets that meet your requirements.
    They include UTF-16BE, UTF-16LE, and UTF-8. The difference between the
    first two (which differ only in endianness) and the last is that UTF-8
    is optimized to reduce the file size of files that contain mostly ASCII
    characters, while the UTF-16 encodings will be smaller when the file
    contains random characters chosen from throughout the entire Unicode
    character set, or if it contains mostly characters not in the ISO Latin
    1 (ISO8859-1) range, which is a superset of ASCII. It's worth noting
    that Java's UTF-8 is *not* the same as the UTF-8 used throughout the
    remainder of the computing world, so you shouldn't assume compatibility
    with UTF-8 character decoders written in other languages.

    Internally, Java Strings are stored logically in UTF-16. The endianness
    is unspecified, because the String class will use Java primitive data
    types, whose endianness is never observable by a Java application.

    --
    www.designacourse.com
    The Easiest Way To Train Anyone... Anywhere.

    Chris Smith - Lead Software Developer/Technical Trainer
    MindIQ Corporation
    Chris Smith, Mar 27, 2006
    #3
  4. Chris Smith wrote:
    > 1 (ISO8859-1) range, which is a superset of ASCII. It's worth noting
    > that Java's UTF-8 is *not* the same as the UTF-8 used throughout the
    > remainder of the computing world, so you shouldn't assume compatibility
    > with UTF-8 character decoders written in other languages.


    Doesn't the "modified UTF-8" only apply to DataOutputStream,
    DataInputStream and related classes plus some JNI related stuff. The
    encoding used by java.nio.charset classes should be the true UTF-8.

    Mark Thornton
    Mark Thornton, Mar 27, 2006
    #4
  5. mtp

    Guest

    Hi,

    short answer: you can use UTF-8 and you shouldn't have
    any problem.

    Now I'll try to answer to your questions ;)


    mtp wrote:
    > Hello,
    >
    > i need to binary serialize some strings in a Java application. Since
    > there is no restriction at all on the strings, i need to handle all the
    > characters that java.lang.String handles.


    The characters handled by java.lang.String depends on the version
    of Java you're using... Up to Java 1.4 you'll "only" be able to
    handle correctly Unicode 3.0 code points.

    >From Java 1.5, you can handle "all" the Unicode code points (and

    the String class got new methods to this effect, like
    codepointAt(...)).


    > What is the "innner" charset of String class?


    You shouldn't care. All you should care is what encoding is
    available when serializing and deserializing your strings.

    That said, I'll try to answer your question.

    The String class is based on the underlying char primitive which,
    unfortunately, is 16 bits wide. Java was designed at a time where
    Unicode didn't have more than 65536 codepoints defined yet... And
    at that time a Java char was equivalent to an "Unicode code unit"
    (check the Character class's API doc for the terminology).

    This has very funny implications, like:

    "some Unicode 3.1 and above string".length()

    not returning the length in "Unicode codepoints" but in "Java char's".


    > Since Java must store characters in memory, it must use
    > some kind of internal charset.


    Before Java 1.5 it was known that the internal representation for
    several JMV was UCS-2 (UTF-16 without surrogates). But AFAIK
    this was not specified by the spec (now I may be wrong).

    I've read in this group, years ago, that people have used this fact
    to do very fast DB to/from JVM string exchanges (eg by configuring
    the DB to use UCS-2).

    In Java 1.5 both String and Character's API docs mention that
    UTF-16 is used (with surrogates support).


    > So what is the best charset?


    There's not really an answer to that. UTF-8 is pretty common and is
    mandated by the spec to be present in every J(2)SE JVM (you'll
    still have to catch an exception that, by the spec, is impossible to
    be thrown when doing "getBytes("UTF-8").

    So usually it's a safe bet to go with UTF-8 encoding.
    , Mar 27, 2006
    #5
  6. mtp

    Guest

    Hi tom,

    tom fredriksen wrote:
    > mtp wrote:
    > >
    > > What is the "innner" charset of String class? Since Java must store
    > > characters in memory, it must use some kind of internal charset. If i
    > > use the same, i won't have any trouble, i believe... am i right?

    >
    > Read the api doc, the answer is there in plain sight.


    If I check the Java 1.5 String API doc I do indeed see that UTF-16
    is used.

    What if the OP is using Java 1.4 ? (many in the real world are still
    stuck with pre-1.5 Java) It certainly isn't "in plain sight" as it
    is in 1.5.

    What "answer" should he find? UTF-16? I'm 100% sure several
    JVM have used UCS-2 internally in the past. And UCS-2 is *not*
    identical to UTF-16 (even if they're very similar).

    AFAIK Java 1.4 only support all "Unicode 3.0 code units", not all
    "Unicode 3.1+ code points". So an 1.4 JVM may very well use
    the UCS-2 encoding internally and still be compliant to the
    1.4 specs. This is *not* the case for an 1.5 JVM: the (older) UCS-2
    encoding isn't sufficient.

    In the part you quoted, I see two questions. How's your
    post explaining if the OP will have problem or not using that
    same encoding? (and what would be that "same" encoding?
    UTF-16? UCS-2?)

    I find the OP's post to be a legitimate question that deserves
    more than a "RTFM". I may have made mistakes in my
    explanation, but at least I tried to help him.

    And Chris Smith gave a very nice and gentle explanation,
    proposing, amongst other, to use UTF-8 (like I did), and
    even explaining UTF-8 gotchas (which I wasn't aware of).

    Now that may be just me, but I find Chris Smith's answer
    to be gentle and insightful, not yours...

    Moreover, not so long ago on this group (thanks Google),
    you insisted that ASCII was an 8 bit encoding... So if I was
    the OP I'd take any advice coming from you regarding
    characters set/encoding/etc. with a huge grain of salt for
    I wouldn't think you'd be the definitive authority on the
    subject.

    Good day to you and sorry I feel condescending (but note
    that I did find your answer to the OP condescending and
    that certainly influenced the tone of my reply here)
    , Mar 27, 2006
    #6
  7. mtp

    Roedy Green Guest

    On Mon, 27 Mar 2006 13:03:49 +0200, mtp <> wrote,
    quoted or indirectly quoted someone who said :

    >What is the "innner" charset of String class? Since Java must store
    >characters in memory, it must use some kind of internal charset. If i
    >use the same, i won't have any trouble, i believe... am i right?


    UTF-16. see http://mindprod.com/jgloss/utf.html

    However, there is no way for you to get at that char array directly.
    You can of course use the Java's serialisation which will use writeUTF
    which uses a bastardised UTF-8.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Mar 27, 2006
    #7
  8. mtp

    Roedy Green Guest

    On 27 Mar 2006 09:06:45 -0800, ""
    <> wrote, quoted or indirectly quoted someone who
    said :

    >
    >If I check the Java 1.5 String API doc I do indeed see that UTF-16
    >is used.
    >
    >What if the OP is using Java 1.4 ?


    then there is no 32 bit support. Strings are composed of 16-bit
    unicode. the lo/li surrogates are just treated as ordinary characters.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Mar 27, 2006
    #8
  9. mtp

    mtp Guest

    Roedy Green wrote:
    > On Mon, 27 Mar 2006 13:03:49 +0200, mtp <> wrote,
    > quoted or indirectly quoted someone who said :
    >
    >
    >>What is the "innner" charset of String class? Since Java must store
    >>characters in memory, it must use some kind of internal charset. If i
    >>use the same, i won't have any trouble, i believe... am i right?

    >
    >
    > UTF-16. see http://mindprod.com/jgloss/utf.html
    >
    > However, there is no way for you to get at that char array directly.
    > You can of course use the Java's serialisation which will use writeUTF
    > which uses a bastardised UTF-8.


    Thx to all for these valuable informations. I will use UTF-8 since our
    compagny do not sell a lot in Japan right now ;)
    mtp, Mar 28, 2006
    #9
  10. mtp

    Alex Hunsley Guest

    mtp wrote:
    > Roedy Green wrote:
    >> On Mon, 27 Mar 2006 13:03:49 +0200, mtp <> wrote,
    >> quoted or indirectly quoted someone who said :
    >>
    >>
    >>> What is the "innner" charset of String class? Since Java must store
    >>> characters in memory, it must use some kind of internal charset. If i
    >>> use the same, i won't have any trouble, i believe... am i right?

    >>
    >>
    >> UTF-16. see http://mindprod.com/jgloss/utf.html
    >>
    >> However, there is no way for you to get at that char array directly.
    >> You can of course use the Java's serialisation which will use writeUTF
    >> which uses a bastardised UTF-8.

    >
    > Thx to all for these valuable informations. I will use UTF-8 since our
    > compagny do not sell a lot in Japan right now ;)


    Is it really any cost just to do it correctly now and use UTF-16? Might
    save a headache later. Or maybe not, who knows? :]
    Alex Hunsley, Mar 28, 2006
    #10
  11. opalinski from opalpaweb, Mar 28, 2006
    #11
  12. mtp

    Oliver Wong Guest

    "Alex Hunsley" <> wrote in message
    news:gwbWf.285686$...
    > mtp wrote:
    >>
    >> Thx to all for these valuable informations. I will use UTF-8 since our
    >> compagny do not sell a lot in Japan right now ;)

    >
    > Is it really any cost just to do it correctly now and use UTF-16? Might
    > save a headache later. Or maybe not, who knows? :]


    Yes, there is a cost. If you use only ASCII characters in your document,
    then UTF-8 will use 1 byte per character. UTF-16 will use 2 bytes per
    character.

    If you mainly use Asian characters (for example), UTF-8 will use 3 bytes
    per character, UTF-16 will use 2 bytes per character.

    So the choice between UTF-8 and UTF-16 depends on what you expect to
    appear in your documents.

    - Oliver
    Oliver Wong, Mar 28, 2006
    #12
  13. opalinski from opalpaweb, Mar 28, 2006
    #13
  14. mtp

    Oliver Wong Guest

    <> wrote in message
    news:...
    > UTF-8 works well for Japanese too...


    UTF-16 "works better" though, if the metric used is size of bitstream.
    Characters with codepoints between \u0800 and \uFFFF take up 3 bytes in
    UTF-8, but only 2 bytes in UTF-16. This includes most Asian scripts
    (Chinese, Japanese, Korean, Yi, Mongolian, Tibetan, Thai, etc.).

    - Oliver
    Oliver Wong, Mar 28, 2006
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    8
    Views:
    2,227
    deadsea
    Jan 2, 2005
  2. Replies:
    3
    Views:
    1,011
  3. Dimitri Ognibene
    Replies:
    4
    Views:
    767
    Dimitri Ognibene
    Sep 2, 2006
  4. Ramunas Urbonas
    Replies:
    1
    Views:
    383
    Dino Chiesa [Microsoft]
    Jul 27, 2004
  5. optimistx

    javascript charset <> page charset

    optimistx, Aug 14, 2008, in forum: Javascript
    Replies:
    2
    Views:
    259
    optimistx
    Aug 15, 2008
Loading...

Share This Page