What is the best charset to choose for binary serialization

mtp · Mar 27, 2006

Hello,

i need to binary serialize some strings in a Java application. Since
there is no restriction at all on the strings, i need to handle all the
characters that java.lang.String handles.

What is the "innner" charset of String class? Since Java must store
characters in memory, it must use some kind of internal charset. If i
use the same, i won't have any trouble, i believe... am i right?

So what is the best charset?

Thanks

tom fredriksen · Mar 27, 2006

mtp said:
What is the "innner" charset of String class? Since Java must store
characters in memory, it must use some kind of internal charset. If i
use the same, i won't have any trouble, i believe... am i right?

Read the api doc, the answer is there in plain sight.

/tom

Chris Smith · Mar 27, 2006

mtp said:
i need to binary serialize some strings in a Java application. Since
there is no restriction at all on the strings, i need to handle all the
characters that java.lang.String handles.

What is the "innner" charset of String class? Since Java must store
characters in memory, it must use some kind of internal charset. If i
use the same, i won't have any trouble, i believe... am i right?

There are actually a couple character sets that meet your requirements.
They include UTF-16BE, UTF-16LE, and UTF-8. The difference between the
first two (which differ only in endianness) and the last is that UTF-8
is optimized to reduce the file size of files that contain mostly ASCII
characters, while the UTF-16 encodings will be smaller when the file
contains random characters chosen from throughout the entire Unicode
character set, or if it contains mostly characters not in the ISO Latin
1 (ISO8859-1) range, which is a superset of ASCII. It's worth noting
that Java's UTF-8 is *not* the same as the UTF-8 used throughout the
remainder of the computing world, so you shouldn't assume compatibility
with UTF-8 character decoders written in other languages.

Internally, Java Strings are stored logically in UTF-16. The endianness
is unspecified, because the String class will use Java primitive data
types, whose endianness is never observable by a Java application.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

Mark Thornton · Mar 27, 2006

Chris said:
1 (ISO8859-1) range, which is a superset of ASCII. It's worth noting
that Java's UTF-8 is *not* the same as the UTF-8 used throughout the
remainder of the computing world, so you shouldn't assume compatibility
with UTF-8 character decoders written in other languages.

Doesn't the "modified UTF-8" only apply to DataOutputStream,
DataInputStream and related classes plus some JNI related stuff. The
encoding used by java.nio.charset classes should be the true UTF-8.

Mark Thornton

lewmania942 · Mar 27, 2006

Hi,

short answer: you can use UTF-8 and you shouldn't have
any problem.

Now I'll try to answer to your questions

Hello,

i need to binary serialize some strings in a Java application. Since
there is no restriction at all on the strings, i need to handle all the
characters that java.lang.String handles.

The characters handled by java.lang.String depends on the version
of Java you're using... Up to Java 1.4 you'll "only" be able to
handle correctly Unicode 3.0 code points.

From Java 1.5, you can handle "all" the Unicode code points (and

the String class got new methods to this effect, like
codepointAt(...)).

What is the "innner" charset of String class?

You shouldn't care. All you should care is what encoding is
available when serializing and deserializing your strings.

That said, I'll try to answer your question.

The String class is based on the underlying char primitive which,
unfortunately, is 16 bits wide. Java was designed at a time where
Unicode didn't have more than 65536 codepoints defined yet... And
at that time a Java char was equivalent to an "Unicode code unit"
(check the Character class's API doc for the terminology).

This has very funny implications, like:

"some Unicode 3.1 and above string".length()

not returning the length in "Unicode codepoints" but in "Java char's".

Since Java must store characters in memory, it must use
some kind of internal charset.

Before Java 1.5 it was known that the internal representation for
several JMV was UCS-2 (UTF-16 without surrogates). But AFAIK
this was not specified by the spec (now I may be wrong).

I've read in this group, years ago, that people have used this fact
to do very fast DB to/from JVM string exchanges (eg by configuring
the DB to use UCS-2).

In Java 1.5 both String and Character's API docs mention that
UTF-16 is used (with surrogates support).

So what is the best charset?

There's not really an answer to that. UTF-8 is pretty common and is
mandated by the spec to be present in every J(2)SE JVM (you'll
still have to catch an exception that, by the spec, is impossible to
be thrown when doing "getBytes("UTF-8").

So usually it's a safe bet to go with UTF-8 encoding.

lewmania942 · Mar 27, 2006

Hi tom,

tom said:
Read the api doc, the answer is there in plain sight.

If I check the Java 1.5 String API doc I do indeed see that UTF-16
is used.

What if the OP is using Java 1.4 ? (many in the real world are still
stuck with pre-1.5 Java) It certainly isn't "in plain sight" as it
is in 1.5.

What "answer" should he find? UTF-16? I'm 100% sure several
JVM have used UCS-2 internally in the past. And UCS-2 is *not*
identical to UTF-16 (even if they're very similar).

AFAIK Java 1.4 only support all "Unicode 3.0 code units", not all
"Unicode 3.1+ code points". So an 1.4 JVM may very well use
the UCS-2 encoding internally and still be compliant to the
1.4 specs. This is *not* the case for an 1.5 JVM: the (older) UCS-2
encoding isn't sufficient.

In the part you quoted, I see two questions. How's your
post explaining if the OP will have problem or not using that
same encoding? (and what would be that "same" encoding?
UTF-16? UCS-2?)

I find the OP's post to be a legitimate question that deserves
more than a "RTFM". I may have made mistakes in my
explanation, but at least I tried to help him.

And Chris Smith gave a very nice and gentle explanation,
proposing, amongst other, to use UTF-8 (like I did), and
even explaining UTF-8 gotchas (which I wasn't aware of).

Now that may be just me, but I find Chris Smith's answer
to be gentle and insightful, not yours...

Moreover, not so long ago on this group (thanks Google),
you insisted that ASCII was an 8 bit encoding... So if I was
the OP I'd take any advice coming from you regarding
characters set/encoding/etc. with a huge grain of salt for
I wouldn't think you'd be the definitive authority on the
subject.

Good day to you and sorry I feel condescending (but note
that I did find your answer to the OP condescending and
that certainly influenced the tone of my reply here)

Roedy Green · Mar 27, 2006

What is the "innner" charset of String class? Since Java must store
characters in memory, it must use some kind of internal charset. If i
use the same, i won't have any trouble, i believe... am i right?

UTF-16. see http://mindprod.com/jgloss/utf.html

However, there is no way for you to get at that char array directly.
You can of course use the Java's serialisation which will use writeUTF
which uses a bastardised UTF-8.

Roedy Green · Mar 27, 2006

If I check the Java 1.5 String API doc I do indeed see that UTF-16
is used.

What if the OP is using Java 1.4 ?

then there is no 32 bit support. Strings are composed of 16-bit
unicode. the lo/li surrogates are just treated as ordinary characters.

mtp · Mar 28, 2006

Roedy said:
UTF-16. see http://mindprod.com/jgloss/utf.html

However, there is no way for you to get at that char array directly.
You can of course use the Java's serialisation which will use writeUTF
which uses a bastardised UTF-8.

Thx to all for these valuable informations. I will use UTF-8 since our
compagny do not sell a lot in Japan right now

Alex Hunsley · Mar 28, 2006

mtp said:
Thx to all for these valuable informations. I will use UTF-8 since our
compagny do not sell a lot in Japan right now

Is it really any cost just to do it correctly now and use UTF-16? Might
save a headache later. Or maybe not, who knows? :]

opalpa · Mar 28, 2006

UTF-8 is the best charset. It is IMO the best design.

Opalinski
(e-mail address removed)
http://www.geocities.com/opalpaweb/

Oliver Wong · Mar 28, 2006

Alex Hunsley said:
mtp said:

Thx to all for these valuable informations. I will use UTF-8 since our
compagny do not sell a lot in Japan right now

Click to expand...

Is it really any cost just to do it correctly now and use UTF-16? Might
save a headache later. Or maybe not, who knows? :]

Yes, there is a cost. If you use only ASCII characters in your document,
then UTF-8 will use 1 byte per character. UTF-16 will use 2 bytes per
character.

If you mainly use Asian characters (for example), UTF-8 will use 3 bytes
per character, UTF-16 will use 2 bytes per character.

So the choice between UTF-8 and UTF-16 depends on what you expect to
appear in your documents.

- Oliver

opalpa · Mar 28, 2006

UTF-8 works well for Japanese too...

Opalinski
(e-mail address removed)
http://www.geocities.com/opalpaweb/

Oliver Wong · Mar 28, 2006

UTF-8 works well for Japanese too...

UTF-16 "works better" though, if the metric used is size of bitstream.
Characters with codepoints between \u0800 and \uFFFF take up 3 bytes in
UTF-8, but only 2 bytes in UTF-16. This includes most Asian scripts
(Chinese, Japanese, Korean, Yi, Mongolian, Tibetan, Thai, etc.).

- Oliver

What programming language to choose?	4	Jul 3, 2022
What is the best paying programming language?	6	Jun 21, 2022
What is the best way of going about recreating the setTimeout() function?	0	Sep 2, 2022
I have an idea for a website and also what are the best languages to use for this website?	2	Aug 19, 2022
Hi. What would be the best language for creating a text game that includes relatively simple stats?	3	Jan 2, 2023
What is the most astounding C++ syntax construct?	0	Dec 22, 2022
Encoding of primitives for binary serialization	17	Apr 9, 2009
Low-latency alternative to Java Object Serialization	13	Oct 1, 2011

What is the best charset to choose for binary serialization

mtp

tom fredriksen

Chris Smith

Mark Thornton

lewmania942

lewmania942

Roedy Green

Roedy Green

mtp

Alex Hunsley

opalpa

Oliver Wong

opalpa

Oliver Wong

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads