What is the best charset to choose for binary serialization

M

mtp

Hello,

i need to binary serialize some strings in a Java application. Since
there is no restriction at all on the strings, i need to handle all the
characters that java.lang.String handles.

What is the "innner" charset of String class? Since Java must store
characters in memory, it must use some kind of internal charset. If i
use the same, i won't have any trouble, i believe... am i right?

So what is the best charset?

Thanks
 
T

tom fredriksen

mtp said:
What is the "innner" charset of String class? Since Java must store
characters in memory, it must use some kind of internal charset. If i
use the same, i won't have any trouble, i believe... am i right?

Read the api doc, the answer is there in plain sight.

/tom
 
C

Chris Smith

mtp said:
i need to binary serialize some strings in a Java application. Since
there is no restriction at all on the strings, i need to handle all the
characters that java.lang.String handles.

What is the "innner" charset of String class? Since Java must store
characters in memory, it must use some kind of internal charset. If i
use the same, i won't have any trouble, i believe... am i right?

There are actually a couple character sets that meet your requirements.
They include UTF-16BE, UTF-16LE, and UTF-8. The difference between the
first two (which differ only in endianness) and the last is that UTF-8
is optimized to reduce the file size of files that contain mostly ASCII
characters, while the UTF-16 encodings will be smaller when the file
contains random characters chosen from throughout the entire Unicode
character set, or if it contains mostly characters not in the ISO Latin
1 (ISO8859-1) range, which is a superset of ASCII. It's worth noting
that Java's UTF-8 is *not* the same as the UTF-8 used throughout the
remainder of the computing world, so you shouldn't assume compatibility
with UTF-8 character decoders written in other languages.

Internally, Java Strings are stored logically in UTF-16. The endianness
is unspecified, because the String class will use Java primitive data
types, whose endianness is never observable by a Java application.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
M

Mark Thornton

Chris said:
1 (ISO8859-1) range, which is a superset of ASCII. It's worth noting
that Java's UTF-8 is *not* the same as the UTF-8 used throughout the
remainder of the computing world, so you shouldn't assume compatibility
with UTF-8 character decoders written in other languages.

Doesn't the "modified UTF-8" only apply to DataOutputStream,
DataInputStream and related classes plus some JNI related stuff. The
encoding used by java.nio.charset classes should be the true UTF-8.

Mark Thornton
 
L

lewmania942

Hi,

short answer: you can use UTF-8 and you shouldn't have
any problem.

Now I'll try to answer to your questions ;)

Hello,

i need to binary serialize some strings in a Java application. Since
there is no restriction at all on the strings, i need to handle all the
characters that java.lang.String handles.

The characters handled by java.lang.String depends on the version
of Java you're using... Up to Java 1.4 you'll "only" be able to
handle correctly Unicode 3.0 code points.
From Java 1.5, you can handle "all" the Unicode code points (and
the String class got new methods to this effect, like
codepointAt(...)).

What is the "innner" charset of String class?

You shouldn't care. All you should care is what encoding is
available when serializing and deserializing your strings.

That said, I'll try to answer your question.

The String class is based on the underlying char primitive which,
unfortunately, is 16 bits wide. Java was designed at a time where
Unicode didn't have more than 65536 codepoints defined yet... And
at that time a Java char was equivalent to an "Unicode code unit"
(check the Character class's API doc for the terminology).

This has very funny implications, like:

"some Unicode 3.1 and above string".length()

not returning the length in "Unicode codepoints" but in "Java char's".

Since Java must store characters in memory, it must use
some kind of internal charset.

Before Java 1.5 it was known that the internal representation for
several JMV was UCS-2 (UTF-16 without surrogates). But AFAIK
this was not specified by the spec (now I may be wrong).

I've read in this group, years ago, that people have used this fact
to do very fast DB to/from JVM string exchanges (eg by configuring
the DB to use UCS-2).

In Java 1.5 both String and Character's API docs mention that
UTF-16 is used (with surrogates support).

So what is the best charset?

There's not really an answer to that. UTF-8 is pretty common and is
mandated by the spec to be present in every J(2)SE JVM (you'll
still have to catch an exception that, by the spec, is impossible to
be thrown when doing "getBytes("UTF-8").

So usually it's a safe bet to go with UTF-8 encoding.
 
L

lewmania942

Hi tom,

tom said:
Read the api doc, the answer is there in plain sight.

If I check the Java 1.5 String API doc I do indeed see that UTF-16
is used.

What if the OP is using Java 1.4 ? (many in the real world are still
stuck with pre-1.5 Java) It certainly isn't "in plain sight" as it
is in 1.5.

What "answer" should he find? UTF-16? I'm 100% sure several
JVM have used UCS-2 internally in the past. And UCS-2 is *not*
identical to UTF-16 (even if they're very similar).

AFAIK Java 1.4 only support all "Unicode 3.0 code units", not all
"Unicode 3.1+ code points". So an 1.4 JVM may very well use
the UCS-2 encoding internally and still be compliant to the
1.4 specs. This is *not* the case for an 1.5 JVM: the (older) UCS-2
encoding isn't sufficient.

In the part you quoted, I see two questions. How's your
post explaining if the OP will have problem or not using that
same encoding? (and what would be that "same" encoding?
UTF-16? UCS-2?)

I find the OP's post to be a legitimate question that deserves
more than a "RTFM". I may have made mistakes in my
explanation, but at least I tried to help him.

And Chris Smith gave a very nice and gentle explanation,
proposing, amongst other, to use UTF-8 (like I did), and
even explaining UTF-8 gotchas (which I wasn't aware of).

Now that may be just me, but I find Chris Smith's answer
to be gentle and insightful, not yours...

Moreover, not so long ago on this group (thanks Google),
you insisted that ASCII was an 8 bit encoding... So if I was
the OP I'd take any advice coming from you regarding
characters set/encoding/etc. with a huge grain of salt for
I wouldn't think you'd be the definitive authority on the
subject.

Good day to you and sorry I feel condescending (but note
that I did find your answer to the OP condescending and
that certainly influenced the tone of my reply here)
 
R

Roedy Green

What is the "innner" charset of String class? Since Java must store
characters in memory, it must use some kind of internal charset. If i
use the same, i won't have any trouble, i believe... am i right?

UTF-16. see http://mindprod.com/jgloss/utf.html

However, there is no way for you to get at that char array directly.
You can of course use the Java's serialisation which will use writeUTF
which uses a bastardised UTF-8.
 
R

Roedy Green

If I check the Java 1.5 String API doc I do indeed see that UTF-16
is used.

What if the OP is using Java 1.4 ?

then there is no 32 bit support. Strings are composed of 16-bit
unicode. the lo/li surrogates are just treated as ordinary characters.
 
M

mtp

Roedy said:
UTF-16. see http://mindprod.com/jgloss/utf.html

However, there is no way for you to get at that char array directly.
You can of course use the Java's serialisation which will use writeUTF
which uses a bastardised UTF-8.

Thx to all for these valuable informations. I will use UTF-8 since our
compagny do not sell a lot in Japan right now ;)
 
A

Alex Hunsley

mtp said:
Thx to all for these valuable informations. I will use UTF-8 since our
compagny do not sell a lot in Japan right now ;)

Is it really any cost just to do it correctly now and use UTF-16? Might
save a headache later. Or maybe not, who knows? :]
 
O

Oliver Wong

Alex Hunsley said:
mtp said:
Thx to all for these valuable informations. I will use UTF-8 since our
compagny do not sell a lot in Japan right now ;)

Is it really any cost just to do it correctly now and use UTF-16? Might
save a headache later. Or maybe not, who knows? :]

Yes, there is a cost. If you use only ASCII characters in your document,
then UTF-8 will use 1 byte per character. UTF-16 will use 2 bytes per
character.

If you mainly use Asian characters (for example), UTF-8 will use 3 bytes
per character, UTF-16 will use 2 bytes per character.

So the choice between UTF-8 and UTF-16 depends on what you expect to
appear in your documents.

- Oliver
 
O

Oliver Wong

UTF-8 works well for Japanese too...

UTF-16 "works better" though, if the metric used is size of bitstream.
Characters with codepoints between \u0800 and \uFFFF take up 3 bytes in
UTF-8, but only 2 bytes in UTF-16. This includes most Asian scripts
(Chinese, Japanese, Korean, Yi, Mongolian, Tibetan, Thai, etc.).

- Oliver
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top