java string encoding

S

Sender

I have string of Chinese characters encoded with "big5". I convert it to
unicode and use a printBytes method to display the hex value to System.out.
Here is the coding:


String cname ...... //this is the big5 string
if (cname != null) {
printBytes(cname.getBytes(), "CName ");
ch_name = new String(cname.getBytes(), "BIG5");
printBytes(ch_name.getBytes(), "ch_name ");
String bb = new String(ch_name.getBytes("BIG5"));
printBytes(bb.getBytes(), "bb ");
}

public static void printBytes(byte[] array, String name) {
System.out.print(name + " = ");
for (int k = 0; k < array.length; k++) {
System.out.print("0x" + Common.byteToHex(array[k]) + " ");
}
System.out.println();
}

And here is the output of printBytes:

CName = 0xb5 0xd8 0xb1 0xe1 0xa4 0xa4 0xb0 0xea
ch_name = 0x3f 0x3f 0x3f 0x3f
bb = 0xb5 0xd8 0xb1 0xe1 0xa4 0xa4 0xb0 0xea

As you can see, ch_name became "????". But the last 2 lines of code can
convert it back to the original big5 string. Why? I've tried to insert some
awt frame displaying code inbetween (the ShowString from Sun) and found that
ch_name can display the Chinese correctly. Why? In fact, what I wanted to do
is to convert the big5 string to unicode and store it as a varchar column in
MySql. While I can store the ch_name, it only stored as "????" and
retrieving it cannot be displayed in ShowString correctly. In other words,
if convert-and-display, it works, if convert-store-display, it doesn't work.
Any help?
 
B

Boudewijn Dijkstra

Sender said:
I have string of Chinese characters encoded with "big5".
String cname ...... //this is the big5 string
if (cname != null) {
printBytes(cname.getBytes(), "CName ");
ch_name = new String(cname.getBytes(), "BIG5");
printBytes(ch_name.getBytes(), "ch_name ");
String bb = new String(ch_name.getBytes("BIG5"));
printBytes(bb.getBytes(), "bb ");
}

Before we can verify this code does what you want, some questions have to be
answered:
How exactly do you initialize cname?
What is your default platform encoding?
 
N

Niels Dybdahl

As you can see, ch_name became "????". But the last 2 lines of code can
convert it back to the original big5 string. Why?

The documentation for String.getBytes states:
public byte[] getBytes()Encodes this String into a sequence of bytes using
the platform's default charset, storing the result into a new byte array.

So ch_name is not "????". It just means that the contents of ch_name can not
be stored in your default charset.

Niels Dybdahl
 
J

Jon A. Cruz

Sender said:
I have string of Chinese characters encoded with "big5". I convert it to

No, you don't.

I convert it to
unicode and use a printBytes method to display the hex value to System.out.

"convert it to unicode" is not actually going on, since Java chars are
*always* Unicode.

Also... System.out is usually broken in regards to printing non-ASCII
(that means characters that don't fall in the range from 0 through 127)
characters.

Here is the coding:


String cname ...... //this is the big5 string

How do you get that cname?

At the point at which you have a Java String, it has 16-bit Unicode for
it's contents. By definition.

if (cname != null) {
printBytes(cname.getBytes(), "CName ");

This says "take the sequence of 16-bit Unicode characters living in the
String 'cname' and convert it to a byte array using the default platform
conversion"

ch_name = new String(cname.getBytes(), "BIG5");

This says "take the sequence of 16-bit Unicode characters living in the
String 'cname' and convert it to a byte array using Unicode->'BIG5'
explicitly for the conversion"

printBytes(ch_name.getBytes(), "ch_name ");
String bb = new String(ch_name.getBytes("BIG5"));

Ok. There's a *huge* problem.

You just converted a String from 16-Bit Unicode chars to BIG5 bytes, and
then converted it *back* to a String of 16-bit Unicode chars *but* using
the default platform encoding for it.

That default encoding changes all the time. Never count on it.

public static void printBytes(byte[] array, String name) {
System.out.print(name + " = ");
for (int k = 0; k < array.length; k++) {
System.out.print("0x" + Common.byteToHex(array[k]) + " ");
}
System.out.println();
}

That's decent. *However*, you need to printChars also

public static void printBytes(String str, String name) {
System.out.print(name + " = ");
for (int k = 0; k < str.length(); k++) {
System.out.print("0x" + Integer.toHexString(0x0ffff &
str.charAt(k)) + " ");
}
System.out.println();
}

And here is the output of printBytes:

CName = 0xb5 0xd8 0xb1 0xe1 0xa4 0xa4 0xb0 0xea
ch_name = 0x3f 0x3f 0x3f 0x3f
bb = 0xb5 0xd8 0xb1 0xe1 0xa4 0xa4 0xb0 0xea

As you can see, ch_name became "????". But the last 2 lines of code can
convert it back to the original big5 string. Why?

Because at one point you said "convert this Java char sequence to bytes
using the default local character encoding" yet at another you said
"convert this Java char sequence to bytes using 'BIG5' to do the conversion.

I would draw the conclusion that you misunderstand Strings in Java.

They do *not* store things in bytes.
They *do* store things in 16-bit unsigned Unicode characters. Always.

String.getBytes() and String.getBytes() *convert* the contents to a byte
array. They do *not* 'access' some internal byte array.


Why? In fact, what I wanted to do
is to convert the big5 string to unicode and store it as a varchar column in
MySql. While I can store the ch_name, it only stored as "????" and
retrieving it cannot be displayed in ShowString correctly. In other words,
if convert-and-display, it works, if convert-store-display, it doesn't work.

Yes.

Java Strings are *always* Unicode.

Look to where you go in and out of Strings. Always use explicit
encodings. Never use String.getBytes() or new String(byte[]). Instead
use String.getBytes(String) and new String(byte[], String) exclusively.



Oh, and on all modern MS Windows systems, the user can change the local
encoding with a click on the taskbar. So don't trust it to stay fixed
even on a single machine.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top