Char encoding & decoding incorrect

T

terry

I find encoding and decoding of EUC_CN does not working. I have written an example:

String big5ToGB(String big5)
{
try
{
ByteArrayInputStream bais = new ByteArrayInputStream(big5.getBytes());
InputStreamReader isr = new InputStreamReader(bais,"EUC_CN");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter osw = new OutputStreamWriter(baos,"EUC_CN");

char[] cbuf=new char[80];
int n;
while((n=isr.read(cbuf))!=-1)
{
osw.write(cbuf,0,n);
}
osw.flush();
osw.close();
isr.close();
return baos.toString();

}
catch(UnsupportedEncodingException e)
{

return null;
}
catch(IOException e)
{
return null;
}
}

:
:
String s="欢迎";
debug.println(s);
String ss=ut.big5ToGB(s);
char[] c=s.toCharArray();
char[] cc=ss.toCharArray();
debug.println(c[0]);
debug.println(cc[0]);

I found the printouts are not equals.

If I change the encoding to be Big5_HKSCS, the results are matched!

Could anyone tell me how to decode Chinese GB codes correctly?
 
G

Gordon Beaton

I find encoding and decoding of EUC_CN does not working. I have
written an example:

String big5ToGB(String big5)
{
try
{
ByteArrayInputStream bais = new ByteArrayInputStream(big5.getBytes());
InputStreamReader isr = new InputStreamReader(bais,"EUC_CN");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter osw = new OutputStreamWriter(baos,"EUC_CN");
[...]

Your big5 String is *already* a Java String. There is absolutely no
need to convert it to a byte array just so you can read it into a
String again.

In fact, this is what your code does:

Unicode String -> byte array (big5.getBytes())
byte array -> Unicode String (is.read())
Unicode String -> byte array (os.write())
byte array -> Unicode String (baos.toString())

If everything had worked as you intended, the input and output Strings
would have been *identical*.

However, getBytes() and baos.toString() convert between byte array and
String using some default character encoding, so your data likely gets
corrupted by these methods.

The code I showed you earlier reads text from an external source using
one encoding, and writes it to an external destination using a
different encoding. You can't use it to change String -> String.

Internally, all Java Strings are in Unicode. The encoding *only*
becomes an issue when you want to read Strings from an external source
(such as a file or socket) or write them to an external source, i.e.
when bytes are converted to Unicode or vice versa. For those things,
use InputStreamReader or OutputStreamWriter with an appropriate
encoding. If you need to convert between String and byte array in
other situations, always choose conversion methods that let you
specify which encoding to use.

/gordon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top