C
Crouchez
String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);
Why does this return 2?
System.out.println(chinese.getBytes().length);
Why does this return 2?
Crouchez said:String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);
Why does this return 2?
Crouchez said:String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);
Why does this return 2?
String.getBytes() uses the platform's default charset. SeeCrouchez said:String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);
Why does this return 2?
bugbear said:http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#getBytes()
"The behavior of this method when this string cannot be encoded in the default charset is unspecified."
String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);
Why does this return 2?
Roedy Green said:String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);
Why does this return 2?
I modified your code a little, so it will make the problem clear:
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args )
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
byte[] b = chinese.getBytes();
for ( int i=0; i<b.length; i++ )
{
System.out.println( b);
}
// prints
// Cp1252
// 63
// 63
// in other words ??. Those tho chars are not available in your
default encoding.
}
}
I further modified you code to choose the encoding explicitly:
import java.io.UnsupportedEncodingException;
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args ) throws
UnsupportedEncodingException
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
// explicit choice of encoding, designed to support Chinese.
byte[] b = chinese.getBytes( "Big5-HKSCS" );
for ( int i=0; i<b.length; i++ )
{
System.out.println( 0xff & b);
}
// prints
// Cp1252
// 164
// 164
// 164
// 112 more like you would expect.
}
}
Why have you done an AND on this?
System.out.println( 0xff & b);
b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?
Roedy said:I suspect your parents punished you for curiosity as a toddler.
EXPERIMENT!
cheers.
If I do
byte[] b = chinese.getBytes( "UTF-8" );
b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?
Crouchez said:cheers.
If I do
byte[] b = chinese.getBytes( "UTF-8" );
b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?
bugbear said:Or read the manual;
http://unicode.org/
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
I'd always prefer a clear definitive spec
to the results of experiment.
Reverse engineering complex systems
can be time consuming and error prone.
BugBear
Roedy Green said:b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?
I suspect your parents punished you for curiosity as a toddler.
EXPERIMENT!
import java.io.UnsupportedEncodingException;
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args ) throws
UnsupportedEncodingException
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
// explicit choice of encoding, UTF-8 supports everything
including Chinese.
byte[] b = chinese.getBytes( "UTF-8" );
for ( int i=0; i<b.length; i++ )
{
System.out.println( Integer.toHexString( 0xff & b ));
}
// prints
// Cp1252
// e4
// b8
// ad
// e5
// b0
// 8f
// why those chars?
// BOM is ef bb bf, so that is not it.
// see http://mindprod.com/jgloss/utf.html#UTF8ENCODER
// codes >= 0x800 take 3 bytes to encode.
}
}
steve said:cheers.
If I do
byte[] b = chinese.getBytes( "UTF-8" );
b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?
not always.
Steve
Roedy Green said:Why have you done an AND on this?
System.out.println( 0xff & b);
see http://mindprod.com/jgloss/unsigned.html
You can find out yourself, either by experimentingCrouchez said:steve said:cheers.
If I do
byte[] b = chinese.getBytes( "UTF-8" );
b.length = 6. But why 6 when I thought chinese characters take up 2
bytes per character?
not always.
Steve
When is it not?
Oops, I meantThomas said:You can find out yourself, either by experimenting
System.out.println("\u0000".getBytes("UTF-8");
System.out.println("\u007F".getBytes("UTF-8");
System.out.println("\u0080".getBytes("UTF-8");
System.out.println("\u07FF".getBytes("UTF-8");
System.out.println("\u0800".getBytes("UTF-8");
System.out.println("\uFFFF".getBytes("UTF-8");
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.