Unicode chinese

C

Crouchez

String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?
 
K

Knute Johnson

Crouchez said:
String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?

The font on the console may not be able to draw it. Try it with an
appropriate font in a JComponent of some variety.
 
T

Thomas Fritsch

Crouchez said:
String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?
String.getBytes() uses the platform's default charset. See
<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#getBytes()>

If the platform's default charset is "Cp1252" (like on my system and may
be on Crouchez's), then chinese.getBytes() returns 2 bytes. By the way:
the 2 bytes are {63,63} which is just {'?','?'} because the encoding
can't decode characters beyond '\u00ff'.

If the platform's default charset is "UTF-8" (like probably on
sadiruddin's system), then chinese.getBytes() returns 6 bytes.
 
A

Andreas Leitgeb

bugbear said:
http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#getBytes()
"The behavior of this method when this string cannot be encoded in the default charset is unspecified."

While it's not specified, and could theoretically change over time,
the current implementation seems to encode your string as two
questionmarks, which account for length==2.

The other one, who answered that it gave "6" for him, likely
has an utf-8 based system-encoding (or utf-8 itself).

On Unix-systems, the system-encoding generally depends on the
environment variable LANG (and possibly overridden by certain
LC_... variables whose names I never remember).
For Windows, I don't know.
 
R

Roedy Green

String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?

I modified your code a little, so it will make the problem clear:

public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args )
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
byte[] b = chinese.getBytes();
for ( int i=0; i<b.length; i++ )
{
System.out.println( b);
}
// prints
// Cp1252
// 63
// 63
// in other words ??. Those tho chars are not available in your
default encoding.
}
}


I further modified you code to choose the encoding explicitly:

import java.io.UnsupportedEncodingException;
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args ) throws
UnsupportedEncodingException
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
// explicit choice of encoding, designed to support Chinese.
byte[] b = chinese.getBytes( "Big5-HKSCS" );
for ( int i=0; i<b.length; i++ )
{
System.out.println( 0xff & b);
}
// prints
// Cp1252
// 164
// 164
// 164
// 112 more like you would expect.
}
}
 
C

Crouchez

cheers.

If I do

byte[] b = chinese.getBytes( "UTF-8" );

b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?
 
C

Crouchez

Roedy Green said:
String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?

I modified your code a little, so it will make the problem clear:

public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args )
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
byte[] b = chinese.getBytes();
for ( int i=0; i<b.length; i++ )
{
System.out.println( b);
}
// prints
// Cp1252
// 63
// 63
// in other words ??. Those tho chars are not available in your
default encoding.
}
}


I further modified you code to choose the encoding explicitly:

import java.io.UnsupportedEncodingException;
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args ) throws
UnsupportedEncodingException
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
// explicit choice of encoding, designed to support Chinese.
byte[] b = chinese.getBytes( "Big5-HKSCS" );
for ( int i=0; i<b.length; i++ )
{
System.out.println( 0xff & b);
}
// prints
// Cp1252
// 164
// 164
// 164
// 112 more like you would expect.
}
}


Why have you done an AND on this?
System.out.println( 0xff & b);
 
R

Roedy Green

b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?

I suspect your parents punished you for curiosity as a toddler.
EXPERIMENT!

import java.io.UnsupportedEncodingException;
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args ) throws
UnsupportedEncodingException
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
// explicit choice of encoding, UTF-8 supports everything
including Chinese.
byte[] b = chinese.getBytes( "UTF-8" );
for ( int i=0; i<b.length; i++ )
{
System.out.println( Integer.toHexString( 0xff & b ));
}
// prints
// Cp1252
// e4
// b8
// ad
// e5
// b0
// 8f

// why those chars?
// BOM is ef bb bf, so that is not it.
// see http://mindprod.com/jgloss/utf.html#UTF8ENCODER
// codes >= 0x800 take 3 bytes to encode.
}
}
 
S

steve

cheers.

If I do

byte[] b = chinese.getBytes( "UTF-8" );

b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?

not always.

Steve
 
C

Crouchez

Crouchez said:
cheers.

If I do

byte[] b = chinese.getBytes( "UTF-8" );

b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?

So chinese characters take up 3 bytes with utf-8 and 2 with 'native
encodings'?? Imagine the extra bandwidth for a chinese server if it uses
UTF-8! +0.5!
 
C

Crouchez

Roedy Green said:
b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?

I suspect your parents punished you for curiosity as a toddler.
EXPERIMENT!

import java.io.UnsupportedEncodingException;
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args ) throws
UnsupportedEncodingException
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
// explicit choice of encoding, UTF-8 supports everything
including Chinese.
byte[] b = chinese.getBytes( "UTF-8" );
for ( int i=0; i<b.length; i++ )
{
System.out.println( Integer.toHexString( 0xff & b ));
}
// prints
// Cp1252
// e4
// b8
// ad
// e5
// b0
// 8f

// why those chars?
// BOM is ef bb bf, so that is not it.
// see http://mindprod.com/jgloss/utf.html#UTF8ENCODER
// codes >= 0x800 take 3 bytes to encode.
}
}


Thanks Roedy, nice site there - often comes in useful for all types of java
stuff
 
T

Thomas Fritsch

Crouchez said:
steve said:
cheers.

If I do

byte[] b = chinese.getBytes( "UTF-8" );

b.length = 6. But why 6 when I thought chinese characters take up 2
bytes per character?

not always.

Steve

When is it not?
You can find out yourself, either by experimenting
System.out.println("\u0000".getBytes("UTF-8");
System.out.println("\u007F".getBytes("UTF-8");
System.out.println("\u0080".getBytes("UTF-8");
System.out.println("\u07FF".getBytes("UTF-8");
System.out.println("\u0800".getBytes("UTF-8");
System.out.println("\uFFFF".getBytes("UTF-8");
or more easily by reading the UTF-8 documentation
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
 
T

Thomas Fritsch

Thomas said:
You can find out yourself, either by experimenting
System.out.println("\u0000".getBytes("UTF-8");
System.out.println("\u007F".getBytes("UTF-8");
System.out.println("\u0080".getBytes("UTF-8");
System.out.println("\u07FF".getBytes("UTF-8");
System.out.println("\u0800".getBytes("UTF-8");
System.out.println("\uFFFF".getBytes("UTF-8");
Oops, I meant
System.out.println("\u0000".getBytes("UTF-8").length);
....
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top