Unicode chinese

Crouchez · Aug 29, 2007

String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?

Knute Johnson · Aug 29, 2007

Crouchez said:
String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?

The font on the console may not be able to draw it. Try it with an
appropriate font in a JComponent of some variety.

sadiruddin · Aug 29, 2007

It runs 6 for me.

bugbear · Aug 29, 2007

Crouchez said:
String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?

http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#getBytes()

"The behavior of this method when this string cannot be encoded in the default charset is unspecified."

BugBear

Thomas Fritsch · Aug 29, 2007

Crouchez said:
String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?

String.getBytes() uses the platform's default charset. See
<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#getBytes()>

If the platform's default charset is "Cp1252" (like on my system and may
be on Crouchez's), then chinese.getBytes() returns 2 bytes. By the way:
the 2 bytes are {63,63} which is just {'?','?'} because the encoding
can't decode characters beyond '\u00ff'.

If the platform's default charset is "UTF-8" (like probably on
sadiruddin's system), then chinese.getBytes() returns 6 bytes.

Andreas Leitgeb · Aug 29, 2007

bugbear said:
http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#getBytes()
"The behavior of this method when this string cannot be encoded in the default charset is unspecified."

While it's not specified, and could theoretically change over time,
the current implementation seems to encode your string as two
questionmarks, which account for length==2.

The other one, who answered that it gave "6" for him, likely
has an utf-8 based system-encoding (or utf-8 itself).

On Unix-systems, the system-encoding generally depends on the
environment variable LANG (and possibly overridden by certain
LC_... variables whose names I never remember).
For Windows, I don't know.

Roedy Green · Aug 29, 2007

String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?

I modified your code a little, so it will make the problem clear:

public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args )
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
byte[] b = chinese.getBytes();
for ( int i=0; i<b.length; i++ )
{
System.out.println( b);
}
// prints
// Cp1252
// 63
// 63
// in other words ??. Those tho chars are not available in your
default encoding.
}
}

I further modified you code to choose the encoding explicitly:

import java.io.UnsupportedEncodingException;
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args ) throws
UnsupportedEncodingException
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
// explicit choice of encoding, designed to support Chinese.
byte[] b = chinese.getBytes( "Big5-HKSCS" );
for ( int i=0; i<b.length; i++ )
{
System.out.println( 0xff & b);
}
// prints
// Cp1252
// 164
// 164
// 164
// 112 more like you would expect.
}
}

Crouchez · Aug 29, 2007

cheers.

If I do

byte[] b = chinese.getBytes( "UTF-8" );

b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?

Crouchez · Aug 29, 2007

Roedy Green said:
String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?

Click to expand...

I modified your code a little, so it will make the problem clear:

public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args )
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
byte[] b = chinese.getBytes();
for ( int i=0; i<b.length; i++ )
{
System.out.println( b);
}
// prints
// Cp1252
// 63
// 63
// in other words ??. Those tho chars are not available in your
default encoding.
}
}

I further modified you code to choose the encoding explicitly:

import java.io.UnsupportedEncodingException;
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args ) throws
UnsupportedEncodingException
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
// explicit choice of encoding, designed to support Chinese.
byte[] b = chinese.getBytes( "Big5-HKSCS" );
for ( int i=0; i<b.length; i++ )
{
System.out.println( 0xff & b);
}
// prints
// Cp1252
// 164
// 164
// 164
// 112 more like you would expect.
}
}

Why have you done an AND on this?
System.out.println( 0xff & b);

Roedy Green · Aug 30, 2007

Why have you done an AND on this?
System.out.println( 0xff & b);

see http://mindprod.com/jgloss/unsigned.html

Roedy Green · Aug 30, 2007

b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?

I suspect your parents punished you for curiosity as a toddler.
EXPERIMENT!

import java.io.UnsupportedEncodingException;
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args ) throws
UnsupportedEncodingException
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
// explicit choice of encoding, UTF-8 supports everything
including Chinese.
byte[] b = chinese.getBytes( "UTF-8" );
for ( int i=0; i<b.length; i++ )
{
System.out.println( Integer.toHexString( 0xff & b ));
}
// prints
// Cp1252
// e4
// b8
// ad
// e5
// b0
// 8f

// why those chars?
// BOM is ef bb bf, so that is not it.
// see http://mindprod.com/jgloss/utf.html#UTF8ENCODER
// codes >= 0x800 take 3 bytes to encode.
}
}

bugbear · Aug 30, 2007

Roedy said:
I suspect your parents punished you for curiosity as a toddler.
EXPERIMENT!

Or read the manual;

http://unicode.org/
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

I'd always prefer a clear definitive spec
to the results of experiment.

Reverse engineering complex systems
can be time consuming and error prone.

BugBear

steve · Aug 30, 2007

cheers.

If I do

byte[] b = chinese.getBytes( "UTF-8" );

b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?

not always.

Steve

Crouchez · Aug 30, 2007

Crouchez said:
cheers.

If I do

byte[] b = chinese.getBytes( "UTF-8" );

b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?

So chinese characters take up 3 bytes with utf-8 and 2 with 'native
encodings'?? Imagine the extra bandwidth for a chinese server if it uses
UTF-8! +0.5!

Crouchez · Aug 30, 2007

bugbear said:
Or read the manual;

http://unicode.org/
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

I'd always prefer a clear definitive spec
to the results of experiment.

Reverse engineering complex systems
can be time consuming and error prone.

BugBear

I prefer the experiments personally - those technical manuals are usually
way to wordy

Crouchez · Aug 30, 2007

Roedy Green said:
b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?

Click to expand...

I suspect your parents punished you for curiosity as a toddler.
EXPERIMENT!

import java.io.UnsupportedEncodingException;
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args ) throws
UnsupportedEncodingException
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
// explicit choice of encoding, UTF-8 supports everything
including Chinese.
byte[] b = chinese.getBytes( "UTF-8" );
for ( int i=0; i<b.length; i++ )
{
System.out.println( Integer.toHexString( 0xff & b ));
}
// prints
// Cp1252
// e4
// b8
// ad
// e5
// b0
// 8f

// why those chars?
// BOM is ef bb bf, so that is not it.
// see http://mindprod.com/jgloss/utf.html#UTF8ENCODER
// codes >= 0x800 take 3 bytes to encode.
}
}

Thanks Roedy, nice site there - often comes in useful for all types of java
stuff

Crouchez · Aug 30, 2007

steve said:
cheers.

If I do

byte[] b = chinese.getBytes( "UTF-8" );

b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?

Click to expand...

not always.

Steve

When is it not?

Crouchez · Aug 30, 2007

Roedy Green said:
Why have you done an AND on this?
System.out.println( 0xff & b);

Click to expand...

see http://mindprod.com/jgloss/unsigned.html

It baffles me a lot of that. I remember doing floating point and binary
stuff on paper years ago and never used it for real. Whats the main use for
bitwise and bit shifting?

Thomas Fritsch · Aug 30, 2007

Crouchez said:
steve said:

cheers.

If I do

byte[] b = chinese.getBytes( "UTF-8" );

b.length = 6. But why 6 when I thought chinese characters take up 2
bytes per character?

Click to expand...

not always.

Steve

Click to expand...

When is it not?

You can find out yourself, either by experimenting
System.out.println("\u0000".getBytes("UTF-8");
System.out.println("\u007F".getBytes("UTF-8");
System.out.println("\u0080".getBytes("UTF-8");
System.out.println("\u07FF".getBytes("UTF-8");
System.out.println("\u0800".getBytes("UTF-8");
System.out.println("\uFFFF".getBytes("UTF-8");
or more easily by reading the UTF-8 documentation
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

Thomas Fritsch · Aug 30, 2007

Thomas said:
You can find out yourself, either by experimenting
System.out.println("\u0000".getBytes("UTF-8");
System.out.println("\u007F".getBytes("UTF-8");
System.out.println("\u0080".getBytes("UTF-8");
System.out.println("\u07FF".getBytes("UTF-8");
System.out.println("\u0800".getBytes("UTF-8");
System.out.println("\uFFFF".getBytes("UTF-8");

Oops, I meant
System.out.println("\u0000".getBytes("UTF-8").length);
....

SMPP sending chinese message to smsc	1	Jan 15, 2018
How do i convert a Chinese DAT file from a game I play	2	Feb 4, 2022
Java matrix problem	3	Sep 10, 2023
Chinese characters library for C / ARM	0	Feb 4, 2015
Help in hangman game	1	Jul 24, 2023
Read utf-8 file	1	Mar 18, 2013
problem printing unicode in java	15	May 8, 2009
Void problem	1	Mar 20, 2023

Unicode chinese

Crouchez

Knute Johnson

sadiruddin

bugbear

Thomas Fritsch

Andreas Leitgeb

Roedy Green

Crouchez

Crouchez

Roedy Green

Roedy Green

bugbear

steve

Crouchez

Crouchez

Crouchez

Crouchez

Crouchez

Thomas Fritsch

Thomas Fritsch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads