Unicode chinese

Discussion in 'Java' started by Crouchez, Aug 29, 2007.

  1. Crouchez

    Crouchez Guest

    String chinese = "\u4e2d\u5c0f";
    System.out.println(chinese.getBytes().length);

    Why does this return 2?
     
    Crouchez, Aug 29, 2007
    #1
    1. Advertising

  2. Crouchez wrote:
    > String chinese = "\u4e2d\u5c0f";
    > System.out.println(chinese.getBytes().length);
    >
    > Why does this return 2?
    >
    >


    The font on the console may not be able to draw it. Try it with an
    appropriate font in a JComponent of some variety.

    --

    Knute Johnson
    email s/nospam/knute/
     
    Knute Johnson, Aug 29, 2007
    #2
    1. Advertising

  3. Crouchez

    Guest

    It runs 6 for me.
     
    , Aug 29, 2007
    #3
  4. Crouchez

    bugbear Guest

    bugbear, Aug 29, 2007
    #4
  5. Crouchez wrote:
    > String chinese = "\u4e2d\u5c0f";
    > System.out.println(chinese.getBytes().length);
    >
    > Why does this return 2?
    >
    >

    String.getBytes() uses the platform's default charset. See
    <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#getBytes()>

    If the platform's default charset is "Cp1252" (like on my system and may
    be on Crouchez's), then chinese.getBytes() returns 2 bytes. By the way:
    the 2 bytes are {63,63} which is just {'?','?'} because the encoding
    can't decode characters beyond '\u00ff'.

    If the platform's default charset is "UTF-8" (like probably on
    sadiruddin's system), then chinese.getBytes() returns 6 bytes.


    --
    Thomas
     
    Thomas Fritsch, Aug 29, 2007
    #5
  6. bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > Crouchez wrote:
    >> String chinese = "\u4e2d\u5c0f";
    >> System.out.println(chinese.getBytes().length);
    >> Why does this return 2?

    > http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#getBytes()
    > "The behavior of this method when this string cannot be encoded in the default charset is unspecified."


    While it's not specified, and could theoretically change over time,
    the current implementation seems to encode your string as two
    questionmarks, which account for length==2.

    The other one, who answered that it gave "6" for him, likely
    has an utf-8 based system-encoding (or utf-8 itself).

    On Unix-systems, the system-encoding generally depends on the
    environment variable LANG (and possibly overridden by certain
    LC_... variables whose names I never remember).
    For Windows, I don't know.
     
    Andreas Leitgeb, Aug 29, 2007
    #6
  7. Crouchez

    Roedy Green Guest

    On Wed, 29 Aug 2007 03:47:16 GMT, "Crouchez"
    <> wrote, quoted or indirectly
    quoted someone who said :

    >String chinese = "\u4e2d\u5c0f";
    >System.out.println(chinese.getBytes().length);
    >
    >Why does this return 2?


    I modified your code a little, so it will make the problem clear:

    public class Chinese
    {
    /**
    * test harness
    *
    * @param args not used
    */
    public static void main ( String[] args )
    {
    System.out.println( System.getProperty( "file.encoding" ));
    String chinese = "\u4e2d\u5c0f";
    byte[] b = chinese.getBytes();
    for ( int i=0; i<b.length; i++ )
    {
    System.out.println( b);
    }
    // prints
    // Cp1252
    // 63
    // 63
    // in other words ??. Those tho chars are not available in your
    default encoding.
    }
    }


    I further modified you code to choose the encoding explicitly:

    import java.io.UnsupportedEncodingException;
    public class Chinese
    {
    /**
    * test harness
    *
    * @param args not used
    */
    public static void main ( String[] args ) throws
    UnsupportedEncodingException
    {
    System.out.println( System.getProperty( "file.encoding" ));
    String chinese = "\u4e2d\u5c0f";
    // explicit choice of encoding, designed to support Chinese.
    byte[] b = chinese.getBytes( "Big5-HKSCS" );
    for ( int i=0; i<b.length; i++ )
    {
    System.out.println( 0xff & b);
    }
    // prints
    // Cp1252
    // 164
    // 164
    // 164
    // 112 more like you would expect.
    }
    }




    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Aug 29, 2007
    #7
  8. Crouchez

    Crouchez Guest

    cheers.

    If I do

    byte[] b = chinese.getBytes( "UTF-8" );

    b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
    per character?
     
    Crouchez, Aug 29, 2007
    #8
  9. Crouchez

    Crouchez Guest

    "Roedy Green" <> wrote in message
    news:...
    > On Wed, 29 Aug 2007 03:47:16 GMT, "Crouchez"
    > <> wrote, quoted or indirectly
    > quoted someone who said :
    >
    >>String chinese = "\u4e2d\u5c0f";
    >>System.out.println(chinese.getBytes().length);
    >>
    >>Why does this return 2?

    >
    > I modified your code a little, so it will make the problem clear:
    >
    > public class Chinese
    > {
    > /**
    > * test harness
    > *
    > * @param args not used
    > */
    > public static void main ( String[] args )
    > {
    > System.out.println( System.getProperty( "file.encoding" ));
    > String chinese = "\u4e2d\u5c0f";
    > byte[] b = chinese.getBytes();
    > for ( int i=0; i<b.length; i++ )
    > {
    > System.out.println( b);
    > }
    > // prints
    > // Cp1252
    > // 63
    > // 63
    > // in other words ??. Those tho chars are not available in your
    > default encoding.
    > }
    > }
    >
    >
    > I further modified you code to choose the encoding explicitly:
    >
    > import java.io.UnsupportedEncodingException;
    > public class Chinese
    > {
    > /**
    > * test harness
    > *
    > * @param args not used
    > */
    > public static void main ( String[] args ) throws
    > UnsupportedEncodingException
    > {
    > System.out.println( System.getProperty( "file.encoding" ));
    > String chinese = "\u4e2d\u5c0f";
    > // explicit choice of encoding, designed to support Chinese.
    > byte[] b = chinese.getBytes( "Big5-HKSCS" );
    > for ( int i=0; i<b.length; i++ )
    > {
    > System.out.println( 0xff & b);
    > }
    > // prints
    > // Cp1252
    > // 164
    > // 164
    > // 164
    > // 112 more like you would expect.
    > }
    > }
    >
    >
    >
    >
    > --
    > Roedy Green Canadian Mind Products
    > The Java Glossary
    > http://mindprod.com


    Why have you done an AND on this?
    System.out.println( 0xff & b);
     
    Crouchez, Aug 29, 2007
    #9
  10. Crouchez

    Roedy Green Guest

    Roedy Green, Aug 30, 2007
    #10
  11. Crouchez

    Roedy Green Guest

    On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
    <> wrote, quoted or indirectly
    quoted someone who said :

    >b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
    >per character?


    I suspect your parents punished you for curiosity as a toddler.
    EXPERIMENT!

    import java.io.UnsupportedEncodingException;
    public class Chinese
    {
    /**
    * test harness
    *
    * @param args not used
    */
    public static void main ( String[] args ) throws
    UnsupportedEncodingException
    {
    System.out.println( System.getProperty( "file.encoding" ));
    String chinese = "\u4e2d\u5c0f";
    // explicit choice of encoding, UTF-8 supports everything
    including Chinese.
    byte[] b = chinese.getBytes( "UTF-8" );
    for ( int i=0; i<b.length; i++ )
    {
    System.out.println( Integer.toHexString( 0xff & b ));
    }
    // prints
    // Cp1252
    // e4
    // b8
    // ad
    // e5
    // b0
    // 8f

    // why those chars?
    // BOM is ef bb bf, so that is not it.
    // see http://mindprod.com/jgloss/utf.html#UTF8ENCODER
    // codes >= 0x800 take 3 bytes to encode.
    }
    }
    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Aug 30, 2007
    #11
  12. Crouchez

    bugbear Guest

    Roedy Green wrote:
    > On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
    > <> wrote, quoted or indirectly
    > quoted someone who said :
    >
    >> b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
    >> per character?

    >
    > I suspect your parents punished you for curiosity as a toddler.
    > EXPERIMENT!


    Or read the manual;

    http://unicode.org/
    http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

    I'd always prefer a clear definitive spec
    to the results of experiment.

    Reverse engineering complex systems
    can be time consuming and error prone.

    BugBear
     
    bugbear, Aug 30, 2007
    #12
  13. Crouchez

    steve Guest

    On Thu, 30 Aug 2007 00:22:45 +0800, Crouchez wrote
    (in article <p3hBi.21431$>):

    > cheers.
    >
    > If I do
    >
    > byte[] b = chinese.getBytes( "UTF-8" );
    >
    > b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
    > per character?
    >
    >


    not always.

    Steve
     
    steve, Aug 30, 2007
    #13
  14. Crouchez

    Crouchez Guest

    "Crouchez" <> wrote in message
    news:p3hBi.21431$...
    > cheers.
    >
    > If I do
    >
    > byte[] b = chinese.getBytes( "UTF-8" );
    >
    > b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
    > per character?
    >


    So chinese characters take up 3 bytes with utf-8 and 2 with 'native
    encodings'?? Imagine the extra bandwidth for a chinese server if it uses
    UTF-8! +0.5!
     
    Crouchez, Aug 30, 2007
    #14
  15. Crouchez

    Crouchez Guest

    "bugbear" <bugbear@trim_papermule.co.uk_trim> wrote in message
    news:46D69030.50403@trim_papermule.co.uk_trim...
    > Roedy Green wrote:
    >> On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
    >> <> wrote, quoted or indirectly
    >> quoted someone who said :
    >>
    >>> b.length = 6. But why 6 when I thought chinese characters take up 2
    >>> bytes per character?

    >>
    >> I suspect your parents punished you for curiosity as a toddler.
    >> EXPERIMENT!

    >
    > Or read the manual;
    >
    > http://unicode.org/
    > http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
    >
    > I'd always prefer a clear definitive spec
    > to the results of experiment.
    >
    > Reverse engineering complex systems
    > can be time consuming and error prone.
    >
    > BugBear


    I prefer the experiments personally - those technical manuals are usually
    way to wordy
     
    Crouchez, Aug 30, 2007
    #15
  16. Crouchez

    Crouchez Guest

    "Roedy Green" <> wrote in message
    news:...
    > On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
    > <> wrote, quoted or indirectly
    > quoted someone who said :
    >
    >>b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
    >>per character?

    >
    > I suspect your parents punished you for curiosity as a toddler.
    > EXPERIMENT!
    >
    > import java.io.UnsupportedEncodingException;
    > public class Chinese
    > {
    > /**
    > * test harness
    > *
    > * @param args not used
    > */
    > public static void main ( String[] args ) throws
    > UnsupportedEncodingException
    > {
    > System.out.println( System.getProperty( "file.encoding" ));
    > String chinese = "\u4e2d\u5c0f";
    > // explicit choice of encoding, UTF-8 supports everything
    > including Chinese.
    > byte[] b = chinese.getBytes( "UTF-8" );
    > for ( int i=0; i<b.length; i++ )
    > {
    > System.out.println( Integer.toHexString( 0xff & b ));
    > }
    > // prints
    > // Cp1252
    > // e4
    > // b8
    > // ad
    > // e5
    > // b0
    > // 8f
    >
    > // why those chars?
    > // BOM is ef bb bf, so that is not it.
    > // see http://mindprod.com/jgloss/utf.html#UTF8ENCODER
    > // codes >= 0x800 take 3 bytes to encode.
    > }
    > }
    > --
    > Roedy Green Canadian Mind Products
    > The Java Glossary
    > http://mindprod.com


    Thanks Roedy, nice site there - often comes in useful for all types of java
    stuff
     
    Crouchez, Aug 30, 2007
    #16
  17. Crouchez

    Crouchez Guest

    "steve" <> wrote in message
    news:...
    > On Thu, 30 Aug 2007 00:22:45 +0800, Crouchez wrote
    > (in article <p3hBi.21431$>):
    >
    >> cheers.
    >>
    >> If I do
    >>
    >> byte[] b = chinese.getBytes( "UTF-8" );
    >>
    >> b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
    >> per character?
    >>
    >>

    >
    > not always.
    >
    > Steve
    >
    >


    When is it not?
     
    Crouchez, Aug 30, 2007
    #17
  18. Crouchez

    Crouchez Guest

    "Roedy Green" <> wrote in message
    news:...
    > On Wed, 29 Aug 2007 16:50:41 GMT, "Crouchez"
    > <> wrote, quoted or indirectly
    > quoted someone who said :
    >
    >>Why have you done an AND on this?
    >>System.out.println( 0xff & b);

    >
    > see http://mindprod.com/jgloss/unsigned.html
    > --
    > Roedy Green Canadian Mind Products
    > The Java Glossary
    > http://mindprod.com


    It baffles me a lot of that. I remember doing floating point and binary
    stuff on paper years ago and never used it for real. Whats the main use for
    bitwise and bit shifting?
     
    Crouchez, Aug 30, 2007
    #18
  19. Crouchez wrote:
    > "steve" <> wrote in message
    > news:...
    >> On Thu, 30 Aug 2007 00:22:45 +0800, Crouchez wrote
    >> (in article <p3hBi.21431$>):
    >>
    >>> cheers.
    >>>
    >>> If I do
    >>>
    >>> byte[] b = chinese.getBytes( "UTF-8" );
    >>>
    >>> b.length = 6. But why 6 when I thought chinese characters take up 2
    >>> bytes per character?
    >>>
    >>>

    >>
    >> not always.
    >>
    >> Steve
    >>
    >>

    >
    > When is it not?

    You can find out yourself, either by experimenting
    System.out.println("\u0000".getBytes("UTF-8");
    System.out.println("\u007F".getBytes("UTF-8");
    System.out.println("\u0080".getBytes("UTF-8");
    System.out.println("\u07FF".getBytes("UTF-8");
    System.out.println("\u0800".getBytes("UTF-8");
    System.out.println("\uFFFF".getBytes("UTF-8");
    or more easily by reading the UTF-8 documentation
    http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

    --
    Thomas

    --
    Thomas
     
    Thomas Fritsch, Aug 30, 2007
    #19
  20. Thomas Fritsch wrote:
    > You can find out yourself, either by experimenting
    > System.out.println("\u0000".getBytes("UTF-8");
    > System.out.println("\u007F".getBytes("UTF-8");
    > System.out.println("\u0080".getBytes("UTF-8");
    > System.out.println("\u07FF".getBytes("UTF-8");
    > System.out.println("\u0800".getBytes("UTF-8");
    > System.out.println("\uFFFF".getBytes("UTF-8");

    Oops, I meant
    System.out.println("\u0000".getBytes("UTF-8").length);
    ....
    > or more easily by reading the UTF-8 documentation
    > http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8


    --
    Thomas
     
    Thomas Fritsch, Aug 30, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gordon

    Unicode Support In chinese Win98

    Gordon, Oct 22, 2003, in forum: ASP .Net
    Replies:
    0
    Views:
    350
    Gordon
    Oct 22, 2003
  2. Replies:
    1
    Views:
    519
    steve
    Dec 24, 2004
  3. Replies:
    2
    Views:
    406
  4. Posadas, Dennis

    python unicode display of chinese characters

    Posadas, Dennis, Dec 9, 2003, in forum: Python
    Replies:
    1
    Views:
    573
    Serge Orlov
    Dec 10, 2003
  5. rs387
    Replies:
    3
    Views:
    289
    pataphor
    Sep 14, 2008
Loading...

Share This Page