platform's default charset ?

Discussion in 'Java' started by gk, Jan 30, 2006.

  1. gk

    gk Guest

    what is platform's default charset ?



    String original = new String("A" + "\u00ea" + "\u00f1" +
    "\u00fc" + "C");
    try {
    byte[] utf8Bytes = original.getBytes("UTF8");
    byte[] defaultBytes = original.getBytes();
    String roundTrip = new String(utf8Bytes, "UTF8");
    String defaultTrip = new String(defaultBytes);

    System.out.println("roundTrip = " + roundTrip); // output-1
    System.out.println("defaultTrip = " + defaultTrip); // output-2




    QUESTION :

    why output-1 and output-2 are same ?


    REASON OF THIS QUESTION :

    String original = new String("A" + "\u00ea" + "\u00f1" +
    "\u00fc" + "C");

    this is a unicode string and it looks like "AêñüC"


    How could the second output output-2 produces the same output as
    output-1 ?

    the ouput-2 has been encoded/decoded into "platform's default charset"
    .. as i have used

    byte[] defaultBytes = original.getBytes();

    and

    String defaultTrip = new String(defaultBytes);


    for the output-2




    (My System is windows XP ) ......so how that could produce the same
    output as output-1 which uses encoding UTF-8 ?



    do yo want to say, windows XP supporting UTF-8 ? so, by default it
    picks up the UTF-8 encoding ?



    in which place this 2 output i.e output-1 and output-2 wnt be same ?

    is it in linux ? solaris ?
    or where this two output are not same .

    thank you
    gk, Jan 30, 2006
    #1
    1. Advertising

  2. gk wrote:
    > what is platform's default charset ?


    Charset.defaultCharset()


    > How could the second output output-2 produces the same output as
    > output-1 ?


    Why do you think they should be different at all? You start with the
    same Unicode string. Then you convert it into two (possibly different)
    byte representations. Then you convert the byte representations with the
    correct *matching reverse operation* back to two Unicode strings.

    The version where you use the UTF-8 byte encoding can't fail. It is made
    to represent Unicode characters, and you provide Unicode characters for
    a start. From Java's point of view it is even a very trivial operation,
    since the VM uses a modified UTF-8 encoding internally, so there isn't
    much to do when converting to a UTF-8 byte sequence.

    The only way the version which uses the platform's default encoding
    could fail would be if the platform's encoding could not represent a
    particular character in a platform-specific byte sequence. In that case
    you wouldn't get a full round trip conversion for such characters. This
    is, however, very unlikely, since you did chose Unicode characters which
    are all well in the Latin 1 range. This is the second most common
    character encoding after seven bit ASCII, and many character encodings
    encompass Latin 1 in one way or the other (the first 256 Unicode
    characters are actually the Latin 1 characters).


    /Thomas
    --
    The comp.lang.java.gui FAQ:
    ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
    http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
    Thomas Weidenfeller, Jan 30, 2006
    #2
    1. Advertising

  3. gk

    Roedy Green Guest

    On 30 Jan 2006 02:14:49 -0800, "gk" <> wrote, quoted
    or indirectly quoted someone who said :

    >what is platform's default charset ?


    see http://mindprod.com/jgloss/encoding.html

    for how to find out. Oddly it is a secret for unsigned Applets.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Jan 30, 2006
    #3
  4. gk

    Roedy Green Guest

    On 30 Jan 2006 02:14:49 -0800, "gk" <> wrote, quoted
    or indirectly quoted someone who said :

    > byte[] utf8Bytes =3D original.getBytes("UTF8");
    > byte[] defaultBytes =3D original.getBytes();
    > String roundTrip =3D new String(utf8Bytes, "UTF8");
    > String defaultTrip =3D new String(defaultBytes);


    try dumping out the byte encodings. That will solve your mystery.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Jan 30, 2006
    #4
  5. "The Java 2 platform uses the UTF-16 representation in char arrays and
    in the String and StringBuffer classes"
    (http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html)


    > From Java's point of view it is even a very trivial operation,
    > since the VM uses a modified UTF-8 encoding internally


    When one talks about Java using a modified UTF-8 it normally refers to
    Java representing UTF-8 a little different than most implementaitons.
    http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8_in_Java

    Java uses UTF-16 interanally.

    Opalinski

    http://www.geocities.com/opalpaweb/
    opalinski from opalpaweb, Jan 30, 2006
    #5
  6. gk

    Alex Buell Guest

    On 30 Jan 2006 05:23:00 -0800 " opalinski from
    opalpaweb" <> waved a wand and this message magically
    appeared:

    > Java uses UTF-16 interanally.


    "inter-anally"? Teehee.

    --
    http://www.munted.org.uk

    "Honestly, what can I possibly say to get you into my bed?" - Anon.
    Alex Buell, Jan 30, 2006
    #6
  7. gk

    Roedy Green Guest

    On 30 Jan 2006 05:23:00 -0800, " opalinski from
    opalpaweb" <> wrote, quoted or indirectly quoted
    someone who said :

    >Java uses UTF-16 interanally.


    what that a typo or a Freudian slip or a slur?
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Jan 30, 2006
    #7
  8. gk

    Chris Uppal Guest

    Thomas Weidenfeller wrote:

    > the VM uses a modified UTF-8 encoding internally, so there isn't
    > much to do when converting to a UTF-8 byte sequence.


    This is almost certainly untrue for any given JVM. It's true that some of the
    /external interfaces/ to the JVM, notably JNI and the classfile format, do use
    the modified version of UTF-8, but that in no way constrains, or (probably)
    reflects, the internal representation of Java Strings.

    If we are talking about the Sun implementations, then Strings are represented
    (quite explicitly at Java level) as char[] arrays which hold Unicode data
    represented as UTF-16 sequences of 16-bit integers. Of course, there might be
    other versions of the platform which have different implementations of String.
    I suppose it's not impossible that one of them could use byte[] arrays in
    not-actually-UTF-8 format, but I find it hard to imagine a convincing
    motivation.

    BTW, converting Sun's bastardised imitation of UTF-8 into real UTF-8 is /not/
    trivial. Converting not-actually-UTF-8 into UTF-8 involves (logically) the
    same steps as converting not-actually-UTF-8 to UTF-16, decoding that to
    Unicode, and finally encoding that as UTF-8.

    -- chris
    Chris Uppal, Jan 30, 2006
    #8
  9. me> Java uses UTF-16 interanally.
    Alex> "inter-anally"? Teehee.
    Roedy> what that a typo or a Freudian slip or a slur?

    Too many message windows to too many sexpartners. All this
    simultanallity; poor linear mind gets vexed.

    Lol.

    Cheers.
    opalinski from opalpaweb, Jan 30, 2006
    #9
  10. gk

    Roedy Green Guest

    On Mon, 30 Jan 2006 14:45:13 -0000, "Chris Uppal"
    <-THIS.org> wrote, quoted or indirectly
    quoted someone who said :

    > Of course, there might be
    >other versions of the platform which have different implementations of String.
    >I suppose it's not impossible that one of them could use byte[] arrays in
    >not-actually-UTF-8 format, but I find it hard to imagine a convincing
    >motivation.


    To index and process strings you need them in 16 bit form. However,
    for storage of strings not actively being processed I could imagine
    some sort of caching scheme that converts them to UTF-8 for more
    compact storage. All string handling functions would have to be aware
    of the two formats and automatically unpack Strings when accessed for
    anything other than referencing the string as a whole.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Jan 30, 2006
    #10
  11. gk

    gk Guest

    gk, Jan 31, 2006
    #11
  12. gk

    gk Guest


    > The only way the version which uses the platform's default encoding
    > could fail would be if the platform's encoding could not represent a
    > particular character in a platform-specific byte sequence. In that case
    > you wouldn't get a full round trip conversion for such characters. This
    > is, however, very unlikely, since you did chose Unicode characters which
    > are all well in the Latin 1 range. This is the second most common
    > character encoding after seven bit ASCII, and many character encodings
    > encompass Latin 1 in one way or the other (the first 256 Unicode
    > characters are actually the Latin 1 characters).
    >



    bit confused.

    do you mean, the defaulf character set for all the platform is
    "unicode",

    because the DOC says,

    String(byte[] bytes)
    Constructs a new String by decoding the specified array of
    bytes using the platform's default charset.


    so, when i am doing the reverse thingie, if i dont mention the encoding
    format , the default charset will be invoked and they may produce
    different strings on different platforms.



    do you mean, all the platforms have UTF-8 character set by default ?

    do you mean, when i called , String defaultTrip = new
    String(defaultBytes); the UTF-8 has been called ?.....but how that
    cold be possible ? may be linux uses some other encoding as default ,
    solaris uses some other encoding as default.....so, this would produce
    some other strings .............even, if they (platforms) have UTF-8
    chars, how UTF-8 wold be called by default (because i have not
    mentioned in the constructor ) and so they are bound to produce
    different results ?


    i dont have have other platforms, so i am not able to test it in
    another platforms.

    i did it only in win-xp.


    it is still confusing .

    please explain.


    and who knows , whats the default charset of other platforms ......so,
    this might produce some other strings
    gk, Jan 31, 2006
    #12
  13. gk

    gk Guest

    i discoveded this


    import java.nio.charset.Charset;
    class StringTest
    {
    public static void main(String[] args)
    {
    String defaultEncodingName = System.getProperty( "file.encoding" );
    System.out.println(defaultEncodingName);
    }
    }




    output:
    =====
    Cp1252



    SO, my platform supports only Cp1252 encoding.


    According to DOC >>

    byte[] getBytes()
    Encodes this String into a sequence of bytes using the
    platform's default charset, storing the result into a new byte array.


    AND

    String(byte[] bytes)
    Constructs a new String by decoding the specified array of
    bytes using the platform's default charset.



    and According to my code here,

    byte[] defaultBytes = original.getBytes();
    String defaultTrip = new String(defaultBytes);

    they should work with platform's default charset and that is "Cp1252"
    ( my discovery)

    note, this is not unicode !!.......

    but when i printed

    System.out.println("defaultTrip = " + defaultTrip);

    it prints a unicode !!!!!.....this should have printed some other
    complex odd looking sring...is not it ?
    gk, Jan 31, 2006
    #13
  14. gk

    Chris Uppal Guest

    gk wrote:
    > Thomas Weidenfeller wrote:
    > > Charset.defaultCharset()

    > this does not exists .


    It's new in 1.5.

    -- chris
    Chris Uppal, Jan 31, 2006
    #14
  15. gk

    Chris Uppal Guest

    gk wrote:

    > bit confused.


    I'm not certain, but I /think/ that you might be misunderstanding the
    relationship between Strings and Charsets.

    A String has /no/ Charset, and is not associated with any particular byte
    encoding. (Technically this is only true if you are using the right APIs, but
    it close enough to being true to be a good approximation to start from[*]).
    That's to say a String contains pure Unicode data, not in any encoding, just
    pure characters. (Compare the way that an int contains pure integer data,
    separate from any encoding as big-endian or little-endian, or anything else).
    A Charset is only involved when you need to convert a String to bytes (or the
    other way around) in order to communicate with external systems or save the
    data to file.

    So, in your original example, after
    String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
    you have a String, original, which contains pure Unicode.

    If you new do:
    byte[] utf8Bytes = original.getBytes("UTF8");
    then you have the original data encoded as UTF-8. And later:
    String roundTrip = new String(utf8Bytes, "UTF8");
    which gives you a new String containing pure Unicode data, assembled by
    decoding the UTF-8 bytes. Since UTF-8 is (by design) capable of encoding any
    Unicode data, no information will have been lost, and roundTrip will be the
    same as original.

    When you do the same using the platform-default Charset:
    byte[] defaultBytes = original.getBytes();
    String defaultTrip = new String(defaultBytes);
    The only thing that is different is that you are using a different Charset.
    So, if that Charset happens to be capable of encoding every character in the
    original String, no data will have been lost and roundTrip will be the same as
    original. If you had used any Unicode characters in original which could /not/
    be encoded in the platform default Charset then the operation would have
    failed. Since the platform default Charset is machine-specific, that means
    that you don't really know what'd gong to happen when you convert Strings into
    byte[] arrays using it -- which is why using the platform default Charset is
    usually a bad idea.

    But the important thing to realise is that Strings don't have Charsets.
    Charsets are only used when converting Strings to byte sequences.

    -- chris

    ([*] We can talk more about that approximation, if you want, but it best to get
    the current confusion cleared up first)
    Chris Uppal, Jan 31, 2006
    #15
  16. gk

    Roedy Green Guest

    On Tue, 31 Jan 2006 10:48:17 -0000, "Chris Uppal"
    <-THIS.org> wrote, quoted or indirectly
    quoted someone who said :

    >> > Charset.defaultCharset()

    >> this does not exists .

    >
    >It's new in 1.5.

    prior to that you had look at a System property. It might even have
    been restricted to signed applets. See
    http://mindprod.com/jgloss/encoding.html I should have it all
    documented there.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Jan 31, 2006
    #16
  17. gk

    Roedy Green Guest

    On 30 Jan 2006 21:56:00 -0800, "gk" <> wrote, quoted
    or indirectly quoted someone who said :

    >SO, my platform supports only Cp1252 encoding.


    unless you specifically ask for something else. That is just the
    default for Readers/Writer and String <=> byte[] conversion.

    See http://mindprod.com/jgloss/encoding.html
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Jan 31, 2006
    #17
  18. gk

    Roedy Green Guest

    On Tue, 31 Jan 2006 13:46:47 GMT, Roedy Green
    <> wrote, quoted or
    indirectly quoted someone who said :

    >>SO, my platform supports only Cp1252 encoding.

    >
    >unless you specifically ask for something else. That is just the
    >default for Readers/Writer and String <=> byte[] conversion.
    >
    >See http://mindprod.com/jgloss/encoding.html


    see http://mindprod.com/jgloss/fileio.html
    for how to specify a different encoding for Reader/Writer

    see http://mindprod.com/jgloss/conversion for how to specify a
    different one for String <=> byte[] conversion.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
    Roedy Green, Jan 31, 2006
    #18
  19. gk

    gk Guest

    here are some points i have taken note from your comments

    1) java strings are simply chars ......may be we could think these are
    as unicode chars.

    so, String str="one big string" .....is a bunc of unicode chars....

    2) there is no encoding involved while we talk about
    Strings.......encoidng will come into picture, when we do the String
    <=> byte[] conversion.

    3) we could use any encoidng to encode these bunch of unicode chars
    into byte[] array.....if those ebcoding recognises these unicode chars
    , then we are safe...becuase when we revert back, there will be no
    loss of data.

    4) I is always suggested to use UTF-8 encoding while we convert it
    into byte[] and vice versa.



    BUT, i am not comfortable when i run this "Roedy Green's" code
    (http://mindprod.com/jgloss/conversion)


    String s = "abc";
    // string -> byte[]
    byte [] b = s.getBytes( "8859_1" /* encoding */ );
    // byte[] -> String
    String t = new String( b , "Cp1252" /* encoding */ );


    This code prints t="abc" !!

    see, we encoded the string via "8859_1" and retrieved via
    ""Cp1252"" ...and we get the original string.




    i also tried...

    String s = "abc";
    // string -> byte[]
    byte [] b = s.getBytes( "windows-1250" /* encoding */ );
    // byte[] -> String
    String t = new String( b , "Cp1252" /* encoding */ );
    System.out.println(t);


    again got t="abc"

    there is No loss of data.

    so, this means, each encoding recognises other encoding.....and thats
    why they are able to revert back.


    but, this is not good.....it is not expected that one encoding would be
    recognised by other encoding !!....because, if that happens any body
    can hack any binary documents written in unknown encoding like
    this......the thief need not to know, whether the owner has encode the
    file in UTF-8, or "8859_1", or "Cp1252" or " "windows-1250" etc
    etc.....because, the thief knows encoding are brothers , and they
    recognise each other...so, he could decode by any encoding.


    P.S : MIND IT..... i am talking about Cryptrography ....but here in
    this example we are loosing the meaning of the word "encoding".
    gk, Feb 1, 2006
    #19
  20. gk

    gk Guest

    sorry, i meant ...i am NOT talking abot Cryptrography and the
    different versions of encoding.

    i am talking about these simple charset encoding .
    gk, Feb 1, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. J.P.Jarolim
    Replies:
    0
    Views:
    1,039
    J.P.Jarolim
    Feb 27, 2004
  2. Stefano

    nio and default charset

    Stefano, Jun 4, 2004, in forum: Java
    Replies:
    1
    Views:
    466
    Gordon Beaton
    Jun 4, 2004
  3. Christophe Darville

    platform default charset

    Christophe Darville, Aug 20, 2004, in forum: Java
    Replies:
    6
    Views:
    21,198
    Mark Thornton
    Aug 22, 2004
  4. cs_professional
    Replies:
    14
    Views:
    5,157
    cs_professional
    Dec 12, 2010
  5. optimistx

    javascript charset <> page charset

    optimistx, Aug 14, 2008, in forum: Javascript
    Replies:
    2
    Views:
    244
    optimistx
    Aug 15, 2008
Loading...

Share This Page