platform's default charset ?

G

gk

what is platform's default charset ?



String original = new String("A" + "\u00ea" + "\u00f1" +
"\u00fc" + "C");
try {
byte[] utf8Bytes = original.getBytes("UTF8");
byte[] defaultBytes = original.getBytes();
String roundTrip = new String(utf8Bytes, "UTF8");
String defaultTrip = new String(defaultBytes);

System.out.println("roundTrip = " + roundTrip); // output-1
System.out.println("defaultTrip = " + defaultTrip); // output-2




QUESTION :

why output-1 and output-2 are same ?


REASON OF THIS QUESTION :

String original = new String("A" + "\u00ea" + "\u00f1" +
"\u00fc" + "C");

this is a unicode string and it looks like "AêñüC"


How could the second output output-2 produces the same output as
output-1 ?

the ouput-2 has been encoded/decoded into "platform's default charset"
.. as i have used

byte[] defaultBytes = original.getBytes();

and

String defaultTrip = new String(defaultBytes);


for the output-2




(My System is windows XP ) ......so how that could produce the same
output as output-1 which uses encoding UTF-8 ?



do yo want to say, windows XP supporting UTF-8 ? so, by default it
picks up the UTF-8 encoding ?



in which place this 2 output i.e output-1 and output-2 wnt be same ?

is it in linux ? solaris ?
or where this two output are not same .

thank you
 
T

Thomas Weidenfeller

gk said:
what is platform's default charset ?
Charset.defaultCharset()


How could the second output output-2 produces the same output as
output-1 ?

Why do you think they should be different at all? You start with the
same Unicode string. Then you convert it into two (possibly different)
byte representations. Then you convert the byte representations with the
correct *matching reverse operation* back to two Unicode strings.

The version where you use the UTF-8 byte encoding can't fail. It is made
to represent Unicode characters, and you provide Unicode characters for
a start. From Java's point of view it is even a very trivial operation,
since the VM uses a modified UTF-8 encoding internally, so there isn't
much to do when converting to a UTF-8 byte sequence.

The only way the version which uses the platform's default encoding
could fail would be if the platform's encoding could not represent a
particular character in a platform-specific byte sequence. In that case
you wouldn't get a full round trip conversion for such characters. This
is, however, very unlikely, since you did chose Unicode characters which
are all well in the Latin 1 range. This is the second most common
character encoding after seven bit ASCII, and many character encodings
encompass Latin 1 in one way or the other (the first 256 Unicode
characters are actually the Latin 1 characters).


/Thomas
 
R

Roedy Green

byte[] utf8Bytes =3D original.getBytes("UTF8");
byte[] defaultBytes =3D original.getBytes();
String roundTrip =3D new String(utf8Bytes, "UTF8");
String defaultTrip =3D new String(defaultBytes);

try dumping out the byte encodings. That will solve your mystery.
 
O

opalpa

"The Java 2 platform uses the UTF-16 representation in char arrays and
in the String and StringBuffer classes"
(http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html)

From Java's point of view it is even a very trivial operation,
since the VM uses a modified UTF-8 encoding internally

When one talks about Java using a modified UTF-8 it normally refers to
Java representing UTF-8 a little different than most implementaitons.
http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8_in_Java

Java uses UTF-16 interanally.

Opalinski
(e-mail address removed)
http://www.geocities.com/opalpaweb/
 
C

Chris Uppal

Thomas said:
the VM uses a modified UTF-8 encoding internally, so there isn't
much to do when converting to a UTF-8 byte sequence.

This is almost certainly untrue for any given JVM. It's true that some of the
/external interfaces/ to the JVM, notably JNI and the classfile format, do use
the modified version of UTF-8, but that in no way constrains, or (probably)
reflects, the internal representation of Java Strings.

If we are talking about the Sun implementations, then Strings are represented
(quite explicitly at Java level) as char[] arrays which hold Unicode data
represented as UTF-16 sequences of 16-bit integers. Of course, there might be
other versions of the platform which have different implementations of String.
I suppose it's not impossible that one of them could use byte[] arrays in
not-actually-UTF-8 format, but I find it hard to imagine a convincing
motivation.

BTW, converting Sun's bastardised imitation of UTF-8 into real UTF-8 is /not/
trivial. Converting not-actually-UTF-8 into UTF-8 involves (logically) the
same steps as converting not-actually-UTF-8 to UTF-16, decoding that to
Unicode, and finally encoding that as UTF-8.

-- chris
 
O

opalpa

me> Java uses UTF-16 interanally.
Alex> "inter-anally"? Teehee.
Roedy> what that a typo or a Freudian slip or a slur?

Too many message windows to too many sexpartners. All this
simultanallity; poor linear mind gets vexed.

Lol.

Cheers.
 
R

Roedy Green

Of course, there might be
other versions of the platform which have different implementations of String.
I suppose it's not impossible that one of them could use byte[] arrays in
not-actually-UTF-8 format, but I find it hard to imagine a convincing
motivation.

To index and process strings you need them in 16 bit form. However,
for storage of strings not actively being processed I could imagine
some sort of caching scheme that converts them to UTF-8 for more
compact storage. All string handling functions would have to be aware
of the two formats and automatically unpack Strings when accessed for
anything other than referencing the string as a whole.
 
G

gk

The only way the version which uses the platform's default encoding
could fail would be if the platform's encoding could not represent a
particular character in a platform-specific byte sequence. In that case
you wouldn't get a full round trip conversion for such characters. This
is, however, very unlikely, since you did chose Unicode characters which
are all well in the Latin 1 range. This is the second most common
character encoding after seven bit ASCII, and many character encodings
encompass Latin 1 in one way or the other (the first 256 Unicode
characters are actually the Latin 1 characters).


bit confused.

do you mean, the defaulf character set for all the platform is
"unicode",

because the DOC says,

String(byte[] bytes)
Constructs a new String by decoding the specified array of
bytes using the platform's default charset.


so, when i am doing the reverse thingie, if i dont mention the encoding
format , the default charset will be invoked and they may produce
different strings on different platforms.



do you mean, all the platforms have UTF-8 character set by default ?

do you mean, when i called , String defaultTrip = new
String(defaultBytes); the UTF-8 has been called ?.....but how that
cold be possible ? may be linux uses some other encoding as default ,
solaris uses some other encoding as default.....so, this would produce
some other strings .............even, if they (platforms) have UTF-8
chars, how UTF-8 wold be called by default (because i have not
mentioned in the constructor ) and so they are bound to produce
different results ?


i dont have have other platforms, so i am not able to test it in
another platforms.

i did it only in win-xp.


it is still confusing .

please explain.


and who knows , whats the default charset of other platforms ......so,
this might produce some other strings
 
G

gk

i discoveded this


import java.nio.charset.Charset;
class StringTest
{
public static void main(String[] args)
{
String defaultEncodingName = System.getProperty( "file.encoding" );
System.out.println(defaultEncodingName);
}
}




output:
=====
Cp1252



SO, my platform supports only Cp1252 encoding.


According to DOC >>

byte[] getBytes()
Encodes this String into a sequence of bytes using the
platform's default charset, storing the result into a new byte array.


AND

String(byte[] bytes)
Constructs a new String by decoding the specified array of
bytes using the platform's default charset.



and According to my code here,

byte[] defaultBytes = original.getBytes();
String defaultTrip = new String(defaultBytes);

they should work with platform's default charset and that is "Cp1252"
( my discovery)

note, this is not unicode !!.......

but when i printed

System.out.println("defaultTrip = " + defaultTrip);

it prints a unicode !!!!!.....this should have printed some other
complex odd looking sring...is not it ?
 
C

Chris Uppal

gk said:
bit confused.

I'm not certain, but I /think/ that you might be misunderstanding the
relationship between Strings and Charsets.

A String has /no/ Charset, and is not associated with any particular byte
encoding. (Technically this is only true if you are using the right APIs, but
it close enough to being true to be a good approximation to start from[*]).
That's to say a String contains pure Unicode data, not in any encoding, just
pure characters. (Compare the way that an int contains pure integer data,
separate from any encoding as big-endian or little-endian, or anything else).
A Charset is only involved when you need to convert a String to bytes (or the
other way around) in order to communicate with external systems or save the
data to file.

So, in your original example, after
String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
you have a String, original, which contains pure Unicode.

If you new do:
byte[] utf8Bytes = original.getBytes("UTF8");
then you have the original data encoded as UTF-8. And later:
String roundTrip = new String(utf8Bytes, "UTF8");
which gives you a new String containing pure Unicode data, assembled by
decoding the UTF-8 bytes. Since UTF-8 is (by design) capable of encoding any
Unicode data, no information will have been lost, and roundTrip will be the
same as original.

When you do the same using the platform-default Charset:
byte[] defaultBytes = original.getBytes();
String defaultTrip = new String(defaultBytes);
The only thing that is different is that you are using a different Charset.
So, if that Charset happens to be capable of encoding every character in the
original String, no data will have been lost and roundTrip will be the same as
original. If you had used any Unicode characters in original which could /not/
be encoded in the platform default Charset then the operation would have
failed. Since the platform default Charset is machine-specific, that means
that you don't really know what'd gong to happen when you convert Strings into
byte[] arrays using it -- which is why using the platform default Charset is
usually a bad idea.

But the important thing to realise is that Strings don't have Charsets.
Charsets are only used when converting Strings to byte sequences.

-- chris

([*] We can talk more about that approximation, if you want, but it best to get
the current confusion cleared up first)
 
R

Roedy Green

G

gk

here are some points i have taken note from your comments

1) java strings are simply chars ......may be we could think these are
as unicode chars.

so, String str="one big string" .....is a bunc of unicode chars....

2) there is no encoding involved while we talk about
Strings.......encoidng will come into picture, when we do the String
<=> byte[] conversion.

3) we could use any encoidng to encode these bunch of unicode chars
into byte[] array.....if those ebcoding recognises these unicode chars
, then we are safe...becuase when we revert back, there will be no
loss of data.

4) I is always suggested to use UTF-8 encoding while we convert it
into byte[] and vice versa.



BUT, i am not comfortable when i run this "Roedy Green's" code
(http://mindprod.com/jgloss/conversion)


String s = "abc";
// string -> byte[]
byte [] b = s.getBytes( "8859_1" /* encoding */ );
// byte[] -> String
String t = new String( b , "Cp1252" /* encoding */ );


This code prints t="abc" !!

see, we encoded the string via "8859_1" and retrieved via
""Cp1252"" ...and we get the original string.




i also tried...

String s = "abc";
// string -> byte[]
byte [] b = s.getBytes( "windows-1250" /* encoding */ );
// byte[] -> String
String t = new String( b , "Cp1252" /* encoding */ );
System.out.println(t);


again got t="abc"

there is No loss of data.

so, this means, each encoding recognises other encoding.....and thats
why they are able to revert back.


but, this is not good.....it is not expected that one encoding would be
recognised by other encoding !!....because, if that happens any body
can hack any binary documents written in unknown encoding like
this......the thief need not to know, whether the owner has encode the
file in UTF-8, or "8859_1", or "Cp1252" or " "windows-1250" etc
etc.....because, the thief knows encoding are brothers , and they
recognise each other...so, he could decode by any encoding.


P.S : MIND IT..... i am talking about Cryptrography ....but here in
this example we are loosing the meaning of the word "encoding".
 
G

gk

sorry, i meant ...i am NOT talking abot Cryptrography and the
different versions of encoding.

i am talking about these simple charset encoding .
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top