String default encoding: UTF-16 or Platform's default charset?

C

cs_professional

I understand that Java Strings are Unicode (charset), but how are Java
String's stored in memory? As UTF-16 encoding or using the platform's
default charset?

There seems to be conflicting information this, the official String
javadoc says platform's default charset:
http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
"Constructs a new String by decoding the specified array of bytes
using the platform's default charset."

I assume the platform's default charset is what you can get by
calling:
System.getProperty("file.encoding") OR
http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html#defaultCharset()

On my windows machine the above calls return Windows-1252 or CP-1252
(they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
So does this mean all Java Strings are encoded and stored in memory in
this Windows-1252 or CP-1252 format?

However, the "Java Internationalization FAQ" says UTF-16:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#recommended-charset
"... internal representation in Java, which is UTF-16".

So, what is it correct answer? Are Java Strings stored in memory as
UTF-16 or the platform's default charset?

Btw, I'm trying to understand this so I know what to expect in a more
complex i18n Browser-Servlet scenario.
 
A

Arne Vajhøj

I understand that Java Strings are Unicode (charset), but how are Java
String's stored in memory? As UTF-16 encoding or using the platform's
default charset?

There seems to be conflicting information this, the official String
javadoc says platform's default charset:
http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
"Constructs a new String by decoding the specified array of bytes
using the platform's default charset."

I assume the platform's default charset is what you can get by
calling:
System.getProperty("file.encoding") OR
http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html#defaultCharset()

On my windows machine the above calls return Windows-1252 or CP-1252
(they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
So does this mean all Java Strings are encoded and stored in memory in
this Windows-1252 or CP-1252 format?

However, the "Java Internationalization FAQ" says UTF-16:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#recommended-charset
"... internal representation in Java, which is UTF-16".

So, what is it correct answer? Are Java Strings stored in memory as
UTF-16 or the platform's default charset?

Btw, I'm trying to understand this so I know what to expect in a more
complex i18n Browser-Servlet scenario.

Strings are stored as UTF-16.

The default char set applies to external representations.

Arne
 
J

Joshua Cranmer

I understand that Java Strings are Unicode (charset), but how are Java
String's stored in memory? As UTF-16 encoding or using the platform's
default charset?

Strings internally are stored as chars, which a unsigned 16 bit integers
representing UTF-16 codepoints.
There seems to be conflicting information this, the official String
javadoc says platform's default charset:
http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
"Constructs a new String by decoding the specified array of bytes
using the platform's default charset."

For serialization as a byte stream, Strings by default use the platform
default charset.
On my windows machine the above calls return Windows-1252 or CP-1252
(they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
So does this mean all Java Strings are encoded and stored in memory in
this Windows-1252 or CP-1252 format?

It can't be, since you can store, say, π in a Java string, which is not
a character in CP-1252. On the other hand, if your default charset is
CP-1252, you can't serialize that character (you'll get ? instead).
Btw, I'm trying to understand this so I know what to expect in a more
complex i18n Browser-Servlet scenario.

What you have to be concerned about is the translation between byte
arrays (or any input/output that reads/writes bytes, possibly
autoconverting (!) characters) and character arrays (or Strings or other
containers implementing CharSequence).
 
R

Roedy Green

I understand that Java Strings are Unicode (charset), but how are Java
String's stored in memory? As UTF-16 encoding or using the platform's
default charset?

The spec allows the implementor to do anything he pleases internally,
including 8-bit encodings. However, they behave as if they were
encoded as 16-bit Unicode chars.

They are converted to the default local encoding when you use a
PrintWriter for example without specifying an explicit encoding.

You can experiment writing files, then feeding them to the encoding
recognizer to figure out what encoding was actually used. Local
encodings are often 8-bit.
http://mindprod.com/applet/encodingrecogniser.html
--
Roedy Green Canadian Mind Products
http://mindprod.com

Doubling the size of a team will probably make it produce even more slowly.
The problem is the more team members, the more secrets, the less each team
member understands about how it all fits together and how his changes may
adversely affect others.
 
R

Roedy Green

For serialization as a byte stream, Strings by default use the platform
default charset

I don't think so. They use UTF-8 with lead count field, like
DataOutputStream. Otherwise such files would not be portable. I use
serialised streams all the time as resources. They would not work if
they read back differently by different clients.

--
Roedy Green Canadian Mind Products
http://mindprod.com

Doubling the size of a team will probably make it produce even more slowly.
The problem is the more team members, the more secrets, the less each team
member understands about how it all fits together and how his changes may
adversely affect others.
 
M

Mike Schilling

Roedy Green said:
I don't think so. They use UTF-8 with lead count field, like
DataOutputStream. Otherwise such files would not be portable. I use
serialised streams all the time as resources. They would not work if
they read back differently by different clients.

It's a complicated area, so we need to speak precisely.

DataOutputStream's writeChar() and writeChars() methods write characters as
UTF-16 code points. Its WriteUTF() method writes a string in (Java's
version of) UTF-8. None of these are affected by the platform's default
encoding.

Java object serialization uses these methods. Again, its output is
unaffected by the platform's default encoding.

The platform's default charset does affect other places where chars are
converted to bytes and no encoding is specified. These include
String.getBytes() and the various Writer methods that output strings (e.g
write(String)) if no encoding was specified when the Writer was created.
 
R

Robert Klemme

On 12/10/2010 11:12 AM, cs_professional wrote:
There seems to be conflicting information this, the official String
javadoc says platform's default charset:
http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])

"Constructs a new String by decoding the specified array of bytes
using the platform's default charset."

For serialization as a byte stream, Strings by default use the platform
default charset.

Please don't call String's getBytes() "serialization". Serialization is
a completely different mechanism (see [1]) and we don't really have to
bother how that format looks like because this is a Java only story and
instances are guaranteed to come back as they were written.

Kind regards

robert


[1] http://download.oracle.com/javase/6/docs/api/java/io/Serializable.html
 
D

David

Strings internally are stored as chars, which a unsigned 16 bit integers
representing UTF-16 codepoints.

Strictly speaking, strings could be stored in some other format, like
UTF-32, or arrays of double where the integer part represents a
Unicode codepoint, or Perl's SvPV type (that carries a flag and can be
either ISO-8859-1 or UTF-8 internally). However, the Sun reference
implementation uses UTF-16 on all platforms, and some of the methods
in String are easier to implement efficiently when that's the case.
 
M

Mike Schilling

David said:
Strictly speaking, strings could be stored in some other format, like
UTF-32, or arrays of double where the integer part represents a
Unicode codepoint, or Perl's SvPV type (that carries a flag and can be
either ISO-8859-1 or UTF-8 internally). However, the Sun reference
implementation uses UTF-16 on all platforms, and some of the methods
in String are easier to implement efficiently when that's the case.

I'm wondering whether there's any guarantee that String.charAt() is O(0),
which would be next to impossible if the String were an array of UTF-32.
 
T

Tom Anderson

I'm wondering whether there's any guarantee that String.charAt() is O(0),
which would be next to impossible if the String were an array of UTF-32.

O(0)?

tom
 
B

BGB


OoO, its not just fast, its miracle fast...

infinite fast...


it will, ever so gently, stretch open space-time, such that one can gaze
into its bowels...

say:
----
== ==
== ==
----
||

so, the magic O(0) operator, who needs O(1) now?...


ok, not really being serious here...

or such...
 
C

cs_professional

Thanks all! The conclusion is that Strings are typically stored in the
JVM as UTF-16. Anytime the JVM needs to interact with the os/platform
(e.g. file i/o, println, etc.) it by default converts the Strings to
the host/platform encoding (e.g. Windows-1252 or CP-1252). The
developer can choose to convert the Strings to some other encoding
(e.g. UTF-8 recommended by Java i18n FAQ) by calling the appropriate
APIs.

For Browser-Servlet interactions, this gets more complex with J2EE
container (e.g. Weblogic, Tomcat, etc.) specific behavior and the fact
that not all Browsers transmit the encoding information consistently.
The most recommended way to handle multi-byte is to use UTF-8
everywhere... browser, container, file, database.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,439
Members
44,829
Latest member
PIXThurman

Latest Threads

Top