String default encoding: UTF-16 or Platform's default charset?

cs_professional · Dec 10, 2010

I understand that Java Strings are Unicode (charset), but how are Java
String's stored in memory? As UTF-16 encoding or using the platform's
default charset?

There seems to be conflicting information this, the official String
javadoc says platform's default charset:
http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
"Constructs a new String by decoding the specified array of bytes
using the platform's default charset."

I assume the platform's default charset is what you can get by
calling:
System.getProperty("file.encoding") OR
http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html#defaultCharset()

On my windows machine the above calls return Windows-1252 or CP-1252
(they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
So does this mean all Java Strings are encoded and stored in memory in
this Windows-1252 or CP-1252 format?

However, the "Java Internationalization FAQ" says UTF-16:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#recommended-charset
"... internal representation in Java, which is UTF-16".

So, what is it correct answer? Are Java Strings stored in memory as
UTF-16 or the platform's default charset?

Btw, I'm trying to understand this so I know what to expect in a more
complex i18n Browser-Servlet scenario.

Arne Vajhøj · Dec 10, 2010

I understand that Java Strings are Unicode (charset), but how are Java
String's stored in memory? As UTF-16 encoding or using the platform's
default charset?

There seems to be conflicting information this, the official String
javadoc says platform's default charset:
http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
"Constructs a new String by decoding the specified array of bytes
using the platform's default charset."

I assume the platform's default charset is what you can get by
calling:
System.getProperty("file.encoding") OR
http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html#defaultCharset()

On my windows machine the above calls return Windows-1252 or CP-1252
(they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
So does this mean all Java Strings are encoded and stored in memory in
this Windows-1252 or CP-1252 format?

However, the "Java Internationalization FAQ" says UTF-16:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#recommended-charset
"... internal representation in Java, which is UTF-16".

So, what is it correct answer? Are Java Strings stored in memory as
UTF-16 or the platform's default charset?

Btw, I'm trying to understand this so I know what to expect in a more
complex i18n Browser-Servlet scenario.

Strings are stored as UTF-16.

The default char set applies to external representations.

Arne

Joshua Cranmer · Dec 10, 2010

I understand that Java Strings are Unicode (charset), but how are Java
String's stored in memory? As UTF-16 encoding or using the platform's
default charset?

Strings internally are stored as chars, which a unsigned 16 bit integers
representing UTF-16 codepoints.

There seems to be conflicting information this, the official String
javadoc says platform's default charset:
http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
"Constructs a new String by decoding the specified array of bytes
using the platform's default charset."

For serialization as a byte stream, Strings by default use the platform
default charset.

On my windows machine the above calls return Windows-1252 or CP-1252
(they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
So does this mean all Java Strings are encoded and stored in memory in
this Windows-1252 or CP-1252 format?

It can't be, since you can store, say, Ï€ in a Java string, which is not
a character in CP-1252. On the other hand, if your default charset is
CP-1252, you can't serialize that character (you'll get ? instead).

Btw, I'm trying to understand this so I know what to expect in a more
complex i18n Browser-Servlet scenario.

What you have to be concerned about is the translation between byte
arrays (or any input/output that reads/writes bytes, possibly
autoconverting (!) characters) and character arrays (or Strings or other
containers implementing CharSequence).

Roedy Green · Dec 10, 2010

I understand that Java Strings are Unicode (charset), but how are Java
String's stored in memory? As UTF-16 encoding or using the platform's
default charset?

The spec allows the implementor to do anything he pleases internally,
including 8-bit encodings. However, they behave as if they were
encoded as 16-bit Unicode chars.

They are converted to the default local encoding when you use a
PrintWriter for example without specifying an explicit encoding.

You can experiment writing files, then feeding them to the encoding
recognizer to figure out what encoding was actually used. Local
encodings are often 8-bit.
http://mindprod.com/applet/encodingrecogniser.html
--
Roedy Green Canadian Mind Products
http://mindprod.com

Doubling the size of a team will probably make it produce even more slowly.
The problem is the more team members, the more secrets, the less each team
member understands about how it all fits together and how his changes may
adversely affect others.

Roedy Green · Dec 10, 2010

For serialization as a byte stream, Strings by default use the platform
default charset

I don't think so. They use UTF-8 with lead count field, like
DataOutputStream. Otherwise such files would not be portable. I use
serialised streams all the time as resources. They would not work if
they read back differently by different clients.

--
Roedy Green Canadian Mind Products
http://mindprod.com

Doubling the size of a team will probably make it produce even more slowly.
The problem is the more team members, the more secrets, the less each team
member understands about how it all fits together and how his changes may
adversely affect others.

Mike Schilling · Dec 10, 2010

Roedy Green said:
I don't think so. They use UTF-8 with lead count field, like
DataOutputStream. Otherwise such files would not be portable. I use
serialised streams all the time as resources. They would not work if
they read back differently by different clients.

It's a complicated area, so we need to speak precisely.

DataOutputStream's writeChar() and writeChars() methods write characters as
UTF-16 code points. Its WriteUTF() method writes a string in (Java's
version of) UTF-8. None of these are affected by the platform's default
encoding.

Java object serialization uses these methods. Again, its output is
unaffected by the platform's default encoding.

The platform's default charset does affect other places where chars are
converted to bytes and no encoding is specified. These include
String.getBytes() and the various Writer methods that output strings (e.g
write(String)) if no encoding was specified when the Writer was created.

Robert Klemme · Dec 10, 2010

On 12/10/2010 11:12 AM, cs_professional wrote:

There seems to be conflicting information this, the official String
javadoc says platform's default charset:
http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])

"Constructs a new String by decoding the specified array of bytes
using the platform's default charset."

Click to expand...

For serialization as a byte stream, Strings by default use the platform
default charset.

Please don't call String's getBytes() "serialization". Serialization is
a completely different mechanism (see [1]) and we don't really have to
bother how that format looks like because this is a Java only story and
instances are guaranteed to come back as they were written.

Kind regards

robert

[1] http://download.oracle.com/javase/6/docs/api/java/io/Serializable.html

David · Dec 10, 2010

Strings internally are stored as chars, which a unsigned 16 bit integers
representing UTF-16 codepoints.

Strictly speaking, strings could be stored in some other format, like
UTF-32, or arrays of double where the integer part represents a
Unicode codepoint, or Perl's SvPV type (that carries a flag and can be
either ISO-8859-1 or UTF-8 internally). However, the Sun reference
implementation uses UTF-16 on all platforms, and some of the methods
in String are easier to implement efficiently when that's the case.

Mike Schilling · Dec 10, 2010

David said:
Strictly speaking, strings could be stored in some other format, like
UTF-32, or arrays of double where the integer part represents a
Unicode codepoint, or Perl's SvPV type (that carries a flag and can be
either ISO-8859-1 or UTF-8 internally). However, the Sun reference
implementation uses UTF-16 on all platforms, and some of the methods
in String are easier to implement efficiently when that's the case.

I'm wondering whether there's any guarantee that String.charAt() is O(0),
which would be next to impossible if the String were an array of UTF-32.

Tom Anderson · Dec 11, 2010

I'm wondering whether there's any guarantee that String.charAt() is O(0),
which would be next to impossible if the String were an array of UTF-32.

O(0)?

tom

BGB · Dec 11, 2010

O(0)?

OoO, its not just fast, its miracle fast...

infinite fast...

it will, ever so gently, stretch open space-time, such that one can gaze
into its bowels...

say:
----
== ==
== ==
----
||

so, the magic O(0) operator, who needs O(1) now?...

ok, not really being serious here...

or such...

Mike Schilling · Dec 11, 2010

Tom Anderson said:
O(0)?

OK, I'll settle for O(1)

Tom Anderson · Dec 11, 2010

OK, I'll settle for O(1)

Sadly, i think the spec doesn't guarantee O(1) any more than O(0)!

tom

Arne Vajhøj · Dec 11, 2010

Sadly, i think the spec doesn't guarantee O(1) any more than O(0)!

We will have to settle for that it seems to be the common
implementation.

Arne

cs_professional · Dec 12, 2010

Thanks all! The conclusion is that Strings are typically stored in the
JVM as UTF-16. Anytime the JVM needs to interact with the os/platform
(e.g. file i/o, println, etc.) it by default converts the Strings to
the host/platform encoding (e.g. Windows-1252 or CP-1252). The
developer can choose to convert the Strings to some other encoding
(e.g. UTF-8 recommended by Java i18n FAQ) by calling the appropriate
APIs.

For Browser-Servlet interactions, this gets more complex with J2EE
container (e.g. Weblogic, Tomcat, etc.) specific behavior and the fact
that not all Browsers transmit the encoding information consistently.
The most recommended way to handle multi-byte is to use UTF-8
everywhere... browser, container, file, database.

platform's default charset ?	32	Jan 30, 2006
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
Wrong default endianess in utf-16 and utf-32 !?	4	Oct 12, 2010
Mini Web Server in C++ (Part One)	4	Oct 2, 2025
HTTP request with trailer	0	Mar 22, 2024
[sun java] conver charset for string	1	Apr 14, 2007
UTF-8 problems with windows	31	Aug 10, 2009
From UTF-8 to windows-1252	3	Jan 6, 2011

String default encoding: UTF-16 or Platform's default charset?

cs_professional

Arne Vajhøj

Joshua Cranmer

Roedy Green

Roedy Green

Mike Schilling

Robert Klemme

David

Mike Schilling

Tom Anderson

BGB

Mike Schilling

Tom Anderson

Arne Vajhøj

cs_professional

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads