String default encoding: UTF-16 or Platform's default charset?

Discussion in 'Java' started by cs_professional, Dec 10, 2010.

  1. I understand that Java Strings are Unicode (charset), but how are Java
    String's stored in memory? As UTF-16 encoding or using the platform's
    default charset?

    There seems to be conflicting information this, the official String
    javadoc says platform's default charset:
    http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
    "Constructs a new String by decoding the specified array of bytes
    using the platform's default charset."

    I assume the platform's default charset is what you can get by
    calling:
    System.getProperty("file.encoding") OR
    http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html#defaultCharset()

    On my windows machine the above calls return Windows-1252 or CP-1252
    (they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
    So does this mean all Java Strings are encoded and stored in memory in
    this Windows-1252 or CP-1252 format?

    However, the "Java Internationalization FAQ" says UTF-16:
    http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#recommended-charset
    "... internal representation in Java, which is UTF-16".

    So, what is it correct answer? Are Java Strings stored in memory as
    UTF-16 or the platform's default charset?

    Btw, I'm trying to understand this so I know what to expect in a more
    complex i18n Browser-Servlet scenario.
    cs_professional, Dec 10, 2010
    #1
    1. Advertising

  2. cs_professional

    Arne Vajhøj Guest

    On 10-12-2010 11:12, cs_professional wrote:
    > I understand that Java Strings are Unicode (charset), but how are Java
    > String's stored in memory? As UTF-16 encoding or using the platform's
    > default charset?
    >
    > There seems to be conflicting information this, the official String
    > javadoc says platform's default charset:
    > http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
    > "Constructs a new String by decoding the specified array of bytes
    > using the platform's default charset."
    >
    > I assume the platform's default charset is what you can get by
    > calling:
    > System.getProperty("file.encoding") OR
    > http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html#defaultCharset()
    >
    > On my windows machine the above calls return Windows-1252 or CP-1252
    > (they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
    > So does this mean all Java Strings are encoded and stored in memory in
    > this Windows-1252 or CP-1252 format?
    >
    > However, the "Java Internationalization FAQ" says UTF-16:
    > http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#recommended-charset
    > "... internal representation in Java, which is UTF-16".
    >
    > So, what is it correct answer? Are Java Strings stored in memory as
    > UTF-16 or the platform's default charset?
    >
    > Btw, I'm trying to understand this so I know what to expect in a more
    > complex i18n Browser-Servlet scenario.


    Strings are stored as UTF-16.

    The default char set applies to external representations.

    Arne
    Arne Vajhøj, Dec 10, 2010
    #2
    1. Advertising

  3. On 12/10/2010 11:12 AM, cs_professional wrote:
    > I understand that Java Strings are Unicode (charset), but how are Java
    > String's stored in memory? As UTF-16 encoding or using the platform's
    > default charset?


    Strings internally are stored as chars, which a unsigned 16 bit integers
    representing UTF-16 codepoints.

    > There seems to be conflicting information this, the official String
    > javadoc says platform's default charset:
    > http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
    > "Constructs a new String by decoding the specified array of bytes
    > using the platform's default charset."


    For serialization as a byte stream, Strings by default use the platform
    default charset.

    > On my windows machine the above calls return Windows-1252 or CP-1252
    > (they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
    > So does this mean all Java Strings are encoded and stored in memory in
    > this Windows-1252 or CP-1252 format?


    It can't be, since you can store, say, π in a Java string, which is not
    a character in CP-1252. On the other hand, if your default charset is
    CP-1252, you can't serialize that character (you'll get ? instead).

    > Btw, I'm trying to understand this so I know what to expect in a more
    > complex i18n Browser-Servlet scenario.


    What you have to be concerned about is the translation between byte
    arrays (or any input/output that reads/writes bytes, possibly
    autoconverting (!) characters) and character arrays (or Strings or other
    containers implementing CharSequence).

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer, Dec 10, 2010
    #3
  4. cs_professional

    Roedy Green Guest

    On Fri, 10 Dec 2010 08:12:13 -0800 (PST), cs_professional
    <> wrote, quoted or indirectly quoted someone who
    said :

    >I understand that Java Strings are Unicode (charset), but how are Java
    >String's stored in memory? As UTF-16 encoding or using the platform's
    >default charset?


    The spec allows the implementor to do anything he pleases internally,
    including 8-bit encodings. However, they behave as if they were
    encoded as 16-bit Unicode chars.

    They are converted to the default local encoding when you use a
    PrintWriter for example without specifying an explicit encoding.

    You can experiment writing files, then feeding them to the encoding
    recognizer to figure out what encoding was actually used. Local
    encodings are often 8-bit.
    http://mindprod.com/applet/encodingrecogniser.html
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    Doubling the size of a team will probably make it produce even more slowly.
    The problem is the more team members, the more secrets, the less each team
    member understands about how it all fits together and how his changes may
    adversely affect others.
    Roedy Green, Dec 10, 2010
    #4
  5. cs_professional

    Roedy Green Guest

    On Fri, 10 Dec 2010 12:52:32 -0500, Joshua Cranmer
    <> wrote, quoted or indirectly quoted someone
    who said :

    >For serialization as a byte stream, Strings by default use the platform
    >default charset


    I don't think so. They use UTF-8 with lead count field, like
    DataOutputStream. Otherwise such files would not be portable. I use
    serialised streams all the time as resources. They would not work if
    they read back differently by different clients.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    Doubling the size of a team will probably make it produce even more slowly.
    The problem is the more team members, the more secrets, the less each team
    member understands about how it all fits together and how his changes may
    adversely affect others.
    Roedy Green, Dec 10, 2010
    #5
  6. "Roedy Green" <> wrote in message
    news:...
    > On Fri, 10 Dec 2010 12:52:32 -0500, Joshua Cranmer
    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    >>For serialization as a byte stream, Strings by default use the platform
    >>default charset

    >
    > I don't think so. They use UTF-8 with lead count field, like
    > DataOutputStream. Otherwise such files would not be portable. I use
    > serialised streams all the time as resources. They would not work if
    > they read back differently by different clients.


    It's a complicated area, so we need to speak precisely.

    DataOutputStream's writeChar() and writeChars() methods write characters as
    UTF-16 code points. Its WriteUTF() method writes a string in (Java's
    version of) UTF-8. None of these are affected by the platform's default
    encoding.

    Java object serialization uses these methods. Again, its output is
    unaffected by the platform's default encoding.

    The platform's default charset does affect other places where chars are
    converted to bytes and no encoding is specified. These include
    String.getBytes() and the various Writer methods that output strings (e.g
    write(String)) if no encoding was specified when the Writer was created.
    Mike Schilling, Dec 10, 2010
    #6
  7. On 12/10/2010 06:52 PM, Joshua Cranmer wrote:
    > On 12/10/2010 11:12 AM, cs_professional wrote:


    >> There seems to be conflicting information this, the official String
    >> javadoc says platform's default charset:
    >> http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
    >>
    >> "Constructs a new String by decoding the specified array of bytes
    >> using the platform's default charset."

    >
    > For serialization as a byte stream, Strings by default use the platform
    > default charset.


    Please don't call String's getBytes() "serialization". Serialization is
    a completely different mechanism (see [1]) and we don't really have to
    bother how that format looks like because this is a Java only story and
    instances are guaranteed to come back as they were written.

    Kind regards

    robert


    [1] http://download.oracle.com/javase/6/docs/api/java/io/Serializable.html
    Robert Klemme, Dec 10, 2010
    #7
  8. cs_professional

    David Guest

    On 10 dic, 12:52, Joshua Cranmer <> wrote:
    > On 12/10/2010 11:12 AM, cs_professional wrote:
    >
    > > I understand that Java Strings are Unicode (charset), but how are Java
    > > String's stored in memory? As UTF-16 encoding or using the platform's
    > > default charset?

    >
    > Strings internally are stored as chars, which a unsigned 16 bit integers
    > representing UTF-16 codepoints.


    Strictly speaking, strings could be stored in some other format, like
    UTF-32, or arrays of double where the integer part represents a
    Unicode codepoint, or Perl's SvPV type (that carries a flag and can be
    either ISO-8859-1 or UTF-8 internally). However, the Sun reference
    implementation uses UTF-16 on all platforms, and some of the methods
    in String are easier to implement efficiently when that's the case.

    --
    DLL
    David, Dec 10, 2010
    #8
  9. "David" <> wrote in message
    news:...
    > On 10 dic, 12:52, Joshua Cranmer <> wrote:
    >> On 12/10/2010 11:12 AM, cs_professional wrote:
    >>
    >> > I understand that Java Strings are Unicode (charset), but how are Java
    >> > String's stored in memory? As UTF-16 encoding or using the platform's
    >> > default charset?

    >>
    >> Strings internally are stored as chars, which a unsigned 16 bit integers
    >> representing UTF-16 codepoints.

    >
    > Strictly speaking, strings could be stored in some other format, like
    > UTF-32, or arrays of double where the integer part represents a
    > Unicode codepoint, or Perl's SvPV type (that carries a flag and can be
    > either ISO-8859-1 or UTF-8 internally). However, the Sun reference
    > implementation uses UTF-16 on all platforms, and some of the methods
    > in String are easier to implement efficiently when that's the case.


    I'm wondering whether there's any guarantee that String.charAt() is O(0),
    which would be next to impossible if the String were an array of UTF-32.
    Mike Schilling, Dec 11, 2010
    #9
  10. cs_professional

    Tom Anderson Guest

    On Fri, 10 Dec 2010, Mike Schilling wrote:

    > "David" <> wrote in message
    > news:...
    >
    >> Strictly speaking, strings could be stored in some other format, like
    >> UTF-32, or arrays of double where the integer part represents a Unicode
    >> codepoint, or Perl's SvPV type (that carries a flag and can be either
    >> ISO-8859-1 or UTF-8 internally).

    >
    > I'm wondering whether there's any guarantee that String.charAt() is O(0),
    > which would be next to impossible if the String were an array of UTF-32.


    O(0)?

    tom

    --
    william gibson said that the future has already happened, it just isn't
    evenly distributed. he was talking specifically about finsbury park. --
    andy
    Tom Anderson, Dec 11, 2010
    #10
  11. cs_professional

    BGB Guest

    On 12/11/2010 7:27 AM, Tom Anderson wrote:
    > On Fri, 10 Dec 2010, Mike Schilling wrote:
    >
    >> "David" <> wrote in message
    >> news:...
    >>
    >>> Strictly speaking, strings could be stored in some other format, like
    >>> UTF-32, or arrays of double where the integer part represents a
    >>> Unicode codepoint, or Perl's SvPV type (that carries a flag and can
    >>> be either ISO-8859-1 or UTF-8 internally).

    >>
    >> I'm wondering whether there's any guarantee that String.charAt() is
    >> O(0), which would be next to impossible if the String were an array of
    >> UTF-32.

    >
    > O(0)?
    >


    OoO, its not just fast, its miracle fast...

    infinite fast...


    it will, ever so gently, stretch open space-time, such that one can gaze
    into its bowels...

    say:
    ----
    == ==
    == ==
    ----
    ||

    so, the magic O(0) operator, who needs O(1) now?...


    ok, not really being serious here...

    or such...
    BGB, Dec 11, 2010
    #11
  12. "Tom Anderson" <> wrote in message
    news:...
    > On Fri, 10 Dec 2010, Mike Schilling wrote:
    >
    >> "David" <> wrote in message
    >> news:...
    >>
    >>> Strictly speaking, strings could be stored in some other format, like
    >>> UTF-32, or arrays of double where the integer part represents a Unicode
    >>> codepoint, or Perl's SvPV type (that carries a flag and can be either
    >>> ISO-8859-1 or UTF-8 internally).

    >>
    >> I'm wondering whether there's any guarantee that String.charAt() is O(0),
    >> which would be next to impossible if the String were an array of UTF-32.

    >
    > O(0)?


    OK, I'll settle for O(1)
    Mike Schilling, Dec 11, 2010
    #12
  13. cs_professional

    Tom Anderson Guest

    On Sat, 11 Dec 2010, Mike Schilling wrote:

    > "Tom Anderson" <> wrote in message
    > news:...
    >> On Fri, 10 Dec 2010, Mike Schilling wrote:
    >>
    >>> "David" <> wrote in message
    >>> news:...
    >>>
    >>>> Strictly speaking, strings could be stored in some other format, like
    >>>> UTF-32, or arrays of double where the integer part represents a Unicode
    >>>> codepoint, or Perl's SvPV type (that carries a flag and can be either
    >>>> ISO-8859-1 or UTF-8 internally).
    >>>
    >>> I'm wondering whether there's any guarantee that String.charAt() is O(0),
    >>> which would be next to impossible if the String were an array of UTF-32.

    >>
    >> O(0)?

    >
    > OK, I'll settle for O(1)


    Sadly, i think the spec doesn't guarantee O(1) any more than O(0)!

    tom

    --
    Know who said that? Fucking Terrorvision, that's who. -- D
    Tom Anderson, Dec 11, 2010
    #13
  14. cs_professional

    Arne Vajhøj Guest

    On 11-12-2010 16:37, Tom Anderson wrote:
    > On Sat, 11 Dec 2010, Mike Schilling wrote:
    >> "Tom Anderson" <> wrote in message
    >> news:...
    >>> On Fri, 10 Dec 2010, Mike Schilling wrote:
    >>>> "David" <> wrote in message
    >>>> news:...
    >>>>> Strictly speaking, strings could be stored in some other format,
    >>>>> like UTF-32, or arrays of double where the integer part represents
    >>>>> a Unicode codepoint, or Perl's SvPV type (that carries a flag and
    >>>>> can be either ISO-8859-1 or UTF-8 internally).
    >>>>
    >>>> I'm wondering whether there's any guarantee that String.charAt() is
    >>>> O(0), which would be next to impossible if the String were an array
    >>>> of UTF-32.
    >>>
    >>> O(0)?

    >>
    >> OK, I'll settle for O(1)

    >
    > Sadly, i think the spec doesn't guarantee O(1) any more than O(0)!


    We will have to settle for that it seems to be the common
    implementation.

    Arne
    Arne Vajhøj, Dec 12, 2010
    #14
  15. Thanks all! The conclusion is that Strings are typically stored in the
    JVM as UTF-16. Anytime the JVM needs to interact with the os/platform
    (e.g. file i/o, println, etc.) it by default converts the Strings to
    the host/platform encoding (e.g. Windows-1252 or CP-1252). The
    developer can choose to convert the Strings to some other encoding
    (e.g. UTF-8 recommended by Java i18n FAQ) by calling the appropriate
    APIs.

    For Browser-Servlet interactions, this gets more complex with J2EE
    container (e.g. Weblogic, Tomcat, etc.) specific behavior and the fact
    that not all Browsers transmit the encoding information consistently.
    The most recommended way to handle multi-byte is to use UTF-8
    everywhere... browser, container, file, database.
    cs_professional, Dec 12, 2010
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. J.P.Jarolim
    Replies:
    0
    Views:
    1,052
    J.P.Jarolim
    Feb 27, 2004
  2. Christophe Darville

    platform default charset

    Christophe Darville, Aug 20, 2004, in forum: Java
    Replies:
    6
    Views:
    21,241
    Mark Thornton
    Aug 22, 2004
  3. Fritz Bayer
    Replies:
    1
    Views:
    4,286
    Wibble
    May 30, 2005
  4. gk
    Replies:
    32
    Views:
    12,400
    Piotr Kobzda
    Mar 13, 2006
  5. optimistx

    javascript charset <> page charset

    optimistx, Aug 14, 2008, in forum: Javascript
    Replies:
    2
    Views:
    272
    optimistx
    Aug 15, 2008
Loading...

Share This Page