String default encoding: UTF-16 or Platform's default charset?

Discussion in 'Java' started by cs_professional, Dec 10, 2010.

  1. I understand that Java Strings are Unicode (charset), but how are Java
    String's stored in memory? As UTF-16 encoding or using the platform's
    default charset?

    There seems to be conflicting information this, the official String
    javadoc says platform's default charset:
    http://download.oracle.com/javase/6/docs/api/java/lang/String.html#String(byte[])
    "Constructs a new String by decoding the specified array of bytes
    using the platform's default charset."

    I assume the platform's default charset is what you can get by
    calling:
    System.getProperty("file.encoding") OR
    http://java.sun.com/javase/6/docs/api/java/nio/charset/Charset.html#defaultCharset()

    On my windows machine the above calls return Windows-1252 or CP-1252
    (they are the same thing: http://en.wikipedia.org/wiki/Windows-1252).
    So does this mean all Java Strings are encoded and stored in memory in
    this Windows-1252 or CP-1252 format?

    However, the "Java Internationalization FAQ" says UTF-16:
    http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#recommended-charset
    "... internal representation in Java, which is UTF-16".

    So, what is it correct answer? Are Java Strings stored in memory as
    UTF-16 or the platform's default charset?

    Btw, I'm trying to understand this so I know what to expect in a more
    complex i18n Browser-Servlet scenario.
     
    cs_professional, Dec 10, 2010
    #1
    1. Advertisements

  2. cs_professional

    Arne Vajhøj Guest

    Strings are stored as UTF-16.

    The default char set applies to external representations.

    Arne
     
    Arne Vajhøj, Dec 10, 2010
    #2
    1. Advertisements

  3. Strings internally are stored as chars, which a unsigned 16 bit integers
    representing UTF-16 codepoints.
    For serialization as a byte stream, Strings by default use the platform
    default charset.
    It can't be, since you can store, say, π in a Java string, which is not
    a character in CP-1252. On the other hand, if your default charset is
    CP-1252, you can't serialize that character (you'll get ? instead).
    What you have to be concerned about is the translation between byte
    arrays (or any input/output that reads/writes bytes, possibly
    autoconverting (!) characters) and character arrays (or Strings or other
    containers implementing CharSequence).
     
    Joshua Cranmer, Dec 10, 2010
    #3
  4. cs_professional

    Roedy Green Guest

    The spec allows the implementor to do anything he pleases internally,
    including 8-bit encodings. However, they behave as if they were
    encoded as 16-bit Unicode chars.

    They are converted to the default local encoding when you use a
    PrintWriter for example without specifying an explicit encoding.

    You can experiment writing files, then feeding them to the encoding
    recognizer to figure out what encoding was actually used. Local
    encodings are often 8-bit.
    http://mindprod.com/applet/encodingrecogniser.html
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    Doubling the size of a team will probably make it produce even more slowly.
    The problem is the more team members, the more secrets, the less each team
    member understands about how it all fits together and how his changes may
    adversely affect others.
     
    Roedy Green, Dec 10, 2010
    #4
  5. cs_professional

    Roedy Green Guest

    I don't think so. They use UTF-8 with lead count field, like
    DataOutputStream. Otherwise such files would not be portable. I use
    serialised streams all the time as resources. They would not work if
    they read back differently by different clients.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    Doubling the size of a team will probably make it produce even more slowly.
    The problem is the more team members, the more secrets, the less each team
    member understands about how it all fits together and how his changes may
    adversely affect others.
     
    Roedy Green, Dec 10, 2010
    #5
  6. It's a complicated area, so we need to speak precisely.

    DataOutputStream's writeChar() and writeChars() methods write characters as
    UTF-16 code points. Its WriteUTF() method writes a string in (Java's
    version of) UTF-8. None of these are affected by the platform's default
    encoding.

    Java object serialization uses these methods. Again, its output is
    unaffected by the platform's default encoding.

    The platform's default charset does affect other places where chars are
    converted to bytes and no encoding is specified. These include
    String.getBytes() and the various Writer methods that output strings (e.g
    write(String)) if no encoding was specified when the Writer was created.
     
    Mike Schilling, Dec 10, 2010
    #6
  7. Please don't call String's getBytes() "serialization". Serialization is
    a completely different mechanism (see [1]) and we don't really have to
    bother how that format looks like because this is a Java only story and
    instances are guaranteed to come back as they were written.

    Kind regards

    robert


    [1] http://download.oracle.com/javase/6/docs/api/java/io/Serializable.html
     
    Robert Klemme, Dec 10, 2010
    #7
  8. cs_professional

    David Guest

    Strictly speaking, strings could be stored in some other format, like
    UTF-32, or arrays of double where the integer part represents a
    Unicode codepoint, or Perl's SvPV type (that carries a flag and can be
    either ISO-8859-1 or UTF-8 internally). However, the Sun reference
    implementation uses UTF-16 on all platforms, and some of the methods
    in String are easier to implement efficiently when that's the case.
     
    David, Dec 10, 2010
    #8
  9. I'm wondering whether there's any guarantee that String.charAt() is O(0),
    which would be next to impossible if the String were an array of UTF-32.
     
    Mike Schilling, Dec 11, 2010
    #9
  10. cs_professional

    Tom Anderson Guest

    O(0)?

    tom
     
    Tom Anderson, Dec 11, 2010
    #10
  11. cs_professional

    BGB Guest

    OoO, its not just fast, its miracle fast...

    infinite fast...


    it will, ever so gently, stretch open space-time, such that one can gaze
    into its bowels...

    say:
    ----
    == ==
    == ==
    ----
    ||

    so, the magic O(0) operator, who needs O(1) now?...


    ok, not really being serious here...

    or such...
     
    BGB, Dec 11, 2010
    #11
  12. OK, I'll settle for O(1)
     
    Mike Schilling, Dec 11, 2010
    #12
  13. cs_professional

    Tom Anderson Guest

    Sadly, i think the spec doesn't guarantee O(1) any more than O(0)!

    tom
     
    Tom Anderson, Dec 11, 2010
    #13
  14. cs_professional

    Arne Vajhøj Guest

    We will have to settle for that it seems to be the common
    implementation.

    Arne
     
    Arne Vajhøj, Dec 12, 2010
    #14
  15. Thanks all! The conclusion is that Strings are typically stored in the
    JVM as UTF-16. Anytime the JVM needs to interact with the os/platform
    (e.g. file i/o, println, etc.) it by default converts the Strings to
    the host/platform encoding (e.g. Windows-1252 or CP-1252). The
    developer can choose to convert the Strings to some other encoding
    (e.g. UTF-8 recommended by Java i18n FAQ) by calling the appropriate
    APIs.

    For Browser-Servlet interactions, this gets more complex with J2EE
    container (e.g. Weblogic, Tomcat, etc.) specific behavior and the fact
    that not all Browsers transmit the encoding information consistently.
    The most recommended way to handle multi-byte is to use UTF-8
    everywhere... browser, container, file, database.
     
    cs_professional, Dec 12, 2010
    #15
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.