Unicode and UTF-8

Discussion in 'Java' started by maxwelton, Oct 9, 2005.

  1. maxwelton

    maxwelton Guest

    I have a situation where I read a string of settings from
    a cookie. One section of this string has UTF-8 encoded
    characters.
    see "UTF-8 and Unicode, Oct 17, 2001" I discover to
    get the bytes I have to use a call like this specifying
    "8859_1":

    String cookieData; // populated by cookie token.
    ..
    ..
    byte[] utf8Contents = cookieData.getBytes("8859_1");

    // then to get it where the applet will display it
    // I have to put it in UTF-16 by doing this.
    String userData = new String(utf8Contents, "UTF-8");
    the unicode values I have tried up to 0441. I
    don't know of any limits so far, but what bothers me
    and what I don't understand is why does "8859_1" have
    to be specified? Nothing I am doing should be specific
    to that encoding. I have tried the getBytes() without
    specifying the encoding but it didn't seem to work
    unless I missed something. Should it have worked?
     
    maxwelton, Oct 9, 2005
    #1
    1. Advertisements

  2. That's a shameful thing for the server (client?) to do to you.
    "Put it in UTF-16" may be technically accurate for some VMs (notably
    Sun's), but it is a poor way to conceptualize what is happening. You
    NEED to get a firm grasp on the difference between bytes (byte
    sequences) and characters, and you need then to appreciate that Strings
    are fundamentally sequences of characters, not bytes.
    That's lucky. Chances are good, then, that it will work for the entire
    Unicode BMP. If you need to worry about characters outside the BMP then
    I'd test some of those.
    Evidently you are mistaken, because the procedure you describe works.
    Or at least, even if you are not explicitly doing anything with
    ISO-8859-1, something must be using it on your behalf. Since that's the
    default charset for HTTP character data, chances are good that whatever
    tool is processing the HTTP traffic is using it to parse the cookies out
    of the HTTP header.
    String.getBytes() has to use _some_ charset to convert the characters to
    bytes. The nullary version of the method uses the platform default
    charset; that could in fact be ISO-8859-1, but typically isn't, so
    myString.getBytes("ISO-8859-1") is usually not the same thing as
    myString.getBytes(). Your procedure is converting the character
    sequence obtained in String form from your Cookie back into the original
    byte sequence by reversing an INCORRECT ISO-8859-1 decoding. With the
    bytes in hand, you then decode them (correctly) via UTF-8 into the
    desired character sequence. You can't use just any random charset to
    get the bytes -- it has to be the one that was used to produce the
    incorrect String in the first place.

    Note also that it is not in general guaranteed that decoding a byte
    sequence and then re-encoding the resulting characters will get back the
    original input, even when the same charset is used for both steps. That
    it seems to work in this case probably means that the charset
    implementation in use is probably being a bit lazy in this case, but
    that's beside the point. Because the decoding is not in general 100%
    reversible, however, and because some software may have trouble with
    non-ASCII characters in HTTP headers, it is *much better* to not rely on
    it. A better approach is to encode the UTF-8 (byte) representation into
    ASCII characters, such as via a Base-64 encoding or even a URL encoding.
    Since ISO-8859-* (and UTF-8) coincide with ASCII over the range
    covered by ASCII, you can be more certain that what you get out on the
    receiving end is the same as what you put in on the sending end, and you
    can work with the result (i.e. reverse the double encoding) without so
    much dependency on and assumption about the details of the underlying
    HTTP protocol handling.
     
    John C. Bollinger, Oct 10, 2005
    #2
    1. Advertisements

  3. maxwelton

    Roedy Green Guest

    My understanding is cookies are limited to ASCII-7 and further you
    can't use any of the HTTP characters e.g. ? = & You are expected to
    URL-encode your cookie. They have to be transported in HTTP headers,
    so can't get too fancy on characters.

    see http://mindprod.com/jgloss/urlencoded.html
     
    Roedy Green, Oct 10, 2005
    #3
  4. maxwelton

    Roedy Green Guest

    UTF-16 is one way of encoding unicode externally as bytes. Java uses
    the char type internally for unicode which is 16-bit. One is a byte
    type, one a char type.

    See http://mindprod.com/jgloss/utf.html
    http://mindprod.com/jgloss/encoding.html
     
    Roedy Green, Oct 10, 2005
    #4
  5. maxwelton

    maxwelton Guest

    Thanks for the info. I realized I was going about this in the
    wrong way. When sending out the data to the server 2 parts go
    out URLEncoded to UTF-8. This particular server needs those in
    UTF-8 because of multiple language support. When reading the
    cookie back all I needed to do was specify decoding to UTF-8
    on these 2 token strings. So URLDecoder.decode(token, "UTF-8")
    is what solved this problem. Thanks.
     
    maxwelton, Oct 10, 2005
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.