Unicode and UTF-8

maxwelton · Oct 9, 2005

I have a situation where I read a string of settings from
a cookie. One section of this string has UTF-8 encoded
characters.

From reading an earlier topic in this news group;

see "UTF-8 and Unicode, Oct 17, 2001" I discover to
get the bytes I have to use a call like this specifying
"8859_1":

String cookieData; // populated by cookie token.
..
..
byte[] utf8Contents = cookieData.getBytes("8859_1");

// then to get it where the applet will display it
// I have to put it in UTF-16 by doing this.
String userData = new String(utf8Contents, "UTF-8");

From the testing I have done so far this works on all

the unicode values I have tried up to 0441. I
don't know of any limits so far, but what bothers me
and what I don't understand is why does "8859_1" have
to be specified? Nothing I am doing should be specific
to that encoding. I have tried the getBytes() without
specifying the encoding but it didn't seem to work
unless I missed something. Should it have worked?

John C. Bollinger · Oct 9, 2005

I have a situation where I read a string of settings from
a cookie. One section of this string has UTF-8 encoded
characters.

That's a shameful thing for the server (client?) to do to you.

From reading an earlier topic in this news group;

Click to expand...

see "UTF-8 and Unicode, Oct 17, 2001" I discover to
get the bytes I have to use a call like this specifying
"8859_1":

String cookieData; // populated by cookie token.
.
.
byte[] utf8Contents = cookieData.getBytes("8859_1");

// then to get it where the applet will display it
// I have to put it in UTF-16 by doing this.
String userData = new String(utf8Contents, "UTF-8");

"Put it in UTF-16" may be technically accurate for some VMs (notably
Sun's), but it is a poor way to conceptualize what is happening. You
NEED to get a firm grasp on the difference between bytes (byte
sequences) and characters, and you need then to appreciate that Strings
are fundamentally sequences of characters, not bytes.

the unicode values I have tried up to 0441.

That's lucky. Chances are good, then, that it will work for the entire
Unicode BMP. If you need to worry about characters outside the BMP then
I'd test some of those.

I
don't know of any limits so far, but what bothers me
and what I don't understand is why does "8859_1" have
to be specified? Nothing I am doing should be specific
to that encoding.

Evidently you are mistaken, because the procedure you describe works.
Or at least, even if you are not explicitly doing anything with
ISO-8859-1, something must be using it on your behalf. Since that's the
default charset for HTTP character data, chances are good that whatever
tool is processing the HTTP traffic is using it to parse the cookies out
of the HTTP header.

I have tried the getBytes() without
specifying the encoding but it didn't seem to work
unless I missed something. Should it have worked?

String.getBytes() has to use _some_ charset to convert the characters to
bytes. The nullary version of the method uses the platform default
charset; that could in fact be ISO-8859-1, but typically isn't, so
myString.getBytes("ISO-8859-1") is usually not the same thing as
myString.getBytes(). Your procedure is converting the character
sequence obtained in String form from your Cookie back into the original
byte sequence by reversing an INCORRECT ISO-8859-1 decoding. With the
bytes in hand, you then decode them (correctly) via UTF-8 into the
desired character sequence. You can't use just any random charset to
get the bytes -- it has to be the one that was used to produce the
incorrect String in the first place.

Note also that it is not in general guaranteed that decoding a byte
sequence and then re-encoding the resulting characters will get back the
original input, even when the same charset is used for both steps. That
it seems to work in this case probably means that the charset
implementation in use is probably being a bit lazy in this case, but
that's beside the point. Because the decoding is not in general 100%
reversible, however, and because some software may have trouble with
non-ASCII characters in HTTP headers, it is *much better* to not rely on
it. A better approach is to encode the UTF-8 (byte) representation into
ASCII characters, such as via a Base-64 encoding or even a URL encoding.
Since ISO-8859-* (and UTF-8) coincide with ASCII over the range
covered by ASCII, you can be more certain that what you get out on the
receiving end is the same as what you put in on the sending end, and you
can work with the result (i.e. reverse the double encoding) without so
much dependency on and assumption about the details of the underlying
HTTP protocol handling.

Roedy Green · Oct 10, 2005

the unicode values I have tried up to 0441.

My understanding is cookies are limited to ASCII-7 and further you
can't use any of the HTTP characters e.g. ? = & You are expected to
URL-encode your cookie. They have to be transported in HTTP headers,
so can't get too fancy on characters.

see http://mindprod.com/jgloss/urlencoded.html

Roedy Green · Oct 10, 2005

"Put it in UTF-16" may be technically accurate for some VMs (notably
Sun's), but it is a poor way to conceptualize what is happening. You
NEED to get a firm grasp on the difference between bytes (byte
sequences) and characters, and you need then to appreciate that Strings
are fundamentally sequences of characters, not bytes.

UTF-16 is one way of encoding unicode externally as bytes. Java uses
the char type internally for unicode which is 16-bit. One is a byte
type, one a char type.

See http://mindprod.com/jgloss/utf.html
http://mindprod.com/jgloss/encoding.html

maxwelton · Oct 10, 2005

Roedy said:
My understanding is cookies are limited to ASCII-7 and further you
can't use any of the HTTP characters e.g. ? = & You are expected to
URL-encode your cookie. They have to be transported in HTTP headers,
so can't get too fancy on characters.

see http://mindprod.com/jgloss/urlencoded.html

Thanks for the info. I realized I was going about this in the
wrong way. When sending out the data to the server 2 parts go
out URLEncoded to UTF-8. This particular server needs those in
UTF-8 because of multiple language support. When reading the
cookie back all I needed to do was specify decoding to UTF-8
on these 2 token strings. So URLDecoder.decode(token, "UTF-8")
is what solved this problem. Thanks.

Unicode (UTF-8) in C	13	Mar 16, 2014
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011
UTF-8 and strings	44	Jun 7, 2011
UTF-8 to Unicode conversion in ajax response	9	May 17, 2011
8 buttons ,3 states and PJON Arduino	0	Jan 15, 2022
UTF-8 problems with windows	31	Aug 10, 2009
UTF-8 read & print?	6	Nov 25, 2012

Unicode and UTF-8

maxwelton

John C. Bollinger

Roedy Green

Roedy Green

maxwelton

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads