UTF-8 to Unicode conversion in ajax response

T

Tim Streater

I have what was originally a gb2312 string, mime-encoded. Here is an
example:

=?gb2312?B?otUtxvMg0rUg1dAgxrggViBTIMPmIMrUILy8IA==?=

This I decode, convert to UTF-8, and store in an SQLite database. That
string I then retrieve and use as a response to an ajax request from a
browser. So far, this is all PHP, but bare with me.

In the browser, the ajax responseText is taken and put in a table cell
as the cell's innerText, and the cell proceeds to display Chinese
characters.

Now, if I look at the responseText with e.g. string.charCodeAt(),
instead of seeing a series of UTF-8 bytes, I see a series of (large)
unicode values which correspond to the Chinese characters displayed.

So, where is this apparently automatic UTF-8 -> unicode conversion
taking place? Can someone point me at a description of the process?

In fact, there are two places on the browser page where this
responseText might be displayed (the data is retrieved at different
times using different ajax calls and different PHP scripts). In one
place what's displayed appears to be OK, but in another, a *second*
conversion seems to be happening and I'm trying to pin this down.
 
B

Bart Van der Donck

Tim said:
I have what was originally a gb2312 string, mime-encoded. Here is an
example:

=?gb2312?B?otUtxvMg0rUg1dAgxrggViBTIMPmIMrUILy8IA==?=

This I decode, convert to UTF-8, and store in an SQLite database. That
string I then retrieve and use as a response to an ajax request from a
browser. So far, this is all PHP, but bare with me.

Sorry for the nitpick, but MIME is a much too general term here. You
are referring to the Encoded-Word Syntax as described in RFC 2047.

I think your scenario sounds okay if SQLite stores the correct
multibyte ISO-8859-1 characters (min. 1, max. 4).
In the browser, the ajax responseText is taken and put in a table cell
as the cell's innerText, and the cell proceeds to display Chinese
characters.

I would have a preference for innerHTML, but okay.
Now, if I look at the responseText with e.g. string.charCodeAt(),
instead of seeing a series of UTF-8 bytes, I see a series of (large)
unicode values which correspond to the Chinese characters displayed.
So, where is this apparently automatic UTF-8 -> unicode conversion
taking place? Can someone point me at a description of the process?

The response is sent as a percent-encoded multi-byte sequence.

- Say the following character is sent:
http://ja.wikipedia.org/wiki/語
- Corresponds in UTF-8 to the following three ISO-8859-1 characters:
語 (e with a grave accent + ordinal indicator + z with caron)
- Becomes after percent-encoding:
%E8%AA%9E
It is this value that is received after the AJAX-request.
In fact, there are two places on the browser page where this
responseText might be displayed (the data is retrieved at different
times using different ajax calls and different PHP scripts). In one
place what's displayed appears to be OK, but in another, a *second*
conversion seems to be happening and I'm trying to pin this down.

Could you give a URL ?

I would think that it's an automatical conversion which you can't
influence; XMLHttpRequest assumes that a URL is percent-encoded in
UTF-8 ('Café' > 'Caf%C3%A9' and not the elder 'Caf%E9').

AJAX will/should show 'Caf%C3%A9' as 'Café' and not literally as 'Caf
%C3%A9'. If you want the latter, you would do 'Caf%25C3%25A9'.

The appropriate headers are also necessary (<meta> and HTTP-header).

Hope this helps,
 
T

Tim Streater

Bart Van der Donck said:
Sorry for the nitpick, but MIME is a much too general term here. You
are referring to the Encoded-Word Syntax as described in RFC 2047.

I think your scenario sounds okay if SQLite stores the correct
multibyte ISO-8859-1 characters (min. 1, max. 4).
Yes.


I would have a preference for innerHTML, but okay.

:) I try to use innerHTML only when I have html to put in.
The response is sent as a percent-encoded multi-byte sequence.

- Say the following character is sent:
http://ja.wikipedia.org/wiki/語
- Corresponds in UTF-8 to the following three ISO-8859-1 characters:
ŹĽž (e with a grave accent + ordinal indicator + z with caron)
- Becomes after percent-encoding:
%E8%AA%9E
It is this value that is received after the AJAX-request.

Thanks - that was the info I sought. Since then, it's been suggested
that I put out a header to start the response, with charset utf-8. I did
this and it solved my problem.
Could you give a URL ?

I would, but there isn't one really. It's wherever you happen to be
running it. It's not a classic browser <-> Internet <-> apache <->
website scenario. In my case, browser, instance of apache, scripts, and
SQLite are all on the same machine.

Thanks for the feedback.
 
B

Bart Van der Donck

Stanimir said:
Fri, 20 May 2011 04:38:53 -0700 (PDT), /Bart Van der Donck/:


What is this beast supposed to be?  ISO-8859-1 defines 256
characters each one encoded using single byte (eight-bit).

Yes, it are those 256 characters that are combined to form a UTF-8
character, for example:

e -> e (1 byte)
e with a grave accent [*] -> Ä + ¨ (2 bytes)
e with a circumflex and acute accent [**] -> á + º + ¿ (3 bytes)
Han 024B62 [***] -> ð + ¤ + [SHY] + ¡ (4 bytes)

This article is in ISO-8859-1, hopefully the characters display
correctly on Usenet.

[*] http://en.wikipedia.org/wiki/è
[**] http://en.wiktionary.org/wiki/ế
[***] http://www.fileformat.info/info/unicode/char/24B62/
 
J

Jukka K. Korpela

23.5.2011 10:56 said:
Yes, it are those 256 characters that are combined to form a UTF-8
character

No, UTF-8 does not combine "ISO-8859-1 characters". It combines octets
(bytes). It is absurd to interpret the octets as characters according to
some 8-bit encoding, except for the trivial case that an octet in the
range 0...7F (hexadecimal) can be interpreted according to ASCII (or any
encoding that coincides with ASCII in that range), because UTF-8 was
designed that way.
for example:

e -> e (1 byte)

That's an odd and misleading way of saying that the letter "e" has the
same representation in UTF-8 as in ISO-8859-1 (or any ISO-8859-something
or...).
e with a grave accent [*] -> Ä + ¨ (2 bytes)

That's a completely wrong statement. The bytes that constitute the UTF-8
encoded form of "è" are just bytes, octets, numbers. The only meaningful
interpretation in the UTF-8 context is that only their combination has a
meaning as a character.

In error situations, when a program interprets UTF-8 data as e.g.
ISO-8859-1, the data that was meant to represent "è" is shown as "Ĩ".
But this is an error and surely implies no such idea that UTF-8 would
combine two or more characters to represent a character.
 
B

Bart Van der Donck

Jukka said:
That's an odd and misleading way of saying that the letter "e" has the
same representation in UTF-8 as in ISO-8859-1 (or any ISO-8859-something
or...).

Well, both UTF-8 and ISO-8859-X tie code point 65 to 'e'. Thus it
makes sense to me to state that the representation of CP65 is
identical here (frankly, I'ld be surprised to see a workable charset
that doesn't tie it to 'e').
  e with a grave accent [*] ->  Ä + ¨ (2 bytes)

That's a completely wrong statement. The bytes that constitute the UTF-8
encoded form of "è" are just bytes, octets, numbers. The only meaningful
interpretation in the UTF-8 context is that only their combination has a
meaning as a character.

I would say it's an educatively valuable way of looking at UTF-8, and
I believe this view is more common than you think. See e.g. the
erroneous UTF-8 displays which you mention, the percent-encoding in
URL's, HEX, etc. IMHO this view helps a lot to get a clear idea of how
UTF-8 works.
In error situations, when a program interprets UTF-8 data as e.g.
ISO-8859-1, the data that was meant to represent "è" is shown as "Ĩ".
But this is an error and surely implies no such idea that UTF-8 would
combine two or more characters to represent a character.

Yes.
 
S

Stanimir Stamenkov

Mon, 23 May 2011 02:06:34 -0700 (PDT), /Bart Van der Donck/:
Well, both UTF-8 and ISO-8859-X tie code point 65 to 'e'. Thus it
makes sense to me to state that the representation of CP65 is
identical here (frankly, I'ld be surprised to see a workable charset
that doesn't tie it to 'e').

You've said quite different thing: "ISO-8859-1 characters are
combined to form a UTF-8 character" - this is wrong. The fact some
UTF-8 byte sequence could be decoded successfully using some single
byte encoding, resulting in some totally unrelated characters, is a
coincidence.
e with a grave accent [*] -> Ä + ¨ (2 bytes)

That's a completely wrong statement. The bytes that constitute the UTF-8
encoded form of "è" are just bytes, octets, numbers. The only meaningful
interpretation in the UTF-8 context is that only their combination has a
meaning as a character.

I would say it's an educatively valuable way of looking at UTF-8, and
I believe this view is more common than you think. See e.g. the
erroneous UTF-8 displays which you mention, the percent-encoding in
URL's, HEX, etc. IMHO this view helps a lot to get a clear idea of how
UTF-8 works.

Well, I think you're just misleading people which don't understand
text encoding well, and so you're making the problem of people not
understanding text encoding worse.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top