UTF-8 to Unicode conversion in ajax response

Tim Streater · May 17, 2011

I have what was originally a gb2312 string, mime-encoded. Here is an
example:

=?gb2312?B?otUtxvMg0rUg1dAgxrggViBTIMPmIMrUILy8IA==?=

This I decode, convert to UTF-8, and store in an SQLite database. That
string I then retrieve and use as a response to an ajax request from a
browser. So far, this is all PHP, but bare with me.

In the browser, the ajax responseText is taken and put in a table cell
as the cell's innerText, and the cell proceeds to display Chinese
characters.

Now, if I look at the responseText with e.g. string.charCodeAt(),
instead of seeing a series of UTF-8 bytes, I see a series of (large)
unicode values which correspond to the Chinese characters displayed.

So, where is this apparently automatic UTF-8 -> unicode conversion
taking place? Can someone point me at a description of the process?

In fact, there are two places on the browser page where this
responseText might be displayed (the data is retrieved at different
times using different ajax calls and different PHP scripts). In one
place what's displayed appears to be OK, but in another, a *second*
conversion seems to be happening and I'm trying to pin this down.

Bart Van der Donck · May 20, 2011

Tim said:
I have what was originally a gb2312 string, mime-encoded. Here is an
example:

=?gb2312?B?otUtxvMg0rUg1dAgxrggViBTIMPmIMrUILy8IA==?=

This I decode, convert to UTF-8, and store in an SQLite database. That
string I then retrieve and use as a response to an ajax request from a
browser. So far, this is all PHP, but bare with me.

Sorry for the nitpick, but MIME is a much too general term here. You
are referring to the Encoded-Word Syntax as described in RFC 2047.

I think your scenario sounds okay if SQLite stores the correct
multibyte ISO-8859-1 characters (min. 1, max. 4).

In the browser, the ajax responseText is taken and put in a table cell
as the cell's innerText, and the cell proceeds to display Chinese
characters.

I would have a preference for innerHTML, but okay.

Now, if I look at the responseText with e.g. string.charCodeAt(),
instead of seeing a series of UTF-8 bytes, I see a series of (large)
unicode values which correspond to the Chinese characters displayed.
So, where is this apparently automatic UTF-8 -> unicode conversion
taking place? Can someone point me at a description of the process?

The response is sent as a percent-encoded multi-byte sequence.

- Say the following character is sent:
http://ja.wikipedia.org/wiki/語
- Corresponds in UTF-8 to the following three ISO-8859-1 characters:
Ã¨ÂªÅ¾ (e with a grave accent + ordinal indicator + z with caron)
- Becomes after percent-encoding:
%E8%AA%9E
It is this value that is received after the AJAX-request.

In fact, there are two places on the browser page where this
responseText might be displayed (the data is retrieved at different
times using different ajax calls and different PHP scripts). In one
place what's displayed appears to be OK, but in another, a *second*
conversion seems to be happening and I'm trying to pin this down.

Could you give a URL ?

I would think that it's an automatical conversion which you can't
influence; XMLHttpRequest assumes that a URL is percent-encoded in
UTF-8 ('CafÃ©' > 'Caf%C3%A9' and not the elder 'Caf%E9').

AJAX will/should show 'Caf%C3%A9' as 'CafÃ©' and not literally as 'Caf
%C3%A9'. If you want the latter, you would do 'Caf%25C3%25A9'.

The appropriate headers are also necessary (<meta> and HTTP-header).

Hope this helps,

Tim Streater · May 20, 2011

Bart Van der Donck said:
Sorry for the nitpick, but MIME is a much too general term here. You
are referring to the Encoded-Word Syntax as described in RFC 2047.

I think your scenario sounds okay if SQLite stores the correct
multibyte ISO-8859-1 characters (min. 1, max. 4).
Yes.

I would have a preference for innerHTML, but okay.

I try to use innerHTML only when I have html to put in.

The response is sent as a percent-encoded multi-byte sequence.

- Say the following character is sent:
http://ja.wikipedia.org/wiki/語
- Corresponds in UTF-8 to the following three ISO-8859-1 characters:
Å¹Ä½Å¾ (e with a grave accent + ordinal indicator + z with caron)
- Becomes after percent-encoding:
%E8%AA%9E
It is this value that is received after the AJAX-request.

Thanks - that was the info I sought. Since then, it's been suggested
that I put out a header to start the response, with charset utf-8. I did
this and it solved my problem.

Could you give a URL ?

I would, but there isn't one really. It's wherever you happen to be
running it. It's not a classic browser <-> Internet <-> apache <->
website scenario. In my case, browser, instance of apache, scripts, and
SQLite are all on the same machine.

Thanks for the feedback.

Bart Van der Donck · May 20, 2011

Tim said:
I try to use innerHTML only when I have html to put in.

Okay, but innerHTML is much more compatible. E.g. innerText doesn't
work here on my SeaMonkey and Firefox.

http://www.quirksmode.org/dom/w3c_html.html

Tim Streater · May 20, 2011

Bart Van der Donck said:
Okay, but innerHTML is much more compatible. E.g. innerText doesn't
work here on my SeaMonkey and Firefox.

http://www.quirksmode.org/dom/w3c_html.html

Errr, <furtle, furtle> did I say innerText? Sorry - I lied. I meant
textContent.

Stanimir Stamenkov · May 22, 2011

Fri, 20 May 2011 04:38:53 -0700 (PDT), /Bart Van der Donck/:

multibyte ISO-8859-1 characters (min. 1, max. 4).

What is this beast supposed to be? ISO-8859-1 defines 256
characters each one encoded using single byte (eight-bit).

http://en.wikipedia.org/wiki/ISO/IEC_8859-1

Bart Van der Donck · May 23, 2011

Stanimir said:
Fri, 20 May 2011 04:38:53 -0700 (PDT), /Bart Van der Donck/:

What is this beast supposed to be? ISO-8859-1 defines 256
characters each one encoded using single byte (eight-bit).

Yes, it are those 256 characters that are combined to form a UTF-8
character, for example:

e -> e (1 byte)
e with a grave accent [*] -> Ä + ¨ (2 bytes)
e with a circumflex and acute accent [**] -> á + º + ¿ (3 bytes)
Han 024B62 [***] -> ð + ¤ + [SHY] + ¡ (4 bytes)

This article is in ISO-8859-1, hopefully the characters display
correctly on Usenet.

[*] http://en.wikipedia.org/wiki/è
[**] http://en.wiktionary.org/wiki/ế
[***] http://www.fileformat.info/info/unicode/char/24B62/

Jukka K. Korpela · May 23, 2011

23.5.2011 10:56 said:
Yes, it are those 256 characters that are combined to form a UTF-8
character

No, UTF-8 does not combine "ISO-8859-1 characters". It combines octets
(bytes). It is absurd to interpret the octets as characters according to
some 8-bit encoding, except for the trivial case that an octet in the
range 0...7F (hexadecimal) can be interpreted according to ASCII (or any
encoding that coincides with ASCII in that range), because UTF-8 was
designed that way.

for example:

e -> e (1 byte)

That's an odd and misleading way of saying that the letter "e" has the
same representation in UTF-8 as in ISO-8859-1 (or any ISO-8859-something
or...).

e with a grave accent [*] -> Ä + ¨ (2 bytes)

That's a completely wrong statement. The bytes that constitute the UTF-8
encoded form of "è" are just bytes, octets, numbers. The only meaningful
interpretation in the UTF-8 context is that only their combination has a
meaning as a character.

In error situations, when a program interprets UTF-8 data as e.g.
ISO-8859-1, the data that was meant to represent "è" is shown as "Ä¨".
But this is an error and surely implies no such idea that UTF-8 would
combine two or more characters to represent a character.

Bart Van der Donck · May 23, 2011

Jukka said:
That's an odd and misleading way of saying that the letter "e" has the
same representation in UTF-8 as in ISO-8859-1 (or any ISO-8859-something
or...).

Well, both UTF-8 and ISO-8859-X tie code point 65 to 'e'. Thus it
makes sense to me to state that the representation of CP65 is
identical here (frankly, I'ld be surprised to see a workable charset
that doesn't tie it to 'e').

e with a grave accent [*] -> Ä + ¨ (2 bytes)

Click to expand...

That's a completely wrong statement. The bytes that constitute the UTF-8
encoded form of "è" are just bytes, octets, numbers. The only meaningful
interpretation in the UTF-8 context is that only their combination has a
meaning as a character.

I would say it's an educatively valuable way of looking at UTF-8, and
I believe this view is more common than you think. See e.g. the
erroneous UTF-8 displays which you mention, the percent-encoding in
URL's, HEX, etc. IMHO this view helps a lot to get a clear idea of how
UTF-8 works.

In error situations, when a program interprets UTF-8 data as e.g.
ISO-8859-1, the data that was meant to represent "è" is shown as "Ä¨".
But this is an error and surely implies no such idea that UTF-8 would
combine two or more characters to represent a character.

Yes.

Stanimir Stamenkov · May 23, 2011

Mon, 23 May 2011 02:06:34 -0700 (PDT), /Bart Van der Donck/:

Well, both UTF-8 and ISO-8859-X tie code point 65 to 'e'. Thus it
makes sense to me to state that the representation of CP65 is
identical here (frankly, I'ld be surprised to see a workable charset
that doesn't tie it to 'e').

You've said quite different thing: "ISO-8859-1 characters are
combined to form a UTF-8 character" - this is wrong. The fact some
UTF-8 byte sequence could be decoded successfully using some single
byte encoding, resulting in some totally unrelated characters, is a
coincidence.

e with a grave accent [*] -> Ä + ¨ (2 bytes)

Click to expand...

That's a completely wrong statement. The bytes that constitute the UTF-8
encoded form of "è" are just bytes, octets, numbers. The only meaningful
interpretation in the UTF-8 context is that only their combination has a
meaning as a character.

Click to expand...

I would say it's an educatively valuable way of looking at UTF-8, and
I believe this view is more common than you think. See e.g. the
erroneous UTF-8 displays which you mention, the percent-encoding in
URL's, HEX, etc. IMHO this view helps a lot to get a clear idea of how
UTF-8 works.

Well, I think you're just misleading people which don't understand
text encoding well, and so you're making the problem of people not
understanding text encoding worse.

Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
Unicode (UTF-8) in C	13	Mar 16, 2014
UTF-8 vs w_char	48	Nov 3, 2013
How Does an Improper Conversion to Manage Oversized PST files Hurt Outlook?	1	Sep 11, 2025
Survey details won't go through using php, ajax, Mysql	3	Oct 25, 2023
UTF-8 read & print?	6	Nov 25, 2012
Ajax changing get to post and header data disappears	5	Dec 14, 2020
UTF-8 and strings	44	Jun 7, 2011

UTF-8 to Unicode conversion in ajax response

Tim Streater

Bart Van der Donck

Tim Streater

Bart Van der Donck

Tim Streater

Stanimir Stamenkov

Bart Van der Donck

Jukka K. Korpela

Bart Van der Donck

Stanimir Stamenkov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads