Server.HTMLEncode with UTF-8

Marco Miltenburg · Sep 15, 2006

While working on some multilingual code I found a rather strange thing
happening with Server.HTMLEncode.

While loading different languages I change the Codepage and Charset in
ASP to reflect the language. This all works fine. However when I tried
to use Charset UTF-8 with Codepage 65001 everywhere I found that
HTMLEncode always translates all UTF-8 characters to &#xxxx.

Example:

Response.Charset = "shift_jis"
Response.Codepage = 932
Response.Write "Some Japanese Text"
Response.Write Server.HTMLEncode("Some Japanese Text")

Both Write actions output a character string in Shift_JIS, no UTF-8,
no &#xxxx sequences. Just fine and as it should be.

But when I do this:

Response.Charset = "utf-8"
Response.Codepage = 65001
Response.Write "Some Japanese Text"
Response.Write Server.HTMLEncode("Some Japanese Text")

The first write outputs an UTF-8 character string but the second Write
outputs a string encoded into &#xxxx sequences.

Why is that ???

Grtz,
Marco

Anthony Jones · Sep 15, 2006

Marco Miltenburg said:
While working on some multilingual code I found a rather strange thing
happening with Server.HTMLEncode.

While loading different languages I change the Codepage and Charset in
ASP to reflect the language. This all works fine. However when I tried
to use Charset UTF-8 with Codepage 65001 everywhere I found that
HTMLEncode always translates all UTF-8 characters to &#xxxx.

Example:

Response.Charset = "shift_jis"
Response.Codepage = 932
Response.Write "Some Japanese Text"
Response.Write Server.HTMLEncode("Some Japanese Text")

Both Write actions output a character string in Shift_JIS, no UTF-8,
no &#xxxx sequences. Just fine and as it should be.

But when I do this:

Response.Charset = "utf-8"
Response.Codepage = 65001
Response.Write "Some Japanese Text"
Response.Write Server.HTMLEncode("Some Japanese Text")

The first write outputs an UTF-8 character string but the second Write
outputs a string encoded into &#xxxx sequences.

Why is that ???

Whilst all string handling in script is done in unicode, script itself can't
be encoded in unicode. It is possible to run a script encoded as UTF-8
simply because all keywords and operators etc are within the ASCII character
set and therefore are identical when encoded as UTF-8. However string
literals in the code will be treated as single byte ANSI characters despite
having been encoded as UTF-8.

In the real world where the string being encoded by HTMLEncode has be
retrieved from say a database this problem wouldn't occur. If you need
string literals in a multi-language output you will need to store them
somewhere else.

Anthony.

UTF-8 and Server.URLEncode	1	Jan 2, 2007
UTF-8 not parsing properly - suggestions?	4	Dec 15, 2008
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
CDONTS or CDOSYS UTF-8 Email	10	Nov 8, 2006
Strange behaviour with UTF-8 encoding	4	Feb 10, 2006
Array.index with utf-8	4	May 21, 2011
UTF-8 problems with windows	31	Aug 10, 2009
codec for UTF-8 with BOM	3	May 2, 2011

Server.HTMLEncode with UTF-8

Marco Miltenburg

Anthony Jones

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads