Server.HTMLEncode with UTF-8

M

Marco Miltenburg

While working on some multilingual code I found a rather strange thing
happening with Server.HTMLEncode.

While loading different languages I change the Codepage and Charset in
ASP to reflect the language. This all works fine. However when I tried
to use Charset UTF-8 with Codepage 65001 everywhere I found that
HTMLEncode always translates all UTF-8 characters to &#xxxx.

Example:

Response.Charset = "shift_jis"
Response.Codepage = 932
Response.Write "Some Japanese Text"
Response.Write Server.HTMLEncode("Some Japanese Text")

Both Write actions output a character string in Shift_JIS, no UTF-8,
no &#xxxx sequences. Just fine and as it should be.

But when I do this:

Response.Charset = "utf-8"
Response.Codepage = 65001
Response.Write "Some Japanese Text"
Response.Write Server.HTMLEncode("Some Japanese Text")

The first write outputs an UTF-8 character string but the second Write
outputs a string encoded into &#xxxx sequences.

Why is that ???

Grtz,
Marco
 
A

Anthony Jones

Marco Miltenburg said:
While working on some multilingual code I found a rather strange thing
happening with Server.HTMLEncode.

While loading different languages I change the Codepage and Charset in
ASP to reflect the language. This all works fine. However when I tried
to use Charset UTF-8 with Codepage 65001 everywhere I found that
HTMLEncode always translates all UTF-8 characters to &#xxxx.

Example:

Response.Charset = "shift_jis"
Response.Codepage = 932
Response.Write "Some Japanese Text"
Response.Write Server.HTMLEncode("Some Japanese Text")

Both Write actions output a character string in Shift_JIS, no UTF-8,
no &#xxxx sequences. Just fine and as it should be.

But when I do this:

Response.Charset = "utf-8"
Response.Codepage = 65001
Response.Write "Some Japanese Text"
Response.Write Server.HTMLEncode("Some Japanese Text")

The first write outputs an UTF-8 character string but the second Write
outputs a string encoded into &#xxxx sequences.

Why is that ???

Whilst all string handling in script is done in unicode, script itself can't
be encoded in unicode. It is possible to run a script encoded as UTF-8
simply because all keywords and operators etc are within the ASCII character
set and therefore are identical when encoded as UTF-8. However string
literals in the code will be treated as single byte ANSI characters despite
having been encoded as UTF-8.

In the real world where the string being encoded by HTMLEncode has be
retrieved from say a database this problem wouldn't occur. If you need
string literals in a multi-language output you will need to store them
somewhere else.

Anthony.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top