Encode() behaves differently with different charsets?

S

Scott Matthews

I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer.

Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set.

For example:
<script>
document.write(escape('Ôèëìè'));
(note: that should be five accented characters)
</script>

Produces: %D4%E8%EB%EC%E8

But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write(escape('Ôèëìè'));
</script>

Produces: %u0424%u0438%u043B%u043C%u0438

Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.

Might you know of any resources that can help me better understand
what's happening there?

Many thanks!
Scott
 
L

Lasse Reichstein Nielsen

I've recently come upon an odd Javascript (and/or browser) behavior,
and after hunting around the Web I still can't seem to find an answer.
Specifically, I have noticed that the Javascript encode() function
behaves differently if a codepage has been set.
For example:
<script>
document.write(escape('Ôèëìè'));
(note: that should be five accented characters)

It is five accented characters, because your message is encoded as
ISO-8859-1, and, e.g., the first character (byte value 212) is
O-circumflex in ISO-8859-1. It also has Unicode codepoint 212,
since Unicode agress with ISO-8859-1 on values below 256.
</script>

Produces: %D4%E8%EB%EC%E8

Where D4 is 212 in hex, so as expected.
But setting the codepage to Windows-1251:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1251">
<script>
document.write(escape('Ôèëìè'));

Now, this *script* is interpreted as Windows-1251 characters, including
the literal string. The first character of that string is the byte 212,
which in Windows 1251 is the Cyrillic capital letter EF. Since Javascript
uses Unicode for strings, the first character of the string value becomes
Cyrillic EF, which has Unicode code-point 1060.
</script>

Produces: %u0424%u0438%u043B%u043C%u0438

Here 0424 is hex for 1060, as expected.
(can be checked using 'parseInt("0424",16)')
Personally, I wouldn't expect the Javascript encode() function to
change its behavior if the codepage has been changed.

It doesn't. What changes is the interpretation of the string literal.
Try changing the write to
document.write('Ô'.charCodeAt(0));
or even better
document.write('Ôèëìè');
Might you know of any resources that can help me better understand
what's happening there?

No ressources, sorry. But remember that when you assign an encoding
that is different from the one used by your editor, you can't trust
the characters you see. WYSI-not-WYG!

You should learn what a codepage really does. A codepage represents a
set of (up to) 256 different characters (or code points), like capital
Roman letter A, Arabic numeral 4, Roman letter O circumflex accent,
cyrillice capital EF, or Chinese glyph whatnot. Those are the only
characters that can be written using that codepage. It also defines a
map from 8-bit bytes to those characters. Different code pages can
assign different code points to the same byte, as ISO-8859-1 and
Windows-1251 does to the byte 212.

Javascript converts all strings
to 16-bit Unicode internally, so it doesn't need to know about
code pages after the page has loaded.


Unicode:
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-lat1.htm>
<URL:http://www.voltaire.ox.ac.uk/x_voltfnd/etc/e-texts/www_xtechs/iso_unicodes/iso-cyr1.htm>

Codepage 1251 is "Cyrillic (Windows)"
<URL:http://longhorn.msdn.microsoft.com/lhsdk/ref/ns/system.text/c/encoding/p/codepage.aspx>

/L
 
S

Scott Matthews

Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.

In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?

Here's my situation: I have a FORM that asks for a URL as input. The
page that the FORM sits on is available in a few languages, and so it
can include a few differnt codepages.

The action sets a window.location to the value of that form field --
when I'm in Windows-1251, I get a 404 but in ISO-8859-1 everything
works.

I appreciate your thoughts on how best to remedy this!

Thanks again,
Scott
 
L

Lasse Reichstein Nielsen

Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.

That is because it is encoding different values. In the latin-1 code
page, your string contains the unicode character with code point 212.
It is escaped as %D4, because that is how 212 is written in hex.

In the Windows-1251(Cyrillic) codepage, the string contains the unicode
character with code point 1060. Since that can't be represented as a
two-digit hex number, escape uses the longer four-digit encoding:
%u0424
In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?

It could, but it doesn't have to, since two hex digits are sufficient.
It optimizes and uses the shorter representation. It could have
generated %u00D4 instead, but that would be three bytes wasted.
Here's my situation: I have a FORM that asks for a URL as input. The
page that the FORM sits on is available in a few languages, and so it
can include a few differnt codepages.

Whee! Inputs and codepages. I believe there is something tricky about
that, but I don't know it. If the way the input is interpreted by the
browser is not the way it is intended by the operating system (I press
the Cyrillic FE key, browser writes an O-circumflex), then something
is bound to go wrong (or you might say that it already is).

I am afraid it is probably browser *and* operating system dependent.

/L
 
P

Paul Gorodyansky

Scott said:
Thanks for your reply, please permit me to follow-up...

I don't seem to understand why Javascript's encode() gives a %XX
two-char hex encoded string when the codepage is at the default
ISO-8859-1, but instead gives a %uXXXX four-char hex Unicode encoded
string when the codepage is set to Windows-1251.

In other words, as I read your explanation, shouldn't I expect the
ISO-8859-1 encode() to also produce a %uXXXX four char hex Unicode
encoded string?

We had - 2 years ago - the same situation but with Japanese and
Chinese :) (my company does not support Cyrillic yet, but supports
Western European languages and Far East ones) -
and had exactly the same question!

Thanks, Lasse, your guess finally makes some sense (we were lost):
It could, but it doesn't have to, since two hex digits are sufficient.
It optimizes and uses the shorter representation. It could have
generated %u00D4 instead, but that would be three bytes wasted.

So Scott, when our server-side software receives a data from a form
we have IF-ELSE there!

That is, if it's Western (windows-1252 or iso-8859-1) we use
URLDecoding1() that assumes %XX format
Otherwise, we use URLDecoding2() that assumes %uXXXX format.

We _always_ know - at the server side - what the encoding is -
when we send a page to a browser in the first place, creating
HTTP Header with "...charset=..." in it, we store that value on server
side. Or, in some cases, we create a page in such a way that
a form has a hidden field that contains encoding name, so when a
data is sent from the form to the server, one of the fields will
tell server-side software what the encoding is.

As for languages/encodings and Form Input - it's not really an
issue of this topic (in this topic we assume - as most Apps do - that
the data coming from a form are in the same encoding that page itself
is), you can read here:

http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html
 
P

Paul Gorodyansky

Lasse said:
...
I am afraid it is probably browser *and* operating system dependent.

Right. When we first ran into this issue (2+ years ago)
we found out that only Internet
Explorer creates either %XX or %uXXXX based on the encoding, while
Netscape 4.0 does not - JavaScript in it always converts to %XX form

Don't know how JavaScript in Netscape 7/Mozilla works in such case -
we do use them now, but I did not ask the guys...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top