VK napisal:
Or (which is really the best) convert all string literals into \uFFFF
form.
UTF-8 even served not as UTF-8 charset (via Content-Type HTTP header,
or even desiged in charset attribute) will be 'understood' via JS
engine if written in US-ASCII.
Most of charsets registered by IANA is downgrade compatibile with
US-ASCII when we talk about only letters (so without - _ # and other
stuff), and UTF-8 was specially designed to be downgrade compatibile
for it.
The problem is when you want to put some string literals (eg. generated
form cgi designed for i18n) which will be properly understood by JS
engine. I remember that form JavaScript 1.3 the strings are represented
in Unicode, and each characer of string is 2 byte long.
/* from ECMA */
<quote>
4.3.16 String value
A string value is a member of the type String and is the set of all
finite ordered sequences of zero or more
Unicode characters.
</quote>
later we can read:
<quote>
However, it is possible to represent every ECMAScript program using
only ASCII characters (which are equivalent to
the first 128 Unicode characters). Non-ASCII Unicode characters may
appear only within comments and string literals.
In string literals, any Unicode character may also be expressed as a
Unicode escape sequence consisting of six ASCII
characters, namely \u plus four hexadecimal digits. Within a comment,
such an escape sequence is effectively ignored
as part of the comment. Within a string literal, the Unicode escape
sequence contributes one character to the string
value of the literal.
</quote>
Reading more carefully we can be sure that ECMA meaning of Unicode was
the 16-bit unsigned integer - so as Unicode it probably was an UTF-16
child (see String.prototype.charCodeAt(pos) ). Another very interesting
aspects was that String.prototype.toLowerCase and other was based on
canonical Unicode 2.0 case mapping so it is fully i18n.
Personally i think that browser treats JavaScript code as a one-byte
per character stream - so charset isn't great issue. UTF-8 can be used
when writting JavaScript code - so when using only US-ASCII characters
we are sure that there is one-byte per character relationship and later
in comments we can use national characters for documentation purposes
(for programs which generates this documentation like JSDoc
http://jsdoc.sourceforge.net/).
Secondly i think that even when we will somehow using Copy & Paste
UTF-16 in string contents (means inside " " characters) it will be
inproperly interpreted - JS engine will assume that it is fragment in
which only one character per byte is used (and not 2 as is in UTF-16).
Anyway i think older browsers was'nt Unicode aware - so using Unicode
in code is not good idea when talking whit older browsers (one
exception is UTF-8)) - even when charset in <script> element was UTF-16
and Content-Type had a charset fragment in HTTP header.
Simple answer is:
- use UTF-8 charset - and write code using ASCII chars (first 127 of
US-ASCII charset), comments may be written in your native language for
generation purposes (for eg. JSDoc);
When using this suggestion, i thing charset attribute may be omitted,
but it is wise to use one with UTF-8 inside.
What you think about this ?
Best regards.
Luke.