Some interesing aspect of injecting scripts on page...

L

Luke Matuszewski

Hi !
Simple question (but thus it may appear no simple answer):
If i putting a script onto the page i simply could inline it in
<script> element or via its src attribute so (second way):
<script type="text/javascript" src="js/crossBrowserDegradationLib.js">
</script>
, my question is simple
What encoding should i use when writting my
crossBrowserDegradationLib.js:
a) should it be the same encoding like encoding of my web page (and the
same as noted in meta tags) ?
b) should it be other (like US-ASCII) ?

Please reply :)
 
M

Martin Honnen

Luke Matuszewski wrote:

<script type="text/javascript" src="js/crossBrowserDegradationLib.js">
</script>
, my question is simple
What encoding should i use when writting my
crossBrowserDegradationLib.js:
a) should it be the same encoding like encoding of my web page (and the
same as noted in meta tags) ?
b) should it be other (like US-ASCII) ?

Note that the HTML 4 specification defines an attribute named charset
for the script element
<http://www.w3.org/TR/html4/interact/scripts.html#h-18.2.1>
so you can always do
<script type="text/javascript"
src="file.js"
charset="putCharacterEncodingNameOfFile.jsHere"></script>
to let the browser know the encoding.
And of course you can set up your server to send files with a HTTP
Content-Type header that includes a charset parameter.

So the encoding for the script file is your choice or the choice of the
script file author, it can differ from the encoding of the HTML
document, the only important thing is to communicate the encoding to the
browser as explained above.
 
V

VK

Luke said:
Hi !
Simple question (but thus it may appear no simple answer):
If i putting a script onto the page i simply could inline it in
<script> element or via its src attribute so (second way):
<script type="text/javascript" src="js/crossBrowserDegradationLib.js">
</script>
, my question is simple
What encoding should i use when writting my
crossBrowserDegradationLib.js:
a) should it be the same encoding like encoding of my web page (and the
same as noted in meta tags) ?
b) should it be other (like US-ASCII) ?

Rather strange question... I guess you should start your education from
the Books of ECMA then ;-)
Despite they may get sometimes wrong or foggy when answering on
questions of the modern world, but for basic things like this one, they
are still very helpful. In the particular I suggest you to read the
Book of Source, Chapter 1 and further:
<http://developer.mozilla.org/js/specs/ecma-262#a-6>

Concerning your previous questions about constructors and objects: it
seems (possibly wrong) that you are trying to explain JavaScript from
within JavaScript itself exclusively.
I invite you then to take a look at qualifiers "public", "protected",
"private", "static" and "final" in other OOP language (not necessary
Java!) and try to fing their analogs or emulations in JavaScript. It
helps a lot. :)

About "new" question (namely - about ability to create new objects). It
was first tested on the last update on Netscape 2.0 and it was
officially introduced in Netscape 3.0.

It is interesting to notice that in the original release of Netscape
3.0 you couldn't assign an object reference to a variable which was
initialized with a primitive:
var foo = "bar";
foo = new Image(60,60);
would lead to "Argument error".
var foo = null;
foo = new Image(60,60);
would go fine.

AFAYK up to now it was the first and the last attempt of any typezation
in JavaScript. While looking through ancient codes you may notice that
the future reference holders are initialized with null.
 
L

Luke Matuszewski

Martin said:
Note that the HTML 4 specification defines an attribute named charset
for the script element
<http://www.w3.org/TR/html4/interact/scripts.html#h-18.2.1>
so you can always do
<script type="text/javascript"
src="file.js"
charset="putCharacterEncodingNameOfFile.jsHere"></script>
So the encoding for the script file is your choice or the choice of the
script file author, it can differ from the encoding of the HTML
document, the only important thing is to communicate the encoding to the
browser as explained above.
In specs world i agree :) but 'charset' is optional attribute - and if
so then which encoding would be taken by browser to decode the script
(let as assume that Content-Type isn't send and even if it is sent then
i know (from some rfc) that default charset for web servers is
ISO-8859-1)...
Finally i realize that 'normal' server sents the Content-Type and thus
charset set to ISO-8859-1 (see
http://www.cert.org/tech_tips/malicious_code_mitigation.html#3)...

So typically if we omit the charset attribute the standard ISO-8859-1
is used for decoding :)

Best regards for open discussion.
Luke.
 
L

Luke Matuszewski

<http://developer.mozilla.org/js/specs/ecma-262#a-6>
Yep i knew this one (read this chap)... then all JavaScript developers
should use Unicode when writting a script ? I think the answer is no...
About "new" question (namely - about ability to create new objects). It
was first tested on the last update on Netscape 2.0 and it was
officially introduced in Netscape 3.0.

It is interesting to notice that in the original release of Netscape
3.0 you couldn't assign an object reference to a variable which was
initialized with a primitive:
var foo = "bar";
foo = new Image(60,60);
would lead to "Argument error".
var foo = null;
foo = new Image(60,60);
would go fine.

AFAYK up to now it was the first and the last attempt of any typezation
in JavaScript. While looking through ancient codes you may notice that
the future reference holders are initialized with null.
They are valuable 'historical' issued :) Thanks for that.
Best regards.
 
T

Thomas 'PointedEars' Lahn

Luke said:
Martin said:
[encoding is specified by the `charset' attribute of the `script'
element]

In specs world i agree :) but 'charset' is optional attribute - and if
so then which encoding would be taken by browser to decode the script
(let as assume that Content-Type isn't send and even if it is sent then
i know (from some rfc) that default charset for web servers is
ISO-8859-1)...
Finally i realize that 'normal' server sents the Content-Type and thus
charset set to ISO-8859-1 (see
http://www.cert.org/tech_tips/malicious_code_mitigation.html#3)...

So typically if we omit the charset attribute the standard ISO-8859-1
is used for decoding :)

There is nothing typically on the Web.
Best regards for open discussion.

I see no argument or question that requires one. Surely you do not want the
participants of that discussion to indulge them in making educated guesses
as to which implementations of perverse twisted versions of real standards
server and browser vendors are capable of.


PointedEars
 
L

Luke Matuszewski

Thomas said:
I see no argument or question that requires one. Surely you do not want the
participants of that discussion to indulge them in making educated guesses
as to which implementations of perverse twisted versions of real standards
server and browser vendors are capable of.

Yes, quite right. My questions, posts here are quite academic... and i
think that they should be so in some matters...

PS i would like to read about JavaScript more - so if someone have good
example of it i would appreciate (i have read most of faqs and
faq_notes on jibbering).

Best regards.
Luke.
 
V

VK

Luke said:
then all JavaScript developers
should use Unicode when writting a script ? I think the answer is no...

Of course not. We have to define first does the "code" mean.

Any code (not JavaScript only) consists of

[1] Tokes (reserved words, math sings etc.)
These one are low ASCII only (first 128 chars) You cannot use any other
chars in your code, otherwise it will lead to the syntacs error. If you
want to use an extended encoding, for example, to give Japanese names
to your identifiers, you have to use Unicode escape sequences for that:

var \uFA67\uFA68 = "foobar";
So the encoding problem is not applicable to the program code itself -
I mean there is not such problem.

[2] String literals
alert("Imagine This Text Is In French");
What happens with the script containing this alert box with a text on
French?
Here we have to remember that the script doesn't go "to the page". It
goes *to the browser*.And the browser is a rather smart guy and it
knows a lot of thigs. In the particular it also read The Book Of
Source, Chapter 6. So it knows that this <script> stuff wants
everything in Unicode, so it converts it to Unicode first if needed. A
problem may arise if the current encoding is not UTF. In this case
browser need first to patch the chars to fit them into some Unicode
table. While deciding in what table to place this newly arrived stuff,
it looks for a prompt first at the current page encoding or (if no
such) at the systen default encoding.
Naturally if a French string literal came with German charset
declaration on a Chinese computer - here could be complications. The
script will still work, but the visitor will see a characadabra in the
alert box. But as you can see you really have to *specially* apply
yourself for that.

So the answer is:

You have to use only base ASCII in tokens, so interpreter simply
doesn't allow you to have any encoding problems.

If you are using only base ASCII characters in your string literals
also, there are no problems at all.

If you want to have string literals in some national language, make
sure that the declared page charset will match (not too mach to ask, is
it?)

If you want to have string literals in some national language and
you're expecting your script to roam around the world, it is suggested
to save it in Unicode and make a readme file in the destribution
package explaining that this script should be served as UTF-8.
 
T

Thomas 'PointedEars' Lahn

VK said:
It is interesting to notice that in the original release of Netscape
3.0 you couldn't assign an object reference to a variable which was
initialized with a primitive:
var foo = "bar";
foo = new Image(60,60);
would lead to "Argument error".
var foo = null;
foo = new Image(60,60);
would go fine.

AFAYK up to now it was the first and the last attempt of any typezation
in JavaScript. [...]

JavaScript 2.0 exists. The problem is that it exists merely on paper
(or in bytes, for that matter) and as a test implementation (Epimetheus).

<http://www.mozilla.org/js/language/>
<http://www.mozilla.org/js/language/js20/>
<http://www.mozilla.org/js/language/Epimetheus.html>

However, JScript 7.0 (.NET) implements some features of ECMAScript 4
(which is also still a Netscape proposal) supported by the Microsoft
..NET Framework 1.0, e.g. by ASP.NET, thus it supports strict typing.

<http://www.mozilla.org/js/language/es4/>
<http://msdn.microsoft.com/library/d...cript7/html/jslrfjscriptlanguagereference.asp>


PointedEars
 
V

VK

If you want to have string literals in some national language and
you're expecting your script to roam around the world, it is suggested
to save it in Unicode and make a readme file in the destribution
package explaining that this script should be served as UTF-8.

Or (which is really the best) convert all string literals into \uFFFF
form.
 
T

Thomas 'PointedEars' Lahn

VK said:
Luke said:
then all JavaScript developers
should use Unicode when writting a script ? I think the answer is no...

Of course not. We have to define first does the "code" mean.

Any code (not JavaScript only) consists of

[1] Tokes (reserved words, math sings etc.)
These one are low ASCII only (first 128 chars) You cannot use any other
chars in your code, otherwise it will lead to the syntacs error. If you
want to use an extended encoding, for example, to give Japanese names
to your identifiers, you have to use Unicode escape sequences for that:

var \uFA67\uFA68 = "foobar";

Utter nonsense.
So the encoding problem is not applicable to the program code itself -
I mean there is not such problem.

Of course Unicode _literals_ cannot be used outside of literals.
However, Unicode _characters_ can:

var _ü = "foobar";
alert(_ü); // yields `foobar'

But, since the specification explicitly states that all program characters
are UTF-16 encoded, I really do not know if there is a correct answer to
Luke's question. UTF-16, as each code unit is 16 bits, requires a minimum
of 16 bits per character, hence it is not downwards compatible to ASCII or
ISO-8859-* and I tend to say yes. What am I missing?


PointedEars
 
L

Luke Matuszewski

VK napisal:
Or (which is really the best) convert all string literals into \uFFFF
form.

UTF-8 even served not as UTF-8 charset (via Content-Type HTTP header,
or even desiged in charset attribute) will be 'understood' via JS
engine if written in US-ASCII.
Most of charsets registered by IANA is downgrade compatibile with
US-ASCII when we talk about only letters (so without - _ # and other
stuff), and UTF-8 was specially designed to be downgrade compatibile
for it.
The problem is when you want to put some string literals (eg. generated
form cgi designed for i18n) which will be properly understood by JS
engine. I remember that form JavaScript 1.3 the strings are represented
in Unicode, and each characer of string is 2 byte long.
/* from ECMA */
<quote>
4.3.16 String value
A string value is a member of the type String and is the set of all
finite ordered sequences of zero or more
Unicode characters.
</quote>

later we can read:

<quote>
However, it is possible to represent every ECMAScript program using
only ASCII characters (which are equivalent to
the first 128 Unicode characters). Non-ASCII Unicode characters may
appear only within comments and string literals.
In string literals, any Unicode character may also be expressed as a
Unicode escape sequence consisting of six ASCII
characters, namely \u plus four hexadecimal digits. Within a comment,
such an escape sequence is effectively ignored
as part of the comment. Within a string literal, the Unicode escape
sequence contributes one character to the string
value of the literal.
</quote>

Reading more carefully we can be sure that ECMA meaning of Unicode was
the 16-bit unsigned integer - so as Unicode it probably was an UTF-16
child (see String.prototype.charCodeAt(pos) ). Another very interesting
aspects was that String.prototype.toLowerCase and other was based on
canonical Unicode 2.0 case mapping so it is fully i18n.

Personally i think that browser treats JavaScript code as a one-byte
per character stream - so charset isn't great issue. UTF-8 can be used
when writting JavaScript code - so when using only US-ASCII characters
we are sure that there is one-byte per character relationship and later
in comments we can use national characters for documentation purposes
(for programs which generates this documentation like JSDoc
http://jsdoc.sourceforge.net/).

Secondly i think that even when we will somehow using Copy & Paste
UTF-16 in string contents (means inside " " characters) it will be
inproperly interpreted - JS engine will assume that it is fragment in
which only one character per byte is used (and not 2 as is in UTF-16).
Anyway i think older browsers was'nt Unicode aware - so using Unicode
in code is not good idea when talking whit older browsers (one
exception is UTF-8)) - even when charset in <script> element was UTF-16
and Content-Type had a charset fragment in HTTP header.

Simple answer is:
- use UTF-8 charset - and write code using ASCII chars (first 127 of
US-ASCII charset), comments may be written in your native language for
generation purposes (for eg. JSDoc);
When using this suggestion, i thing charset attribute may be omitted,
but it is wise to use one with UTF-8 inside.
What you think about this ?

Best regards.
Luke.
 
L

Luke Matuszewski

Another thing about Unicode is that it is compatibile with US-ASCII,
but only UTF-8 has the same lenght as US-ASCII when talking only about
chars form US-ASCII.

BR
Luke.
 
T

Thomas 'PointedEars' Lahn

Luke said:
Another thing about Unicode is that it is compatibile with US-ASCII,

Would you care to define "compatible"? In my book, it does not qualify as
being compatible since more than 90% of its characters and code points are
not part of US-ASCII.
but only UTF-8 has the same lenght as US-ASCII

If you mean by that a UTF-8 code unit had the same length as the US-ASCII
code unit: no, it has not. US-ASCII is a 7-bit code, code points 0x00 to
0x7F. UTF-7 (which exists) code units have a length of 7 bits, hence
UTF-7 is _downwards_ compatible to US-ASCII (meaning that every US-ASCII
character code sequence matches the UTF-7 encoding for code points of the
same characters in Unicode, U+0000 to U+007F) and is the only UTF (Unicode
Transformation Format) that has this particular property.


PointedEars
 
V

VK

Luke said:
Another thing about Unicode is that it is compatibile with US-ASCII,
but only UTF-8 has the same lenght as US-ASCII when talking only about
chars form US-ASCII.

Luke,
Based on your comments I'm guessing that you take UTF-8 as a Unicode
format, and this is where the confusion arises.

UTF-8 is *not* a Unicode format. This is a transport encoding used to
deliver Unicode-16 characters from server to the user agent (like
base64 is used to deliver binary data over HTTP).
Namely these are 2 bytes representing an Unicode character prefixed by
special character (%A0 if I remember properly) to mark the beginning of
the Unicode sequence.

As a common agreement characters matching the base ACSII characters are
not prefixed by %A0 and sent as one byte (w/o %00 byte marking the
table). This allows to save hell a lot of a traffic in the Internet as
till now base ASCII chars constitute the main part of the Internet.
From the other side this simplification may lead to some bizzare
problems in some rather particular circumstances. Again - it is not
connected with the Unicode per se, but with decoding UTF-8 transport
stream back to Unicode.

You may find intersting the following threads:

<http://groups.google.com/group/comp..._frm/thread/d444d7077da80dd7/a1787fc74a45fdb5>

<http://groups.google.com/group/comp..._frm/thread/200ab1e205b57af6/2cbcdf2981216dfd>
 
T

Thomas 'PointedEars' Lahn

VK said:
Luke,
Based on your comments I'm guessing that you take UTF-8 as a Unicode
format, and this is where the confusion arises.

Based on your comments, I do not have to guess that you have no idea
as to what Unicode and UTF are. You are absolutely clueless about it.
UTF-8 is *not* a Unicode format.

UTF-8 is the 8-bit Unicode Transformation Format, meaning Unicode
characters are encoded in one or more code units of 8 bit.
This is a transport encoding used to deliver Unicode-16 characters

There is no such thing as a Unicode-16 character.
from server to the user agent (like base64 is used to deliver binary data
over HTTP).

No. Any UTF can be used anywhere provided both sender and receiver support
it and either the sender declares it properly or the sender has means to
recognize it (where the former is recommended).
Namely these are 2 bytes representing an Unicode character prefixed by
special character (%A0 if I remember properly) to mark the beginning of
the Unicode sequence.

Nonsense. UTF-16 encodes Unicode characters with _at least_ 16 bits because
the length of the code unit used is 16 bits. If a Unicode character has a
code point beyond that what can be displayed with 16 bits unsigned, more
code units (resulting in a so-called surrogate pair) are used,
respectively.

The code prefix you are talking about which is the BOM is not attached to
every character code sequence but at the beginning of the UTF-16 or UTF-32
encoded string to indicates the endianess of that string; it is not
0xA0/U+00A0 (no-break space) but the character at code point U+FEFF
(zero-width no-break space).

<URL:http://en.wikipedia.org/wiki/Unicode> pp.
As a common agreement characters matching the base ACSII characters are
not prefixed by %A0 and sent as one byte (w/o %00 byte marking the
table). This allows to save hell a lot of a traffic in the Internet as
till now base ASCII chars constitute the main part of the Internet.

Surely you have proof to back up this wild assumption.
problems in some rather particular circumstances. Again - it is not
connected with the Unicode per se, but with decoding UTF-8 transport
stream back to Unicode.

You may find intersting the following threads:
<http://groups.google.com/group/comp..._frm/thread/200ab1e205b57af6/2cbcdf2981216dfd>

If you just had read _and_ understood.


PointedEars
 
J

John W. Kennedy

Thomas said:
Would you care to define "compatible"? In my book, it does not qualify as
being compatible since more than 90% of its characters and code points are
not part of US-ASCII.


If you mean by that a UTF-8 code unit had the same length as the US-ASCII
code unit: no, it has not. US-ASCII is a 7-bit code, code points 0x00 to
0x7F. UTF-7 (which exists) code units have a length of 7 bits, hence
UTF-7 is _downwards_ compatible to US-ASCII (meaning that every US-ASCII
character code sequence matches the UTF-7 encoding for code points of the
same characters in Unicode, U+0000 to U+007F) and is the only UTF (Unicode
Transformation Format) that has this particular property.

Not quite. If that were literally so, UTF-7 would have no way to encode
any character outside of US-ASCII. In fact, '+' is used as an escape
character, and a literal '+' must be encoded (as '+-'). Furthermore, in
practice, UTF-7 may encode other US-ASCII characters, for transmission
across channels where those characters are deemed unsafe.

--
John W. Kennedy
"But now is a new thing which is very old--
that the rich make themselves richer and not poorer,
which is the true Gospel, for the poor's sake."
-- Charles Williams. "Judgement at Chelmsford"
 
T

Thomas 'PointedEars' Lahn

John said:
Not quite. If that were literally so, UTF-7 would have no way to encode
any character outside of US-ASCII.

I don't think so. If that were so, characters outside of US-ASCII would
only require more "UTF-7" code units. However, it turns out that in real
UTF-7, characters outside of US-ASCII have to "be encoded in UTF-16 and
then in modified base64" (Wikipedia).
In fact, '+' is used as an escape character, and a literal '+' must be
encoded (as '+-').

True, my bad. From the term of UTF-7 I falsely inferred encoding would take
place in the way of the other UTFs, using a fixed size code for all code
points and more code units only if required by the code point.

<URL:http://en.wikipedia.org/wiki/UTF-7>


PointedEars
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,436
Messages
2,571,696
Members
48,796
Latest member
Greg L.
Top