Bizarre JS brackets bug - mystery solved!

A

Al Reynolds

Afternoon,

In an earlier thread (http://tinyurl.com/5v4aa), I described a
problem I was having which was rather bizarrely solved by
changing the line:
"inputbox.value = numq+ag-cw-cc;"
to:
"inputbox.value = numq+(ag)-(cw)-(cc);"

This was needed in IE6 but not in any other browser I tried.
I have now solved the mystery of why inserting the brackets
removed the problem.

I used the age-old technique of removing everything else until
only the error remains. If you're interested in the two files
which eventually helped me to see the error, look at:
http://www.ex.ac.uk/cimt/dev/oddity/ie6-oddity-working.htm
http://www.ex.ac.uk/cimt/dev/oddity/ie6-oddity-faulty.htm

I will, however, explain the solution here.

IE6 is, I believe, the first version of the IE browser to have
"Auto-Select" for text encoding (character set) turned on by
default. When it loads the first of the above pages, it decides
that the encoding is "Western European (Windows)". When it
loads the second of the above pages, it decides that the
encoding is "Unicode (UTF-7)".

This process (and its arbitrary nature) is rather nicely illustrated
by the three examples below, which are all short. For full effect,
make sure you have Auto-Select turned on for text encoding if
you look at any of the web pages.

(1) http://www.ex.ac.uk/cimt/dev/oddity/plusminus-oddity-1.htm

<HTML>
<HEAD><TITLE>plus minus oddity 1</TITLE></HEAD>
<BODY>
foo+stuff-bar
</BODY>
</HTML>

This displays:
foo<oriental symbol>bar.
IE has decided that the document is Unicode (UTF-7).

(2) http://www.ex.ac.uk/cimt/dev/oddity/plusminus-oddity-2.htm

<HTML>
<HEAD><TITLE>plus minus oddity 2</TITLE></HEAD>
<BODY>
foo+stuff-bar<BR>
foo+ stuff -bar
</BODY>
</HTML>

This displays:
foo+stuff-bar
foo+ stuff -bar
IE has decided that this document is Western European (Windows).
How it has decided this is unclear to me. It contains the same first
line as example (1), but something in the second line makes it change
its mind. Perhaps it is the appearance of "stuff" without the "+"
directly in front?

(3) http://www.ex.ac.uk/cimt/dev/oddity/plusminus-oddity-3.htm

<HTML>
<HEAD><TITLE>plus minus oddity</TITLE></HEAD>
<META HTTP-EQUIV="Content-Type"
CONTENT="text/html; CHARSET=iso-8859-1">
<BODY>
foo+stuff-bar
</BODY>
</HTML>

This displays:
foo+stuff-bar
IE has correctly responded to my suggestion that this document is in
Western European (ISO) as specified in the META tag.

I'm sure that some of you will tell me that I should have always set
the character set for every HTML page I have ever written. If I had
done then I might never have discovered this IE6 "feature".

Anyway, I have learnt my lesson.

I can see two potential ongoing problems. Firstly, it seems odd (to
me) that the text-encoding has also been used to process the script
within the page. There will be plenty of occasions where a variable
is enclosed between a "+" and a "-", and each of these could
potentially lead to an error. Do people script in non-latin charsets?

What makes the problem worse is that the way in which IE decides
the encoding depends fairly arbitrarily on things which appear *later*
in the code and/or page. Removing a working section of code might
remove the problem, but not because there was a fault in that section
of code.

Anyway, there is an easy solution.
Make sure the text-encoding is specified on every page.

Al
 
M

Michael Winter

[snip]
Do people script in non-latin charsets?

I don't know if they do, but I presume that the potential is there.
Identifiers can legally contain Unicode characters from certain code
groups, and string literals can contain any Unicode character (and I'm not
referring to escape sequences). For them to be properly processed, I
assume that the character set must be set correctly.

[snip]

Mike
 
G

Grant Wagner

Al said:
I can see two potential ongoing problems. Firstly, it seems odd (to
me) that the text-encoding has also been used to process the script
within the page.

The script within the page is just part of the page. If the page is
Anyway, there is an easy solution.
Make sure the text-encoding is specified on every page.

Indeed.


Anyway, this may be of passing interest to you: <url:
http://zsigri.tripod.com/fontboard/cjk/utf7.html />

Using some guess work and the URL above, I've arrived at a partial
solution to your question about why IE sometimes decides to Auto-Select
UTF-7 and sometimes it does not. Here it is:

If all "+" characters on a page are only followed by characters from the
Base64 alphabet up to the next "-" character, the page is assumed to be
UTF-7. If even a single "+" character on the page is followed by a
character not from the Base64 alphabet, the page is assumed to not be
UTF-7. As a result:

abc ++++- def would be UTF-7; but
abc +<b>+++</b>- would not

However, this does not explain everything, otherwise: for (var i = 0; i <
length; ++i-b) { ... } would cause problems (assuming no other occurances
of "+" on the page), but it does not.
 
V

VK

Anyway, there is an easy solution.
Make sure the text-encoding is specified on every page.

I don't think it always helps. How about situations when you really need a
script-powered page in Unicode? - Online dictionaries and language lessons
just to name the first.

Also I'm out of any ideas how the "+stuff-" literal might be interpreted as
a Korean syllabic symbol (Unicode value B2DB).

I think this is a bug ("+stuff-" = \u45787) and this is so called "unwanted
behavior" for the whole situation.

IMHO this should be definitely reported to Washington (I mean to the state
of, not DC :)
 
J

Jim Ley

I don't think it always helps. How about situations when you really need a
script-powered page in Unicode? - Online dictionaries and language lessons
just to name the first.

There is no problem with scripting in IE in UTF-8 or Mozilla, even
script using utf-8 chars as variables work fine - Older Opera and
others have problems, but none in literals.

If the encoding is specifed there's no problem at all, just ensure you
specify an encoding, don't let it be guessed, as IE will guess wrong.
I think this is a bug ("+stuff-" = \u45787) and this is so called "unwanted
behavior" for the whole situation.

No, anything the browser does in response to an invalid document that
it has to fix-up is luck if it works or not - don't risk to luck and
you won't have a problem. For your bug above, a legitimate UTF-7
document would have a complementary bug - you can't deal with both.

Just include a proper charset!

Jim.
 
M

Michael Winter

I don't think it always helps. How about situations when you really need
a script-powered page in Unicode? - Online dictionaries and language
lessons just to name the first.

[Theory]
Declare the document with its correct character set and place the script
in a separate file. If necessary, specify the charset attribute on the
SCRIPT element.
[/Theory]

Not having written documents in other character sets, I don't know how
effective that will be. However, it seems to be the technically correct
approach.
Also I'm out of any ideas how the "+stuff-" literal might be interpreted
as a Korean syllabic symbol (Unicode value B2DB).

"+stuff-" literal? What are you referring to?
[...] \u45787 [...]

Unicode escape sequences use hexadecimal, not decimal.

[snip]

Mike
 
V

VK

[Theory]
Declare the document with its correct character set and place the script
in a separate file. If necessary, specify the charset attribute on the
SCRIPT element.
[/Theory]

The theory is good and it's the first what came in my head too. But how to
deal with all this inline little onEvent stuff? (like
"...onChange=update(this.form, this.form)"
It looks like in Unicode it may be transformed in a unpredictable way.
"+stuff-" literal? What are you referring to?

I'm referring to http://www.ex.ac.uk/cimt/dev/oddity/ie6-oddity-working.htm
from the original posting.
The character sequence (let's stick to this term) "foo+stuff-bar" has been
transformed into "foo[Korean symbol]bar".
Why? And what else may happen with your script on a unicode page? Maybe
"x+y=z" can become a Japanese text in some circumstances?

[...] \u45787 [...]

Unicode escape sequences use hexadecimal, not decimal.

It depends. Unicode consortium publish all its tables in hex values.
Nevertheless if you need to use Unicode chars in non-unicode document (for
scripting for example), you have to use \u-sequences (\u+digital code
value).


Again - I'm not saying it's a crucial default, but it is definitely an issue
to be addressed in new IE releases.
 
M

Michael Winter

[Theory]
Declare the document with its correct character set and place the
script in a separate file. If necessary, specify the charset attribute
on the SCRIPT element.
[/Theory]

The theory is good and it's the first what came in my head too. But how
to deal with all this inline little onEvent stuff? (like
"...onChange=update(this.form, this.form)"
It looks like in Unicode it may be transformed in a unpredictable way.

That is a possibility. However, you could add the listeners through the
script itself. The only problem here is that old browsers won't be able to
use such pages as getting a reference to anything other than form controls
depends on getElementById (or similar).
"+stuff-" literal? What are you referring to?

I'm referring to
http://www.ex.ac.uk/cimt/dev/oddity/ie6-oddity-working.htm
from the original posting.
The character sequence (let's stick to this term) "foo+stuff-bar" has
been transformed into "foo[Korean symbol]bar".

Oh, I see. I thought you were referring to some strange non-standard
character entity.

From UTF-7 Definition, RFC 2152 - UTF-7 A Mail-Safe Transformation Format
of Unicode:

The "+" signals that subsequent octets are to be interpreted as
elements of the Modified Base64 alphabet until a character not in
that alphabet is encountered. Such characters include control
characters such as carriage returns and line feeds; thus, a Unicode
shifted sequence always terminates at the of a line [sic]. As a
special case, if the sequence terminates with the character "-"
(US-ASCII decimal 45) then that character is absorbed; other
terminating characters are not absorbed and are processed normally.

So in the sequence, +...-, that entire string is replaced by the value of
.... in the Base64 alphabet. The question is why IE decides the page is
UTF-7.

[snip]
[...] \u45787 [...]

Unicode escape sequences use hexadecimal, not decimal.

It depends. Unicode consortium publish all its tables in hex values.
Nevertheless if you need to use Unicode chars in non-unicode document
(for scripting for example), you have to use
\u-sequences (\u+digital code value).

A script can be a Unicode document. Though identifiers much come from a
limited alphabet, string literals can contain any Unicode character.

Unicode escape sequences in string literals within scripts *do* require
hexadecimal characters. HTML entity references can use either decimal or
hexadecimal (decimal is probably safer).
Again - I'm not saying it's a crucial default, but it is definitely an
issue to be addressed in new IE releases.

However, Microsoft only seem to be issuing security updates. The next full
release will only be available in Longhorn, or so I've read.

Mike
 
J

Jim Ley

[Theory]
Declare the document with its correct character set and place the script
in a separate file. If necessary, specify the charset attribute on the
SCRIPT element.
[/Theory]

The theory is good and it's the first what came in my head too. But how to
deal with all this inline little onEvent stuff? (like
"...onChange=update(this.form, this.form)"
It looks like in Unicode it may be transformed in a unpredictable way.

It's not, current browsers have excellent unicode support, you've just
got to declare the character set so it knows!
Why? And what else may happen with your script on a unicode page? Maybe
"x+y=z" can become a Japanese text in some circumstances?

no, not if you correctly declare the encoding, it simply cannot
happen.
It depends. Unicode consortium publish all its tables in hex values.
Nevertheless if you need to use Unicode chars in non-unicode document (for
scripting for example), you have to use \u-sequences (\u+digital code
value).

Please read the specifications, Michael was entirely correct:

\uhhhh - Unicode character represented by the four-digit hexadecimal
number hhhh.
Again - I'm not saying it's a crucial default, but it is definitely an issue
to be addressed in new IE releases.

There's no bug, the bug is in your code.

Jim.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top