Bizarre JS brackets bug - mystery solved!

Discussion in 'Javascript' started by Al Reynolds, Sep 30, 2004.

  1. Al Reynolds

    Al Reynolds Guest

    Afternoon,

    In an earlier thread (http://tinyurl.com/5v4aa), I described a
    problem I was having which was rather bizarrely solved by
    changing the line:
    "inputbox.value = numq+ag-cw-cc;"
    to:
    "inputbox.value = numq+(ag)-(cw)-(cc);"

    This was needed in IE6 but not in any other browser I tried.
    I have now solved the mystery of why inserting the brackets
    removed the problem.

    I used the age-old technique of removing everything else until
    only the error remains. If you're interested in the two files
    which eventually helped me to see the error, look at:
    http://www.ex.ac.uk/cimt/dev/oddity/ie6-oddity-working.htm
    http://www.ex.ac.uk/cimt/dev/oddity/ie6-oddity-faulty.htm

    I will, however, explain the solution here.

    IE6 is, I believe, the first version of the IE browser to have
    "Auto-Select" for text encoding (character set) turned on by
    default. When it loads the first of the above pages, it decides
    that the encoding is "Western European (Windows)". When it
    loads the second of the above pages, it decides that the
    encoding is "Unicode (UTF-7)".

    This process (and its arbitrary nature) is rather nicely illustrated
    by the three examples below, which are all short. For full effect,
    make sure you have Auto-Select turned on for text encoding if
    you look at any of the web pages.

    (1) http://www.ex.ac.uk/cimt/dev/oddity/plusminus-oddity-1.htm

    <HTML>
    <HEAD><TITLE>plus minus oddity 1</TITLE></HEAD>
    <BODY>
    foo+stuff-bar
    </BODY>
    </HTML>

    This displays:
    foo<oriental symbol>bar.
    IE has decided that the document is Unicode (UTF-7).

    (2) http://www.ex.ac.uk/cimt/dev/oddity/plusminus-oddity-2.htm

    <HTML>
    <HEAD><TITLE>plus minus oddity 2</TITLE></HEAD>
    <BODY>
    foo+stuff-bar<BR>
    foo+ stuff -bar
    </BODY>
    </HTML>

    This displays:
    foo+stuff-bar
    foo+ stuff -bar
    IE has decided that this document is Western European (Windows).
    How it has decided this is unclear to me. It contains the same first
    line as example (1), but something in the second line makes it change
    its mind. Perhaps it is the appearance of "stuff" without the "+"
    directly in front?

    (3) http://www.ex.ac.uk/cimt/dev/oddity/plusminus-oddity-3.htm

    <HTML>
    <HEAD><TITLE>plus minus oddity</TITLE></HEAD>
    <META HTTP-EQUIV="Content-Type"
    CONTENT="text/html; CHARSET=iso-8859-1">
    <BODY>
    foo+stuff-bar
    </BODY>
    </HTML>

    This displays:
    foo+stuff-bar
    IE has correctly responded to my suggestion that this document is in
    Western European (ISO) as specified in the META tag.

    I'm sure that some of you will tell me that I should have always set
    the character set for every HTML page I have ever written. If I had
    done then I might never have discovered this IE6 "feature".

    Anyway, I have learnt my lesson.

    I can see two potential ongoing problems. Firstly, it seems odd (to
    me) that the text-encoding has also been used to process the script
    within the page. There will be plenty of occasions where a variable
    is enclosed between a "+" and a "-", and each of these could
    potentially lead to an error. Do people script in non-latin charsets?

    What makes the problem worse is that the way in which IE decides
    the encoding depends fairly arbitrarily on things which appear *later*
    in the code and/or page. Removing a working section of code might
    remove the problem, but not because there was a fault in that section
    of code.

    Anyway, there is an easy solution.
    Make sure the text-encoding is specified on every page.

    Al
    Al Reynolds, Sep 30, 2004
    #1
    1. Advertising

  2. On Thu, 30 Sep 2004 16:00:28 +0100, Al Reynolds <>
    wrote:

    [snip]

    > Do people script in non-latin charsets?


    I don't know if they do, but I presume that the potential is there.
    Identifiers can legally contain Unicode characters from certain code
    groups, and string literals can contain any Unicode character (and I'm not
    referring to escape sequences). For them to be properly processed, I
    assume that the character set must be set correctly.

    [snip]

    Mike

    --
    Michael Winter
    Replace ".invalid" with ".uk" to reply by e-mail.
    Michael Winter, Sep 30, 2004
    #2
    1. Advertising

  3. Al Reynolds

    Grant Wagner Guest

    Al Reynolds wrote:

    > I can see two potential ongoing problems. Firstly, it seems odd (to
    > me) that the text-encoding has also been used to process the script
    > within the page.


    The script within the page is just part of the page. If the page is
    encoded a specific way, then the text between the <script> and </script>
    tags will be encoded the same way.

    > Anyway, there is an easy solution.
    > Make sure the text-encoding is specified on every page.


    Indeed.


    Anyway, this may be of passing interest to you: <url:
    http://zsigri.tripod.com/fontboard/cjk/utf7.html />

    Using some guess work and the URL above, I've arrived at a partial
    solution to your question about why IE sometimes decides to Auto-Select
    UTF-7 and sometimes it does not. Here it is:

    If all "+" characters on a page are only followed by characters from the
    Base64 alphabet up to the next "-" character, the page is assumed to be
    UTF-7. If even a single "+" character on the page is followed by a
    character not from the Base64 alphabet, the page is assumed to not be
    UTF-7. As a result:

    abc ++++- def would be UTF-7; but
    abc +<b>+++</b>- would not

    However, this does not explain everything, otherwise: for (var i = 0; i <
    length; ++i-b) { ... } would cause problems (assuming no other occurances
    of "+" on the page), but it does not.

    --
    Grant Wagner <>
    comp.lang.javascript FAQ - http://jibbering.com/faq
    Grant Wagner, Sep 30, 2004
    #3
  4. Al Reynolds

    VK Guest

    > Anyway, there is an easy solution.
    > Make sure the text-encoding is specified on every page.


    I don't think it always helps. How about situations when you really need a
    script-powered page in Unicode? - Online dictionaries and language lessons
    just to name the first.

    Also I'm out of any ideas how the "+stuff-" literal might be interpreted as
    a Korean syllabic symbol (Unicode value B2DB).

    I think this is a bug ("+stuff-" = \u45787) and this is so called "unwanted
    behavior" for the whole situation.

    IMHO this should be definitely reported to Washington (I mean to the state
    of, not DC :)
    VK, Oct 1, 2004
    #4
  5. Al Reynolds

    Jim Ley Guest

    On Fri, 1 Oct 2004 15:12:34 +0200, "VK" <>
    wrote:

    >> Anyway, there is an easy solution.
    >> Make sure the text-encoding is specified on every page.

    >
    >I don't think it always helps. How about situations when you really need a
    >script-powered page in Unicode? - Online dictionaries and language lessons
    >just to name the first.


    There is no problem with scripting in IE in UTF-8 or Mozilla, even
    script using utf-8 chars as variables work fine - Older Opera and
    others have problems, but none in literals.

    If the encoding is specifed there's no problem at all, just ensure you
    specify an encoding, don't let it be guessed, as IE will guess wrong.

    >I think this is a bug ("+stuff-" = \u45787) and this is so called "unwanted
    >behavior" for the whole situation.


    No, anything the browser does in response to an invalid document that
    it has to fix-up is luck if it works or not - don't risk to luck and
    you won't have a problem. For your bug above, a legitimate UTF-7
    document would have a complementary bug - you can't deal with both.

    Just include a proper charset!

    Jim.
    Jim Ley, Oct 1, 2004
    #5
  6. On Fri, 1 Oct 2004 15:12:34 +0200, VK <> wrote:

    >> Anyway, there is an easy solution.
    >> Make sure the text-encoding is specified on every page.

    >
    > I don't think it always helps. How about situations when you really need
    > a script-powered page in Unicode? - Online dictionaries and language
    > lessons just to name the first.


    [Theory]
    Declare the document with its correct character set and place the script
    in a separate file. If necessary, specify the charset attribute on the
    SCRIPT element.
    [/Theory]

    Not having written documents in other character sets, I don't know how
    effective that will be. However, it seems to be the technically correct
    approach.

    > Also I'm out of any ideas how the "+stuff-" literal might be interpreted
    > as a Korean syllabic symbol (Unicode value B2DB).


    "+stuff-" literal? What are you referring to?

    > [...] \u45787 [...]


    Unicode escape sequences use hexadecimal, not decimal.

    [snip]

    Mike

    --
    Michael Winter
    Replace ".invalid" with ".uk" to reply by e-mail.
    Michael Winter, Oct 1, 2004
    #6
  7. Al Reynolds

    VK Guest

    > [Theory]
    > Declare the document with its correct character set and place the script
    > in a separate file. If necessary, specify the charset attribute on the
    > SCRIPT element.
    > [/Theory]


    The theory is good and it's the first what came in my head too. But how to
    deal with all this inline little onEvent stuff? (like
    "...onChange=update(this.form, this.form)"
    It looks like in Unicode it may be transformed in a unpredictable way.

    > "+stuff-" literal? What are you referring to?


    I'm referring to http://www.ex.ac.uk/cimt/dev/oddity/ie6-oddity-working.htm
    from the original posting.
    The character sequence (let's stick to this term) "foo+stuff-bar" has been
    transformed into "foo[Korean symbol]bar".
    Why? And what else may happen with your script on a unicode page? Maybe
    "x+y=z" can become a Japanese text in some circumstances?


    > > [...] \u45787 [...]

    >
    > Unicode escape sequences use hexadecimal, not decimal.


    It depends. Unicode consortium publish all its tables in hex values.
    Nevertheless if you need to use Unicode chars in non-unicode document (for
    scripting for example), you have to use \u-sequences (\u+digital code
    value).


    Again - I'm not saying it's a crucial default, but it is definitely an issue
    to be addressed in new IE releases.
    VK, Oct 1, 2004
    #7
  8. On Fri, 1 Oct 2004 16:29:12 +0200, VK <> wrote:

    >> [Theory]
    >> Declare the document with its correct character set and place the
    >> script in a separate file. If necessary, specify the charset attribute
    >> on the SCRIPT element.
    >> [/Theory]

    >
    > The theory is good and it's the first what came in my head too. But how
    > to deal with all this inline little onEvent stuff? (like
    > "...onChange=update(this.form, this.form)"
    > It looks like in Unicode it may be transformed in a unpredictable way.


    That is a possibility. However, you could add the listeners through the
    script itself. The only problem here is that old browsers won't be able to
    use such pages as getting a reference to anything other than form controls
    depends on getElementById (or similar).

    >> "+stuff-" literal? What are you referring to?

    >
    > I'm referring to
    > http://www.ex.ac.uk/cimt/dev/oddity/ie6-oddity-working.htm
    > from the original posting.
    > The character sequence (let's stick to this term) "foo+stuff-bar" has
    > been transformed into "foo[Korean symbol]bar".


    Oh, I see. I thought you were referring to some strange non-standard
    character entity.

    > Why?


    From UTF-7 Definition, RFC 2152 - UTF-7 A Mail-Safe Transformation Format
    of Unicode:

    The "+" signals that subsequent octets are to be interpreted as
    elements of the Modified Base64 alphabet until a character not in
    that alphabet is encountered. Such characters include control
    characters such as carriage returns and line feeds; thus, a Unicode
    shifted sequence always terminates at the of a line [sic]. As a
    special case, if the sequence terminates with the character "-"
    (US-ASCII decimal 45) then that character is absorbed; other
    terminating characters are not absorbed and are processed normally.

    So in the sequence, +...-, that entire string is replaced by the value of
    .... in the Base64 alphabet. The question is why IE decides the page is
    UTF-7.

    [snip]

    >> > [...] \u45787 [...]

    >>
    >> Unicode escape sequences use hexadecimal, not decimal.

    >
    > It depends. Unicode consortium publish all its tables in hex values.
    > Nevertheless if you need to use Unicode chars in non-unicode document
    > (for scripting for example), you have to use
    > \u-sequences (\u+digital code value).


    A script can be a Unicode document. Though identifiers much come from a
    limited alphabet, string literals can contain any Unicode character.

    Unicode escape sequences in string literals within scripts *do* require
    hexadecimal characters. HTML entity references can use either decimal or
    hexadecimal (decimal is probably safer).

    > Again - I'm not saying it's a crucial default, but it is definitely an
    > issue to be addressed in new IE releases.


    However, Microsoft only seem to be issuing security updates. The next full
    release will only be available in Longhorn, or so I've read.

    Mike

    --
    Michael Winter
    Replace ".invalid" with ".uk" to reply by e-mail.
    Michael Winter, Oct 1, 2004
    #8
  9. Al Reynolds

    Jim Ley Guest

    On Fri, 1 Oct 2004 16:29:12 +0200, "VK" <>
    wrote:

    >> [Theory]
    >> Declare the document with its correct character set and place the script
    >> in a separate file. If necessary, specify the charset attribute on the
    >> SCRIPT element.
    >> [/Theory]

    >
    >The theory is good and it's the first what came in my head too. But how to
    >deal with all this inline little onEvent stuff? (like
    >"...onChange=update(this.form, this.form)"
    >It looks like in Unicode it may be transformed in a unpredictable way.


    It's not, current browsers have excellent unicode support, you've just
    got to declare the character set so it knows!

    >Why? And what else may happen with your script on a unicode page? Maybe
    >"x+y=z" can become a Japanese text in some circumstances?


    no, not if you correctly declare the encoding, it simply cannot
    happen.

    >It depends. Unicode consortium publish all its tables in hex values.
    >Nevertheless if you need to use Unicode chars in non-unicode document (for
    >scripting for example), you have to use \u-sequences (\u+digital code
    >value).


    Please read the specifications, Michael was entirely correct:

    \uhhhh - Unicode character represented by the four-digit hexadecimal
    number hhhh.

    >Again - I'm not saying it's a crucial default, but it is definitely an issue
    >to be addressed in new IE releases.


    There's no bug, the bug is in your code.

    Jim.
    Jim Ley, Oct 1, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. No One

    Vanishing buttons mystery "solved"

    No One, Oct 28, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    321
    No One
    Oct 28, 2004
  2. Roedy Green

    An enum mystery solved

    Roedy Green, Aug 26, 2005, in forum: Java
    Replies:
    14
    Views:
    7,027
    Roedy Green
    Aug 28, 2005
  3. Michael Bacarella
    Replies:
    26
    Views:
    1,289
    harri
    Nov 20, 2007
  4. Al Reynolds

    Bizarre JS bug involving brackets (IE6/SP2)

    Al Reynolds, Sep 29, 2004, in forum: Javascript
    Replies:
    8
    Views:
    110
    Al Reynolds
    Sep 30, 2004
  5. Replies:
    0
    Views:
    98
Loading...

Share This Page