Identifiers - UnicodeEscapeSequence

Discussion in 'Javascript' started by Asen Bozhilov, Feb 15, 2010.

  1. Documentation permit to be used `\UnicodeEscapeSequence` in
    IdentifierName. But there:

    | Unicode escape sequences are also permitted in identifiers,
    | where they contribute a single character to the
    | identifier, as computed by the CV of the
    | UnicodeEscapeSequence. The \ preceding the
    | UnicodeEscapeSequence does not contribute a character to the
    identifier.
    | A UnicodeEscapeSequence cannot be
    | used to put a character into an identifier that
    | would otherwise be illegal. In other words, if a
    \UnicodeEscapeSequence
    | sequence were replaced by its UnicodeEscapeSequence's CV,
    | the result must still be a
    | valid Identifier that has the exact same sequence of characters as
    the original Identifier.

    As i understand it. If i type:

    var \\u0069\\u0066; //var if;

    `if` is ReservedWord and example above, should throw SyntaxError.

    try {
    eval('var \\u0069\\u0066;'); //var if;
    }catch(e) {
    window.alert(e instanceof SyntaxError);
    }

    Firefox 3.5.7 - No error
    IE6 - true
    Chrome 4.0 - No error
    Opera 9.64 - No error
    Safari 4.0 - No error
    Rhino 1.7R2 - No error
    DMDScript 1.02 - true

    try {
    eval('var \\u0030;'); //var 0;
    }catch (e) {
    window.alert(e instanceof SyntaxError);
    }

    Firefox 3.5.7 - true
    IE6 - true
    Chrome 4.0 - true
    Opera 9.64 - No error
    Safari 4.0 - true
    Rhino 1.7R2 - No error
    DMDScript 1.02 - No error

    My question is, what is the proper behavior related with
    specification? I think if i have `var \\u0069\\u0066;` should throw
    SyntaxError.

    Thanks.
     
    Asen Bozhilov, Feb 15, 2010
    #1
    1. Advertising

  2. Asen Bozhilov

    Scott Sauyet Guest

    On Feb 15, 2:38 pm, Asen Bozhilov <> wrote:
    > My question is, what is the proper behavior related with
    > specification? I think if i have `var \\u0069\\u0066;` should throw
    > SyntaxError.


    I don't know the spec well enough to answer. But I'm wondering if you
    would expect an error from this as well:

    window["if"] = 10;

    I can't see why either should throw an error. The only reason to
    disallow the reserved word as a identifier name is to make unambiguous
    to the ES engine what is meant by the term. There is no such
    provision for keywords to be listed via unicode escapes, is there? If
    not, then there is no ambiguity about what "\\u0069\\u0066" should
    represent.

    -- Scott
     
    Scott Sauyet, Feb 15, 2010
    #2
    1. Advertising

  3. Asen Bozhilov <> writes:

    > Documentation permit to be used `\UnicodeEscapeSequence` in
    > IdentifierName. But there:
    >
    > | Unicode escape sequences are also permitted in identifiers,
    > | where they contribute a single character to the
    > | identifier, as computed by the CV of the
    > | UnicodeEscapeSequence.


    This is the important part. It allows unicode escapes in identifiers.
    There is no similar statement for any of the reserved words, so
    unicode escapes cannot be used in a keyword.

    > | The \ preceding the
    > | UnicodeEscapeSequence does not contribute a character to the
    > identifier.
    > | A UnicodeEscapeSequence cannot be
    > | used to put a character into an identifier that
    > | would otherwise be illegal. In other words, if a
    > \UnicodeEscapeSequence
    > | sequence were replaced by its UnicodeEscapeSequence's CV,
    > | the result must still be a
    > | valid Identifier that has the exact same sequence of characters as
    > the original Identifier.
    >
    > As i understand it. If i type:
    >
    > var \\u0069\\u0066; //var if;


    (I assume it should be single backslashes when not in a string :)

    > `if` is ReservedWord and example above, should throw SyntaxError.


    No. While 'if' is a keyword, it is only the sequence U+0069 U+0066
    that is recognized as the 'if' keyword. Unicode escapes are not allowed
    as parts of keywords. The above, correctly, declares a variable called
    'if' - because "\u0069\u0066" matches the production of an identifier
    and it doesn't match the production of any reserved word.

    The inputs, "if" and "i\u0066" are different sequences of characters.
    They are parsed differently. The latter is parsed as an identifier.
    An identifier is represented as a sequence of code points. It just
    happens that "i\u0066", "\u0069f" and "\u0069\u0066" all parses to
    identifers represented by U+0069U+0066, and "if" does not.

    ....
    > My question is, what is the proper behavior related with
    > specification?


    Yes.

    > I think if i have `var \\u0069\\u0066;` should throw
    > SyntaxError.


    The operative part of the ECMA262 standard is in section 7.6, which
    you quote. It allows escape sequences in identifiers. No such
    allowance are given for keywords or other reserved words - so anything
    containing a unicode escape is not a keyword.

    /L
    --
    Lasse Reichstein Holst Nielsen
    'Javascript frameworks is a disruptive technology'
     
    Lasse Reichstein Nielsen, Feb 15, 2010
    #3
  4. Lasse Reichstein Nielsen wrote:

    > Asen Bozhilov <> writes:
    >> Documentation permit to be used `\UnicodeEscapeSequence` in
    >> IdentifierName. But there:
    >>
    >> | Unicode escape sequences are also permitted in identifiers,
    >> | where they contribute a single character to the
    >> | identifier, as computed by the CV of the
    >> | UnicodeEscapeSequence.

    >
    > This is the important part. It allows unicode escapes in identifiers.


    But none that would not be allowed if the character was included verbatim.

    > There is no similar statement for any of the reserved words, so
    > unicode escapes cannot be used in a keyword.


    You have got it backwards.

    >> | The \ preceding the UnicodeEscapeSequence does not contribute a
    >> | character to the identifier. A UnicodeEscapeSequence cannot be
    >> | used to put a character into an identifier that would otherwise be
    >> | illegal. In other words, if a \UnicodeEscapeSequence sequence were

    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    >> | replaced by its UnicodeEscapeSequence's CV, the result must still

    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    >> | be a valid Identifier that has the exact same sequence of characters

    ^^^^^^^^^^^^^^^^^^^^^
    >> | as the original Identifier.


    I do not think it can be worded more clearly.

    >> As i understand it. If i type:
    >>
    >> var \\u0069\\u0066; //var if;

    >
    > (I assume it should be single backslashes when not in a string :)


    Why, the double backslashes are legal, too. However the resulting value
    would still not be an /Identifier/, barring language extensions.

    >> `if` is ReservedWord and example above, should throw SyntaxError.

    >
    > No.


    True, but the program ought to be syntactical in error nonetheless.

    > While 'if' is a keyword, it is only the sequence U+0069 U+0066
    > that is recognized as the 'if' keyword. Unicode escapes are not allowed
    > as parts of keywords. The above, correctly, declares a variable called
    > 'if' - because "\u0069\u0066" matches the production of an identifier
    > and it doesn't match the production of any reserved word.
    > The inputs, "if" and "i\u0066" are different sequences of characters.
    > They are parsed differently. The latter is parsed as an identifier.


    Your logic is flawed, because escape sequences are converted into the
    corresponding Unicode characters (the character is the Computed Value)
    *before* the tokenization process takes place that follows from applying
    the syntactical grammar:

    | 5.1.4
    |
    | [...]
    | When a stream of characters is to be parsed as an ECMAScript program, it
    | is first converted to a stream of input elements by repeated application
    | of the lexical grammar; this stream of input elements is then parsed by
    | a single application of the syntactic grammar. The program is
    | syntactically in error if the tokens in the stream of input elements
    | cannot be parsed as a single instance of the goal nonterminal /Program/,
    | with no tokens left over.

    /UnicodeEscapeSequence/ is a goal symbol of the lexical grammar as is
    /Keyword/; /IfStatement/ is a goal symbol of the syntactic grammar.

    As a result, first application of the lexical grammar ought to cause

    var \u0069\u0066

    to become

    var if

    and second application of the lexical grammar ought to cause `if' to be
    parsed as as a /Keyword/:

    | Keyword :: one of
    | [...] if [...]

    Then, application of the syntactic grammar ought to cause

    var if

    to be recognized as theoretically producible by

    VariableStatement :
    VariableDeclarationList

    VariableDeclarationList :
    VariableDeclaration

    VariableDeclaration :
    Identifier Initialiser_opt

    which ought to fail because the token `if' has been determined a /Keyword/
    before, not an /Identifier/, and no other productions of the syntactic
    grammar would be applicable.

    Therefore, the program ought to be considered syntactically in error. That
    it might not, could only be attributed to a proprietary extension. Hence
    the clarification as quoted above:

    | A UnicodeEscapeSequence cannot be used to put a character into an
    | identifier that would otherwise be illegal. [...]


    PointedEars
    --
    Danny Goodman's books are out of date and teach practices that are
    positively harmful for cross-browser scripting.
    -- Richard Cornford, cljs, <cife6q$253$1$> (2004)
     
    Thomas 'PointedEars' Lahn, Feb 16, 2010
    #4
  5. Thomas 'PointedEars' Lahn wrote:

    > Lasse Reichstein Nielsen wrote:
    >> Asen Bozhilov <> writes:
    >>> As i understand it. If i type:
    >>>
    >>> var \\u0069\\u0066; //var if;

    >>
    >> (I assume it should be single backslashes when not in a string :)

    >
    > Why, the double backslashes are legal, too.


    Ignore that, I went too far here.

    | IdentifierStart ::
    | UnicodeLetter
    | $
    | _
    | \ UnicodeEscapeSequence
    |
    | [...]
    | UnicodeEscapeSequence ::
    | u HexDigit HexDigit HexDigit HexDigit


    PointedEars
     
    Thomas 'PointedEars' Lahn, Feb 16, 2010
    #5
  6. Lasse Reichstein Nielsen wrote:
    > Asen Bozhilov writes:


    > > var \\u0069\\u0066; //var if;

    >
    > (I assume it should be single backslashes when not in a string :)
    >
    > > `if` is ReservedWord and example above, should throw SyntaxError.


    Yes. Should be:

    var \u0069\u0066;

    Double backslashes because i was copy from passed string to `eval'.
    However this is my mystake.

    > No. While 'if' is a keyword, it is only the sequence U+0069 U+0066
    > that is recognized as the 'if' keyword. Unicode escapes are not allowed
    > as parts of keywords. The above, correctly, declares a variable called
    > 'if' - because "\u0069\u0066" matches the production of an identifier
    > and it doesn't match the production of any reserved word.


    I agree with this point of specification.

    | 6 Source Text
    | [...]
    | In string literals, regular expression literals and identifiers,
    | any character (code point) may also be expressed as a
    | Unicode escape sequence consisting of six characters,
    | namely \u plus four hexadecimal digits.

    You are correct and next example prove your words.

    try {
    \u0069\u0066 (true);
    }catch(e) {
    window.alert(e instanceof ReferenceError); //true
    }

    \u0069\u0066 (true); Will be evaluate as `ExpressionStatement` which
    finish with explicit semicolon instead of `if Statement` with
    `EmptyStatement` `;`.

    > The operative part of the ECMA262 standard is in section 7.6, which
    > you quote. It allows escape sequences in identifiers.  No such
    > allowance are given for keywords or other reserved words - so anything
    > containing a unicode escape is not a keyword.


    I am confused from:

    | A UnicodeEscapeSequence cannot be
    | used to put a character into an identifier that
    | would otherwise be illegal. In other words, if a
    | \UnicodeEscapeSequence sequence were replaced by
    | its UnicodeEscapeSequence's CV,
    | the result must still be a
    | valid Identifier

    As i understand it.

    If i replace:

    var \u0069\u0066;

    With characters value (CV) i will get:

    var if;

    And syntax grammar for `Identifiers` doesn't allow identifier with
    name `if` in:

    Identifier ::
    IdentifierName but not ReservedWord

    Because `if` is keyword and it's a part from `7.5.1 Reserved Words`.

    Thanks for this comment, but why specification doesn't say anything
    about this case in explicit way?
     
    Asen Bozhilov, Feb 16, 2010
    #6
  7. Thomas 'PointedEars' Lahn <> writes:

    ....
    >>> | illegal. In other words, if a \UnicodeEscapeSequence sequence were

    > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    >>> | replaced by its UnicodeEscapeSequence's CV, the result must still

    > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    >>> | be a valid Identifier that has the exact same sequence of characters

    > ^^^^^^^^^^^^^^^^^^^^^
    >>> | as the original Identifier.

    >
    > I do not think it can be worded more clearly.


    I must admit that, on second thought, I tend to agree with that
    interpretation.
    However, it seems that IE is the only browser that agrees. All of
    Opera, Firefox, Chrome and Safari accept \u0069\u0066 as an identifier.

    /L
    --
    Lasse Reichstein Holst Nielsen
    'Javascript frameworks is a disruptive technology'
     
    Lasse Reichstein Nielsen, Feb 16, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. valentin tihomirov

    Advantages of denying keywords as identifiers

    valentin tihomirov, Dec 17, 2004, in forum: VHDL
    Replies:
    8
    Views:
    531
    Mike Treseler
    Dec 28, 2004
  2. Spartanicus
    Replies:
    2
    Views:
    987
    brucie
    May 25, 2004
  3. Karl Heinz Buchegger

    Why Does C++ Name-Mangle Identifiers?

    Karl Heinz Buchegger, Nov 2, 2004, in forum: C++
    Replies:
    20
    Views:
    997
    Markus Elfring
    Nov 5, 2004
  4. Richard Bos
    Replies:
    3
    Views:
    371
    CBFalconer
    Feb 6, 2004
  5. Replies:
    1
    Views:
    369
    Roedy Green
    Apr 22, 2008
Loading...

Share This Page