Are â€extended characters†safe in identifiers?

R

Ry Nohryb

or are there some pitfalls? Various coding conventions as well as
practical editing issues (you can’t be sure of always being able to edit
your code on a Unicode-enabled editor) aside, is there still some real
technical reason to stick to the A–Z, a–z, 0–9, ”$”, ”_" repertoire?

If the <script src=''> body is sent with the proper charset in its
http headers, or (when the headers don't say) the <script src=''> has
the proper charset attribute, or idem for the containing page for
inlined <script>s... then perhaps when/if it fails it won't be your
fault :)
 
J

Jukka K. Korpela

23.5.2011 15:06, Ry Nohryb kirjoitti:
Ha!, so if


why function Æ’ (){} is not a good use for it ?

Because the name of a function should reflect its specific meaning and
use, not its being a function. The symbol â€Æ’†might conceivably be used
to denote a formal parameter that corresponds to the notion â€any functionâ€.
And ∑ and ∆ for example, would have been good names too, for summation
and increment, don't you think so ?

Well, they are operators rather than functions – we write ∆x and not
∆(x), but the difference is somewhat subtle.

I guess people who defined the identifier syntax just thought that
identifiers are names, rather than operators. Maybe in ten years, if the
current syntax is taken into use, further extensions will be considered.

It would be rather cool to be able to declare a variable like ∆x. We
could cheat by writing Δx, and few people would notice the difference (Δ
= U+0394 GREEK CAPITAL LETTER DELTA).
 
R

Ry Nohryb

It would be rather cool to be able to declare a variable like ∆x.We
could cheat by writing Δx, and few people would notice the difference (Δ
= U+0394 GREEK CAPITAL LETTER DELTA).

Wow, you're a fountain of wisdom :)

Thanks (again) !

These delta and sigma (as capital greek letters) are permitted, unlike
the "other" ones (INCREMENT and N-ARY SUMMATION) under Math symbols,
fool of me ! I thought they were the same thing !
 
T

Thomas 'PointedEars' Lahn

Jukka said:
In practice, Firefox, IE, Opera, and Chrome all seem to limit identifier
characters to those for which support is required in the standard. So
the Phoenican letter \uD802\uDD0E won’t do in an identifier (even though
it’s OK in a string literal), making Phoeninican programmers very sad.

:)

However, that this works in a string literal [in "Chromium 11.0.696.68
(84545) Built on Debian unstable, running on Debian 6.0.1"] really should be
considered a quirk of the runtime environment (in particular, its display
functionality), not even a language extension, and certainly not a reliable
feature.

Because the productions that apply to Unicode escape sequences in
identifiers apply to Unicode escape sequences in string literals and Unicode
escape sequences in literals, too. Currently, standard ECMAScript syntax
does not allow or provide for the possibility to refer to characters beyond
the BMP with (combinations of) Unicode escape sequences nor does it require
conforming implementations to support the correspondig raw-written
characters, as the required number of hexadecimal characters in a Unicode
escape sequence is 4, and conforming implementations are only required to
consider the BMP.

What you have written should be parsed into *two* Unicode characters [ES5,
7.8.4]. Accordingly, the value of the `length' property of such a String
instance in the aforementioned runtime environment is 2, not 1.


PointedEars
 
S

Stanimir Stamenkov

Fri, 03 Jun 2011 02:12:18 +0200, /Thomas 'PointedEars' Lahn/:
What you have written should be parsed into *two* Unicode characters [ES5,
7.8.4]. Accordingly, the value of the `length' property of such a String
instance in the aforementioned runtime environment is 2, not 1.

Don't you mean "parsed into two _JavaScript_ character values", instead?
 
S

Stanimir Stamenkov

Sat, 04 Jun 2011 17:56:56 +0300, /Stanimir Stamenkov/:
Fri, 03 Jun 2011 02:12:18 +0200, /Thomas 'PointedEars' Lahn/:
What you have written should be parsed into *two* Unicode characters [ES5,
7.8.4]. Accordingly, the value of the `length' property of such a String
instance in the aforementioned runtime environment is 2, not 1.

Don't you mean "parsed into two _JavaScript_ character values", instead?
[ES5]:

*4.3.16*
*String value*
primitive value that is a finite ordered sequence of zero or more
16-bit unsigned integer.

NOTE A String value is a member of the String type. Each integer
value in the sequence usually represents a single 16-bit unit of
UTF-16 text. However, ECMAScript does not place any restrictions or
requirements on the values except that they must be 16-bit unsigned
integers.

Having pointed all above, I don't think you're correct in previous
reply:

Fri, 03 Jun 2011 02:12:18 +0200, /Thomas 'PointedEars' Lahn/:
However, that this works in a string literal really should be
considered a quirk of the runtime environment (in particular, its display
functionality), not even a language extension, and certainly not a reliable
feature.

I don't really see what the "display functionality" has to do with
whether one is permitted to have \uD802\uDD0E in a string literal.
 
T

Thomas 'PointedEars' Lahn

Stanimir said:
Fri, 03 Jun 2011 02:12:18 +0200, /Thomas 'PointedEars' Lahn/:
What you have written should be parsed into *two* Unicode characters
[ES5, 7.8.4]. Accordingly, the value of the `length' property of such a
String instance in the aforementioned runtime environment is 2, not 1.

Don't you mean "parsed into two _JavaScript_ character values", instead?

Yes, I don't.


PointedEars
 
S

Stanimir Stamenkov

Sat, 04 Jun 2011 23:17 +0200, /Thomas 'PointedEars' Lahn/:
Stanimir said:
Fri, 03 Jun 2011 02:12:18 +0200, /Thomas 'PointedEars' Lahn/:
What you have written should be parsed into *two* Unicode characters
[ES5, 7.8.4]. Accordingly, the value of the `length' property of such a
String instance in the aforementioned runtime environment is 2, not 1.

Don't you mean "parsed into two _JavaScript_ character values", instead?

Yes, I don't.

Is it just your bad English, or is it just your stupid attitude I
honestly don't understand, no matter how I try? You don't seem to
accept when you've been corrected, even when you're obviously wrong.
 
S

Stanimir Stamenkov

Sun, 05 Jun 2011 00:28:39 +0300, /Stanimir Stamenkov/:
Sat, 04 Jun 2011 23:17 +0200, /Thomas 'PointedEars' Lahn/:

Is it just your bad English, or is it just your stupid attitude I
honestly don't understand, no matter how I try?

I admit I could have misunderstood you because of my bad English,
but your terse answer is at least ambiguous to me.
 
T

Thomas 'PointedEars' Lahn

Stanimir said:
Sat, 04 Jun 2011 23:17 +0200, /Thomas 'PointedEars' Lahn/:
Stanimir said:
Fri, 03 Jun 2011 02:12:18 +0200, /Thomas 'PointedEars' Lahn/:
What you have written should be parsed into *two* Unicode characters
[ES5, 7.8.4]. Accordingly, the value of the `length' property of such
[a
String instance in the aforementioned runtime environment is 2, not 1.
Don't you mean "parsed into two _JavaScript_ character values", instead?
Yes, I don't.

Is it just your bad English, or is it just your stupid attitude I
honestly don't understand, no matter how I try? You don't seem to
accept when you've been corrected, even when you're obviously wrong.

*You* are obviously wrong, and apparently you are the one with the stupid
attitude. There is no such thing as a "JavaScript character value", and
trying to insult me does not change that.


PointedEars
 
J

Jukka K. Korpela

Fri, 03 Jun 2011 02:12:18 +0200, /Thomas 'PointedEars' Lahn/:

I don't really see what the "display functionality" has to do with
whether one is permitted to have \uD802\uDD0E in a string literal.

It is impossible (and unnecessary) to know whether the troll is
thoroughly confused or just trying to confuse us. We wrote as if he did
not know the difference between a Unicode character and a JavaScript
character (or, to say it the way troll wants everyone to use no matter
clumsy it is: the character concept as defined and used in the ECMA-262
standard), which corresponds to the Unicode concept of “code pointâ€.

In a string literal, any \uXXXX (where X is a hexadecimal digit) is
allowed, even if it corresponds to an unassigned code point, a code
point designated as noncharacter, or a surrogate code point. Display
functionality indeed has nothing to do with this.

The interpretation of code points is a different issue, and it is a
somewhat grey are what should happen when a string is to be displayed
via alert() or inserted into an HTML document via document.write() for
example. It is absurd to label the behavior of implementations as a
“quirk†if it does the most sensible thing that can be imagined for a
surrogate pair: to interpret it as a single Unicode character.

I tested on Firefox, IE, Opera, Safari, and Chrome, and on all of them,
"\uD802\uDD0E" was treated as U+1090E (i.e., the character denoted by
the surrogate pair D902 DD0E when using UTF-16) when written with
document.write().

Browsers may have difficulties in _rendering_ non-BMP characters, but
then they have the same difficulties when the character is included in
static content, as with the reference . (You need a font like
Code2001 or some very special font to render Phoenician letters. And
something has happened to www.code2000.net that used to host Code2001.
And the glyph for U+1090E in Code2001 looks odd – rather different from
the representative glyph in the Unicode standard.)
 
S

Stanimir Stamenkov

Sat, 04 Jun 2011 18:25:43 +0300, /Thomas 'PointedEars' Lahn/:
*You* are obviously wrong, and apparently you are the one with the stupid
attitude. There is no such thing as a "JavaScript character value", and
trying to insult me does not change that.

The character value is documented right in the ES5 chapter you've
referred to previously "7.8.4 String Literals":
*Semantics*

A string literal stands for a value of the String type. The String
value (SV) of the literal is described in terms of character values
(CV) contributed by the various parts of the string literal.

In a subsequent reply I've also referred to ES5 "4.3.16 String
value" which says a string value is a sequence of 16-bit units, not
sequence of Unicode characters. So you're obviously wrong, again.

A single character value can represent an Unicode character with a
code point only up to U+FFFF. You can still represent Unicode
characters above that code point using multiple character values,
but it is up to you to interpret them properly where needed.
 
T

Thomas 'PointedEars' Lahn

Stanimir said:
Sat, 04 Jun 2011 18:25:43 +0300, /Thomas 'PointedEars' Lahn/:

The character value is documented right in the ES5 chapter you've
referred to previously "7.8.4 String Literals":

ECMAScript is not JavaScript.
In a subsequent reply I've also referred to ES5 "4.3.16 String
value" which says a string value is a sequence of 16-bit units,

I am aware of that.
not sequence of Unicode characters.

That is not stated anywhere. It is stated in ES5, section 6:

| In string literals, […] any character (code point) may also be expressed
| as a Unicode escape sequence consisting of six characters, namely \u plus
| four hexadecimal digits. […] Within a string literal […] the Unicode
| escape sequence contributes one character to the value of the literal.

That part of the Specification is normative.
So you're obviously wrong, again.
No.

A single character value can represent an Unicode character with a
code point only up to U+FFFF.

Yes, although Unicode does not assign a character to U+FFFE and U+FFFF.
You can still represent Unicode characters above that code point using
multiple character values,

You can try to do that.
but it is up to you to interpret them properly where needed.

Yes, AISB it is "a quirk of the runtime environment", not something that
must work. But I am willing to concede that there is a problem with the
length of a string per the Specification as only 16-bit (character) values
are counted, not characters [ES5, 8.4.]


PointedEars
 
T

Thomas 'PointedEars' Lahn

Jukka said:
Fri, 03 Jun 2011 02:12:18 +0200, /Thomas 'PointedEars' Lahn/:
I don't really see what the "display functionality" has to do with
whether one is permitted to have \uD802\uDD0E in a string literal.

[yet another uncalled-for ad-hominem attack]
In a string literal, any \uXXXX (where X is a hexadecimal digit) is
allowed,

Nobody debated that.
even if it corresponds to an unassigned code point,

Nobody debated that either.
a code point designated as noncharacter,
Ditto.

or a surrogate code point.
Ditto.

Display functionality indeed has nothing to do with this.

Yes, it does.
The interpretation of code points is a different issue, and it is a
somewhat grey are what should happen when a string is to be displayed
via alert() or inserted into an HTML document via document.write() for
example.

What you call "interpretation of code points" (which is wrong, you mean code
_units_) I referred to as "display functionality".
It is absurd to label the behavior of implementations as a
“quirk†if it does the most sensible thing that can be imagined for a
surrogate pair: to interpret it as a single Unicode character.

You have not read the Specification, have you?
I tested on Firefox, IE, Opera, Safari, and Chrome, and on all of them,
"\uD802\uDD0E" was treated as U+1090E (i.e., the character denoted by
the surrogate pair D902 DD0E when using UTF-16) when written with
document.write().

In the same operating system, on the same platform, I suppose.
Browsers may have difficulties in _rendering_ non-BMP characters,

They will. This is also a font issue.
but then they have the same difficulties when the character is included in
static content, as with the reference .

Ignoratio elenchi.


PointedEars
 
S

Stanimir Stamenkov

Sun, 05 Jun 2011 10:21:41 +0200, /Thomas 'PointedEars' Lahn/:
I am aware of that.


That is not stated anywhere.

For your convenience, citing it once again:

| 4.3.16
| *String value*
| primitive value that is a finite ordered sequence of zero or more
| 16-bit unsigned integer.
|
| NOTE A String value is a member of the String type. Each integer
| value in the sequence usually represents a single 16-bit unit of
| UTF-16 text. However, ECMAScript does not place any restrictions
| or requirements on the values except that they must be 16-bit
| unsigned integers.

Which part of "16-bit unsigned integer" you don't understand?
It is stated in ES5, section 6:

| In string literals, […] any character (code point) may also be expressed
| as a Unicode escape sequence consisting of six characters, namely \u plus
| four hexadecimal digits. […] Within a string literal […] the Unicode
| escape sequence contributes one character to the value of the literal.

That part of the Specification is normative.

You still doesn't seem to understand what the above says: "the
Unicode escape sequence contributes one character to the value of
the literal" means it contributes a character value - not an Unicode
character. The character value itself is a 16-bit unit which may or
may not be part of a valid UTF-16 sequence.
 
T

Thomas 'PointedEars' Lahn

Stanimir said:
/Thomas 'PointedEars' Lahn/:
I am aware of that.


That is not stated anywhere.

For your convenience, citing it once again: […]

What you cite does not confirm or contain the quoted statement.
It is stated in ES5, section 6:

| In string literals, […] any character (code point) may also be
| expressed as a Unicode escape sequence consisting of six characters,
| namely \u plus four hexadecimal digits. […] Within a string literal
| […] the Unicode escape sequence contributes one character to the value
| of the literal.

That part of the Specification is normative.

You still doesn't seem to understand what the above says: "the
Unicode escape sequence contributes one character to the value of
the literal" means it contributes a character value […]

That is only your interpretation, not what is written there. There is at
least doubt as to whether the Specification implies or intended the
representation of characters beyond the BMP with two consecutive Unicode
escape sequences, and the fact that the `length' property would have the
value 2 (because it only considers the number of 16-bit integer values)
casts serious doubt upon it. Also, all protests do not change the fact that

| 2 Conformance
| […]
| A conforming implementation of this International standard shall interpret
| characters in conformance with the Unicode Standard, Version 3.0 or later
| and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted
| encoding form, implementation level 3. If the adopted ISO/IEC 10646-1
| subset is not otherwise specified, it is presumed to be the BMP subset,
| collection 300. […]


PointedEars
 
T

Thomas 'PointedEars' Lahn

Thomas said:
Stanimir said:
/Thomas 'PointedEars' Lahn/:
It is stated in ES5, section 6:

| In string literals, […] any character (code point) may also be
| expressed as a Unicode escape sequence consisting of six characters,
| namely \u plus four hexadecimal digits. […] Within a string literal
| […] the Unicode escape sequence contributes one character to the value
| of the literal.

That part of the Specification is normative.

You still doesn't seem to understand what the above says: "the
Unicode escape sequence contributes one character to the value of
the literal" means it contributes a character value […]

That is only your interpretation, not what is written there. There is at
least doubt as to whether the Specification implies or intended the
representation of characters beyond the BMP with two consecutive Unicode
escape sequences, and the fact that the `length' property would have the
value 2 (because it only considers the number of 16-bit integer values)
casts serious doubt upon it. Also, all protests do not change the fact
that

| 2 Conformance
| […]
| A conforming implementation of this International standard shall
| interpret characters in conformance with the Unicode Standard, Version
| 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the
| adopted encoding form, implementation level 3. If the adopted ISO/IEC
| 10646-1 subset is not otherwise specified, it is presumed to be the BMP
| subset, collection 300. […]

In addition, in a recent es-discuss message, Allen Wirfs-Brock (who is more
or less responsible for the wording) makes it clear that it was not intended
that implementations of ES5 handle surrogate pairs:

,-<https://mail.mozilla.org/pipermail/es-discuss/2011-May/014342.html>
|
| […]
| The ES5 specification language clearly still has issues WRT Unicode
| encoding of programs and strings. These need to be fixed in the next
| edition. However, interpreting the current language as allow supplemental
| characters to occur in program text and particularly string literals
| doesn't match either reality or the intent of the ES5 spec. […]

In one of the follow-ups, Douglas Crockford suggests \u{HHHHHH} to support
that instead. See also
<http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings>
of "2011/05/19 22:18 by allen".


PointedEars
 
S

Stanimir Stamenkov

Sun, 05 Jun 2011 18:11:14 +0200, /Thomas 'PointedEars' Lahn/:
Thomas said:
That is only your interpretation, not what is written there...

| 2 Conformance
| […]
| A conforming implementation of this International standard shall
| interpret characters in conformance with the Unicode Standard, Version
| 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the
| adopted encoding form...

In addition, in a recent es-discuss message, Allen Wirfs-Brock (who is more
or less responsible for the wording) makes it clear that it was not intended
that implementations of ES5 handle surrogate pairs:

,-<https://mail.mozilla.org/pipermail/es-discuss/2011-May/014342.html>
|
| […]
| The ES5 specification language clearly still has issues WRT Unicode
| encoding of programs and strings. These need to be fixed in the next
| edition. However, interpreting the current language as allow supplemental
| characters to occur in program text and particularly string literals
| doesn't match either reality or the intent of the ES5 spec. […]

Reading through all of this, it really suggest surrogates are not
handled just for source encoding (and I suspect you're arguing just
this from the beginning, which however is not what we're talking about):

| In drafting the ES5 spec, TC39 had two goals WRT character
| encoding. We wanted to allow the occurrences of (BMP) characters
| defined in Unicode versions beyond 2.1 and we wanted to update
| the specification to reflect actual implementation reality that
| source was processed as if it was UCS-2.

Which is fine and dandy. It doesn't mean one can't have surrogate
code points inserted as \uXXXX in the source, nor that any such
values are prohibited in strings during run-time.
In one of the follow-ups, Douglas Crockford suggests \u{HHHHHH} to support
that instead. See also
<http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings>
of "2011/05/19 22:18 by allen".

Seems like more syntactic sugar. It will still result in inserting
pair of (or more) UTF-16 units, and String.length will still give
the length of the 16-bit units contained.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top