Are â€extended characters†safe in identifiers?

J

Jukka K. Korpela

The syntax of ECMAScript has allowed “extended characters†in
identifiers since 3rd edition (1999). This means, among other things,
allowing any Unicode letters, like Greek, Arabic, and Cyrillic letters
as well as e.g. Chinese ideographs. As far as I can see, this has been
supported in web browsers for a long time (e.g., ever since IE 5.5).

So is it really safe to use them, writing, say

var π = Math.PI;
var ผลบวภ= 0;
function Götterdämmerung()

or are there some pitfalls? Various coding conventions as well as
practical editing issues (you can’t be sure of always being able to edit
your code on a Unicode-enabled editor) aside, is there still some real
technical reason to stick to the A–Z, a–z, 0–9, â€$â€, â€_" repertoire?
 
M

Martin Honnen

Jukka said:
The syntax of ECMAScript has allowed “extended characters†in identifiers since 3rd edition (1999). This means, among other things, allowing any Unicode letters, like Greek, Arabic, and Cyrillic letters as well as e.g. Chinese ideographs. As far as I can see, this has been supported in web browsers for a long time (e.g., ever since IE 5.5).

So is it really safe to use them, writing, say

var π = Math.PI;
var ผลบวภ= 0;
function Götterdämmerung()

or are there some pitfalls? Various coding conventions as well as practical editing issues (you can’t be sure of always being able to edit your code on a Unicode-enabled editor) aside, is there still some real technical reason to stick to the A–Z, a–z, 0–9, â€$â€, â€_" repertoire?

I think it is technically safe but I don't see people doing that,
neither in Javascript nor in other languages like C# which also allow
more than ASCII letters in identifiers. But the reason is probably
coding conventions, editing issues and keeping code readable and
understandable internationally. And partly maybe ignorance that more
than ASCII can be used.
 
T

Thomas 'PointedEars' Lahn

Jukka said:
The syntax of ECMAScript has allowed “extended characters†in
identifiers since 3rd edition (1999). This means, among other things,
allowing any Unicode letters, like Greek, Arabic, and Cyrillic letters
as well as e.g. Chinese ideographs. As far as I can see, this has been
supported in web browsers for a long time (e.g., ever since IE 5.5).

So is it really safe to use them, writing, say

var π = Math.PI;
var ผลบวภ= 0;
function Götterdämmerung()

or are there some pitfalls? Various coding conventions as well as
practical editing issues (you can’t be sure of always being able to edit
your code on a Unicode-enabled editor) aside, is there still some real
technical reason to stick to the A–Z, a–z, 0–9, â€$â€, â€_" repertoire?

Perhaps misconfigured Web servers still declaring ISO-8859-1 by default is a
reason why few people use characters beyond U+007F or U+00FF.

As for practical editing issues, it is not only the editor, but also the
input method that needs to be available and to allow for easy typing. At
least on my current X.org keyboard setup it is considerably harder to type
`Ï€' than `pi' (except in GNOME applications where I could type C-S-u
$HEXCODEPOINT; but that would still be four keypresses more). (I really
don't seem to need the THORN letters, so I could define GREEK LETTER … PI
for M-P instead. But not all people can do this, and even if they could
they may not want to.)


PointedEars
 
T

Tim Streater

Thomas 'PointedEars' Lahn said:
As for practical editing issues, it is not only the editor, but also the
input method that needs to be available and to allow for easy typing. At
least on my current X.org keyboard setup it is considerably harder to type
`น' than `pi' (except in GNOME applications where I could type C-S-u
$HEXCODEPOINT; but that would still be four keypresses more). (I really
don't seem to need the THORN letters, so I could define GREEK LETTER ษ PI
for M-P instead. But not all people can do this, and even if they could
they may not want to.)

On my Mac น is option-p (alt-p if you prefer) - in all applications.
 
T

Thomas 'PointedEars' Lahn

Tim said:
On my Mac น is option-p (alt-p if you prefer) - in all applications.

But what good is a handy input method if you have the wrong application?
For example, you did not post GREEK SMALL LETTER PI, but something else
(UniView says, U+0E19 THAI CHARACTER NO NU; you even managed to mangle my
proper Unicode pi and ellipsis when quoting them.)

So I think we can add lack of proper Unicode support in some newsreaders to
the list of technical reasons for not using non-ASCII characters in source
code ;-)


PointedEars
 
T

Tim Streater

Thomas 'PointedEars' Lahn said:
But what good is a handy input method if you have the wrong application?
For example, you did not post GREEK SMALL LETTER PI, but something else
(UniView says, U+0E19 THAI CHARACTER NO NU; you even managed to mangle my
proper Unicode pi and ellipsis when quoting them.)

So I think we can add lack of proper Unicode support in some newsreaders to
the list of technical reasons for not using non-ASCII characters in source
code ;-)

Yes, MT-NewsWatcher does seem to have some issues in this regard (it
claims to send UTF-8 but is obviously lying). Shame really as it's quite
good in most other respects for my purposes.
 
E

Erwin Moller

On my Mac น is option-p (alt-p if you prefer) - in all applications.

Did anybody else notice the change in the topic when Tim replied?

The original quotes around "extended characters" have been replaced by
something else.
Funny, considering the discussion at hand. ;-)

Regards,
Erwin Moller
 
T

Tim Streater

Erwin Moller
Did anybody else notice the change in the topic when Tim replied?

The original quotes around "extended characters" have been replaced by
something else.
Funny, considering the discussion at hand. ;-)

Quite so :)
 
D

Dr J R Stockton

Mon said:
The syntax of ECMAScript has allowed “extended characters†in
identifiers since 3rd edition (1999). This means, among other things,
allowing any Unicode letters, like Greek, Arabic, and Cyrillic letters
as well as e.g. Chinese ideographs. As far as I can see, this has been
supported in web browsers for a long time (e.g., ever since IE 5.5).

So is it really safe to use them, writing, say

var ? = Math.PI;
var ????? = 0;
function Götterdämmerung()

or are there some pitfalls? Various coding conventions as well as
practical editing issues (you can’t be sure of always being able to
edit your code on a Unicode-enabled editor) aside, is there still some
real technical reason to stick to the A–Z, a–z, 0–9, â€$â€, â€_"
repertoire?


It creates interesting possibilities of writing code which looks
incorrect but will execute, or /vice versa/ - for example, "while" can
be used as an ordinary identifier, but not as a reserved word, if the
third character is \u2170. In at least common fonts, that numeric
character is likely to look very much like \x69.

One can likewise attack 'var' and 'extends'. And \u03bf or \u0531 can
be used in 'for'.

One can presumably defeat Google Translate be exchanging visually
equivalent Greek, Cyrillic, and Latin characters.

Code might be visually obfuscated by renaming one's variables to
incorporate, or comprise, various non-inking characters = especially
\u008d.

ENTIRELY UNTESTED.

But the French are rather proud of their language, and IIRC a well-
placed accent can completely change the meaning of a word - I van
understand a French programmer wanting to use accented identifiers.
 
J

Jukka K. Korpela

18.5.2011 20:53 said:
It creates interesting possibilities of writing code which looks
incorrect but will execute, or /vice versa/ - for example, "while" can
be used as an ordinary identifier, but not as a reserved word, if the
third character is \u2170.

Non-Ascii characters in identifiers could be used for a variety of
purposes, yes. There has been a lot of discussion about similar issues
with non-Ascii characters in domain names, where the risk (both
probability and possible damage) of intentionally caused confusion is
much greater.

With identifiers in JavaScript, the risks are already with us, without
any precautions like the complex rules for domain names (e.g. rules
against mixing letters from different writing systems in a word). So I
don't think the risks could be used as an argument against appropriate use.

I was somewhat surprised at seeing that both http://www.jslint.com/ and
http://jshint.com/ apparently report any non-Ascii characters in
identifiers as errors, without even offering any option to allow them.
But the French are rather proud of their language, and IIRC a well-
placed accent can completely change the meaning of a word - I van
understand a French programmer wanting to use accented identifiers.

It's not that common to find French word pairs that differ only in the
use of accents. In Swedish or Finnish, it's much easier, and letters
like å and ä aren't treated as letters with accents but as separate
letters of the alphabet. But Greek, Bulgarian, Thai, and Japanese are
better examples of languages that need non-Ascii letters.
 
D

Dr J R Stockton

Thu said:
18.5.2011 20:53, Dr J R Stockton wrote:
I was somewhat surprised at seeing that both http://www.jslint.com/ and
http://jshint.com/ apparently report any non-Ascii characters in
identifiers as errors, without even offering any option to allow them.

Systems of US origin; such is to be expected. Granted, my own site is
entirely 7-bit, but that is because I still have some old but valued
editing tools.
It's not that common to find French word pairs that differ only in the
use of accents. In Swedish or Finnish, it's much easier, and letters
like å and ä aren't treated as letters with accents but as separate
letters of the alphabet. But Greek, Bulgarian, Thai, and Japanese are
better examples of languages that need non-Ascii letters.

Much harder for me. The only Finnish I know is "Eskimo, kiitos", and I
strongly suspect the first of being only a product name. I know two
words fewer of Swedish, Bulgarian, Thai, and Japanese, and little more
of Greek.
 
J

Jukka K. Korpela

20.5.2011 23:28 said:
Systems of US origin; such is to be expected.

Well, I would have expected that people who write software for checking
program source would apply the standard of the programming language.
Those linters report non-Ascii letters in identifiers as _errors_. I
would accept a warning, though an informative diagnostic might be optimal-
The only Finnish I know is "Eskimo, kiitos", and I
strongly suspect the first of being only a product name.

Yes, it is a trademark.

If people decide to use words from a language other than English in
identifiers, then there are two very different issues:

1) For languages written in Latin letters, you _could_ stick to Ascii
and use replacements like "a" or "ae" for "ä". This is what people
commonly do in programming, but it distorts the words and may force the
programmer to think about potential confusion and to select the words in
an unnatural way- The level of distortion depends on language.

2) For languages written using other alphabets, there really isn't much
of an option - except perhaps transliteration or transcription.
 
R

Ry Nohryb

Well, I would have expected that people who write software for checking
program source would apply the standard of the programming language.
Those linters report non-Ascii letters in identifiers as _errors_. I
would accept a warning, though an informative diagnostic might be optimal-


Yes, it is a trademark.

If people decide to use words from a language other than English in
identifiers, then there are two very different issues:

1) For languages written in Latin letters, you _could_ stick to Ascii
and use replacements like "a" or "ae" for "ä". This is what people
commonly do in programming, but it distorts the words and may force the
programmer to think about potential confusion and to select the words in
an unnatural way- The level of distortion depends on language.

2) For languages written using other alphabets, there really isn't much
of an option - except perhaps transliteration or transcription.

I think you can use for that any utf8 char that's not longer than 2
bytes/16 bits, is that right ?

For example you can use π or μ but you can't use ∆ or ∑

I've been using function Æ’ () { ... } for a while in some <script>s,
but I discovered it was not being parsed properly in some (older)
browsers (I can't recall which, exactly), so I've had to return to
function f () {}.
 
S

Stanimir Stamenkov

Mon, 23 May 2011 02:31:33 -0700 (PDT), /Ry Nohryb/:
I think you can use for that any utf8 char that's not longer than 2
bytes/16 bits, is that right ?

I believe you're thinking not of "utf8 char" but of an Unicode
character which could be represented as a single UTF-16 unit (2 bytes).
For example you can use π or μ but you can't use ∆ or ∑

I think all of these are fine as they are represented using single
UTF-16 unit. I don't know whether using surrogate code points is
permitted or restricted in identifiers, however.
 
J

Jukka K. Korpela

23.5.2011 12:42, Stanimir Stamenkov wrote
:
Mon, 23 May 2011 02:31:33 -0700 (PDT), /Ry Nohryb/:


I believe you're thinking not of "utf8 char" but of an Unicode character
which could be represented as a single UTF-16 unit (2 bytes).

I can’t read minds, but in any case, the UTF-8 encoding has nothing to
do with the issue.

The following characters are allowed in identifiers according to ECMA
262: letters, â€$â€, â€_", digits, combining marks, connector punctuation,
ZWNJ, and ZWJ. The concepts here are to be understood in Unicode sense,
e.g. â€letter†means any Unicode character defined to be a letter by its
General Category property.
I think all of these are fine

No, ∆ (U+2206 INCREMENT) and ∑ (U+2211 N-ARY SUMMATION) are not allowed
in identifiers. Their General Category is Symbol, Math.


as they are represented using single
UTF-16 unit. I don't know whether using surrogate code points is
permitted or restricted in identifiers, however.

I guess it must have been long ago. But I don’t think the use of â€Æ’†as
a function name is a particularly good example of the needs for, or even
benefits of, using â€extended†characters in identifiers. In mathematics,
â€Æ’†is used as a generic symbol of a function.
 
R

Ry Nohryb

Mon, 23 May 2011 02:31:33 -0700 (PDT), /Ry Nohryb/:


I believe you're thinking not of "utf8 char" but of an Unicode
character which could be represented as a single UTF-16 unit (2 bytes).

Right, yes, I would think so too, but then, why does this <http://
jorgechamorro.com/test.html> throw a parse error @ line 12 (and not @
line 11), even when '∆'.charCodeAt(0) is 8710 ?
 
R

Ry Nohryb

23.5.2011 12:42, Stanimir Stamenkov wrote


I can’t read minds, but in any case, the UTF-8 encoding has nothing to
do with the issue.

The following characters are allowed in identifiers according to ECMA
262: letters, â€$â€, â€_", digits, combining marks, connector punctuation,
ZWNJ, and ZWJ. The concepts here are to be understood in Unicode sense,
e.g. â€letter†means any Unicode character defined to be aletter by its
General Category property.



No, ∆ (U+2206 INCREMENT) and ∑ (U+2211 N-ARY SUMMATION) are not allowed
in identifiers. Their General Category is Symbol, Math.

Oh, yeah, I see, so *that* was it (*not a letter*)! Thanks!
 
S

Stanimir Stamenkov

Mon, 23 May 2011 04:41:59 -0700 (PDT), /Ry Nohryb/:
Right, yes, I would think so too, but then, why does this
<http://jorgechamorro.com/test.html> throw a parse error @ line 12 (and not @
line 11), even when '∆'.charCodeAt(0) is 8710 ?

Seems like Jukka Korpela has already pointed out ∆ is not legal
identifier character (I'm missed to check that).
 
R

Ry Nohryb

I guess it must have been long ago. But I don’t think the use of â€Æ’†as
a function name is a particularly good example of the needs for, or even
benefits of, using â€extended†characters in identifiers. In mathematics,
(...)

Ha!, so if
â€Æ’†is used as a generic symbol of a function.

why function Æ’ (){} is not a good use for it ?

And ∑ and ∆ for example, would have been good names too, for summation
and increment, don't you think so ?
 
J

Jukka K. Korpela

I don't know whether using surrogate code points is
permitted or restricted in identifiers, however.

This is defined somewhat implicitly, since clause 7.6 of ECMA 262 says:
â€The characters in the specified categories in version 3.0 of the
Unicode standard must be treated as in those categories by all
conforming ECMAScript implementations.†An implementation does not need
to support characters added after Unicode 3.0 in identifiers. And in
Unicode 3.0, all characters were in BMP, i.e. directly representable as
16-bit code units.

In practice, Firefox, IE, Opera, and Chrome all seem to limit identifier
characters to those for which support is required in the standard. So
the Phoenican letter \uD802\uDD0E won’t do in an identifier (even though
it’s OK in a string literal), making Phoeninican programmers very sad.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top