PEP 3131: Supporting Non-ASCII Identifiers

Neil Hodgson · May 17, 2007

Martin v. Löwis:

... regardless of whether this PEP gets accepted
or not (which it just did).

Which version can we expect this to be implemented in?

Neil

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · May 17, 2007

Neil said:
Martin v. Löwis:

Which version can we expect this to be implemented in?

The PEP says 3.0, and the planned implementation also targets
that release.

Regards,
Martin

sjdevnull · May 17, 2007

Are you worried that some 3rd-party package you have
included in your software will have some non-ascii identifiers
buried in it somewhere? Surely that is easy to check for?
Far easier that checking that it doesn't have some trojan
code it it, it seems to me.

What do you mean, "check for"? If, say, numeric starts using math
characters (as has been suggested), I'm not exactly going to stop
using numeric. It'll still be a lot better than nothing, just
slightly less better than it used to be.

I think we all are in this position. I always send plain
text mail to mailing lists, people I don't know etc. But
that doesn't mean that email software should be contrainted
to only 7-bit plain text, no attachements! I frequently use
such capabilities when they are appropriate.

Sure. But when you're talking about maintaining code, there's a very
high value to having all the existing tools work with it whether
they're wide-character aware or not.

If your response is, "yes, but look at the problems html
email, virus infected, attachements etc cause", the situation
is not the same. You have little control over what kind of
email people send you but you do have control over what
code, libraries, patches, you choose to use in your
software.

If you want to use ascii-only, do it! Nobody is making
you deal with non-ascii code if you don't want to.

Yes. But it's not like this makes things so horribly awful that it's
worth my time to reimplement large external libraries. I remain at -0
on the proposal; it'll cause some headaches for the majority of
current Python programmers, but it may have some benefits to a
sizeable minority and may help bring in new coders. And it's not
going to cause flaming catastrophic death or anything.

sjdevnull · May 17, 2007

Are you worried that some 3rd-party package you have
included in your software will have some non-ascii identifiers
buried in it somewhere? Surely that is easy to check for?
Far easier that checking that it doesn't have some trojan
code it it, it seems to me.

What do you mean, "check for"? If, say, numeric starts using math
characters (as has been suggested), I'm not exactly going to stop
using numeric. It'll still be a lot better than nothing, just
slightly less better than it used to be.

I think we all are in this position. I always send plain
text mail to mailing lists, people I don't know etc. But
that doesn't mean that email software should be contrainted
to only 7-bit plain text, no attachements! I frequently use
such capabilities when they are appropriate.

Sure. But when you're talking about maintaining code, there's a very
high value to having all the existing tools work with it whether
they're wide-character aware or not.

If your response is, "yes, but look at the problems html
email, virus infected, attachements etc cause", the situation
is not the same. You have little control over what kind of
email people send you but you do have control over what
code, libraries, patches, you choose to use in your
software.

If you want to use ascii-only, do it! Nobody is making
you deal with non-ascii code if you don't want to.

Yes. But it's not like this makes things so horribly awful that it's
worth my time to reimplement large external libraries. I remain at -0
on the proposal; it'll cause some headaches for the majority of
current Python programmers, but it may have some benefits to a
sizeable minority and may help bring in new coders. And it's not
going to cause flaming catastrophic death or anything.

Steve Holden · May 17, 2007

Martin said:
The PEP says 3.0, and the planned implementation also targets
that release.

Can we take it this change *won't* be backported to the 2.X series?

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

Ross Ridge · May 17, 2007

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= said:
One possible reason is that the tools processing the program would not
know correctly what encoding the source file is in, and would fail
when they guessed the encoding incorrectly. For comments, that is not
a problem, as an incorrect encoding guess has no impact on the meaning
of the program (if the compiler is able to read over the comment
in the first place).

Possibly. One Java program I remember had Japanese comments encoded
in Shift-JIS. Will Python be better here? Will it support the source
code encodings that programmers around the world expect?

Another possible reason is that the programmers were unsure whether
non-ASCII identifiers are allowed.

If that's the case, I'm not sure how you can improve on that in Python.

There are lots of possible reasons why all these programmers around
the world who want to use non-ASCII identifiers end-up not using them.
One is simply that very people ever really want to do so. However,
if you're to assume that they do, then you should look the existing
practice in other languages to find out what they did right and what
they did wrong. You don't have to speculate.

Ross Ridge

Gregor Horvath · May 17, 2007

With the second one, all my standard tools would work fine. My user's
setups will work with it. And there's a much higher chance that all
the intervening systems will work with it.

Please fix your setup.
This is the 21st Century. Unicode is the default in Python 3000.
Wake up before it is too late for you.

Gregor

Guest · May 18, 2007

Currently, in Python 2.5, identifiers are specified as starting with

an upper- or lowercase letter or underscore ('_') with the following
"characters" of the identifier also optionally being a numerical digit
("0"..."9").

This current state seems easy to remember even if felt restrictive by
many.

Contrawise, the referenced document "UAX-31" is a bit obscure to me

It's actually very easy. The basic principle will stay: the first
character must be a letter or an underscore, followed by letters,
underscores, and digits.

The question really is "what is a letter"? what is an underscore?
what is a digit?

1) Will this allow me to use, say, a "right-arrow" glyph (if I can
find one) to start my identifier?

No. A right-arrow (such as U+2192, RIGHTWARDS ARROW) is a symbol
(general category Sm: Symbol, Math). See

http://unicode.org/Public/UNIDATA/UCD.html

for a list of general category values, and

http://unicode.org/Public/UNIDATA/UnicodeData.txt

for a textual description of all characters.

Now, there is a special case in that Unicode supports "combining
modifier characters", i.e. characters that are not characters
themselves, but modify previous characters, to add diacritical
marks to letters. Unicode has great flexibility in applying these,
to form characters that are not supported themselves. Among those,
there is U+20D7, COMBINING RIGHT ARROW ABOVE, which is of general
category Mn, Mark, Nonspacing.

In PEP 3131, such marks may not appear as the first character
(since they need to modify a base character), but as subsequent
characters. This allows you to form identifiers such as
vâƒ— (which should render as a small letter v, with an vector
arrow on top).

2) Could an ``ID_Continue`` be used as an ``ID_Start`` if using a RTL
(reversed or "mirrored") identifier? (Probably not, but I don't know.)

Unicode, and this PEP, always uses logical order, not rendering order.
What matters is in what order the characters appear in the source code
string.

RTL languages do pose a challenge, in particular since bidirectional
algorithms apparently aren't implemented correctly in many editors.

3) Is or will there be a definitive and exhaustive listing (with
bitmap representations of the glyphs to avoid the font issues) of the
glyphs that the PEP 3131 would allow in identifiers? (Does this
question even make sense?)

It makes sense, but it is difficult to implement. The PEP already
links to a non-normative list that is exhaustive for Unicode 4.1.
Future Unicode versions may add additional characters, so the
a list that is exhaustive now might not be in the future. The
Unicode consortium promises stability, meaning that what is an
identifier now won't be reclassified as a non-identifier in the
future, but the reverse is not true, as new code points get
assigned.

As for the list I generated in HTML: It might be possible to
make it include bitmaps instead of HTML character references,
but doing so is a licensing problem, as you need a license
for a font that has all these characters. If you want to
lookup a specific character, I recommend to go to the Unicode
code charts, at

http://www.unicode.org/charts/

Notice that an HTML page that includes individual bitmaps
for all characters would take *ages* to load.

Regards,
Martin

P.S. Anybody who wants to play with generating visualisations
of the PEP, here are the functions I used:

def isnorm(c):
return unicodedata.normalize("NFC", c)

def start(c):
if not isnorm(c):
return False
if unicodedata.category(c) in ('Ll', 'Lt', 'Lm', 'Lo', 'Nl'):
return True
if c==u'_':
return True
if c in u"\u2118\u212E\u309B\u309C":
return True
return False

def cont_only(c):
if not isnorm(c):
return False
if unicodedata.category(c) in ('Mn', 'Mc', 'Nd', 'Pc'):
return True
if 0x1369 <= ord(c) <= 0x1371:
return True
return False

def cont(c):
return start(c) or cont_only(c)

The isnorm() aspect excludes characters from the list which
change under NFC. This excludes a few compatibility characters
which are allowed in source code, but become indistinguishable
from their canonical form semantically.

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= · May 18, 2007

Possibly. One Java program I remember had Japanese comments encoded

in Shift-JIS. Will Python be better here? Will it support the source
code encodings that programmers around the world expect?

It's not a question of "will it". It does today, starting from Python 2.3.

If that's the case, I'm not sure how you can improve on that in Python.

It will change on its own over time. "Not allowed" could mean "not
permitted by policy". Indeed, the PEP explicitly mandates a policy
that bans non-ASCII characters from source (whether in identifiers or
comments) for Python itself, and encourages other projects to define
similar policies. What projects pick up such a policy, or pick a
different policy (e.g. all comments must be in Korean) remains to
be seen.

Then, programmers will not be sure whether the language and the tools
allow it. For Python, it will be supported from 3.0, so people will
be worried initially whether their code needs to run on older Python
versions. When Python 3.5 comes along, people hopefully have lost
interest in supporting 2.x, so they will start using 3.x features,
including this one.

Now, it may be tempting to say "ok, so lets wait until 3.5, if people
won't use it before anyway". That is trick logic: if we add it only
to 3.5, people won't be using it before 4.0. *Any* new feature
takes several years to get into wide acceptance, but years pass
surprisingly fast.

There are lots of possible reasons why all these programmers around
the world who want to use non-ASCII identifiers end-up not using them.
One is simply that very people ever really want to do so. However,
if you're to assume that they do, then you should look the existing
practice in other languages to find out what they did right and what
they did wrong. You don't have to speculate.

That's indeed how this PEP came about. There were early adapters, like
Java, then experience gained from it (resulting in PEP 263, implemented
in Python 2.3 on the Python side, and resulting in UAX#39 on the Unicode
consortium side), and that experience now flows into PEP 3131.

If you think I speculated in reasoning why people did not use the
feature in Java: sorry for expressing myself unclearly. I know for
a fact that the reasons I suggested were actual reasons given by
actual people. I'm just not sure whether this was an exhaustive
list (because I did not interview every programmer in the world),
and what statistical relevance each of these reasons had (because
I did not conduct a scientific research to gain statistically
relevant data on usage of non-ASCII identifiers in different
regions of the world).

Regards,
Martin

Hendrik van Rooyen · May 18, 2007

Hendrik van Rooyen said:
HvR:
LOL - true - but a broken down assembler programmer like me
does not use getattr - and def is short for define, and for and while
and in are not German.

After an intense session of omphaloscopy, I would like another bite
at this cherry.

I think my problem is something like this - when I see a line of code
like:

def frobnitz():

I do not actually see the word "def" - I see something like:

define a function with no arguments called frobnitz

This "expansion" process is involuntary, and immediate in my mind.

And this is immediately followed by an irritated reaction, like:

WTF is frobnitz? What is it supposed to do? What Idiot wrote this?

Similarly, when I encounter the word "getattr" - it is immediately
expanded to "get attribute" and this "expansion" is kind of
dependant on another thing, namely that my mind is in "English
mode" - I refer here to something that only happens rarely, but
with devastating effect, experienced only by people who can read
more than one language - I am referring to the phenomenon that you
look at an unfamiliar piece of writing on say a signboard, with the
wrong language "switch" set in your mind - and you cannot read it,
it makes no sense for a second or two - until you kind of step back
mentally and have a more deliberate look at it, when it becomes
obvious that its not say English, but Afrikaans, or German, or vice
versa.

So in a sense, I can look you in the eye and assert that "def" and
"getattr" are in fact English words... (for me, that is)

I suppose that this "one language track" - mindedness of mine
is why I find the mix of keywords and German or Afrikaans so
abhorrent - I cannot really help it, it feels as if I am eating a
sandwich, and that I bite on a stone in the bread. - It just jars.

Good luck with your PEP - I don't support it, but it is unlikely
that the Python-dev crowd and GvR would be swayed much
by the opinions of the egregious HvR.

Aesthetics aside, I think that the practical maintenance problems
(especially remote maintenance) is the rock on which this
ship could founder.

- Hendrik

--
Philip Larkin (English Poet) :
They **** you up, your mom and dad -
They do not mean to, but they do.
They fill you with the faults they had,
and add some extra, just for you.

Hendrik van Rooyen · May 18, 2007

Hvr:

What I meant was, would the use of "foreign" identifiers look so
horrible to you if the core language had fewer English keywords?
(Perhaps Perl, with its line-noise, was a poor choice of example.
Maybe Lisp would be better, but I'm not so sure of my Lisp as to
make such an assertion for it.)

I suppose it would jar less - but I avoid such languages, as the whole
thing kind of jars - I am not on the python group for nothing..

: - )

- Hendrik

Paul Rubin · May 18, 2007

Martin v. Löwis said:
Now I understand it is meaning 12 in Merriam-Webster's dictionary,
a) "to decline to bid, double, or redouble in a card game", or b)
"to let something go by without accepting or taking
advantage of it".

I never thought of it as having that meaning. I thought of it in the
sense of going by something without stopping, like "I passed a post
office on my way to work today".

Paul Rubin · May 18, 2007

Martin v. Löwis said:
If you doubt the claim, please indicate which of these three aspects
you doubt:
1. there are programmers which desire to defined classes and functions
with names in their native language.
2. those developers find the code clearer and more maintainable than
if they had to use English names.
3. code clarity and maintainability is important.

I think it can damage clarity and maintainability and if there's so
much demand for it then I'd propose this compromise: non-ascii
identifiers are allowed but they produce a compiler warning message
(including from eval and exec). You can suppress the warning message
with a command line option.

Paul Rubin · May 18, 2007

Martin v. Löwis said:
What specific tools should be discussed, and what specific problems
do you expect?

Emacs, whose unicode support is still pretty weak.

Thomas Bellman · May 18, 2007

As for the list I generated in HTML: It might be possible to
make it include bitmaps instead of HTML character references,
but doing so is a licensing problem, as you need a license
for a font that has all these characters. If you want to
lookup a specific character, I recommend to go to the Unicode
code charts, at

http://www.unicode.org/charts/

My understanding is also that there are several east-asian
characters that display quite differently depending on whether
you are in Japan, Taiwan or mainland China. So much differently
that for example a Japanese person will not be able to recognize
a character rendered in the Taiwanese or mainland Chinese way.

Laurent Pointal · May 18, 2007

Long and interresting discussion with different point of view.

Personnaly, even if the PEP goes (and its accepted), I'll continue to use
identifiers as currently. But I understand those who wants to be able to
use chars in their own language.

* for people which are not expert developers (non-pros, or in learning
context), to be able to use names having meaning, and for pro developers
wanting to give a clear domain specific meaning - mainly for languages non
based on latin characters where the problem must be exacerbated.
They can already use unicode in strings (including documentation ones).

* for exchanging with other programing languages having such identifiers...
when they are really used (I include binding of table/column names in
relational dataabses).

* (not read, but I think present) this will allow developers to lock the
code so that it could not be easily taken/delocalized anywhere by anybody.

In the discussion I've seen that problem of mixing chars having different
unicode number but same representation (ex. omega) is resolved (use of an
unicode attribute linked to representation AFAIU).

I've seen (on fclp) post about speed, it should be verified, I'm not sure we
will loose speed with unicode identifiers.

On the unicode editing, we have in 2007 enough correct editors supporting
unicode (I configure my Windows/Linux editors to use utf-8 by default).

I join concern in possibility to read code from a project which may use such
identifiers (i dont read cyrillic, neither kanji or hindi) but, this will
just give freedom to users.

This can be a pain for me in some case, but is this a valuable argument so
to forbid this for other people which feel the need ?

IMHO what we should have if the PEP goes on:

* reworking on traceback to have a general option (like -T) to ensure
tracebacks prints only pure ascii, to avoid encoding problem when
displaying errors on terminals.

* a possibility to specify for modules that they must *define* only
ascii-based names, like a from __futur__ import asciionly. To be able to
enforce this policy in projects which request it.

* and, as many wrote, enforce that standard Python libraries use only ascii
identifiers.

Torsten Bronger · May 18, 2007

HallÃ¶chen!

Under the PEP, identifiers are converted to normal form NFC, and
we have

py> unicodedata.normalize("NFC", u"\u2126")
u'\u03a9'

So, OHM SIGN compares equal to GREEK CAPITAL LETTER OMEGA. It can't
be confused with it - it is equal to it by the proposed language
semantics.

So different unicode sequences in the source code can denote the
same identifier?

TschÃ¶,
Torsten.

Torsten Bronger · May 18, 2007

HallÃ¶chen!

Laurent said:
[...]

Personnaly, even if the PEP goes (and its accepted), I'll continue
to use identifiers as currently. [...]

Me too (mostly), although I do like the PEP. While many people have
pointed out possible issues of the PEP, only few have tried to
estimate its actual impact. I don't think that it will do harm to
Python code because the programmers will know when it's appropriate
to use it. The potential trouble is too obvious for being ignored
accidentally. And in the case of a bad programmer, you have more
serious problems than flawed identifier names, really.

But for private utilities for example, such identifiers are really a
nice thing to have. The same is true for teaching in some cases.
And the small simulation program in my thesis would have been better
with some Î± and Ï†. At least, the program would be closer to the
equations in the text then.

[...]

* a possibility to specify for modules that they must *define*
only ascii-based names, like a from __futur__ import asciionly. To
be able to enforce this policy in projects which request it.

Please don't. We're all adults. If a maintainer is really
concerned about such a thing, he should write a trivial program that
ensures it. After all, there are some other coding guidelines too
that could be enforced this way but aren't, for good reason.

TschÃ¶,
Torsten.

Gregor Horvath · May 18, 2007

Hendrik said:
I suppose that this "one language track" - mindedness of mine
is why I find the mix of keywords and German or Afrikaans so
abhorrent - I cannot really help it, it feels as if I am eating a
sandwich, and that I bite on a stone in the bread. - It just jars.

Please come to Vienna and learn the local slang.
You would be surprised how beautiful and expressive a language mixed up
of a lot of very different languages can be. Same for music. It's the
secret of success of the music from Vienna. It's just a mix up of all
the different cultures once living in a big multicultural kingdom.

A mix up of Python key words and German identifiers feels very natural
for me. I live in cultural diversity and richness and love it.

Gregor

Istvan Albert · May 18, 2007

Is there any difference for you in debugging this code snippets?

class TÃ¼rstock(object):

Of course there is, how do I type the Ã¼ ? (I can copy/paste for
example, but that gets old quick).

But you're making a strawman argument by using extended ASCII
characters that would work anyhow. How about debugging this (I wonder
will it even make it through?) :

class ï¼–ìžíšŒë‹´ê´€ë ¨ë¡ ì¡°
ï¼–ìžíšŒ = 0
ï¼–ìžíšŒë‹´ê´€ë ¨ ê³ ê·€ ëª…=10

(I don't know what it means, just copied over some words from a
japanese news site, but the first thing it did it messed up my editor,
would not type the colon anymore)

i.

Atoms, Identifiers, and Primaries	21	Apr 16, 2013
Generating valid identifiers	8	Jul 26, 2012
Non-identifiers in dictionary keys for **expression syntax	3	May 23, 2013
Renaming identifiers & debugging	14	Feb 25, 2010
Looking for UNICODE to ASCII Conversioni Example Code	15	Oct 18, 2013
Python 3.5, bytes, and %-interpolation (aka PEP 461)	10	Feb 24, 2014
Is PEP-8 a Code or More of a Guideline?	52	May 26, 2007
Extended identifiers?	1	Jun 15, 2012

PEP 3131: Supporting Non-ASCII Identifiers

Neil Hodgson

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

sjdevnull

sjdevnull

Steve Holden

Ross Ridge

Gregor Horvath

Guest

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Hendrik van Rooyen

Hendrik van Rooyen

Paul Rubin

Paul Rubin

Paul Rubin

Thomas Bellman

Laurent Pointal

Torsten Bronger

Torsten Bronger

Gregor Horvath

Istvan Albert

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads