PEP 3131: Supporting Non-ASCII Identifiers

rurpy · May 15, 2007

This is a very weak argument, IMHO. How do you want to use Python
without learning at least enough English to grasp a somewhat decent
understanding of the standard library? Let's face it: To do any "real"
programming, you need to know at least some English today, and I don't
see that changing anytime soon. And it is definitely not going to be
changed by allowing non-ASCII identifiers.

snip

Another way of framing this discussion could be, "should
Python continue to maintain a barrier to it's use by non-English
speakers if it is not necessary?"

Virtually every guide to programming style I have ever read stresses
the importance of variable naming. For example, the Wikipedia article
"programming style" mentions variable naming right after layout
(indentation, etc) in importance:

"Appropriate choices for variable names are seen as the keystone
for good style. Poorly-named variables make code harder to read
and understand"

Even when English-as-non-native-language speakers can understand
English words, the level and speed of compression is often far below
that of their native language. Denying the ability to use native
language
identifiers puts these people at a significant disadvantage compared
to English speakers with regard to reading (their own!) code.
And the justification for this is the hypothetical case that someone
who doesn't understand that language *might* *someday* have to
read it. Besides the large number of programs that will never be
public (far larger than most of the worriers think is my guess), even
in public programs this is not necessarily a disaster. A public
application
written in "Chinese Python" might work perfectly and be completely
usable by me, even if it is difficult for me to understand. And why
should my difficulty count for more than a Chinese person's
difficultly
in understanding my "English Python" application?

That Python keywords are English is unimportant -- they are a small
finite set that can be memorized. Identifiers are a large unbounded
set that can't be.

That the standard library code and documentation is in English
is irrelevant. One shouldn't need to read the standard library code
to use it. (That one sometimes has to is a Python flaw that should
be fixed -- not bandaided by requiring Python programmers to
know English).

There is no need to understand english to use the standard library.
Documentation has and will (as Python becomes more popular) be
translated into native languages. Here is Python standard library
documentation in Japanese:

http://www.python.jp/doc/release/lib/

While encouraging English in shared/public code is fine,
trying by enforce it by continuing to enforce ascii-only identifiers
smacks to me of a "whites only country club" mentality.

Making Python more accessible to the world (the vast majority
of whom do not speak English) can only advance Python.

Stefan Behnel · May 15, 2007

RenÃ© Fleschenberg said:
We all know what the PEP is about (we can read). The point is: If we do
not *need* non-English/ASCII identifiers, we do not need the PEP. If the
PEP does not solve an actual *problem* and still introduces some
potential for *new* problems, it should be rejected. So far, the
"problem" seems to just not exist. The burden of proof is on those who
support the PEP.

The main problem here seems to be proving the need of something to people who
do not need it themselves. So, if a simple "but I need it because a, b, c" is
not enough, what good is any further prove?

Stefan

Stefan Behnel · May 15, 2007

RenÃ© Fleschenberg said:
We all know what the PEP is about (we can read).

BTW: who is this "we" if it doesn't include you?

Stefan

MRAB · May 15, 2007

There are really two issues here, and they're being
confused.

One is allowing non-English identifiers, which is a political
issuer. The other is homoglyphs, two characters which look the same.
The latter is a real problem in a language like Python with implicit
declarations. If a maintenance programmer sees a variable name
and retypes it, they may silently create a new variable.

If Unicode characters are allowed, they must be done under some
profile restrictive enough to prohibit homoglyphs. I'm not sure
if UTS-39, profile 2, "Highly Restrictive", solves this problem,
but it's a step in the right direction. This limits mixing of scripts
in a single identifier; you can't mix Hebrew and ASCII, for example,
which prevents problems with mixing right to left and left to right
scripts. Domain names have similar restrictions.

We have to have visually unique identifiers.

There's also an issue with implementations that interface
with other languages. Some Python implementations generate
C, Java, or LISP code. Even CPython will call C code.
The representation of external symbols needs to be standardized
across those interfaces.

Surely it should be possible programmatically to compare the visual
appearance of the characters and highlight ones which are similar, or
colour-code various subsets when required.

Steven D'Aprano · May 15, 2007

That is the wrong question. The right question is: Why do you want to
introduce *more* possibilities to do such mistakes? Does this PEP solve
an actual problem, and if so, is that problem big enough to be worth the
introduction of these new risks and problems?

But they aren't new risks and problems, that's the point. So far, every
single objection raised ALREADY EXISTS in some form or another. There's
all this hysteria about the problems the proposed change will cause, but
those problems already exist. When was the last time a Black Hat tried to
smuggle in bad code by changing an identifier from xyz0 to xyzO?

I think it is not. I think that the problem only really applies to very
isolated use-cases.

Like the 5.5 billion people who speak no English.

So isolated that they do not justify a change to
mainline Python. If someone thinks that non-ASCII identifiers are really
needed, he could maintain a special Python branch that supports them. I
doubt that there would be alot of demand for it.

Maybe so. But I guarantee with a shadow of a doubt that if the change
were introduced, people would use it -- even if right now they say they
don't want it.

Matthew Woodcraft · May 15, 2007

Thorsten Kampe said:
* René Fleschenberg (Tue, 15 May 2007 14:35:33 +0200)

Why would I want to do that? It's not my code. Identifier names are
mine. If I use modules from standard library I use some "foreign
words". There's no problem in that.

It could even be an advantage. I sometimes find that I have to use a
'second best' name myself because I want to avoid the possible
confusion caused if I choose a name which has a well-known existing
use.

-M-

Steven D'Aprano · May 15, 2007

Joke aside, this just means that I won't ever be able to program math in
ADA, because I have absolutely no idea on how to do a 'pi' character on
my keyboard.

Maybe you should find out then? Personal ignorance is never an excuse for
rejecting technology.

Steven D'Aprano · May 15, 2007

Typing them is not the only problem. They might not even *display*
correctly if you don't happen to use a font that supports them.

Then maybe you should catch up to the 21st century and install some fonts
and a modern editor.

Steven D'Aprano · May 15, 2007

Thus spake Steven D'Aprano ([email protected]):

Let's set aside the fact that you're guilty of sloppy quoting here,
since the phrase "visual inspection" is yours, not mine.

Yes, my bad, I apologize, that was sloppy of me. What you actually said
was "I can't reliably verify it by eye".

Regardless,
your interpretation of my words is just plain dumb. My phrasing was
intended to draw attention to the fact that one needs to READ code in
order to understand it. You know - with one's eyes. VISUALLY. And VISUAL
INSPECTION of code becomes unreliable if this PEP passes.

Not withstanding my misquote, I find it ... amusing ... that after
hauling me over the coals for using the term "visual inspection", you're
not only using it, but shouting it.

Perhaps you aren't aware that doing something "by eye" is idiomatic
English for doing it quickly, roughly, imprecisely. It is the opposite of
taking the time and effort to do the job carefully and accurately. If you
measure something "by eye", you just look at it and take a guess.

So, as I said, if you're relying on VISUAL INSPECTION (your words _now_)
you're already vulnerable. Fortunately for you, you're not relying on
visual inspection, you are actually _reading_ and _comprehending_ the
code. That might even mean, in extreme cases, you sit down with pencil
and paper and sketch out the program flow to understand what it is doing.

Now that (I hope!) you understand why I said what I said, can we agree
that _understanding_ is critical to the process? If you don't understand
the code, you don't accept it. If somebody submits a patch with
identifiers like a9472302 and a 9473202 you're going to reject it as too
difficult to understand.

How do non-ASCII identifiers change that situation? What will be
different?

I'm sorry to have to tell you, but you understood Martin's post no
better than you did mine. There is no general way to detect homoglyphs
and "convert them to a normal form". Observe:

import unicodedata
print repr(unicodedata.normalize("NFC", u"\u2160")) print u"\u2160"
print "I"

Yes, I observe two very different glyphs, as different as the ASCII
characters I and |. What do you see?

So, a round 0 for reading comprehension this lesson, I'm afraid. Better
luck next time.

Ha ha, very funny.

So, let's summarize...

Non-ASCII identifiers are bad, because they are vulnerable to the exact
same problems as ASCII identifiers, only we're happy to live with those
problems if they are ASCII, and just install a font that makes I and l
look different, but we won't install a font that makes I and â… look
different, because that's too hard.

Well, you've convinced me. Obviously expecting Python programmers to cope
with something as complicated as installing a decent set of fonts is such
a major huddle that people will abandon the language in droves, probably
taking up Haskel and Visual Basic and Lisp and all those other languages
that allow non-ASCII identifiers.

Steven D'Aprano · May 15, 2007

Any program that uses non-English identifiers in Python is bound to
become gibberish, since it *will* be cluttered with English identifiers
all over the place anyway, wether you like it or not.

It won't be gibberish to the people who speak the language.

Steven D'Aprano · May 15, 2007

That is probably because they are just entering the developmental phase
of being able to use formal operational reasoning. I can understand that
they are looking for something to put the blame on but it is an error to
give in to the idea that it is hard for 12 year olds to learn a foreign
language. You realize that children learn new languages a lot faster
than adults?

Children soak up new languages between the ages of about one and four. By
12, they're virtually adults as far as learning new languages.

Again, it's probably not the language but the formal logic they have
problems with.

You have zero evidence for that, you're just applying your own
preconceptions and ignoring what HYRY has told you.

Please do *not* conclude that some child is not very good
at math or logic or programming when they are slow at first.

You're the one saying they're having problems with logic, not HYRY. He's
saying they are having problems with English.

Steven D'Aprano · May 15, 2007

Unless you are 150% sure that there will *never* be the need for a
person who does not know your language of choice to be able to read or
modify your code, the language that "fits the environment best" is
English.

Just a touch of hyperbole perhaps?

You know, it may come to a surprise to some people that English is not
the only common language. In fact, it only ranks third, behind Mandarin
and Spanish, and just above Arabic. Although the exact number of speakers
vary according to the source you consult, the rankings are quite stable:
Mandarin, Spanish, then English. Any of those languages could equally
have claim to be the world's lingua franca.

And interestingly, with only one billion English speakers (as a first or
second language) in the world, and 5.5 billion people who don't speak
English, I think its probably fair to say that it is a small minority
that speak English.

Steven D'Aprano · May 15, 2007

We have to have visually unique identifiers.

Well, Python has existed for years without such a requirement, so I think
"have to" is too strong a term.

Compare:

thisisareallylongbutcompletelylegalidentiferandnotvisuallyuniqueataglance

with

thisisareallylongbutcompletelylegalidentiferadnnotvisuallyuniqueataglance

I imagine, decades ago, people arguing against the introduction of long
identifiers because of the risk that their projects will be flooded with
Black Hats trying to slip one over them by using the vulnerability cause
by really long identifiers. I can just see people banging away on their
keyboard, swearing black and blue that identifiers of more than four
characters are completely unnecessary (who needs more than 450,000
variables in a program?) and will just cause the End Of Programming As We
Know It.

rn = m = None
IIl0 = IlIO = None

I'm sure that the Python community has zero sympathy for anyone
suggesting that Python should _enforce_ rules like "don't use a single l
as an identifier", even if they have complete sympathy with anybody who
has such a rule in their own projects.

Aldo Cortesi · May 15, 2007

Thus spake Steven D'Aprano ([email protected]):

Perhaps you aren't aware that doing something "by eye" is idiomatic
English for doing it quickly, roughly, imprecisely. It is the opposite of
taking the time and effort to do the job carefully and accurately. If you
measure something "by eye", you just look at it and take a guess.

Well, Steve, speaking as someone not entirely unfamiliar with idiomatic
English, I can say with some confidence that that's complete and utter bollocks
(idomatic usage for "nonsense", by the way). To do something "by eye" means
nothing more nor less than doing it visually. Unless you can provide a citation
to the contrary, please move on from this petty little point of yours, and try
to make a substantial technical argument instead.

So, as I said, if you're relying on VISUAL INSPECTION (your words _now_)
you're already vulnerable. Fortunately for you, you're not relying on
visual inspection, you are actually _reading_ and _comprehending_ the
code. That might even mean, in extreme cases, you sit down with pencil
and paper and sketch out the program flow to understand what it is doing.

Please, pick up a dictionary, and look up "visual" and "inspection", then
re-read my message. Ponder the fact that visual inspection is in fact a
necessary precursor to "reading" or "comprehending" code. Now, imagine reading
a piece of code where you can never be sure that a character is what it appears
to be...

Yes, I observe two very different glyphs, as different as the ASCII
characters I and |. What do you see?

I recommend that you gain a basic understanding of the relationship between
Unicode code points and the glyphs on your screen before attempting to argue
this point again. The particular glyph your current font-set translates the
character into is irrelevant. Indeed, the fact that there is font variation
from client to client is one of the more obvious problems with your technically
illiterate hope that one could homogenize characters so that everything that
looks the same has the same meaning. Fiddle around with your fontsets a bit -
you only have to find one combination where the two glyps look the same to
prove my case...

Regards,

Aldo

Steven D'Aprano · May 15, 2007

I've made various comments to other people's responses, so I guess it is
time to actually respond to the PEP itself.

PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments to
the PEP included below, either here (comp.lang.python), or to
(e-mail address removed)

In summary, this PEP proposes to allow non-ASCII letters as identifiers
in Python. If the PEP is accepted, the following identifiers would also
become valid as class, function, or variable names: LÃ¶ffelstiel, changÃ©,
Ð¾ÑˆÐ¸Ð±ÐºÐ°, or å£²ã‚Šå ´ (hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background to
evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these questions:
- should non-ASCII identifiers be supported? why? - would you use them
if it was possible to do so? in what cases?

It seems to me that none of the objections to non-ASCII identifiers are
particularly strong. I've heard many accusations that they will introduce
"vulnerabilities", by analogy to unicode attacks in URLs, but I haven't
seen any credible explanations of how these vulnerabilities would work,
or how they are any different to existing threats. That's not to say that
there isn't a credible threat, but if there is, nobody has come close to
explaining it.

I would find it useful to be able to use non-ASCII characters for heavily
mathematical programs. There would be a closer correspondence between the
code and the mathematical equations if one could write Î”(Âµ*Ï€) instead of
delta(mu*pi).

(Aside: I wonder what the Numeric crowd would say about this?)

Gregor Horvath · May 15, 2007

RenÃ© Fleschenberg said:
We all know what the PEP is about (we can read). The point is: If we do
not *need* non-English/ASCII identifiers, we do not need the PEP. If the
PEP does not solve an actual *problem* and still introduces some
potential for *new* problems, it should be rejected. So far, the
"problem" seems to just not exist. The burden of proof is on those who
support the PEP.

A good product does not only react to problems but acts.

Solving current problems is only one thing. Great products are exploring
new ways, ideas and possibilities according to their underlying vision.

Python has a vision of being easy even for newbies to programming.
Making it easier for non native English speakers is a step forward in
this regard.

Gregor

Gregor Horvath · May 16, 2007

Ross said:
non-ASCII identifiers. While it's easy to find code where comments use
non-ASCII characters, I was never able to find a non-made up example
that used them in identifiers.

If comments are allowed to be none English, then why are identifier not?
This is inconsistent because there is a correlation between identifier
and comment.

The best identifier is one that needs no comment, because it
self-describes it's content. None English identifiers enhance the
meaning of identifiers for some projects. So why forbid them? We are all
adults.

Gregor

Alex Martelli · May 16, 2007

Aldo Cortesi said:
Thus spake Steven D'Aprano ([email protected]):

Well, Steve, speaking as someone not entirely unfamiliar with idiomatic
English, I can say with some confidence that that's complete and utter
bollocks (idomatic usage for "nonsense", by the way). To do something "by
eye" means nothing more nor less than doing it visually. Unless you can
provide a citation to the contrary, please move on from this petty little
point of yours, and try to make a substantial technical argument instead.

I can't find any reference for Steven's alleged idiomatic use of "by
eye", either -- _however_, my wife Anna (an American from Minnesota)
came up with exactly the same meaning when I asked her if "by eye" had
any idiomatic connotations, so I suspect it is indeed there, at least in
the Midwest. Funniest, of course, is that the literal translation into
Italian, "a occhio", has a similiar idiomatic meaning to _any_ native
speaker of Italian -- and THAT one is even in the Italian wikipedia!-)

I'll be the first to admit that this issue has nothing to do with the
substance of the argument (on which my wife, also my co-author of the
2nd ed of the Python Cookbook and a fellow PSF member, deeply agrees
with you, Aldo, and me), but natural language nuances and curios are my
third-from-the-top most consuming interest (after programming and...
Anna herself!-).

[[_Visual inspection_ plays a crucial role in many areas of engineering,
of course; for example, visual inspection of welds is a very reliable,
although costly, quality assurance process, particularly if you ensure
that the inspectors hold the top professional degrees from the American
Welding Society (if you're operating in the USA

]].

Alex

sjdevnull · May 16, 2007

Steven said:
Then maybe you should catch up to the 21st century and install some fonts
and a modern editor.

It's not just about fonts installed on my desktop. I still do a _lot_
of debugging/code browsing remotely over terminal connections. I
still often have to sit down at someone else's machine and help them
troubleshoot, often going through the stack trace for whatever package
they're using--and I don't have control over which fonts they decide
to install. Even simple high-bit latin1 characters differ on vanilla
Windows machines vs. vanilla Linux/Mac machines. I even sometimes
read code snippets on email lists and websites from my handheld, which
is sadly still memory-limited enough that I'm really unlikely to
install anything approaching a full set of Unicode fonts.

sjdevnull · May 16, 2007

Steven said:
I've made various comments to other people's responses, so I guess it is
time to actually respond to the PEP itself.

It seems to me that none of the objections to non-ASCII identifiers are
particularly strong. I've heard many accusations that they will introduce
"vulnerabilities", by analogy to unicode attacks in URLs, but I haven't
seen any credible explanations of how these vulnerabilities would work,
or how they are any different to existing threats. That's not to say that
there isn't a credible threat, but if there is, nobody has come close to
explaining it.

I would find it useful to be able to use non-ASCII characters for heavily
mathematical programs. There would be a closer correspondence between the
code and the mathematical equations if one could write D(u*p) instead of
delta(mu*pi).

Just as one risk here:
When reading the above on Google groups, it showed up as "if one could
write ?(u*p)..."
When quoting it for response, it showed up as "could write D(u*p)".

I'm sure that the symbol you used was neither a capital letter d nor a
question mark.

Using identifiers that are so prone to corruption when posting in a
rather popular forum seems dangerous to me--and I'd guess that a lot
of source code highlighters, email lists, etc have similar problems.
I'd even be surprised if some programming tools didn't have similar
problems.

Atoms, Identifiers, and Primaries	21	Apr 16, 2013
Generating valid identifiers	8	Jul 26, 2012
Non-identifiers in dictionary keys for **expression syntax	3	May 23, 2013
Renaming identifiers & debugging	14	Feb 25, 2010
Looking for UNICODE to ASCII Conversioni Example Code	15	Oct 18, 2013
Python 3.5, bytes, and %-interpolation (aka PEP 461)	10	Feb 24, 2014
Is PEP-8 a Code or More of a Guideline?	52	May 26, 2007
Extended identifiers?	1	Jun 15, 2012

PEP 3131: Supporting Non-ASCII Identifiers

rurpy

Stefan Behnel

Stefan Behnel

MRAB

Steven D'Aprano

Matthew Woodcraft

Steven D'Aprano

Steven D'Aprano

Steven D'Aprano

Steven D'Aprano

Steven D'Aprano

Steven D'Aprano

Steven D'Aprano

Aldo Cortesi

Steven D'Aprano

Gregor Horvath

Gregor Horvath

Alex Martelli

sjdevnull

sjdevnull

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads