PEP 3131: Supporting Non-ASCII Identifiers

Steven D'Aprano · May 13, 2007

I don't
want to be in a situation where I need to mechanically "clean"
code (say, from a submitted patch) with a tool because I can't
reliably verify it by eye.

But you can't reliably verify by eye. That's orders of magnitude more
difficult than debugging by eye, and we all know that you can't reliably
debug anything but the most trivial programs by eye.

If you're relying on cursory visual inspection to recognize harmful code,
you're already vulnerable to trojans.

We should learn from the plethora of
Unicode-related security problems that have cropped up in the last
few years.

Of course we should. And one of the things we should learn is when and
how Unicode is a risk, and not imagine that Unicode is some sort of
mystical contamination that creates security problems just by being used.

- Non-ASCII identifiers would be a barrier to code exchange. If I
know
Python I should be able to easily read any piece of code written
in it, regardless of the linguistic origin of the author. If PEP
3131 is accepted, this will no longer be the case.

But it isn't the case now, so that's no different. Code exchange
regardless of human language is a nice principle, but it doesn't work in
practice. How do you use "any piece of code ... regardless of the
linguistic origin of the author" when you don't know what the functions
and classes and arguments _mean_?

Here's a tiny doc string from one of the functions in the standard
library, translated (more or less) to Portuguese. If you can't read
Portuguese at least well enough to get by, how could you possibly use
this function? What would you use it for? What does it do? What arguments
does it take?

def dirsorteinsercao(a, x, baixo=0, elevado=None):
"""da o artigo x insercao na lista a, e mantem-na a
supondo classificado e classificado. Se x estiver ja em a,
introduza-o a direita do x direita mais. Os args opcionais
baixos (defeito 0) e elevados (len(a) do defeito) limitam
a fatia de a a ser procurarado.
"""
# not a non-ASCII character in sight (unless I missed one...)

[Apologies to Portuguese speakers for the dogs-breakfast I'm sure Babel-
fish and I made of the translation.]

The particular function I chose is probably small enough and obvious
enough that you could work out what it does just by following the
algorithm. You might even be able to guess what it is, because Portuguese
is similar enough to other Latin languages that most people can guess
what some of the words might mean (elevados could be height, maybe?). Now
multiply this difficulty by a thousand for a non-trivial module with
multiple classes and dozens of methods and functions. And you might not
even know what language it is in.

No, code exchange regardless of natural language is a nice principle, but
it doesn't exist except in very special circumstances.

A Python
project that uses Urdu identifiers throughout is just as useless
to me, from a code-exchange point of view, as one written in Perl.

That's because you can't read it, not because it uses Unicode. It could
be written entirely in ASCII, and still be unreadable and impossible to
understand.

- Unicode is harder to work with than ASCII in ways that are more
important
in code than in human-language text. Humans eyes don't care if two
visually indistinguishable characters are used interchangeably.
Interpreters do. There is no doubt that people will accidentally
introduce mistakes into their code because of this.

That's no different from typos in ASCII. There's no doubt that we'll give
the same answer we've always given for this problem: unit tests, pylint
and pychecker.

Steven D'Aprano · May 13, 2007

The compiler wouldn't execute the wrong code; it would execute the code
that the phisher intended it to execute. That might be different from
what it looked like to the reviewer.

How? Just repeating in more words your original claim doesn't explain a
thing.

It seems to me that your argument is, only slightly exaggerated, akin to
the following:

"Unicode identifiers are bad because phishers will no longer need to
write call_evil_func() but can write call_ÆŽvÄ¬Ä¾_func() instead."

Maybe I'm naive, but I don't see how giving phishers the ability to
insert a call to Æ’unction() in some module is any more dangerous than
them inserting a call to function() instead.

If I'm mistaken, please explain why I'm mistaken, not just repeat your
claim in different words.

Paul Rubin · May 13, 2007

Neil Hodgson said:
C#, Java, Ecmascript, Visual Basic.

Java (and C#?) have mandatory declarations so homoglyphic identifiers aren't
nearly as bad a problem. Ecmascript is a horrible bug-prone language and
we want Python to move away from resembling it, not towards it. VB: well,
same as Ecmascript, I guess.

Paul Rubin · May 13, 2007

Steven D'Aprano said:
If I'm mistaken, please explain why I'm mistaken, not just repeat your
claim in different words.

if user_entered_password != stored_password_from_database:
password_is_correct = False
...
if password_is_correct:
log_user_in()

Does "password_is_correct" refer to the same variable in both places?

Steven D'Aprano · May 13, 2007

if user_entered_password != stored_password_from_database:
password_is_correct = False
...
if password_is_correct:
log_user_in()

Does "password_is_correct" refer to the same variable in both places?

No way of telling without a detailed code inspection. Who knows what
happens in the ... ? If a black hat has access to the code, he could
insert anything he liked in there, ASCII or non-ASCII.

How is this a problem with non-ASCII identifiers? password_is_correct is
all ASCII. How can you justify saying that non-ASCII identifiers
introduce a security hole that already exists in all-ASCII Python?

Paul Rubin · May 14, 2007

Steven D'Aprano said:
password_is_correct is all ASCII.

How do you know that? What steps did you take to ascertain it? Those
are steps you currently don't have to bother with.

John Nagle · May 14, 2007

Paul said:
Java (and C#?) have mandatory declarations so homoglyphic identifiers aren't
nearly as bad a problem. Ecmascript is a horrible bug-prone language and
we want Python to move away from resembling it, not towards it. VB: well,
same as Ecmascript, I guess.

That's the first substantive objection I've seen. In a language
without declarations, trouble is more likely. Consider the maintenance
programmer who sees a variable name and retypes it elsewhere, not realizing
the glyphs are different even though they look the same. In a language
with declarations, that generates a compile-time error. In Python, it
doesn't.

John Nagle

Aldo Cortesi · May 14, 2007

Thus spake Steven D'Aprano ([email protected]):

If you're relying on cursory visual inspection to recognize harmful code,
you're already vulnerable to trojans.

What a daft thing to say. How do YOU recognize harmful code in a patch
submission? Perhaps you blindly apply patches, and then run your test suite on
a quarantined system, with an instrumented operating system to allow you to
trace process execution, and then perform a few weeks worth of analysis on the
data?

Me, I try to understand a patch by reading it. Call me old-fashioned.

Code exchange regardless of human language is a nice principle, but it
doesn't work in practice.

And this is clearly bunk. I have come accross code with transliterated
identifiers and comments in a different language, and while understanding was
hampered it wasn't impossible.

That's no different from typos in ASCII. There's no doubt that we'll give
the same answer we've always given for this problem: unit tests, pylint
and pychecker.

A typo that can't be detected visually is fundamentally different problem from
an ASCII typo, as many people in this thread have pointed out.

Regards,

Aldo

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · May 14, 2007

It should be noted that the Python community may use other forums, in

other languages. They would likely be a lot more enthusiastic about
this PEP than the usual crowd here (comp.lang.python).

Please spread the news.

Martin

Alex Martelli · May 14, 2007

Steven D'Aprano said:
automated -- if the patch uses an unexpected "#-*- coding: blah" line, or

No need -- a separate PEP (also by Martin) makes UTF-8 the default
encoding, and UTF-8 can encode any Unicode character you like.

Alex

Alex Martelli · May 14, 2007

Aldo Cortesi said:
Thus spake Steven D'Aprano ([email protected]):

What a daft thing to say. How do YOU recognize harmful code in a patch
submission? Perhaps you blindly apply patches, and then run your test suite on
a quarantined system, with an instrumented operating system to allow you to
trace process execution, and then perform a few weeks worth of analysis on the
data?

Me, I try to understand a patch by reading it. Call me old-fashioned.

I concur, Aldo. Indeed, if I _can't_ be sure I understand a patch, I
don't accept it -- I ask the submitter to make it clearer.

Homoglyphs would ensure I could _never_ be sure I understand a patch,
without at least running it through some transliteration tool. I don't
think the world of open source needs this extra hurdle in its path.

Alex

Hendrik van Rooyen · May 14, 2007

Bruno Desthuilliers said:
Martin v. LÃ¶wis a Ã©crit :

No.

Agreed - I also do not think it is a good idea

Because it will definitivly make code-sharing impossible. Live with it
or else, but CS is english-speaking, period. I just can't understand
code with spanish or german (two languages I have notions of)
identifiers, so let's not talk about other alphabets...

The understanding aside, it seems to me that the maintenance nightmare is
more irritating, as you are faced with stuff you can't type on your
keyboard, without resorting to look up tables and <alt> ... sequences.
And then you could still be wrong, as has been pointed out for capital
A and Greek alpha.

Then one should consider the effects of this on the whole issue of shared
open source python programs, as Bruno points out, before we argue that
I should not be "allowed" access to Greek, or French and German code
with umlauts and other diacritic marks, as someone else has done.

I think it is best to say nothing of Saint Cyril's script.

I think that to allow identifiers to be "native", while the rest of the
reserved words in the language remains ASCII English kind of
defeats the object of making the python language "language friendly".
It would need something like macros to enable the definition of
native language terms for things like "while", "for", "in", etc...

And we have been through the Macro thingy here, and the consensus
seemed to be that we don't want people to write their own dialects.

I think that the same arguments apply here.

NB : I'm *not* a native english speaker, I do *not* live in an english
speaking country, and my mother's language requires non-ascii encoding.
And I don't have special sympathy for the USA. And yes, I do write my
code - including comments - in english.

My case is similar, except that we are supposed to have eleven official
languages. - When my ancestors fought the English at Spion Kop*,
we could not even spell our names - and here I am defending the use of
this disease that masquerades as a language, in the interests of standardisation
of communication and ease of sharing and maintenance.

BTW - Afrikaans also has stuff like umlauts - my keyboard cannot type them
and I rarely miss it, because most of my communication is done in English.

- Hendrik

* Spion Kop is one of the few battles in history that went contrary to the
common usage whereby both sides claim victory. In this case, both sides
claimed defeat. "We have suffered a small reverse..." - Sir Redvers Buller,
who was known afterwards as Sir Reverse Buller, or the Ferryman of the
Tugela. To be fair, it was the first war with trenches in it, and nobody
knew how to handle them.

Jarek Zgoda · May 14, 2007

Alexander Schmolck napisa³(a):

Who or what would force you to? Do you currently have to deal with hebrew,
russian or greek names transliterated into ASCII? I don't and I suspect this
whole panic about everyone suddenly having to deal with code written in kanji,
klingon and hieroglyphs etc. is unfounded -- such code would drastically
reduce its own "fitness" (much more so than the ASCII-transliterated chinese,
hebrew and greek code I never seem to come across), so I think the chances
that it will be thrust upon you (or anyone else in this thread) are minuscule.

I often must read code written by people using some kind of cyrillic
(Russians, Serbs, Bulgarians). "Native" names transliterated to ascii
are usual artifacts and I don't mind it.

BTW, I'm not sure if you don't underestimate your own intellectual faculties
if you think couldn't cope with greek or russian characters. On the other hand
I wonder if you don't overestimate your ability to reasonably deal with code
written in a completely foreign language, as long as its ASCII -- for anything
of nontrivial length, surely doing anything with such code would already be
orders of magnitude harder?

While I don't have problems with some of non-latin character sets, such
as greek and cyrillic (I was attending school in time when learning
Russian was obligatory in Poland and later I learned Greek), there are a
plenty I wouldn't be able to read, such as Hebrew, Arabic or Persian.

Marc 'BlackJack' Rintsch · May 14, 2007

Could you name a few? Thanks.

Haskell. AFAIK the Haskell Report says so but the compilers don't
supported it last time I tried.

Ciao,
Marc 'BlackJack' Rintsch

Neil Hodgson · May 14, 2007

Martin v. LÃ¶wis:

This PEP suggests to support non-ASCII letters (such as accented
characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.

I support this to ease integration with other languages and
platforms that allow non-ASCII letters to be used in identifiers. Python
has a strong heritage as a glue language and this has been enabled by
adapting to the features of various environments rather than trying to
assert a Pythonic view of how things should work.

Neil

Eric Brunel · May 14, 2007

On Sun, 13 May 2007 21:10:46 +0200, Stefan Behnel

Now, I am not a strong supporter (most public code will use English
identifiers anyway)

How will you guarantee that? I'm quite convinced that most of the public
code today started its life as private code earlier...

So, introducing non-ASCII identifiers is just a
small step further. Disallowing this does *not* guarantee in any way that
identifiers are understandable for English native speakers. It only
guarantees
that identifiers are always *typable* by people who have access to latin
characters on their keyboard. A rather small advantage, I'd say.

I would certainly not qualify that as "rather small". There have been
quite a few times where I had to change some public code. If this code had
been written in a character set that did not exist on my keyboard, the
only possibility would have been to copy/paste every identifier I had to
type. Have you ever tried to do that? It's actually quite simple to test
it: just remove on your keyboard a quite frequent letter ('E' is a good
candidate), and try to update some code you have at hand. You'll see that
it takes 4 to 5 times longer than writing the code directly, because you
always have to switch between keyboard and mouse far too often. In
addition to the unnecessary movements, it also completely breaks your
concentration. Typing foreign words transliterated to english actually
does take longer than typing "proper" english words, but at least, it can
be done, and it's still faster than having to copy/paste everything.

So I'd say that it would be a major drawback for code sharing, which - if
I'm not mistaken - is the basis for the whole open-source philosophy.

Eric Brunel · May 14, 2007

Martin v. LÃ¶wis a Ã©crit :

Because it will definitivly make code-sharing impossible. Live with it
or else, but CS is english-speaking, period. I just can't understand
code with spanish or german (two languages I have notions of)
identifiers, so let's not talk about other alphabets...

+1 on everything.

NB : I'm *not* a native english speaker, I do *not* live in an english
speaking country,

.... and so am I (and this happens to be the same country as Bruno's...)

and my mother's language requires non-ascii encoding.

.... and so does my wife's (she's Japanese).

And I don't have special sympathy for the USA. And yes, I do write my
code - including comments - in english.

Again, +1. Even when writing code that appears to be "private" at some
time, one *never* knows what will become of it in the future. If it ever
goes public, its chances to evolve - or just to be maintained - are far
bigger if it's written all in english.

Stefan Behnel · May 14, 2007

Eric said:
Even when writing code that appears to be "private" at some
time, one *never* knows what will become of it in the future. If it ever
goes public, its chances to evolve - or just to be maintained - are far
bigger if it's written all in english.

--python -c "print ''.join([chr(154 - ord(c)) for c in
'U(17zX(%,5.zmz5(17l8(%,5.Z*(93-965$l7+-'])"

Oh well, why did *that* code ever go public?

Stefan

Stefan Behnel · May 14, 2007

Eric said:
On Sun, 13 May 2007 21:10:46 +0200, Stefan Behnel

How will you guarantee that? I'm quite convinced that most of the public
code today started its life as private code earlier...

Ok, so we're back to my original example: the problem here is not the
non-ASCII encoding but the non-english identifiers.

If we move the problem to a pure unicode naming problem:

How likely is it that it's *you* (lacking a native, say, kanji keyboard) who
ends up with code that uses identifiers written in kanji? And that you are the
only person who is now left to do the switch to an ASCII transliteration?

Any chance there are still kanji-enabled programmes around that were not hit
by the bomb in this scenario? They might still be able to help you get the
code "public".

Stefan

Stefan Behnel · May 14, 2007

Alex said:
I concur, Aldo. Indeed, if I _can't_ be sure I understand a patch, I
don't accept it -- I ask the submitter to make it clearer.

Homoglyphs would ensure I could _never_ be sure I understand a patch,
without at least running it through some transliteration tool. I don't
think the world of open source needs this extra hurdle in its path.

But then, where's the problem? Just stick to accepting only patches that are
plain ASCII *for your particular project*. And if you want to be sure, put an
ASCII encoding header in all source files (which you want to do anyway, to
prevent the same problem with string constants).

The PEP is only arguing to support this decision at a per-project level rather
than forbidding it at the language level. This makes sense as it moves the
power into the hands of those people who actually use it, not those who
designed the language.

Stefan

Atoms, Identifiers, and Primaries	21	Apr 16, 2013
Generating valid identifiers	8	Jul 26, 2012
Non-identifiers in dictionary keys for **expression syntax	3	May 23, 2013
Renaming identifiers & debugging	14	Feb 25, 2010
Looking for UNICODE to ASCII Conversioni Example Code	15	Oct 18, 2013
Python 3.5, bytes, and %-interpolation (aka PEP 461)	10	Feb 24, 2014
Is PEP-8 a Code or More of a Guideline?	52	May 26, 2007
Extended identifiers?	1	Jun 15, 2012

PEP 3131: Supporting Non-ASCII Identifiers

Steven D'Aprano

Steven D'Aprano

Paul Rubin

Paul Rubin

Steven D'Aprano

Paul Rubin

John Nagle

Aldo Cortesi

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Alex Martelli

Alex Martelli

Hendrik van Rooyen

Jarek Zgoda

Marc 'BlackJack' Rintsch

Neil Hodgson

Eric Brunel

Eric Brunel

Stefan Behnel

Stefan Behnel

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads