PEP 3131: Supporting Non-ASCII Identifiers


G

Guest

PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
(e-mail address removed)

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り場
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
- would you use them if it was possible to do so? in what cases?

Regards,
Martin


PEP: 3131
Title: Supporting Non-ASCII Identifiers
Version: $Revision: 55059 $
Last-Modified: $Date: 2007-05-01 22:34:25 +0200 (Di, 01 Mai 2007) $
Author: Martin v. Löwis <[email protected]>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 1-May-2007
Python-Version: 3.0
Post-History:


Abstract
========

This PEP suggests to support non-ASCII letters (such as accented
characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.

Rationale
=========

Python code is written by many people in the world who are not familiar
with the English language, or even well-acquainted with the Latin
writing system. Such developers often desire to define classes and
functions with names in their native languages, rather than having to
come up with an (often incorrect) English translation of the concept
they want to name.

For some languages, common transliteration systems exist (in particular,
for the Latin-based writing systems). For other languages, users have
larger difficulties to use Latin to write their native words.

Common Objections
=================

Some objections are often raised against proposals similar to this one.

People claim that they will not be able to use a library if to do so
they have to use characters they cannot type on their keyboards.
However, it is the choice of the designer of the library to decide on
various constraints for using the library: people may not be able to use
the library because they cannot get physical access to the source code
(because it is not published), or because licensing prohibits usage, or
because the documentation is in a language they cannot understand. A
developer wishing to make a library widely available needs to make a
number of explicit choices (such as publication, licensing, language
of documentation, and language of identifiers). It should always be the
choice of the author to make these decisions - not the choice of the
language designers.

In particular, projects wishing to have wide usage probably might want
to establish a policy that all identifiers, comments, and documentation
is written in English (see the GNU coding style guide for an example of
such a policy). Restricting the language to ASCII-only identifiers does
not enforce comments and documentation to be English, or the identifiers
actually to be English words, so an additional policy is necessary,
anyway.

Specification of Language Changes
=================================

The syntax of identifiers in Python will be based on the Unicode
standard annex UAX-31 [1]_, with elaboration and changes as defined
below.

Within the ASCII range (U+0001..U+007F), the valid characters for
identifiers are the same as in Python 2.5. This specification only
introduces additional characters from outside the ASCII range. For
other characters, the classification uses the version of the Unicode
Character Database as included in the ``unicodedata`` module.

The identifier syntax is ``<ID_Start> <ID_Continue>*``.

``ID_Start`` is defined as all characters having one of the general
categories uppercase letters (Lu), lowercase letters (Ll), titlecase
letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers
(Nl), plus the underscore (XXX what are "stability extensions" listed in
UAX 31).

``ID_Continue`` is defined as all characters in ``ID_Start``, plus
nonspacing marks (Mn), spacing combining marks (Mc), decimal number
(Nd), and connector punctuations (Pc).

All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

Policy Specification
====================

As an addition to the Python Coding style, the following policy is
prescribed: All identifiers in the Python standard library MUST use
ASCII-only identifiers, and SHOULD use English words wherever feasible.

As an option, this specification can be applied to Python 2.x. In that
case, ASCII-only identifiers would continue to be represented as byte
string objects in namespace dictionaries; identifiers with non-ASCII
characters would be represented as Unicode strings.

Implementation
==============

The following changes will need to be made to the parser:

1. If a non-ASCII character is found in the UTF-8 representation of the
source code, a forward scan is made to find the first ASCII
non-identifier character (e.g. a space or punctuation character)

2. The entire UTF-8 string is passed to a function to normalize the
string to NFC, and then verify that it follows the identifier syntax.
No such callout is made for pure-ASCII identifiers, which continue to
be parsed the way they are today.

3. If this specification is implemented for 2.x, reflective libraries
(such as pydoc) must be verified to continue to work when Unicode
strings appear in ``__dict__`` slots as keys.

References
==========

... [1] http://www.unicode.org/reports/tr31/


Copyright
=========

This document has been placed in the public domain.
 
Ad

Advertisements

D

dustin

- should non-ASCII identifiers be supported? why?

The only objection that comes to mind is that adding such support may
make some distinct identifiers visually indistinguishable. IIRC the DNS
system has had this problem, leading to much phishing abuse.

I don't necessarily think that the objection is strong enough to reject
the idea -- programmers using non-ASCII symbols would be responsible for
the consequences of their character choice.

Dustin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

The only objection that comes to mind is that adding such support may
make some distinct identifiers visually indistinguishable. IIRC the DNS
system has had this problem, leading to much phishing abuse.

This is a commonly-raised objection, but I don't understand why people
see it as a problem. The phishing issue surely won't apply, as you
normally don't "click" on identifiers, but rather type them. In a
phishing case, it is normally difficult to type the fake character
(because the phishing relies on you mistaking the character for another
one, so you would type the wrong identifier).

People have mentioned that this could be used to obscure your code - but
there are so many ways to write obscure code that I don't see a problem
in adding yet another way.

People also mentioned that they might mistake identifiers in a regular,
non-phishing, non-joking scenario, because they can't tell whether the
second letter of MAXLINESIZE is a Latin A or Greek Alpha. I find that
hard to believe - if the rest of the identifier is Latin, the A surely
also is Latin, and if the rest is Greek, it's likely an Alpha. The issue
is only with single-letter identifiers, and those are most common
as local variables. Then, it's an Alpha if there is also a Beta and
a Gamma as a local variable - if you have B and C also, it's likely A.
I don't necessarily think that the objection is strong enough to reject
the idea -- programmers using non-ASCII symbols would be responsible for
the consequences of their character choice.

Indeed.

Martin
 
G

Guest

PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
(e-mail address removed)

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り場
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?

I use to think differently. However, I would say a strong YES. They
would be extremely useful when teaching programming.
- would you use them if it was possible to do so? in what cases?

Only if I was teaching native French speakers.
Policy Specification
====================

As an addition to the Python Coding style, the following policy is
prescribed: All identifiers in the Python standard library MUST use
ASCII-only identifiers, and SHOULD use English words wherever feasible.

I would add something like:

Any module released for general use SHOULD use ASCII-only identifiers
in the public API.

Thanks for this initiative.

André
 
J

John Nagle

Martin said:
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
(e-mail address removed)

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り場
(hoping that the latter one means "counter").

All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

That may not be restrictive enough, because it permits multiple
different lexical representations of the same identifier in the same
text. Search and replace operations on source text might not find
all instances of the same identifier. Identifiers should be required
to be written in source text with a unique source text representation,
probably NFC, or be considered a syntax error.

I'd suggest restricting identifiers under the rules of UTS-39,
profile 2, "Highly Restrictive". This limits mixing of scripts
in a single identifier; you can't mix Hebrew and ASCII, for example,
which prevents problems with mixing right to left and left to right
scripts. Domain names have similar restrictions.

John Nagle
 
P

Paul Rubin

Martin v. Löwis said:
So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?

No, and especially no without mandatory declarations of all variables.
Look at the problems of non-ascii characters in domain names and the
subsequent invention of Punycode. Maintaining code that uses those
identifiers in good faith will already be a big enough hassle, since
it will require installing and getting familiar with keyboard setups
and editing tools needed to enter those characters. Then there's the
issue of what happens when someone tries to slip a malicious patch
through a code review on purpose, by using homoglyphic characters
similar to the way domain name phishing works. Those tricks have also
been used to re-insert bogus articles into Wikipedia, circumventing
administrative blocks on the article names.
- would you use them if it was possible to do so? in what cases?

I would never insert them into a program. In existing programs where
they were used, I would remove them everywhere I could.
 
Ad

Advertisements

G

Guest

That may not be restrictive enough, because it permits multiple
different lexical representations of the same identifier in the same
text. Search and replace operations on source text might not find
all instances of the same identifier. Identifiers should be required
to be written in source text with a unique source text representation,
probably NFC, or be considered a syntax error.

I'd suggest restricting identifiers under the rules of UTS-39,
profile 2, "Highly Restrictive". This limits mixing of scripts
in a single identifier; you can't mix Hebrew and ASCII, for example,
which prevents problems with mixing right to left and left to right
scripts. Domain names have similar restrictions.

John Nagle

Python keywords MUST be in ASCII ... so the above restriction can't
work. Unless the restriction is removed (which would be a separate
PEP).

André
 
P

Paul Rubin

Martin v. Löwis said:
This is a commonly-raised objection, but I don't understand why people
see it as a problem. The phishing issue surely won't apply, as you
normally don't "click" on identifiers, but rather type them. In a
phishing case, it is normally difficult to type the fake character
(because the phishing relies on you mistaking the character for another
one, so you would type the wrong identifier).

It certainly does apply, if you're maintaining a program and someone
submits a patch. In that case you neither click nor type the
character. You'd normally just make sure the patched program passes
the existing test suite, and examine the patch on the screen to make
sure it looks reasonable. The phishing possibilities are obvious.
 
?

=?iso-8859-1?B?QW5kcuk=?=

PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
(e-mail address removed)

It should be noted that the Python community may use other forums, in
other languages. They would likely be a lot more enthusiastic about
this PEP than the usual crowd here (comp.lang.python).

André
 
A

Anton Vredegoor

Martin said:
In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り場
(hoping that the latter one means "counter").

I am against this PEP for the following reasons:

It will split up the Python user community into different language or
interest groups without having any benefit as to making the language
more expressive in an algorithmic way.

Some time ago there was a discussion about introducing macros into the
language. Among the reasons why macros were excluded was precisely
because anyone could start writing their own kind of dialect of Python
code, resulting in less people being able to read what other programmers
wrote. And that last thing: 'Being able to easily read what other people
wrote' (sometimes that 'other people' is yourself half a year later, but
that isn't relevant in this specific case) is one of the main virtues in
the Python programming community. Correct me if I'm wrong please.

At that time I was considering to give up some user conformity because
the very powerful syntax extensions would make Python rival Lisp. It's
worth sacrificing something if one gets some other thing in return.

However since then we have gained metaclasses, iterators and generators
and even a C-like 'if' construct. Personally I'd also like to have a
'repeat-until'. These things are enough to keep us busy for a long time
and in some respects this new syntax is even more powerful/dangerous
than macros. But most importantly these extra burdens on the ease with
which one is to read code are offset by gaining more expressiveness in
the *coding* of scripts.

While I have little doubt that in the end some stubborn mathematician or
Frenchman will succeed in writing a preprocessor that would enable him
to indoctrinate his students into his specific version of reality, I see
little reason to actively endorse such foolishness.

The last argument I'd like to make is about the very possibly reality
that in a few years the Internet will be dominated by the Chinese
language instead of by the English language. As a Dutchman I have no
special interest in English being the language of the Internet but
-given the status quo- I can see the advantages of everyone speaking the
*same* language. If it be Chinese, Chinese I will start to learn,
however inept I might be at it at first.

That doesn't mean however that one should actively open up to a kind of
contest as to which language will become the main language! On the
contrary one should hold out as long as possible to the united group one
has instead of dispersing into all kinds of experimental directions.

Do we harm the Chinese in this way one might ask by making it harder for
them to gain access to the net? Do we harm ourselves by not opening up
in time to the new status quo? Yes, in a way these are valid points, but
one should not forget that more advanced countries also have a
responsibility to lead the way by providing an example, one should not
think too lightly about that.

Anyway, I feel that it will not be possible to hold off these
developments in the long run, but great beneficial effects can still be
attained by keeping the language as simple and expressive as possible
and to adjust to new realities as soon as one of them becomes undeniably
apparent (which is something entirely different than enthusiastically
inviting them in and let them fight it out against each other in your
own house) all the time taking responsibility to lead the way as long as
one has any consensus left.

A.
 
S

Stefan Behnel

Anton said:
I am against this PEP for the following reasons:

It will split up the Python user community into different language or
interest groups without having any benefit as to making the language
more expressive in an algorithmic way.


We must distinguish between "identifiers named in a non-english language" and
"identifiers written with non-ASCII characters".

While the first is already allowed as long as the transcription uses only
ASCII characters, the second is currently forbidden and is what this PEP is about.

So, nothing currently keeps you from giving names to identifiers that are
impossible to understand by, say, Americans (ok, that's easy anyway).

For example, I could write

def zieheDreiAbVon(wert):
return zieheAb(wert, 3)

and most people on earth would not have a clue what this is good for. However,
someone who is fluent enough in German could guess from the names what this does.

I do not think non-ASCII characters make this 'problem' any worse. So I must
ask people to restrict their comments to the actual problem that this PEP is
trying to solve.

Stefan
 
Ad

Advertisements

J

Jarek Zgoda

Martin v. Löwis napisa³(a):
So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?

No, because "programs must be written for people to read, and only
incidentally for machines to execute". Using anything other than "lowest
common denominator" (ASCII) will restrict accessibility of code. This is
not a literature, that requires qualified translators to get the text
from Hindi (or Persian, or Chinese, or Georgian, or...) to Polish.

While I can read the code with Hebrew, Russian or Greek names
transliterated to ASCII, I would not be able to read such code in native.
For some languages, common transliteration systems exist (in particular,
for the Latin-based writing systems). For other languages, users have
larger difficulties to use Latin to write their native words.

This is one of least disturbing difficulties when it comes to programming.
 
S

Stefan Behnel

Martin said:
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
(e-mail address removed)

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り場
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
- would you use them if it was possible to do so? in what cases?


To make it clear: this PEP considers "identifiers written with non-ASCII
characters", not "identifiers named in a non-english language".

While the first is already allowed as long as the transcription uses only
ASCII characters, the second is currently forbidden and is what this PEP is about.

Now, I am not a strong supporter (most public code will use English
identifiers anyway) but we should not forget that Python supports encoding
declarations in source files and thus has much cleaner support for non-ASCII
source code than, say, Java. So, introducing non-ASCII identifiers is just a
small step further. Disallowing this does *not* guarantee in any way that
identifiers are understandable for English native speakers. It only guarantees
that identifiers are always *typable* by people who have access to latin
characters on their keyboard. A rather small advantage, I'd say.

The capability of a Unicode-aware language to express non-English identifiers
in a non-ASCII encoding totally makes sense to me.

Stefan
 
S

Stefan Behnel

Paul said:
I would never insert them into a program. In existing programs where
they were used, I would remove them everywhere I could.

Luckily, you will never be able to touch every program in the world.

Stefan
 
S

Stefan Behnel

Jarek said:
Martin v. Löwis napisa³(a):

Uuups, is that a non-ASCII character in there? Why don't you keep them out of
an English speaking newsgroup?

No, because "programs must be written for people to read, and only
incidentally for machines to execute". Using anything other than "lowest
common denominator" (ASCII) will restrict accessibility of code.

No, but it would make it a lot easier for a lot of people to use descriptive
names. Remember: we're all adults here, right?

While I can read the code with Hebrew, Russian or Greek names
transliterated to ASCII, I would not be able to read such code in native.

Then maybe it was code that was not meant to be read by you?

In the (not so small) place where I work, we tend to use descriptive names *in
German* for the code we write, mainly for reasons of domain clarity. The
*only* reason why we still use the (simple but ugly) ASCII-transcription
(ü->ue etc.) for identifiers is that we program in Java and Java lacks a
/reliable/ way to support non-ASCII characters in source code. Thanks to PEP
263 and 3120, Python does not suffer from this problem, but it suffers from
the bigger problem of not *allowing* non-ASCII characters in identifiers. And
I believe that's a rather arbitrary decision.

The more I think about it, the more I believe that this restriction should be
lifted. 'Any' non-ASCII identifier should be allowed where developers decide
that it makes sense.

Stefan
 
M

Michael Torrie

For example, I could write

def zieheDreiAbVon(wert):
return zieheAb(wert, 3)

and most people on earth would not have a clue what this is good for. However,
someone who is fluent enough in German could guess from the names what this does.

I do not think non-ASCII characters make this 'problem' any worse. So I must
ask people to restrict their comments to the actual problem that this PEP is
trying to solve.

I think non-ASCII characters makes the problem far far worse. While I
may not understand what the function is by it's name in your example,
allowing non-ASCII characters makes it works by forcing all would-be
code readers have to have all kinds of necessary fonts just to view the
source code. Things like reporting exceptions too. At least in your
example I know the exception occurred in zieheDreiAbVon. But if that
identifier is some UTF-8 string, how do I go about finding it in my text
editor, or even reporting the message to the developers? I don't happen
to have that particular keymap installed in my linux system, so I can't
even type the letters!

So given that people can already transliterate their language for use as
identifiers, I think avoiding non-ASCII character sets is a good idea.
ASCII is simply the lowest denominator and is support by *all*
configurations and locales on all developers' systems.
 
Ad

Advertisements

J

Josiah Carlson

Stefan said:
Anton said:
I am against this PEP for the following reasons:

It will split up the Python user community into different language or
interest groups without having any benefit as to making the language
more expressive in an algorithmic way.

We must distinguish between "identifiers named in a non-english language" and
"identifiers written with non-ASCII characters". [snip]
I do not think non-ASCII characters make this 'problem' any worse. So I must
ask people to restrict their comments to the actual problem that this PEP is
trying to solve.

Really? Because when I am reading source code, even if a particular
variable *name* is a sequence of characters that I cannot identify as a
word that I know, I can at least spell it out using Latin characters, or
perhaps even attempt to pronounce it (verbalization of a word, even if
it is an incorrect verbalization, I find helps me to remember a variable
and use it later).

On the other hand, the introduction of some 60k+ valid unicode glyphs
into the set of characters that can be seen as a name in Python would
make any such attempts by anyone who is not a native speaker (and even
native speakers in the case of the more obscure Kanji glyphs) an
exercise in futility.

As it stands, people who use Python (and the vast majority of other
programming languages) learn the 52 upper/lowercase variants of the
latin alphabet (and sometimes the 0-9 number characters for some parts
of the world). That's it. 62 glyphs at the worst. But a huge portion
of these people have already been exposed to these characters through
school, the internet, etc., and this isn't likely to change (regardless
of the 'impending' Chinese population dominance on the internet).

Indeed, the lack of the 60k+ glyphs as valid name characters can make
the teaching of Python to groups of people that haven't been exposed to
the Latin alphabet more difficult, but those people who are exposed to
programming are also typically exposed to the internet, on which Latin
alphabets dominate (never mind that html tags are Latin characters, as
are just about every daemon configuration file, etc.). Exposure to the
Latin alphabet isn't going to go away, and Python is very unlikely to be
the first exposure programmers have to the Latin alphabet (except for
OLPC, but this PEP is about a year late to the game to change that).
And even if Python *is* the first time children or adults are exposed to
the Latin alphabet, one would hope that 62 characters to learn to 'speak
the language of Python' is a small price to pay to use it.

Regarding different characters sharing the same glyphs, it is a problem.
Say that you are importing a module written by a mathematician that
uses an actual capital Greek alpha for a name. When a user sits down to
use it, they could certainly get NameErrors, AttributeErrors, etc., and
never understand why it is the case. Their fancy-schmancy unicode
enabled terminal will show them what looks like the Latin A, but it will
in fact be the Greek Α. Until they copy/paste, check its ord(), etc.,
they will be baffled. It isn't a problem now because A = Α is a syntax
error, but it can and will become a problem if it is allowed to.

But this issue isn't limited to different characters sharing glyphs!
It's also about being able to type names to use them in your own code
(generally very difficult if not impossible for many non-Latin
characters), or even be able to display them. And no number of
guidelines, suggestions, etc., against distributing libraries with
non-Latin identifiers will stop it from happening, and *will* fragment
the community as Anton (and others) have stated.

- Josiah
 
S

Stefan Behnel

Josiah said:
It's also about being able to type names to use them in your own code
(generally very difficult if not impossible for many non-Latin
characters), or even be able to display them. And no number of
guidelines, suggestions, etc., against distributing libraries with
non-Latin identifiers will stop it from happening, and *will* fragment
the community as Anton (and others) have stated.

Ever noticed how the community is already fragmented into people working on
project A and people not working on project A? Why shouldn't the people
working on project A agree what language they write and spell their
identifiers in? And don't forget about project B, C, and all the others.

I agree that code posted to comp.lang.python should use english identifiers
and that it is worth considering to use english identifiers in open source
code that is posted to a public OS project site. Note that I didn't say "ASCII
identifiers" but plain english identifiers. All other code should use the
language and encoding that fits its environment best.

Stefan
 
Ad

Advertisements

T

Terry Reedy

| For example, I could write
|
| def zieheDreiAbVon(wert):
| return zieheAb(wert, 3)
|
| and most people on earth would not have a clue what this is good for.
However,
| someone who is fluent enough in German could guess from the names what
this does.
|
| I do not think non-ASCII characters make this 'problem' any worse.

It is ridiculous claims like this and the consequent refusal to admit,
address, and ameliorate the 50x worse problems that would be introduced
that lead me to oppose the PEP in its current form.

Terry Jan Reedy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top