PEP 3131: Supporting Non-ASCII Identifiers

M

MRAB

I think non-ASCII characters makes the problem far far worse. While I
may not understand what the function is by it's name in your example,
allowing non-ASCII characters makes it works by forcing all would-be
code readers have to have all kinds of necessary fonts just to view the
source code. Things like reporting exceptions too. At least in your
example I know the exception occurred in zieheDreiAbVon. But if that
identifier is some UTF-8 string, how do I go about finding it in my text
editor, or even reporting the message to the developers? I don't happen
to have that particular keymap installed in my linux system, so I can't
even type the letters!

So given that people can already transliterate their language for use as
identifiers, I think avoiding non-ASCII character sets is a good idea.
ASCII is simply the lowest denominator and is support by *all*
configurations and locales on all developers' systems.
Perhaps there could be the option of typing and showing characters as
\uxxxx, eg. \u00FC instead of ü (u-umlaut), or showing them in a
different colour if they're not in a specified set.
 
B

Bruno Desthuilliers

Martin v. Löwis a écrit :
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
(e-mail address removed)

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り場
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported?
No.

why?

Because it will definitivly make code-sharing impossible. Live with it
or else, but CS is english-speaking, period. I just can't understand
code with spanish or german (two languages I have notions of)
identifiers, so let's not talk about other alphabets...

NB : I'm *not* a native english speaker, I do *not* live in an english
speaking country, and my mother's language requires non-ascii encoding.
And I don't have special sympathy for the USA. And yes, I do write my
code - including comments - in english.
 
V

Virgil Dupras

PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
(e-mail address removed)

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り場
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?
- would you use them if it was possible to do so? in what cases?

Regards,
Martin

PEP: 3131
Title: Supporting Non-ASCII Identifiers
Version: $Revision: 55059 $
Last-Modified: $Date: 2007-05-01 22:34:25 +0200 (Di, 01 Mai 2007) $
Author: Martin v. Löwis <[email protected]>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 1-May-2007
Python-Version: 3.0
Post-History:

Abstract
========

This PEP suggests to support non-ASCII letters (such as accented
characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers.

Rationale
=========

Python code is written by many people in the world who are not familiar
with the English language, or even well-acquainted with the Latin
writing system.  Such developers often desire to define classes and
functions with names in their native languages, rather than having to
come up with an (often incorrect) English translation of the concept
they want to name.

For some languages, common transliteration systems exist (in particular,
for the Latin-based writing systems).  For other languages, users have
larger difficulties to use Latin to write their native words.

Common Objections
=================

Some objections are often raised against proposals similar to this one.

People claim that they will not be able to use a library if to do so
they have to use characters they cannot type on their keyboards.
However, it is the choice of the designer of the library to decide on
various constraints for using the library: people may not be able to use
the library because they cannot get physical access to the source code
(because it is not published), or because licensing prohibits usage, or
because the documentation is in a language they cannot understand.  A
developer wishing to make a library widely available needs to make a
number of explicit choices (such as publication, licensing, language
of documentation, and language of identifiers).  It should always be the
choice of the author to make these decisions - not the choice of the
language designers.

In particular, projects wishing to have wide usage probably might want
to establish a policy that all identifiers, comments, and documentation
is written in English (see the GNU coding style guide for an example of
such a policy). Restricting the language to ASCII-only identifiers does
not enforce comments and documentation to be English, or the identifiers
actually to be English words, so an additional policy is necessary,
anyway.

Specification of Language Changes
=================================

The syntax of identifiers in Python will be based on the Unicode
standard annex UAX-31 [1]_, with elaboration and changes as defined
below.

Within the ASCII range (U+0001..U+007F), the valid characters for
identifiers are the same as in Python 2.5.  This specification only
introduces additional characters from outside the ASCII range.  For
other characters, the classification uses the version of the Unicode
Character Database as included in the ``unicodedata`` module.

The identifier syntax is ``<ID_Start> <ID_Continue>*``.

``ID_Start`` is defined as all characters having one of the general
categories uppercase letters (Lu), lowercase letters (Ll), titlecase
letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers
(Nl), plus the underscore (XXX what are "stability extensions" listed in
UAX 31).

``ID_Continue`` is defined as all characters in ``ID_Start``, plus
nonspacing marks (Mn), spacing combining marks (Mc), decimal number
(Nd), and connector punctuations (Pc).

All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

Policy Specification
====================

As an addition to the Python Coding style, the following policy is
prescribed: All identifiers in the Python standard library MUST use
ASCII-only identifiers, and SHOULD use English words wherever feasible.

As an option, this specification can be applied to Python 2.x.  In that
case, ASCII-only identifiers would continue to be represented as byte
string objects in namespace dictionaries; identifiers with non-ASCII
characters would be represented as Unicode strings.

Implementation
==============

The following changes will need to be made to the parser:

1. If a non-ASCII character is found in the UTF-8 representation of the
   source code, a forward scan is made to find the first ASCII
   non-identifier character (e.g. a space or punctuation character)

2. The entire UTF-8 string is passed to a function to normalize the
   string to NFC, and then verify that it follows the identifier syntax.
   No such callout is made for pure-ASCII identifiers, which continue to
   be parsed the way they are today.

3. If this specification is implemented for 2.x, reflective libraries
   (such as pydoc) must be verified to continue to work when Unicode
   strings appear in ``__dict__`` slots as keys.

References
==========

.. [1]http://www.unicode.org/reports/tr31/

Copyright
=========

This document has been placed in the public domain.

I don't think that supporting non-ascii characters for identifiers
would cause any problem. Most people won't use it anyway. People who
use non-english identifiers for their project and hope for it to be
popular worldwide will probably just fail because of their foolish
coding style policy choice. I put that kind of choice in the same
ballpark as deciding to use hungarian notation for python code.

As for malicious patch submission, I think this is a non issue.
Designing tool to detect any non-ascii char identifier in a file
should be a trivial script to write.

I say that if there is a demand for it, let's do it.
 
B

Bruno Desthuilliers

Stefan Behnel a écrit :
We must distinguish between "identifiers named in a non-english language" and
"identifiers written with non-ASCII characters".

While the first is already allowed as long as the transcription uses only
ASCII characters, the second is currently forbidden and is what this PEP is about.

So, nothing currently keeps you from giving names to identifiers that are
impossible to understand by, say, Americans (ok, that's easy anyway).

For example, I could write

def zieheDreiAbVon(wert):
return zieheAb(wert, 3)
>
and most people on earth would not have a clue what this is good for.

Which is exactly why I don't agree with adding support with non-ascii
identifiers. Using non-english identifiers should be strongly
discouraged, not openly supported.
However,
someone who is fluent enough in German could guess from the names what this does.

I do not think non-ASCII characters make this 'problem' any worse.

It does, by openly stating that it's ok to write unreadable code and
offering support for it.
So I must
ask people to restrict their comments to the actual problem that this PEP is
trying to solve.

Sorry, but we can't dismiss the side-effects. Learning enough
CS-oriented technical english to actually read and write code and
documentation is not such a big deal - even I managed to to so, and I'm
a bit impaired when it comes to foreign languages.
 
A

Alexander Schmolck

Jarek Zgoda said:
Martin v. Löwis napisał(a):


No, because "programs must be written for people to read, and only
incidentally for machines to execute". Using anything other than "lowest
common denominator" (ASCII) will restrict accessibility of code. This is
not a literature, that requires qualified translators to get the text
from Hindi (or Persian, or Chinese, or Georgian, or...) to Polish.

While I can read the code with Hebrew, Russian or Greek names
transliterated to ASCII, I would not be able to read such code in native.

Who or what would force you to? Do you currently have to deal with hebrew,
russian or greek names transliterated into ASCII? I don't and I suspect this
whole panic about everyone suddenly having to deal with code written in kanji,
klingon and hieroglyphs etc. is unfounded -- such code would drastically
reduce its own "fitness" (much more so than the ASCII-transliterated chinese,
hebrew and greek code I never seem to come across), so I think the chances
that it will be thrust upon you (or anyone else in this thread) are minuscule.


Plenty of programming languages already support unicode identifiers, so if
there is any rational basis for this fear it shouldn't be hard to come up with
-- where is it?

'as

BTW, I'm not sure if you don't underestimate your own intellectual faculties
if you think couldn't cope with greek or russian characters. On the other hand
I wonder if you don't overestimate your ability to reasonably deal with code
written in a completely foreign language, as long as its ASCII -- for anything
of nontrivial length, surely doing anything with such code would already be
orders of magnitude harder?
 
B

Bruno Desthuilliers

Stefan Behnel a écrit :
To make it clear: this PEP considers "identifiers written with non-ASCII
characters", not "identifiers named in a non-english language".

You cannot just claim that these are two totally distinct issues and get
away with it. The fact is that not only non-english identifiers are a
bad thing when it comes to sharing and cooperation, and it's obvious
that non-ascii glyphs can only make things work - since it's obvious
that people willing to use such a "feature" *wont* do it to spell
english identifiers anyway.
While the first is already allowed as long as the transcription uses only
ASCII characters, the second is currently forbidden and is what this PEP is about.

Now, I am not a strong supporter (most public code will use English
identifiers anyway) but we should not forget that Python supports encoding
declarations in source files and thus has much cleaner support for non-ASCII
source code than, say, Java. So, introducing non-ASCII identifiers is just a
small step further.

I would certainly not qualify this as a "small" step.
Disallowing this does *not* guarantee in any way that
identifiers are understandable for English native speakers.

I'm not an English native speaker. And there's more than a subtle
distinction between "not garantying" and "encouraging".
It only guarantees
that identifiers are always *typable* by people who have access to latin
characters on their keyboard. A rather small advantage, I'd say.

The capability of a Unicode-aware language to express non-English identifiers
in a non-ASCII encoding totally makes sense to me.

It does of course make sens (at least if you add support for non-english
non-ascii translation of the *whole* language - keywords, builtins and
the whole standard lib included). But it's still a very bad idea IMHO.
 
A

Anders J. Munch

Josiah said:
On the other hand, the introduction of some 60k+ valid unicode glyphs
into the set of characters that can be seen as a name in Python would
make any such attempts by anyone who is not a native speaker (and even
native speakers in the case of the more obscure Kanji glyphs) an
exercise in futility.

So you gather up a list of identifiers and and send out for translation. Having
actual Kanji glyphs instead a mix of transliterations and bad English will only
make that easier.

That won't even cost you anything, since you were already having docstrings
translated, along with comments and documentation, right?
But this issue isn't limited to different characters sharing glyphs!
It's also about being able to type names to use them in your own code
(generally very difficult if not impossible for many non-Latin
characters), or even be able to display them.

For display, tell your editor the utf-8 source file is really latin-1. For
entry, copy-paste.

- Anders
 
A

Alex Martelli

Bruno Desthuilliers said:
I'm not an English native speaker. And there's more than a subtle
distinction between "not garantying" and "encouraging".

I agree with Bruno and the many others who have expressed disapproval
for this idea -- and I am not an English native speaker, either (and
neither, it seems to me, are many others who dislike this PEP). The
mild pleasure of using accented letters in code "addressed strictly to
Italian-speaking audiences and never intended to be of any use to
anybody not speaking Italian" (should I ever desire to write such code)
pales in comparison with the disadvantages, many of which have already
been analyzed or at least mentioned.

Homoglyphic characters _introduced by accident_ should not be discounted
as a risk, as, it seems to me, was done early in this thread after the
issue had been mentioned. In the past, it has happened to me to
erroneously introduce such homoglyphs in a document I was preparing with
a word processor, by a slight error in the use of the system- provided
way for inserting characters not present on the keyboard; I found out
when later I went looking for the name I _thought_ I had input (but I
was looking for it spelled with the "right" glyph, not the one I had
actually used which looked just the same) and just could not find it.

On that occasion, suspecting I had mistyped in some way or other, I
patiently tried looking for "pieces" of the word in question, eventually
locating it with just a mild amount of aggravation when I finally tried
a piece without the offending character. But when something similar
happens to somebody using a sufficiently fancy text editor to input
source in a programming language allowing arbitrary Unicode letters in
identifiers, the damage (the sheer waste of developer time) can be much
more substantial -- there will be two separate identifiers around, both
looking exactly like each other but actually distinct, and unbounded
amount of programmer time can be spent chasing after this extremely
elusive and tricky bug -- why doesn't a rebinding appear to "take", etc.
With some copy-and-paste during development and attempts at debugging,
several copies of each distinct version of the identifier can be spread
around the code, further hampering attempts at understanding.


Alex
 
A

Alan Franzoni

Il Sun, 13 May 2007 17:44:39 +0200, "Martin v. Löwis" ha scritto:

[cut]

I'm from Italy, and I can say that some thoughts by Martin v. Löwis are
quite right. It's pretty easy to see code that uses "English" identifiers
and comments, but they're not really english - many times, they're just
"englishized" versions of the italian word. They might lure a real english
reader into an error rather than help him understand what the name really
stands for. It would be better to let the programmer pick the language he
or she prefers, without restrictions.

The patch problem doesn't seem a real issue to me, because it's the project
admin the one who can pick the encoding, and he could easily refuse any
patch that doesn't conform to the standards he wants.

BTW, there're a couple of issues that should be solved; even though I could
do with iso-8859-1, I usually pick utf-8 as the preferred encoding for my
files, because I found it more portable and more compatible with different
editors and IDE (I don't know if I just found some bugs in some specific
software, but I had problems with accented characters when switching
environment from Win to Linux, especially when reading/writing to and from
non-native FS, e.g. reading files from a ntfs disk from linux, or reading
an ext2 volume from Windows) on various platforms.

By the way, I would highly dislike anybody submitting a patch that contains
identifiers other than ASCII or iso-8859-1. Hence, I think there should be
a way, a kind of directive or sth. like that, to constrain the identifiers
charset to a 'subset' of the 'global' one.

Also, there should be a way to convert source files in any 'exotic'
encoding to a pseudo-intellegibile encoding for any reader, a kind of
translittering (is that a proper english word) system out-of-the-box, not
requiring any other tool that's not included in the Python distro. This
will let people to retain their usual working environments even though
they're dealing with source code with identifiers in a really different
charset.

--
Alan Franzoni <[email protected]>
-
Togli .xyz dalla mia email per contattarmi.
Remove .xyz from my address in order to contact me.
-
GPG Key Fingerprint (Key ID = FE068F3E):
5C77 9DC3 BD5B 3A28 E7BC 921A 0255 42AA FE06 8F3E
 
A

Alexander Schmolck

Martin v. Löwis said:
PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
(e-mail address removed)

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り場
(hoping that the latter one means "counter").

I believe this PEP differs from other Py3k PEPs in that it really
requires feedback from people with different cultural background
to evaluate it fully - most other PEPs are culture-neutral.

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported?
Yes.

why?

Because not everyone speaks English, not all languages can losslessly
transliterated ASCII and because it's unreasonable to drastically restrict the
domain of things that can be conveniently expressed for a language that's also
targeted at a non-professional programmer audience.

I'm also not aware of any horror stories from languages which do already allow
unicode identifiers.
- would you use them if it was possible to do so?
Possibly.

in what cases?

Maybe mathematical code (greek letters) or code that is very culture and
domain specific (say code doing Japanese tax forms).

'as
 
A

Anders J. Munch

Michael said:
So given that people can already transliterate their language for use as
identifiers, I think avoiding non-ASCII character sets is a good idea.

Transliteration makes people choose bad variable names, I see it all the time
with Danish programmers. Say e.g. the most descriptive name for a process is
"kør forlæns" (run forward). But "koer_forlaens" is ugly, so instead he'll
write "run_fremad", combining an English word with a slightly less appropriate
Danish word. Sprinkle in some English spelling errors and badly-chosen English
words, and you have the sorry state of the art that is today.

- Anders
 
A

Anders J. Munch

Alex said:
Homoglyphic characters _introduced by accident_ should not be discounted
as a risk, as, it seems to me, was done early in this thread after the
issue had been mentioned. In the past, it has happened to me to
erroneously introduce such homoglyphs in a document I was preparing with
a word processor, by a slight error in the use of the system- provided
way for inserting characters not present on the keyboard; I found out
when later I went looking for the name I _thought_ I had input (but I
was looking for it spelled with the "right" glyph, not the one I had
actually used which looked just the same) and just could not find it.

There's any number of things to be done about that.
1. # -*- encoding: ascii -*-
(I'd like to see you sneak those homoglyphic characters past *that*.)
2. pychecker and pylint - I'm sure you realise what they could do for you.
3. Use a font that doesn't have those characters or deliberately makes them
distinct (that could help web browsing safety too).

I'm not discounting the problem, I just dont believe it's a big one. Can we
chose a codepoint subset that doesn't have these dupes?

- Anders
 
S

Steven D'Aprano

Homoglyphic characters _introduced by accident_ should not be discounted
as a risk ....
But when something similar
happens to somebody using a sufficiently fancy text editor to input
source in a programming language allowing arbitrary Unicode letters in
identifiers, the damage (the sheer waste of developer time) can be much
more substantial -- there will be two separate identifiers around, both
looking exactly like each other but actually distinct, and unbounded
amount of programmer time can be spent chasing after this extremely
elusive and tricky bug -- why doesn't a rebinding appear to "take", etc.
With some copy-and-paste during development and attempts at debugging,
several copies of each distinct version of the identifier can be spread
around the code, further hampering attempts at understanding.


How is that different from misreading "disk_burnt = True" as "disk_bumt =
True"? In the right (or perhaps wrong) font, like the ever-popular Arial,
the two can be visually indistinguishable. Or "call" versus "cal1"?

Surely the correct solution is something like pylint or pychecker? Or
banning the use of lower-case L and digit 1 in identifiers. I'm good with
both.
 
A

Aldo Cortesi

Thus spake "Martin v. Löwis" ([email protected]):
- should non-ASCII identifiers be supported? why?

No! I believe that:

- The security implications have not been sufficiently explored. I don't
want to be in a situation where I need to mechanically "clean" code (say,
from a submitted patch) with a tool because I can't reliably verify it by
eye. We should learn from the plethora of Unicode-related security
problems that have cropped up in the last few years.
- Non-ASCII identifiers would be a barrier to code exchange. If I know
Python I should be able to easily read any piece of code written in it,
regardless of the linguistic origin of the author. If PEP 3131 is
accepted, this will no longer be the case. A Python project that uses
Urdu identifiers throughout is just as useless to me, from a
code-exchange point of view, as one written in Perl.
- Unicode is harder to work with than ASCII in ways that are more important
in code than in human-language text. Humans eyes don't care if two
visually indistinguishable characters are used interchangeably.
Interpreters do. There is no doubt that people will accidentally
introduce mistakes into their code because of this.

- would you use them if it was possible to do so? in what cases?

No.




Regards,



Aldo
 
S

Steven D'Aprano

It certainly does apply, if you're maintaining a program and someone
submits a patch. In that case you neither click nor type the
character. You'd normally just make sure the patched program passes
the existing test suite, and examine the patch on the screen to make
sure it looks reasonable. The phishing possibilities are obvious.

Not to me, I'm afraid. Can you explain how it works? A phisher might be
able to fool a casual reader, but how does he fool the compiler into
executing the wrong code?

As for project maintainers, surely a patch using some unexpected Unicode
locale would fail the "looks reasonable" test? That could even be
automated -- if the patch uses an unexpected "#-*- coding: blah" line, or
includes characters outside of a pre-defined range, ring alarm bells.
("Why is somebody patching my Turkish module in Korean?")
 
M

Marc 'BlackJack' Rintsch

Michael Torrie said:
I think non-ASCII characters makes the problem far far worse. While I
may not understand what the function is by it's name in your example,
allowing non-ASCII characters makes it works by forcing all would-be
code readers have to have all kinds of necessary fonts just to view the
source code. Things like reporting exceptions too. At least in your
example I know the exception occurred in zieheDreiAbVon. But if that
identifier is some UTF-8 string, how do I go about finding it in my text
editor, or even reporting the message to the developers? I don't happen
to have that particular keymap installed in my linux system, so I can't
even type the letters!

You find it in the sources by the line number from the traceback and the
letters can be copy'n'pasted if you don't know how to input them with your
keymap or keyboard layout.

Ciao,
Marc 'BlackJack' Rintsch
 
P

Paul Rubin

Steven D'Aprano said:
Not to me, I'm afraid. Can you explain how it works? A phisher might be
able to fool a casual reader, but how does he fool the compiler into
executing the wrong code?

The compiler wouldn't execute the wrong code; it would execute the code
that the phisher intended it to execute. That might be different from
what it looked like to the reviewer.
 
T

Terry Reedy

Il Sun, 13 May 2007 17:44:39 +0200, "Martin v. Löwis" ha scritto:
|Also, there should be a way to convert source files in any 'exotic'
encoding to a pseudo-intellegibile encoding for any reader, a kind of
translittering (is that a proper english word) system out-of-the-box, not
requiring any other tool that's not included in the Python distro. This
will let people to retain their usual working environments even though
they're dealing with source code with identifiers in a really different
charset.
=============================

When I proposed that PEP3131 include transliteration support, Martin
rejected the idea.

tjr
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top