PEP 3131: Supporting Non-ASCII Identifiers

Sion Arrowsmith · May 15, 2007

Nick Craig-Wood said:
b) Unicode characters would creep into the public interface of public
libraries. I think this would be a step back for the homogeneous
nature of the python community.

One could decree that having a non-ASCII character in an identifier
would have the same meaning as a leading underscore 9-)

Marc 'BlackJack' Rintsch · May 15, 2007

No. I've actually looked hard to find examples of source code that use
non-ASCII identifiers. While it's easy to find code where comments use
non-ASCII characters, I was never able to find a non-made up example
that used them in identifiers.

I think you have to search examples of ASCII sources with transliterated
identifiers too, because the authors may have skipped the transliteration
if they could have written the non-ASCII characters in the first place.

And then I dare to guess that much of that code is not open source. One
example are macros in office programs like spreadsheets. Often those are
written by semi professional programmers or even end users with
transliterated identifiers. If the OpenOffice API wouldn't be so
"javaesque" this would be a good use case for code with non-ASCII
identifiers.

Ciao,
Marc 'BlackJack' Rintsch

Sion Arrowsmith · May 15, 2007

Stefan Behnel said:
I don't think all identifiers in the stdlib are
a) well chosen
b) correct English words

Never mind the standard library, by my count about 20% of keywords
and builtins (excluding exception types) are either not correct
English words ('elif', 'chr') or have some kind of mismatch between
their meaning and the usual English usage ('hex', 'intern').

The discussion on readability and natural language identifiers reminds
me of my first job in programming: looking after a pile of Fortran77
from the mid-80s. Case-insensitive, with different coders having
different preferences (sometimes within the same module), and using
more than four characters on an identifier considered shocking. Of
course you got identifiers which were unintelligable, and it wasn't
a great situation, but we coped and the whole thing didn't fall over
in a complete heap.

Paul Boddie · May 15, 2007

[javac -encoding Latin-1 Hallo.java]

From a Python perspective, I would rather call this behaviour broken. Do I
really have to pass the encoding as a command line option to the compiler?

They presumably weighed up the alternatives and decided that the most
convenient approach (albeit for the developers of Java) was to provide
such a compiler option. Meanwhile, developers get to write their
identifiers in the magic platform encoding, which isn't generally a
great idea but probably works well enough for some people - their
editor lets them write their programs in some writing system and the
Java compiler happens to choose the same writing system when reading
the file - although I wouldn't want to rely on such things myself.
Alternatively, they can do what Python programmers do now and specify
the encoding, albeit on the command line.

However, what I want to see is how people deal with such issues when
sharing their code: what are their experiences and what measures do
they mandate to make it all work properly? You can see some
discussions about various IDEs mandating UTF-8 as the default
encoding, along with UTF-8 being the required encoding for various
kinds of special Java configuration files. Is this because
heterogeneous technical environments even within the same cultural
environment cause too many problems?

I find Python's source encoding much cleaner here, and even more so when the
default encoding becomes UTF-8.

Yes, it should reduce confusion at a technical level. But what about
the tools, the editors, and so on? If every computing environment had
decent UTF-8 support, wouldn't it be easier to say that everything has
to be in UTF-8? Perhaps the developers of Java decided that the rules
should be deliberately vague to accommodate people who don't want to
think about encodings but still want to be able to use Windows Notepad
(or whatever) to write software in their own writing system.

And then, what about patterns of collaboration between groups who have
been able to exchange software with "localised" identifiers for a
number of years? Does it really happen, or do IBM's engineers in China
or India (for example) have to write everything strictly in ASCII? Do
people struggle with characters they don't understand or does copy/
paste work well enough when dealing with such code?

Paul

Javier Bezos · May 15, 2007

This is a very weak argument, IMHO. How do you want to use Python
without learning at least enough English to grasp a somewhat decent
understanding of the standard library?

By heart. I know a few _very good_ programmers
who are unable to understand an English text.
Knowing English helps, of course, but is not
required at all. Of course, they don't know how
to name identifiers in English, but it happens
they _cannot_ give them proper Spanish names,
either (I'm from Spain).

+1 for the PEP, definitely.

But having, for example, things like open() from the stdlib in your code
and then öffnen() as a name for functions/methods written by yourself is
just plain silly. It makes the code inconsistent and ugly without
significantly improving the readability for someone who speaks German
but not English.

Agreed. I always use English names (more or
less

), but this is not the PEP is about.

Javier

Ross Ridge · May 15, 2007

Marc 'BlackJack' Rintsch said:
I think you have to search examples of ASCII sources with transliterated
identifiers too, because the authors may have skipped the transliteration
if they could have written the non-ASCII characters in the first place.

The point of my search was to look for code that actually used non-ASCII
characters in languages that actually supported it (mainly Java at the
time). The point wasn't to create more speculation about what programmers
might or might not do, but to find out what they were actually doing.

And then I dare to guess that much of that code is not open source.

Lots of non-open source code makes it on to the Internet in the form of
code snippets. You don't have to guess what closed-source are actually
doing either.

Ross Ridge

Stefan Behnel · May 15, 2007

Paul said:
Does it really happen, or do IBM's engineers in China
or India (for example) have to write everything strictly in ASCII?

I assume they simply use american keyboards. They're working for an american
company after all. That's like the call-center people in India who learn the
superball results by heart before they go to their cubicle, just to keep
north-american callers from realising they are not connected to the shop
around the corner.

Stefan

Carsten Haese · May 15, 2007

This is a very weak argument, IMHO. How do you want to use Python
without learning at least enough English to grasp a somewhat decent
understanding of the standard library? Let's face it: To do any "real"
programming, you need to know at least some English today, and I don't
see that changing anytime soon. And it is definitely not going to be
changed by allowing non-ASCII identifiers.

Even if it were impossible to do "real programming" in Python without
knowing English (which I will neither accept nor reject because I don't
have enough data either way), I don't think Python should be restricted
to "real" programming only. Python (the programming language) is an
inherently easy-to-learn language. I find it quite plausible that
somebody in China might want to teach their students programming before
teaching them English. The posts on this thread by a teacher from China
confirm this suspicion.

Once the students learn Python and realize that there are lots of Python
resources "out there" that are only in English, that will be a
motivation for them to learn English. Requiring all potential Python
programmers to learn English first (or assuming that they know English
already) is an unacceptable barrier of entry.

Duncan Booth · May 15, 2007

Donn Cave said:
[Spanish in Brazil? Not as much as you might think.]

Sorry temporary[*] brain failure, I really do know it is Portugese.

[*] I hope.

John Nagle · May 15, 2007

There are really two issues here, and they're being
confused.

One is allowing non-English identifiers, which is a political
issuer. The other is homoglyphs, two characters which look the same.
The latter is a real problem in a language like Python with implicit
declarations. If a maintenance programmer sees a variable name
and retypes it, they may silently create a new variable.

If Unicode characters are allowed, they must be done under some
profile restrictive enough to prohibit homoglyphs. I'm not sure
if UTS-39, profile 2, "Highly Restrictive", solves this problem,
but it's a step in the right direction. This limits mixing of scripts
in a single identifier; you can't mix Hebrew and ASCII, for example,
which prevents problems with mixing right to left and left to right
scripts. Domain names have similar restrictions.

We have to have visually unique identifiers.

There's also an issue with implementations that interface
with other languages. Some Python implementations generate
C, Java, or LISP code. Even CPython will call C code.
The representation of external symbols needs to be standardized
across those interfaces.

John Nagle

Stefan Behnel · May 15, 2007

John said:
There are really two issues here, and they're being
confused.

One is allowing non-English identifiers, which is a political
issuer. The other is homoglyphs, two characters which look the same.
The latter is a real problem in a language like Python with implicit
declarations. If a maintenance programmer sees a variable name
and retypes it, they may silently create a new variable.

If Unicode characters are allowed, they must be done under some
profile restrictive enough to prohibit homoglyphs. I'm not sure
if UTS-39, profile 2, "Highly Restrictive", solves this problem,
but it's a step in the right direction. This limits mixing of scripts
in a single identifier; you can't mix Hebrew and ASCII, for example,
which prevents problems with mixing right to left and left to right
scripts. Domain names have similar restrictions.

We have to have visually unique identifiers.

As others stated before, this is unlikely to become a problem in practice.
Project-internal standards will usually define a specific language for a
project, in which case these issues will not arise. In general, programmers
from a specific language/script background will stick to that script and not
magically start typing foreign characters. And projects where multiple
languages are involved will have to define a target language anyway, most
likely (although not necessarily) English.

Note that adherence to a specific script can easily checked programmatically
through Unicode ranges - if the need ever arises.

Stefan

Thorsten Kampe · May 15, 2007

* Duncan Booth (15 May 2007 17:30:58 GMT)

Donn Cave said:
Donn Cave said:

[Spanish in Brazil? Not as much as you might think.]

Click to expand...

Sorry temporary[*] brain failure, I really do know it is Portugese.

Yes, you do. Spanish is what's been used in the United States, right?

Thorsten

Pierre Hanser · May 15, 2007

RenÃ© Fleschenberg a Ã©crit :

IMO, the burden of proof is on you. If this PEP has the potential to
introduce another hindrance for code-sharing, the supporters of this PEP
should be required to provide a "damn good reason" for doing so. So far,
you have failed to do that, in my opinion. All you have presented are
vague notions of rare and isolated use-cases.

you want to limit my liberty of using appealing names in my language.

this alone should be enough to accept the pep!

Pierre Hanser · May 15, 2007

RenÃ© Fleschenberg a Ã©crit :

Your example does not prove much. The fact that some people use
non-ASCII identifiers when they can does not at all prove that it would
be a serious problem for them if they could not.

i have to make orthograph mistakes in my code to please you?

Pierre Hanser · May 15, 2007

hello

i work for a large phone maker, and for a long time
we thought, very arrogantly, our phones would be ok
for the whole world.

After all, using a phone uses so little words, and
some of them where even replaced with pictograms!
every body should be able to understand appel, bis,
renvoi, mévo, ...

nowdays we make chinese, corean, japanese talking
phones.

because we can do it, because graphics are cheaper
than they were, because it augments our market.
(also because some markets require it)

see the analogy?

of course, +1 for the pep

rurpy · May 15, 2007

<[email protected]> wrote:

[I fixed the broken attribution in your quote]

I dispute the irrelevance strongly - I am one of the group referred
to, and I am here on this group because it works for me - I am not
aware of an Afrikaans python group - but even if one were to
exist - who, aside from myself, would frequent it? - would I have
access to the likes of the effbot, Steve Holden, Alex Martelli,
Irmen de Jongh, Eric Brunel, Tim Golden, John Machin, Martin
v Loewis, the timbot and the Nicks, the Pauls and other Stevens?

I didn't say that your (as a fluent but non-native English
speaker) views are irrelevant, only that when you say,
"I am a native speaker of Afrikaans and I don't want non-
ascii identifiers" it shouldn't carry any more weight
that if I (as a native English speaker) say the same
thing. (But I wouldn't of course

.

My point was that this entire discussion is by English
speakers and that a consesious by such a group, that
non-english identfiers are bad, is neither surprising nor
legitimate.

- I somehow doubt it.

Fragmenting this resource into little national groups based
on language would be silly, if not downright stupid, and it seems
to me just as silly to allow native identifiers without also
allowing native reserved words, because you are just creating
a mess that is neither fish nor flesh if you do.

It already is fragmented. There is a Japanese Python users
group, complete with discussion forums, all in Japanese, not
English. Another poster said he was going to bring up this
issue on a French language discussion group.

How can you possibly propose that some authority should
decide what language a group of people should use to
discuss a common interest?!

And the downside to going the whole hog would be as follows:

Nobody would even want to look at my code if I write
"terwyl" instead of 'while', and "werknemer" instead of
"employee" - so where am I going to get help, and how,
once I am fully Python fit, can I contribute if I insist on
writing in a splinter language?

First "while" is a keyword and will remain "while" so
that has nothing to do with anything.

If nobody want to look at your code, it is not
the use of "werknemer" that is the cause. If you used
that as an identifier that I assume you decided
your code was exclusively of interest to Afrikaans
speakers. Otherwise use you would have used English
for for that indentifier. The point is that *you*
are in the best position to decide that, not the
designers of the language.

And while the Mandarin language group could be big enough
to be self sustaining, is that true of for example Finnish?

So I don't think my opinion on this is irrelevant just because
I miss spent my youth reading books by Pelham Grenfell
Wodehouse, amongst others.

And I also don't regard my own position as particularly unique
amongst python programmers that don't speak English as
their native language

Like I said, that English is not your native language is
irrelevant -- what matters is that you now speak English
fluently. Thus you are an English speaker argueing that
excluding non-english identifiers is not a problem.

Michel Claveau · May 15, 2007

Hi!

Yes, for legibility.

If letters with accents are possible:

d=dict(numéro=1234, name='Löwis', prénom='Martin', téléphone='+33123')

or

p1 = personn() #class
p1.numéro = 1234
p1.name='Löwis'
p1.prénom='Martin'
p1.téléphone='+33123'

Imagine the same code, is accents are not possible...

Don't forget: we must often be connected to databases who already
exists

Guest · May 15, 2007

Javier said:
Agreed. I always use English names (more or
less ), but this is not the PEP is about.

We all know what the PEP is about (we can read). The point is: If we do
not *need* non-English/ASCII identifiers, we do not need the PEP. If the
PEP does not solve an actual *problem* and still introduces some
potential for *new* problems, it should be rejected. So far, the
"problem" seems to just not exist. The burden of proof is on those who
support the PEP.

rurpy · May 15, 2007

After 175 replies (and counting), the only thing that is clear is the
controversy around this PEP. Most people are very strong for or
against it, with little middle ground in between. I'm not saying that
every change must meet 100% acceptance, but here there is definitely a
strong opposition to it. Accepting this PEP would upset lots of people
as it seems, and it's interesting that quite a few are not even native
english speakers.

As I pointed out in a previous post,
http://groups.google.com/group/comp.lang.python/msg/c911d6d249d327a6?hl=en&
http://groups.google.com/group/comp.lang.python/msg/363fe742925623c4?hl=en&
whether a person is or is not a native English speaker is
irrelevant -- what is relevant is their current ability with
English.
And my impression is that neally all of posts from people not
fluent in English (judging from grammar mistakes and such)
are in favor of the PEP.

rurpy · May 15, 2007

On May 15 said:
Once the students learn Python and realize that there are lots of Python
resources "out there" that are only in English, that will be a
motivation for them to learn English. Requiring all potential Python
programmers to learn English first (or assuming that they know English
already) is an unacceptable barrier of entry.

One the big concerns seems to be a hypothesized
negative impact on code sharing. Nobody has considered
the positive impact resulting from making Python
more accessible to non-English speakers, some of
whom will go on to become wiling and able to contribute
open "English python" code to the community. This
positive impact may well outweigh the negative.

Atoms, Identifiers, and Primaries	21	Apr 16, 2013
Generating valid identifiers	8	Jul 26, 2012
Non-identifiers in dictionary keys for **expression syntax	3	May 23, 2013
Renaming identifiers & debugging	14	Feb 25, 2010
Looking for UNICODE to ASCII Conversioni Example Code	15	Oct 18, 2013
Python 3.5, bytes, and %-interpolation (aka PEP 461)	10	Feb 24, 2014
Is PEP-8 a Code or More of a Guideline?	52	May 26, 2007
Extended identifiers?	1	Jun 15, 2012

PEP 3131: Supporting Non-ASCII Identifiers

Sion Arrowsmith

Marc 'BlackJack' Rintsch

Sion Arrowsmith

Paul Boddie

Javier Bezos

Ross Ridge

Stefan Behnel

Carsten Haese

Duncan Booth

John Nagle

Stefan Behnel

Thorsten Kampe

Pierre Hanser

Pierre Hanser

Pierre Hanser

rurpy

Michel Claveau

Guest

rurpy

rurpy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads