PEP 3131: Supporting Non-ASCII Identifiers

George Sakkis · May 15, 2007

After 175 replies (and counting), the only thing that is clear is the
controversy around this PEP. Most people are very strong for or
against it, with little middle ground in between. I'm not saying that
every change must meet 100% acceptance, but here there is definitely a
strong opposition to it. Accepting this PEP would upset lots of people
as it seems, and it's interesting that quite a few are not even native
english speakers.

George

HYRY · May 15, 2007

The other thing is trying to teach them formal operational logic when

they are not yet ready for it. In that case it would be better to wait
until they are ready, but unfortunately there are large variations in
the age at which children become ready. Please do not confuse the two
very different matters of language acquisition and formal operational
logic. Language is learned at an early age while formal logic starts at
about the age of eleven (but with very large variation among children).

I think programming language such as Logo, or project such as RUR-PLE
are used to teach children programming. So, I think there is no
problem to teach a 12 years old child for programming. The real
problem is they must remeber many English words before programming
logic, and it's not funny at all.

Why not use IronPython? But anyway you are actually making things worse
by *not* teaching them the language now that they will need later on and
by *teaching* them formal operational logic at an age when they just get
disappointed and frustrated by not yet being able to understand it.
Better go easy on them and teach them lots of English computing terms
and only introduce logic when they show they are ready.

IronPython is wonderful, I will search for some easy and powerful IDE
for it, and switch to IronPython. Learning English is an other
subject, my object is to teach them some basic programming logic, not
English, not even Python. I don't think English is the necessary
condition for formal operational logic.

Stefan Behnel · May 15, 2007

George said:
After 175 replies (and counting), the only thing that is clear is the
controversy around this PEP. Most people are very strong for or
against it, with little middle ground in between. I'm not saying that
every change must meet 100% acceptance, but here there is definitely a
strong opposition to it. Accepting this PEP would upset lots of people
as it seems, and it's interesting that quite a few are not even native
english speakers.

But the positions are clear, I think.

Open-Source people are against it, as they expect hassle with people sending
in code or code being lost as it can't go public as-is.

Teachers are for it as they see the advantage of having children express
concepts in their native language.

In-house developers are rather for this PEP as they see the advantage of
expressing concepts in the way the "non-techies" talk about it.

That's about all I could extract as arguments.

To me, this sounds pretty much like something people and projects could handle
on their own once the PEP is accepted.

Stefan

Thorsten Kampe · May 15, 2007

* René Fleschenberg (Tue, 15 May 2007 15:34:26 +0200)

That would be well outside the scope of this newsgroup, and if you
cannot see the reaons for this yourself, I am afraid that I won't be
able to convince you anyway.

You could actually try by giving some arguments for your opinion. Your
rationale was "English only, please" because of "code sharing".

That completely depends on how you look at code-sharing. My impression
always was that the Python community in general does regard code-sharing
as A Good Thing.

I don't think the "Python community" does that because for something
to be considered good it should actually be clear what it means. So
what actually is "Code sharing"?! Wikipedia seems to know this term
but in a slightly different meaning:

http://en.wikipedia.org/wiki/Code_sharing

It is not as if we were talking about forcing people to
share code. Just about creating/keeping an environment that makes this
easily possible and encourages it.

If the "Python community" would think that "code sharing" (whatever
that means) is per se a good thing it would switch to spaces only
allowed (instead of tabs and spaces allowed). Actually it would
refrain from giving indentation and white space a syntactical meaning
because this undoubtedly makes "code sharing" (on web pages or through
news readers for instance) /a lot/ more difficult.

Thorsten

Paul Boddie · May 15, 2007

But the positions are clear, I think.

Amongst the small group of people responsible for pumping out almost
200 messages on the subject.

Open-Source people are against it, as they expect hassle with people sending
in code or code being lost as it can't go public as-is.

Amongst the small sample here, perhaps that's true. I'm more a Free
Software person than an open source person and I can perfectly well
see the benefits in having identifiers in a broader range of
characters than just the subset of ASCII currently permitted. That's
because I can separate the issues of being able to express concepts in
one's own writing system and being able to share work with other
people, familiar or otherwise with that writing system.

Teachers are for it as they see the advantage of having children express
concepts in their native language.

Yes, because it allows them to concentrate on fewer "new things"
simultaneously.

In-house developers are rather for this PEP as they see the advantage of
expressing concepts in the way the "non-techies" talk about it.

Yes, but this point can be stretched too far. I've worked in
environments with English plus another language in use, as well as
just a non-English language in use, and in all of them there's been a
tendency to introduce English or English-like terms into systems,
often to the detriment of dedicated, officially recommended non-
English terms. But I can see the potential benefit of just letting
people get on with it - again, it's possible to separate the social
issues from the technical ones.

That's about all I could extract as arguments.

From a relatively small group of people where an even smaller group of

participants seem intent on amplifying their arguments on the subject.

To me, this sounds pretty much like something people and projects could handle
on their own once the PEP is accepted.

Yes, of course. But what I'd like to see, for a change, is some kind
of analysis of the prior art in connection with this matter. Java has
had extensive UTF-8 support all over the place for ages, but either no-
one here has any direct experience with the consequences of this
support, or they are more interested in arguing about it as if it were
a hypothetical situation when it is, in fact, a real-life situation
that can presumably be observed and measured.

Paul

Eric Brunel · May 15, 2007

But the positions are clear, I think.

Open-Source people are against it, as they expect hassle with people
sending
in code or code being lost as it can't go public as-is.

Teachers are for it as they see the advantage of having children express
concepts in their native language.

In-house developers are rather for this PEP as they see the advantage of
expressing concepts in the way the "non-techies" talk about it.

No: I *am* an "in-house" developer. The argument is not public/open-source
against private/industrial. As I said in some of my earlier posts, any
code can pass through many people in its life, people not having the same
language. I dare to say that starting a project today in any other
language than english is almost irresponsible: the chances that it will
get at least read by people not talking the same language as the original
coders are very close to 100%, even if it always stays "private".

Guest · May 15, 2007

Thorsten said:
You could actually try by giving some arguments for your opinion. Your
rationale was "English only, please" because of "code sharing".

I thought this was pretty clear. The more people can easily read code,
the higher the probability that it will be useful for them and that they
can make it more useful for others.

I don't think the "Python community" does that because for something
to be considered good it should actually be clear what it means. So
what actually is "Code sharing"?! Wikipedia seems to know this term
but in a slightly different meaning:

http://en.wikipedia.org/wiki/Code_sharing

I think we all know what "code" in the context of Python means. As for
"to share", I think that is pretty clear, also. It means that people
other than the original authors use, study and modify the code. FWIW, a
Google search for 'Python code-sharing' yields almost 30k results.

If the "Python community" would think that "code sharing" (whatever
that means) is per se a good thing it would switch to spaces only
allowed (instead of tabs and spaces allowed).

There is PEP 8 that encourages the use of spaces only. Personally, I
would not have much of a problem with tabs being banned for indentation.
Maybe the main reason why that is not actually done is compatibility
with legacy code -- I don't know.

Actually it would
refrain from giving indentation and white space a syntactical meaning
because this undoubtedly makes "code sharing" (on web pages or through
news readers for instance) /a lot/ more difficult.

Where did anyone say that code-sharing is the only concern? I just think
that it is undoubtedly one among others. So for something that harms
this concern to be done, there should be substantial benefits in it. I
fail to see them with non-ASCII identifiers so far.

Stefan Behnel · May 15, 2007

Paul said:
what I'd like to see, for a change, is some kind
of analysis of the prior art in connection with this matter. Java has
had extensive UTF-8 support all over the place for ages, but either no-
one here has any direct experience with the consequences of this
support, or they are more interested in arguing about it as if it were
a hypothetical situation when it is, in fact, a real-life situation
that can presumably be observed and measured.

It's difficult to extract this analysis from Java. Most people I know from the
Java world do not use this feature as it is error prone. Java does not have
support for *explicit* source encodings, i.e. the local environment settings
win. This is bound to fail e.g. on a latin-1 system where I would like to work
with UTF-8 files (which tend to work better on the Unix build server, etc.)

In the Python world, these problems are solved now and will disappear when
UTF-8 becomes the default encoding (note that this does not inverse the
problem as people using non-utf8 encodings will then just set the respective
encoding tag in their files). So there is not much Python can learn from Java
here except for what it already does better.

I am actually working on a couple of Java projects that use German
identifiers, transliterated to prevent the encoding problems inherent to Java.
The transliteration makes things harder to read than necessary - and this is
only German-vs-English, i.e. simple things like 'ae' instead of 'ä' and 'ss'
instead of 'ß'. But sometimes things become hard to read that way or look like
different words. And it leads to all sorts of weirdly mixed names as sometimes
it is easier to write the similar looking (although maybe not completely
synonymous) English word instead of the transliterated German one.

So, yes, in a way, the code quality in these projects suffers from developers
not being able to freely write Unicode identifiers.

Stefan

Paul Boddie · May 15, 2007

It's difficult to extract this analysis from Java. Most people I know from the
Java world do not use this feature as it is error prone. Java does not have
support for *explicit* source encodings, i.e. the local environment settings
win. This is bound to fail e.g. on a latin-1 system where I would like to work
with UTF-8 files (which tend to work better on the Unix build server, etc.)

Here's a useful link on this topic:

http://www.jorendorff.com/articles/unicode/java.html

A search for "Java source file encoding" on Google provides other
material.

Paul

Marco Colombo · May 15, 2007

After 175 replies (and counting), the only thing that is clear is the
controversy around this PEP. Most people are very strong for or
against it, with little middle ground in between. I'm not saying that
every change must meet 100% acceptance, but here there is definitely a
strong opposition to it. Accepting this PEP would upset lots of people
as it seems, and it's interesting that quite a few are not even native
english speakers.

George

I see very few people against this PEP.

Most objections are against:

1) the use of non-English words for indentifiers;
2) embedding non-ASCII characters in source files (PEP263);
3) writing unreadable code (for English-speaking readers).

None of the above is covered by this PEP.

Let's face that identifiers are just a small part of the information
conveyed by a program source. All the rest can *already* be totally
unreadable to an English speaker, or even undisplayable on his
monitor. There's no real reason to force ASCII-only identifiers UNLESS
we also force ASCII-only programs.

I doubt any program containing Chinese comments, with Chinese
characters, (which we allow), Chinese strings (which we allow),
identifiers that are Chinese words, written with ASCII characters,
(which we allow) is made any LESS readable by writing those
identifiers with Chinese characters too. But it *is* more readable to
someone speaking Chinese! For sure it's easier for them to read their
words with their own glyphs instead of being forced to spell them with
a foreign alphabet.

..TM.

Hendrik van Rooyen · May 15, 2007

(2) Several posters have claimed non-native english speaker
status to bolster their position, but since they are clearly at
or near native-speaker levels of fluency, that english is not
their native language is really irrelevant.

I dispute the irrelevance strongly - I am one of the group referred
to, and I am here on this group because it works for me - I am not
aware of an Afrikaans python group - but even if one were to
exist - who, aside from myself, would frequent it? - would I have
access to the likes of the effbot, Steve Holden, Alex Martelli,
Irmen de Jongh, Eric Brunel, Tim Golden, John Machin, Martin
v Loewis, the timbot and the Nicks, the Pauls and other Stevens?

- I somehow doubt it.

Fragmenting this resource into little national groups based
on language would be silly, if not downright stupid, and it seems
to me just as silly to allow native identifiers without also
allowing native reserved words, because you are just creating
a mess that is neither fish nor flesh if you do.

And the downside to going the whole hog would be as follows:

Nobody would even want to look at my code if I write
"terwyl" instead of 'while', and "werknemer" instead of
"employee" - so where am I going to get help, and how,
once I am fully Python fit, can I contribute if I insist on
writing in a splinter language?

And while the Mandarin language group could be big enough
to be self sustaining, is that true of for example Finnish?

So I don't think my opinion on this is irrelevant just because
I miss spent my youth reading books by Pelham Grenfell
Wodehouse, amongst others.

And I also don't regard my own position as particularly unique
amongst python programmers that don't speak English as
their native language

- Hendrik

Christophe · May 15, 2007

Stefan Behnel a écrit :

But the positions are clear, I think.

Open-Source people are against it, as they expect hassle with people sending
in code or code being lost as it can't go public as-is.

I'm an Open-Source guy too but I'm for that proposal. Anyway, this is a
bad argument as was show already. When accepting code in a project, you
already HAVE to make some rule as to how the code is writen, and 99% of
the time those rules include "All identifiers in correct english (or
rather, american) and all comments in english". Recieving a patch
containing identifiers you cannot read is simple enouth to reject.

Carsten Haese · May 15, 2007

After 175 replies (and counting), the only thing that is clear is the
controversy around this PEP. Most people are very strong for or
against it, with little middle ground in between. I'm not saying that
every change must meet 100% acceptance, but here there is definitely a
strong opposition to it. Accepting this PEP would upset lots of people
as it seems, and it's interesting that quite a few are not even native
english speakers.

While it is true that many of the PEP's opponents here on c.l.p are not
native English speakers, they are largely European or "Europeanized"
people. Many European people don't seem to realize that there are
entirely different cultures where people speak entirely different
languages written in entirely different writing systems, and English is
not widely spoken in many of those cultures. China comes to mind as but
one example of many.

Allowing people to use identifiers in their native language would
definitely be an advantage for people from such cultures. That's the use
case for this PEP. It's easy for Euro-centric people to say "just suck
it up and use ASCII", but the same people would probably starve to death
if they were suddenly teleported from Somewhere In Europe to rural China
which is so unimaginably different from what they know that it might
just as well be a different planet. "Learn English and use ASCII" is not
generally feasible advice in such cultures.

In my opinion, the principles of freedom and fostering global adoption
of Python, require that this PEP be accepted.

The objections for reasons of reduced code readability are valid, but I
think the open source community is good enough at regulating itself and
the community will predominantly *make the choice* to stick to
ASCII-only identifiers.

For programmers that are afraid of accidentally allowing non-ASCII
identifiers from patches, we might want to explore the possibility of
having Python's behavior switchable between ASCII and non-ASCII
identifiers, either by a compiler setting, environment variable, a "from
__future__" import, or similar mechanism.

Regards,

Stefan Behnel · May 15, 2007

Eric said:
No: I *am* an "in-house" developer. The argument is not
public/open-source against private/industrial. As I said in some of my
earlier posts, any code can pass through many people in its life, people
not having the same language. I dare to say that starting a project
today in any other language than english is almost irresponsible: the
chances that it will get at least read by people not talking the same
language as the original coders are very close to 100%, even if it
always stays "private".

Ok, so I'm an Open-Source guy who happens to work in-house. And I'm a
supporter of PEP 3131. I admit that I was simplifying in my round-up.

But I would say that "irresponsible" is a pretty self-centered word in this
context. Can't you imagine that those who take the "irresponsible" decisions
of working on (and starting) projects in "another language than English" are
maybe as responsible as you are when you take the decision of starting a
project in English, but in a different context? It all depends on the specific
constraints of the project, i.e. environment, developer skills, domain, ...

The more complex an application domain, the more important is clear and
correct domain terminology. And software developers just don't have that. They
know their own domain (software development with all those concepts, languages
and keywords), but there is a reason why they develop software for those who
know the complex professional domain in detail but do not know how to develop
software. And it's a good idea to name things in a way that is consistent with
those who know the professional domain.

That's why keywords are taken from the domain of software development and
identifiers are taken (mostly) from the application domain. And that's why I
support PEP 3131.

Stefan

Chris Cioffi · May 15, 2007

+1 for the pep

There are plenty of ways projects can enforce ASCII only if they are
worried about "contamination" and since Python supports file encoding
anyway, this seems like a fairly minor change.

pre-commit scripts can keep weird encoding out of existing projects
and everything else can be based on per-project agreed on standards.

For those who complain that they can't read the weird characters, for
any reason, maybe you aren't meant to read that stuff?

There may be some fragmentation and duplication of effort (A Hindi
module X, a Mandarin module X and the English module X) but that seems
a small price to pay for letting Python fulfill it's purpose: letting
people be expressive and get the job done. People are usually more
expressive in their native languages, and thinking in different
languages may even expose alternative ways of doing things to the
greater Python community.

Chris

Stefan Behnel · May 15, 2007

Paul said:
Here's a useful link on this topic:

http://www.jorendorff.com/articles/unicode/java.html

This is what I meant (quote from your link):

"""
When you compile this program with the command javac Hallo.java, the compiler
does not know the encoding of the source file. Therefore it uses your
platform's default encoding. You might wish to tell javac which encoding to
use explicitly, instead. Use the -encoding option to do this: javac -encoding
Latin-1 Hallo.java . If you do not specify the right encoding, javac will be
confused and may or may not generate a lot of syntax errors as a result.
"""

From a Python perspective, I would rather call this behaviour broken. Do I
really have to pass the encoding as a command line option to the compiler?

I find Python's source encoding much cleaner here, and even more so when the
default encoding becomes UTF-8.

Stefan

Ross Ridge · May 15, 2007

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?

Ross said:
I think the biggest argument against this PEP is how little similar
features are used in other languages

Carsten Haese said:
That observation is biased by your limited sample.

No. I've actually looked hard to find examples of source code that use
non-ASCII identifiers. While it's easy to find code where comments use
non-ASCII characters, I was never able to find a non-made up example
that used them in identifiers.

You only see open source code that chooses to restrict itself to ASCII
and mostly English identifiers to allow for easier code sharing. There
could be millions of kids in China learning C# in native Mandarin and
you'd never know about it.

No, there's tons of code written by kids learning programming languages
from all over the world scattered all over the Internet.

Regardless of what my observations are, I think that you need a better
argument than a bunch of children in China that may very well not exist.
This PEP, like similar features in other langauges is being made and
advocated by people who don't actually want to use it. It's made on
the presumption that somehow developers in China and other places its
proponents aren't familar with desperately want this feature but are
somehow incapable of advocating for it themselves let alone implementing
it.

The burden of proof should be on this PEP's proponents to show that it
will be actually be used. Is this PEP even justified by anyone going
to the trouble of asking for it to be implemented in the first place?

How would a choice of identifiers interact in any way with Python's
standard or third-party libraries? The only things that get passed
between an application and the libraries are objects that neither know
nor care what identifiers, if any, are attached to them.

A number of libraries and utilities, including those included with the
standard Python distribution work with Python identifiers. The PEP gives
one example, but it doesn't really give much though as to how much of
the standard library might be affected.

If the proponents of this PEP think it will actually be used, then the
implementation section of this PEP should be updated to include making all
aspects of the standard Python distribution, the interpreter, libraries
and utilities fully support non-ASCII identifiers. These hypothetical
Chinese students are going to happy if IDLE doesn't highlight identifiers
correctly, or the carret in a syntax error doesn't point to the right
place.

Hmm... normalizing identifiers could cause problems with module names.
If Python searches the filesystem for the module using the normalized
version of the name different from the one that appears in the source
code it could end up surprising users.

Ross Ridge

Guest · May 15, 2007

Carsten said:
Allowing people to use identifiers in their native language would
definitely be an advantage for people from such cultures. That's the use
case for this PEP. It's easy for Euro-centric people to say "just suck
it up and use ASCII", but the same people would probably starve to death
if they were suddenly teleported from Somewhere In Europe to rural China
which is so unimaginably different from what they know that it might
just as well be a different planet. "Learn English and use ASCII" is not
generally feasible advice in such cultures.

This is a very weak argument, IMHO. How do you want to use Python
without learning at least enough English to grasp a somewhat decent
understanding of the standard library? Let's face it: To do any "real"
programming, you need to know at least some English today, and I don't
see that changing anytime soon. And it is definitely not going to be
changed by allowing non-ASCII identifiers.

I must say that the argument about domain-specific terms that
programmers don't know how to translate into English does hold some
merit (although it does not really convince me, either -- are these
cases really so common that you cannot feasibly use a transliteration?).
But having, for example, things like open() from the stdlib in your code
and then Ã¶ffnen() as a name for functions/methods written by yourself is
just plain silly. It makes the code inconsistent and ugly without
significantly improving the readability for someone who speaks German
but not English.

Donn Cave · May 15, 2007

Duncan Booth said:
Yes, non-English speakers have to learn a set of technical words which are
superficially in English, but even English native speakers have to learn
non-obvious meanings, or non-English words 'str', 'isalnum', 'ljust'.
That is an unavoidable barrier, but it is a limited vocabulary and a
limited set of syntax rules. What I'm trying to say is that we shouldn't
raise the entry bar any higher than it has to be.

The languages BTW in the countries I mentioned are: in Nigeria all school
children must study both their indigenous language and English, Brazil and
Uruguay use Spanish and Nepali is the official language of Nepal.

[Spanish in Brazil? Not as much as you might think.]

This issue reminds me a lot of CP4E, which some years back seemed
to be an ideological driver for Python development. Computer Programming
4 Everyone, for those who missed it. I can't say it actually had
a huge effect on Python, which has in most respects gone altogether
the opposite direction, but it was always on the table and certainly
must have had some influence.

One of the reasons these initiatives make soggy footing for a new
direction is that everyone's an expert, when it comes to one or
another feature that may or may not work for children, but no one
has a clue when it comes to the total package, sometimes to the
point of what seems like willful blindness to the deficiencies of
a favorite programming language.

If we have a sound language proposal backed by a compelling need,
fine, but don't add a great burden to the language for the sake of
great plans for Nepalese grade school programmers.

Donn Cave, (e-mail address removed)

Sion Arrowsmith · May 15, 2007

Aldo Cortesi said:
[ ... ] There is no general way to detect homoglyphs and "convert them to
a normal form". Observe:

import unicodedata
print repr(unicodedata.normalize("NFC", u"\u2160"))
print u"\u2160"
print "I"

FYI, those come out as two very clearly distinct glyphs in the
default terminal font I have here. (The ROMAN NUMERAL ONE has no
cross-bars, and is more likely to be confused with "|".)

Atoms, Identifiers, and Primaries	21	Apr 16, 2013
Generating valid identifiers	8	Jul 26, 2012
Non-identifiers in dictionary keys for **expression syntax	3	May 23, 2013
Renaming identifiers & debugging	14	Feb 25, 2010
Looking for UNICODE to ASCII Conversioni Example Code	15	Oct 18, 2013
Python 3.5, bytes, and %-interpolation (aka PEP 461)	10	Feb 24, 2014
Is PEP-8 a Code or More of a Guideline?	52	May 26, 2007
Extended identifiers?	1	Jun 15, 2012

PEP 3131: Supporting Non-ASCII Identifiers

George Sakkis

HYRY

Stefan Behnel

Thorsten Kampe

Paul Boddie

Eric Brunel

Guest

Stefan Behnel

Paul Boddie

Marco Colombo

Hendrik van Rooyen

Christophe

Carsten Haese

Stefan Behnel

Chris Cioffi

Stefan Behnel

Ross Ridge

Guest

Donn Cave

Sion Arrowsmith

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads