PEP 3131: Supporting Non-ASCII Identifiers

H

Hendrik van Rooyen

I can't admit that, but I find that using German
class and method names is beautiful. The rest around
it (keywords and names from the standard library)
are not English - they are Python.

(look me in the eye and tell me that "def" is
an English word, or that "getattr" is one)

Regards,
Martin

LOL - true - but a broken down assembler programmer like me
does not use getattr - and def is short for define, and for and while
and in are not German.

Looks like you have stirred up a hornets nest...

- Hendrik
 
S

Sion Arrowsmith

Hendrik van Rooyen said:
Would not like it at all, for the same reason I don't like re's -
It looks like random samples out of alphabet soup to me.

What I meant was, would the use of "foreign" identifiers look so
horrible to you if the core language had fewer English keywords?
(Perhaps Perl, with its line-noise, was a poor choice of example.
Maybe Lisp would be better, but I'm not so sure of my Lisp as to
make such an assertion for it.)
 
?

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

So, please provide feedback, e.g. perhaps by answering these
I think the biggest argument against this PEP is how little similar
features are used in other languages and how poorly they are supported
by third party utilities. Your PEP gives very little thought to how
the change would affect the standard Python library. Are non-ASCII
identifiers going to be poorly supported in Python's own library and
utilities?

For other languages (in particular Java), one challenge is that
you don't know the source encoding - it's neither fixed, nor is
it given in the source code file itself.

Instead, the environment has to provide the source encoding, and that
makes it difficult to use. The JDK javac uses the encoding from the
locale, which is non-sensical if you check-out source from a
repository. Eclipse has solved the problem: you can specify source
encoding on a per-project basis, and it uses that encoding
consistently in the editor and when running the compiler.

For Python, this problem was solved long ago: PEP 263 allows to
specify the source encoding within the file, and there was
always a default encoding. The default encoding will change to
UTF-8 in Python 3.

IDLE has been supporting PEP 263 from the beginning, and several
other editors support it as well. Not sure what other tools
you have in mind, and what problems you expect.

Regards,
Martin
 
G

Guest

René Fleschenberg said:
Integration with existing tools *is* something that a PEP should
consider. This one does not do that sufficiently, IMO.

What specific tools should be discussed, and what specific problems
do you expect?

Regards,
Martin
 
?

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

In the code I was looking at identifiers were allowed to use non-ASCII
characters. For whatever reason, the programmers choose not use non-ASCII
indentifiers even though they had no problem using non-ASCII characters
in commonets.

One possible reason is that the tools processing the program would not
know correctly what encoding the source file is in, and would fail
when they guessed the encoding incorrectly. For comments, that is not
a problem, as an incorrect encoding guess has no impact on the meaning
of the program (if the compiler is able to read over the comment
in the first place).

Another possible reason is that the programmers were unsure whether
non-ASCII identifiers are allowed.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

After 175 replies (and counting), the only thing that is clear is the
controversy around this PEP. Most people are very strong for or
against it, with little middle ground in between. I'm not saying that
every change must meet 100% acceptance, but here there is definitely a
strong opposition to it. Accepting this PEP would upset lots of people
as it seems, and it's interesting that quite a few are not even native
english speakers.

I believe there is a lot of middle ground, but those people don't speak
up. I interviewed about 20 programmers (none of them Python users), and
most took the position "I might not use it myself, but it surely
can't hurt having it, and there surely are people who would use it".
2 people were strongly in favor, and 3 were strongly opposed.

Of course, those people wouldn't take a lot of effort to defend their
position in a usenet group. So that the majority of the responses
comes from people with strong feelings either way is no surprise.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

However, what I want to see is how people deal with such issues when
sharing their code: what are their experiences and what measures do
they mandate to make it all work properly? You can see some
discussions about various IDEs mandating UTF-8 as the default
encoding, along with UTF-8 being the required encoding for various
kinds of special Java configuration files.

I believe the problem is solved when everybody uses Eclipse.
You can set a default encoding for all Java source files in a project,
and you check the project file into your source repository.
Eclipse both provides the editor and drives the compiler, and
does so in a consistent way.
Yes, it should reduce confusion at a technical level. But what about
the tools, the editors, and so on? If every computing environment had
decent UTF-8 support, wouldn't it be easier to say that everything has
to be in UTF-8?

For both Python and Java, it's too much historical baggage already.
When source encodings were introduced to Python, allowing UTF-8
only was already proposed. People rejected it at the time, because
a) they had source files where weren't encoded in UTF-8, and
were afraid of breaking them, and
b) their editors would not support UTF-8.

So even with Python 3, UTF-8 is *just* the default default encoding.
I would hope that all Python IDEs, over time, learn about this
default, until then, users may have to manually configure their
IDEs and editors. With a default of UTF-8, it's still simpler than
with PEP 263: you can say that .py files are UTF-8, and your
editor will guess incorrectly only if there is an encoding
declaration other than UTF-8.

Regards,
Martin
 
G

Guest

I claim that this is *completely unrealistic*. When learning Python, you
*do* learn the actual meanings of English terms like "open",
"exception", "if" and so on if you did not know them before. It would be
extremely foolish not to do so.

Having taught students for many years now, I can report that this is
most certainly *not* the case. Many people learn only ever the technical
meaning of some term, and never grasp the English meaning. They could
look into a dictionary, but they rather read the documentation.

I've reported this before, but happily do it again: I have lived many
years without knowing what a "hub" is, and what "to pass" means if
it's not the opposite of "to fail". Yet, I have used their technical
meanings correctly all these years.

Regards,
Martin
 
G

Gregor Horvath

Martin said:
I've reported this before, but happily do it again: I have lived many
years without knowing what a "hub" is, and what "to pass" means if
it's not the opposite of "to fail". Yet, I have used their technical
meanings correctly all these years.

That's not only true for computer terms.
In the German Viennese slang there are a lot of Italian, French,
Hungarian, Czech, Hebrew and Serbocroatien words. Nobody knows the exact
meaning in their original language (nor does the vast majority actually
speak those languages), but all are used in the correct original context.

Gregor
 
R

rurpy

Microsoft once translated their VBA to foreign languages.
I didn't use it because I was used to "English" code.
If I program in mixed cultural contexts I have to use to smallest
dominator. Mixing the symbols of the programming language is confusing.

Yup, I agree wholeheartedly. So do almost all
the other people who have responded in this thread.
In public code, open source code, code being worked
on by people from different countries, English is almost
always the best choice.

Nothing in the PEP interferes with or prevents this.
The PEP only allows non-ascii indentifiers, when they
are appropriate: in code that is unlikely to be ever
be touched by people who don't know that language.
(Obviously any language feature can be misused
but peer-pressure, documentation, and education
have been very effective in preventing such misuse.
There is no reason they shouldn't be effective
here too.)

And yes, some code will be developed in a single
language enviroment and then be found to be useful
to a wider audience. It's not the end of the world.
It is no worse than when code written with a single
language UI that is becomes public -- it will get
fixed so that it meets the standards for a internationaly
collaborative project. Seems to me that replacing
identifiers with english ones is fairly trivial
isn't it? One can identify identifiers by parsing
the program and replacing them from a prepared table
of replacements? This seems much easier than fixing
comments and docstrings which need to be done by
hand. But the comment/docstring problem exists now
and has nothing to do with the PEP.
Long time ago at the age of 12 I learned programming using English
Computer books. Then there were no German books at all. It was not easy.
It would have been completely impossible if our schools system would not
have been wise enough to teach as English early.

I think millions of people are handicapped because of this.
Any step to improve this, is a good step for all of us. In no doubt
there are a lot of talents wasted because of this wall.

I agree that anyone who wants to be a programmer is
well advised to learn English. I would also advise
anyone who wants to be a programmer to go to college.
But I have met very good programmers who were not
college graduates and although I don't know any non-
english speakers I am sure there are very good programers
who don't know English.

There is a big difference between encouraging someone
to do something, and taking steps to make them do
something.

A lot of the english-only retoric in this thread seems
very reminiscent of arguments a decade+ ago regarding
wide characters and unicode, and other i18n support.
"Computing is ascii-based, we don't need all this
crap, and besides, it doubles the memory used by strings!
English is good enough". Except of course that it wasn't.

When technology demands that people adapt to it, it looses.
When technology adapts to the needs of people, it wins.

The fundamental question is whether languages designers,
or the people writing the code, should be the ones to
decide what language identifiers are most appropriate
for their program. Do language designers, all of whom
are English speakers, have the wisdom to decide for
programmers all over the world, and for years to come,
that they must learn English to use Python effectively?
And if they do, will the people affected agree, or
will they choose a different language?
 
I

Istvan Albert

Who said anything like that? It's just an example of surprising and
unexpected difficulties that may arise even when doing trivial things,
and that proponents do not seem to want to admit to.
Should computer programming only be easy accessible to a small fraction
of privileged individuals who had the luck to be born in the correct
countries?
Should the unfounded and maybe xenophilous fear of loosing power and
control of a small number of those already privileged be a guide for
development?

Now that right there is your problem. You are reading a lot more into
this than you should. Losing power, xenophilus(?) fear, privileged
individuals,

just step back and think about it for a second, it's a PEP and people
have different opinions, it is very unlikely that there is some
generic sinister agenda that one must be subscribed to

i.
 
G

Guest

I'd suggest restricting identifiers under the rules of UTS-39,
profile 2, "Highly Restrictive". This limits mixing of scripts
in a single identifier; you can't mix Hebrew and ASCII, for example,
which prevents problems with mixing right to left and left to right
scripts. Domain names have similar restrictions.

That sounds interesting, however, I cannot find the document
your refer to. In TR 39 (also called Unicode Technical Standard #39),
at http://unicode.org/reports/tr39/ there is no mentioning
of numbered profiles, or "Highly Restrictive".

Looking at the document, it seems 3.1., "General Security Profile
for Identifiers" might apply. IIUC, xidmodifications.txt would
have to be taken into account.

I'm not quite sure what that means; apparently, a number of
characters (listed as restricted) should not be used in
identifiers. OTOH, it also adds HYPHEN-MINUS and KATAKANA
MIDDLE DOT - which surely shouldn't apply to Python
identifiers, no? (at least HYPHEN-MINUS already has a meaning
in Python, and cannot possibly be part of an identifier).

Also, mixed-script detection might be considered, but it is
not clear to me how to interpret the algorithm in section
5, plus it says that this is just one of the possible
algorithms.

Finally, Confusable Detection is difficult to perform on
a single identifier - it seems you need two of them to
find out whether they are confusable.

In any case, I added this as an open issue to the PEP.

Regards,
Martin
 
R

Richard Hanson

The syntax of identifiers in Python will be based on the Unicode
standard annex UAX-31 [1]_, with elaboration and changes as defined
below.

Within the ASCII range (U+0001..U+007F), the valid characters for
identifiers are the same as in Python 2.5. This specification only
introduces additional characters from outside the ASCII range. For
other characters, the classification uses the version of the Unicode
Character Database as included in the ``unicodedata`` module.

The identifier syntax is ``<ID_Start> <ID_Continue>*``.

``ID_Start`` is defined as all characters having one of the general
categories uppercase letters (Lu), lowercase letters (Ll), titlecase
letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers
(Nl), plus the underscore (XXX what are "stability extensions" listed in
UAX 31).

``ID_Continue`` is defined as all characters in ``ID_Start``, plus
nonspacing marks (Mn), spacing combining marks (Mc), decimal number
(Nd), and connector punctuations (Pc).


[...]

.. [1] http://www.unicode.org/reports/tr31/

First, to Martin: Thanks for writing this PEP.

While I have been reading both sides of this debate and finding both
sides reasonable and understandable in the main, I have several
questions which seem to not have been raised in this thread so far.

Currently, in Python 2.5, identifiers are specified as starting with
an upper- or lowercase letter or underscore ('_') with the following
"characters" of the identifier also optionally being a numerical digit
("0"..."9").

This current state seems easy to remember even if felt restrictive by
many.

Contrawise, the referenced document "UAX-31" is a bit obscure to me
(which is not eased by the fact that various browsers render non-ASCII
characters differently or not at all depending on the setup and font
sets available). Further, a cursory perusing of the unicodedata module
seems to refer me back to the Unicode docs.

I note that UAX-31 seems to allow "ideographs" as ``ID_Start``, for
example. From my relative state of ignorance, several questions come
to mind:

1) Will this allow me to use, say, a "right-arrow" glyph (if I can
find one) to start my identifier?

2) Could an ``ID_Continue`` be used as an ``ID_Start`` if using a RTL
(reversed or "mirrored") identifier? (Probably not, but I don't know.)

3) Is or will there be a definitive and exhaustive listing (with
bitmap representations of the glyphs to avoid the font issues) of the
glyphs that the PEP 3131 would allow in identifiers? (Does this
question even make sense?)

I have long programmed in RPL and have appreciated being able to use,
say, a "right arrow" symbol to start a name of a function (e.g., "->R"
or "->HMS" where the '->' is a single, right-arrow glyph).[1]

While it is not clear that identifiers I may wish to use would still
be prohibited under PEP 3131, I vote:

+0

__________________________________________
[1] RPL (HP's Dr. William Wickes' language and environment circa the
1980s) allows for a few specific "non-ASCII" glyphs as the start of a
name. I have solved my problem with my Python "appliance computer"
project by having up to three representations for my names: Python 2.x
acceptable names as the actual Python identifier, a Unicode text
display exposed to the end user, and also if needed, a bitmap display
exposed to the end user. So -- IAGNI. :)
 
I

Istvan Albert

up. I interviewed about 20 programmers (none of them Python users), and
most took the position "I might not use it myself, but it surely
can't hurt having it, and there surely are people who would use it".

Typically when you ask people about esoteric features that seemingly
don't affect them but might be useful to someone, the majority will
say yes. Its simply common courtesy, its is not like they have to do
anything.

At the same time it takes some mental effort to analyze and understand
all the implications of a feature, and without taking that effort
"something" will always beat "nothing".

After the first time that your programmer friends need fix a trivial
bug in a piece of code that does not display correctly in the terminal
I can assure you that their mellow acceptance will turn to something
entirely different.

i.
 
R

rurpy

(look me in the eye and tell me that "def" is
an English word, or that "getattr" is one)

That's not quite fair. They are not english
words but they are derived from english and
have a memonic value to english speakers that
they don't (or only accidently) have for
non-english speakers.
 
G

Gregor Horvath

Istvan said:
After the first time that your programmer friends need fix a trivial
bug in a piece of code that does not display correctly in the terminal
I can assure you that their mellow acceptance will turn to something
entirely different.

Is there any difference for you in debugging this code snippets?

class Türstock(object):
höhe = 0
breite = 0
tiefe = 0

def _get_fläche(self):
return self.höhe * self.breite

fläche = property(_get_fläche)

#-----------------------------------

class Tuerstock(object):
hoehe = 0
breite = 0
tiefe = 0

def _get_flaeche(self):
return self.hoehe * self.breite

flaeche = property(_get_flaeche)


I can tell you that for me and for my costumers this makes a big difference.

Whether this PEP gets accepted or not I am going to use German
identifiers and you have to be frightened to death by that fact ;-)

Gregor
 
S

Steve Holden

Istvan said:
Typically when you ask people about esoteric features that seemingly
don't affect them but might be useful to someone, the majority will
say yes. Its simply common courtesy, its is not like they have to do
anything.

At the same time it takes some mental effort to analyze and understand
all the implications of a feature, and without taking that effort
"something" will always beat "nothing".
Indeed. For example, getattr() and friends now have to accept Unicode
arguments, and presumably to canonicalize correctly to avoid errors, and
treat equivalent Unicode and ASCII names as the same (question: if two
strings compare equal, do they refer to the same name in a namespace?).
After the first time that your programmer friends need fix a trivial
bug in a piece of code that does not display correctly in the terminal
I can assure you that their mellow acceptance will turn to something
entirely different.
And pretty quickly, too. If anyone but Martin were the author of the
PEP I'd have serious doubts, but if he thinks it's worth proposing
there's at least a chance that it will eventually be implemented.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------
 
S

Steve Holden

Gregor said:
Is there any difference for you in debugging this code snippets?

class Türstock(object):
höhe = 0
breite = 0
tiefe = 0

def _get_fläche(self):
return self.höhe * self.breite

fläche = property(_get_fläche)

#-----------------------------------

class Tuerstock(object):
hoehe = 0
breite = 0
tiefe = 0

def _get_flaeche(self):
return self.hoehe * self.breite

flaeche = property(_get_flaeche)


I can tell you that for me and for my costumers this makes a big difference.
So you are selling to the clothing market? [I think you meant
"customers". God knows I have no room to be snitty about other people's
typos. Just thought it might raise a smile].
Whether this PEP gets accepted or not I am going to use German
identifiers and you have to be frightened to death by that fact ;-)
That's fine - they will be at least as meaningful to you as my English
ones would be to your countrymen who don't speah English.

I think we should remember that while programs are about communication
there's no requirement for (most of) them to be universally comprehensible.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

At the same time it takes some mental effort to analyze and understand
Indeed. For example, getattr() and friends now have to accept Unicode
arguments, and presumably to canonicalize correctly to avoid errors, and
treat equivalent Unicode and ASCII names as the same (question: if two
strings compare equal, do they refer to the same name in a namespace?).

Actually, that is not an issue: In Python 3, there is no data type for
"ASCII string" anymore, so all __name__ attributes and __dict__ keys
are Unicode strings - regardless of whether this PEP gets accepted
or not (which it just did).

Regards,
Martin
 
S

sjdevnull

Istvan Albert schrieb:


After the first time that your programmer friends need fix a trivial
bug in a piece of code that does not display correctly in the terminal
I can assure you that their mellow acceptance will turn to something
entirely different.

Is there any difference for you in debugging this code snippets?

class Türstock(object): [snip]
class Tuerstock(object):

After finding a platform where those are different, I have to say
yes. Absolutely. In my normal setup they both display as "class
Tuerstock" (three letters 'T' 'u' 'e' starting the class name). If,
say, an exception was raised, it'd be fruitless for me to grep or
search for "Tuerstock" in the first one, and I might wind up wasting a
fair amount of time if a user emailed that to me before realizing that
the stack trace was just wrong. Even if I had extended character
support, there's no guarantee that all the users I'm supporting do.
If they do, there's no guarantee that some intervening email system
(or whatever) won't munge things.

With the second one, all my standard tools would work fine. My user's
setups will work with it. And there's a much higher chance that all
the intervening systems will work with it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top