PEP 3131: Supporting Non-ASCII Identifiers

J

Jochen Schulz

* René Fleschenberg:
Stefan said:
[...] They are just tools. Even if you do not
understand English, they will not get in your way. You just learn them.

I claim that this is *completely unrealistic*. When learning Python, you
*do* learn the actual meanings of English terms like "open",
"exception", "if" and so on if you did not know them before.

This is certainly true for easy words like "open" and "in". But there
are a lot of counterexamples.

When learning something new, you always learn a lot of new concepts and
you get to know how they are called in this specific context. For
example, when you learn to program, you might stumble upon the idea of
"exceptions", which you can raise/throw and except/catch. But even if
you know how to use that concept and understand what it does, you do not
necessarily know the "usual" meaning of the word outside of your domain.

As far as I can tell, quite often these are the terms that even enter
the native language without any translation (even though there are
perfect translations for the words in their original meaning). German
examples are "exceptions", "client" and "server", "mail", "hub" and
"switch", "web" and many, many more. Nobody who uses these terms has to
know their exact meaning in his native language as long as he speaks to
Germans or stays in the domain where he learned them.

I read a lot of English text every day but I am sometimes still
surprised to learn that a word I already knew has a meaning outside
of computing. "Hub" is a nice example for that. I was very surprised to
learn that even my bike has this. ;-)

J.
 
G

George Sakkis

(snipped)

So, please provide feedback, e.g. perhaps by answering these
questions:
- should non-ASCII identifiers be supported? why?

Initially I was on -1 but from this thread it seems that many closed
(or semi-closed) environments would benefit from such a change. I'm
still concerned though about the segregation this feature encourages.
In my (admittedly limited) experience on non-english-speaking
environments, everyone used to have some basic command of english and
was encouraged to use proper english identifiers; OTOH, the hodgepodge
of english keywords/stdlib/3rd party symbols with transliterated to
ascii application identifiers was being looked down as clumsy and in
fact less readable.

Bottom line, -0.
- would you use them if it was possible to do so? in what cases?

No, and I would refuse to maintain code that did use them*.

George


* Unless I start teaching programming to preschoolers or something.
 
S

Sion Arrowsmith

Steven D'Aprano said:
Maybe you should find out then? Personal ignorance is never an excuse for
rejecting technology.

The funny thing is, I could have told you exactly how to type a 'pi'
character 18 years ago, when my main use of computers was typesetting
on a Mac. These days ... I've just spent 20 minutes trying to find out
how to insert one into this text (composed in emacs on a remote
machine, connected via ssh from konsole).
 
J

Javier Bezos

Eric Brunel said:
Funny you talk about Japanese, a language I'm a bit familiar with and for
which I actually know some input methods. The thing is, these only work if
you know the transcription to the latin alphabet of the word you want to
type, which closely match its pronunciation. So if you don't know that ??
? is pronounced "uriba" for example, you have absolutely no way of
entering the word.

Actually, you can draw the character (in XP, at
least) entirely or in part and the system shows a
list of them with similar shapes. IIRC, there is
a similar tool on Macs. Of course, I'm not saying
this allows to enter kanji in a easy and fast way,
but certainly it's not impossible at all, even if
you don't know the pronunciation.

Javier
 
S

Sion Arrowsmith

Hendrik van Rooyen said:
I still don't like the thought of the horrible mix of "foreign"
identifiers and English keywords, coupled with the English
sentence construction.

How do you think you'd feel if Python had less in the way of
(conventionally used) English keywords/builtins. Like, say, Perl?
 
M

Marc 'BlackJack' Rintsch

No. Make "ASCII-only" an interpreter option that can be turned on for the
cases where it is really required.

Make no interpreter options and use `pylint` and `pychecker` for checking
if the sources follow your style guide in respect to identifiers.

Ciao,
Marc 'BlackJack' Rintsch
 
N

Neil Hodgson

Eric Brunel:
Funny you talk about Japanese, a language I'm a bit familiar with and
for which I actually know some input methods. The thing is, these only
work if you know the transcription to the latin alphabet of the word you
want to type, which closely match its pronunciation. So if you don't
know that 売り場 is pronounced "uriba" for example, you have absolutely
no way of entering the word. Even if you could choose among a list of
characters, are you aware that there are almost 2000 "basic" Chinese
characters used in the Japanese language? And if I'm not mistaken, there
are several tens of thousands characters in the Chinese language itself.
This makes typing them virtually impossible if you don't know the
language and/or have the correct keyboard.

It is nowhere near that difficult. There are several ways to
approach this, including breaking up each character into pieces and
looking through the subset of characters that use that piece (the
Radical part of the IME). For 売, you can start with the cross with a
short bottom stroke (at the top of the character) 士, for 場 look for
the crossy thing on the left 土. The middle character is simple looking
so probably not Chinese so found it in Hiragana. Another approach is to
count strokes (Strokes section of the IME) and look through the
characters with that number of strokes. Within lists, the characters are
ordered from simplest to more complex so you can get a feel for where to
look.

Neil
 
C

Carsten Haese

You have misread my statements.



I think it is a prerequesite for "real" programming. Yes, I can imagine
that if you use Python as a teaching tool for Chinese 12 year-olds, then
it might be nice to be able to spell identifiers with Chinese
characters. However, IMO this is such a special use-case that it is
justified to require the people who need this to explicitly enable it,
by using a patched interpreter or by enabling an interpreter option for
example.

There you go again with "real" programming. Nobody that I'm aware of
dictates that Python must only be used for real programming.

It sounds like you are acknowledging that there are use cases for
allowing non-ASCII identifiers after all. Making some switch for
enabling this feature is a compromise that has been suggested on this
thread before, including by yours truly. I wouldn't even be opposed to
making this switch be off by default, as long as the feature is there
for people who need it.
I did not assert that at all, where did you get the impression that I
do? If I were convinced that noone would use it, I would have not such a
big problem with it. I fear that it *will* be used "in the wild" if the
PEP in its current form is accepted and that I personally *will* have to
deal with such code.

Yes, I apologize, I completely mangled your assertion. I don't know what
I was thinking when I wrote that. In reality you asserted, and I'll
quote verbatim this time: "It is naive to believe that you can program
in Python without understanding any English once you can use your native
characters in identifiers." It is precisely this assertion that is being
disproved by HYRY's students who *do* program in Python without
understanding any English[*], using native characters in identifiers.
But they have to launder their programs before they can run them.

[*] And if you respond that they must know "some" English in the form of
keywords and such, the answer is no, they need not. It is not hard for
Europeans to learn to visually recognize a handful of simple Chinese
characters without having to learn their pronunciation or even their
actual meaning. By the same token, a Chinese person can easily learn to
recognize "if", "while", "print" and so on visually as symbols, without
having to learn anything beyond what those symbols do in a Python
program.

Regards,
 
E

Eric Brunel

Eric Brunel:


It is nowhere near that difficult. There are several ways to
approach this, including breaking up each character into pieces and
looking through the subset of characters that use that piece (the
Radical part of the IME). For 売, you can start with the cross with a
short bottom stroke (at the top of the character) 士, for 場 look for
the crossy thing on the left 土. The middle character is simple looking
so probably not Chinese so found it in Hiragana. Another approach is to
count strokes (Strokes section of the IME) and look through the
characters with that number of strokes. Within lists, the characters are
ordered from simplest to more complex so you can get a feel for where to
look.

Have you ever tried to enter anything more than 2 or 3 characters like
that? I did. It just takes ages. Come on: are you really serious about
entering *identifiers* in a *program* this way?
 
N

Neil Hodgson

Eric Brunel:
Have you ever tried to enter anything more than 2 or 3 characters like
that?

No, only for examples. Lengthy texts are either already available
digitally or are entered by someone skilled in the language.
> I did. It just takes ages. Come on: are you really serious about
entering *identifiers* in a *program* this way?

Are you really serious about entry of identifiers in another
language being a problem?

Most of the time your identifiers will be available by selection
from an autocompletion list or through cut and paste. Less commonly,
you'll know what they sound like. Even more rarely you'll only have a
printed document. Each of these can be handled reasonably considering
their frequency of occurrence. I have never learned Japanese but have
had to deal with Japanese text at a couple of jobs and it isn't that big
of a problem. Its certainly not "virtually impossible" nor is there
"absolutely no way of entering the word" (売り場). I think you should
moderate your exaggerations.

Is there a realistic scenario in which foreign character set
identifier entry would be difficult for you?

Neil
 
E

Eric Brunel

Eric Brunel:


No, only for examples. Lengthy texts are either already available
digitally or are entered by someone skilled in the language.


Are you really serious about entry of identifiers in another
language being a problem?
Yes.

Most of the time your identifiers will be available by selection
from an autocompletion list or through cut and paste.

Auto-completion lists have always caused me more disturbance than help.
Since - AFAIK - you have to type some characters before they can be of any
help, I don't think they can help much here. I also did have to copy/paste
identifiers to program (because of a broken keyboard, IIRC), and found it
extremely difficult to handle. Constant movements to get every identifier
- either by keyboard or with the mouse - are not only unnecessary, but
also completely breaks my concentration. Programming this way takes me 4
or 5 times longer than being able to type characters directly.
Less commonly, you'll know what they sound like.

Highly improbable in the general context. If I stumble on a source code in
Chinese, Russian or Hebrew, I wouldn't be able to figure out a single
sound.
Even more rarely you'll only have a printed document.

I wonder how that could be of any help.
Each of these can be handled reasonably considering their frequency of
occurrence. I have never learned Japanese but have had to deal with
Japanese text at a couple of jobs and it isn't that big of a problem.
Its certainly not "virtually impossible" nor is there "absolutely no way
of entering the word" (売り場). I think you should moderate your
exaggerations.

I do admit it was a bit exaggerated: there actually are ways. You know it,
and I know it. But what about the average guy, not knowing anything about
Japanese, kanji, radicals and stroke counts? How will he manage to enter
these funny-looking characters, perhaps not even knowing it's Japanese?
And does he have to learn a new input method each time he stumbles across
identifiers written in a character set he doesn't know? And even if he
finds a way, the chances are that it will be terribly inefficient. Having
to pay attention on how you can type the things you want is a really big
problem when it comes to programming: you have a lot of other things to
think about.
 
G

Gregor Horvath

Eric said:
Highly improbable in the general context. If I stumble on a source code
in Chinese, Russian or Hebrew, I wouldn't be able to figure out a single
sound.

If you get source code in a programming language that you don't know you
can't figure out a single sound too.
How is that different?

If someone decides to make *his* identifiers in Russian he's taking into
account that none-Russian speakers are not going to be able to read the
code.
If someone decides to program in Fortran he takes into account that the
average Python programmer can not read the code.

How is that different?

It's the choice of the author.
Taking away the choice is not a good thing.
Following this logic we should forbid all other programming languages
except Python so everyone can read every code in the world.

Gregor
 
P

Paul Boddie

[*] And if you respond that they must know "some" English in the form of
keywords and such, the answer is no, they need not. It is not hard for
Europeans to learn to visually recognize a handful of simple Chinese
characters without having to learn their pronunciation or even their
actual meaning. By the same token, a Chinese person can easily learn to
recognize "if", "while", "print" and so on visually as symbols, without
having to learn anything beyond what those symbols do in a Python
program.

I think this is a crucial point being made here. Taking a page from
the python.jp site, from which an example was posted elsewhere in the
discussion, we see a sprinkling of Latin-based identifiers much like a
number of other Japanese sites:

http://www.python.jp/Zope/pythondoc_jp/

I know hardly anything about the Japanese language and have heard only
anecdotal tales of English proficiency amongst Japanese speakers, but
is it really likely that readers of that page (particularly newcomers)
know the special pronunciation of "LaTeX" (or even most English
readers unfamiliar with that technology) and the derivation of that
name, that "Q" specifically means "question", that "HTML" specifically
means "Hypertext Markup Language", and so on? It seems to me that
modern Japanese culture and society is familiar with such "symbols"
without there being any convincing argument to suggest that this is
only the case because "they all must know English".

Consequently, Python's keywords and even the standard library can
exist with names being "just symbols" for many people. It would be
interesting to explore the notion of localised versions of the
library; the means of providing interoperability between programs and
library versions in different languages would be one of the many
challenges involved.

Paul
 
E

Eric Brunel

If you get source code in a programming language that you don't know you
can't figure out a single sound too.
How is that different?

What kind of argument is that? If it was carved in stone, I would not be
able to enter it in my computer without rewriting it. So what?

The point is that today, I have a reasonable chance of being able to read,
understand and edit any Python code. With PEP 3131, it will no more be
true. That's what bugs me.
If someone decides to make *his* identifiers in Russian he's taking into
account that none-Russian speakers are not going to be able to read the
code.

Same question again and again: how does he know that non-Russian speakers
will *ever* get in touch with his code and/or need to update it?
 
G

Gregor Horvath

Eric said:
The point is that today, I have a reasonable chance of being able to
read, understand and edit any Python code. With PEP 3131, it will no
more be true. That's what bugs me.

That's just not true. I and others in this thread have stated that they
use German or other languages as identifiers today but are forced to
make a stupid and unreadable translation to ASCII.
Same question again and again: how does he know that non-Russian
speakers will *ever* get in touch with his code and/or need to update it?

If you didn't get non English comments and identifiers until now, you
will not get any with this PEP either. And if you do get them today or
with the PEP it doesn't make a difference for you to get some glyphs not
properly displayed, doesn't it?

Gregor
 
I

Istvan Albert

As a non-native English speaker,

- should non-ASCII identifiers be supported? why?

No. I don't think it adds much, I think it will be a little used
feature (as it should be), every python instructor will start their
class by saying here is a feature that you should stay away from
because you never know where your code ends up.
- would you use them if it was possible to do so? in what cases?

No. The only possible uses I can think of are intentionally
obfuscating code.

Here is something that just happened and relates to this subject: I
had to help a student run some python code on her laptop, she had
Windows XP that hid the extensions. I wanted to set it up such that
the extension is shown. I don't have XP in front of me but when I do
it takes me 15 seconds to do it. Now her Windows was set up with some
asian fonts (Chinese, Korean not sure), looked extremely unfamiliar
and I had no idea what the menu systems were. We have spent quite a
bit of time figuring out how to accomplish the task. I had her read me
back the options, but something like "hide extensions" comes out quite
a bit different. Surprisingly tedious and frustrating experience.

Anyway, something to keep in mind. In the end features like this may
end up hurting those it was meant to help.

i.
 
G

Gregor Horvath

Istvan said:
Here is something that just happened and relates to this subject: I
had to help a student run some python code on her laptop, she had
Windows XP that hid the extensions. I wanted to set it up such that
the extension is shown. I don't have XP in front of me but when I do
it takes me 15 seconds to do it. Now her Windows was set up with some
asian fonts (Chinese, Korean not sure), looked extremely unfamiliar
and I had no idea what the menu systems were. We have spent quite a
bit of time figuring out how to accomplish the task. I had her read me
back the options, but something like "hide extensions" comes out quite
a bit different. Surprisingly tedious and frustrating experience.

So the solution is to forbid Chinese XP ?

Gregor
 
S

sjdevnull

Christophe said:
(e-mail address removed) a ecrit :

Who displays stack frames? Your code. Whose code includes unicode
identifiers? Your code. Whose fault is it to create a stack trace
display procedure that cannot handle unicode? You.

Thanks but no--I work with a _lot_ of code I didn't write, and looking
through stack traces from 3rd party packages is not uncommon.

And I'm often not creating a stack trace procedure, I'm using the
built-in python procedure.

And I'm often dealing with mailing lists, Usenet, etc where I don't
know ahead of time what the other end's display capabilities are, how
to fix them if they don't display what I'm trying to send, whether
intervening systems will mangle things, etc.
Even if you don't
make use of them, you still have to fix the stack trace display
procedure because the exception error message can include unicode text
*today*

It can, but having identifiers in portable characters at least allows
some ability to navigate the code. Display of strings is safe by
default anyway, as they can contain all sorts of data.
You should know that displaying and editing UTF-8 text as if it was
latin-1 works very very well.

Given that we've already seen one (fairly simple) character posted in
this thread that displayed differently in the HTML view than in the
edit--and neither place as the symbol originally intended--I'm going
to reserve judgement on this statement. I don't know whether the
problem was with Google, my browser, or something else, but I do know
that it made interchange of information difficult and that I'm using a
fairly recent (within the last 3 years) out-of-the-box setup.
Also, Terminals have support for UTF-8 encodings already. Or you could
always use kate+fish to edit your script on the distant server without
such problems (fish is a KDE protocol used to access a computer with ssh
as if it was a hard disk and kate is the standard text/code editor) It's
a matter of tools.

You don't always get to pick your tools. It's very nice to have
things work with standard setups, be they brand new Windows boxes or
the c. 1993 mail server in the office or your wife's handheld that
you've grabbed to help do emergency troubleshooting from your vacation
or whatever.
 
R

rurpy

On Wed, 16 May 2007 16:29:27 +0200, Neil Hodgson  



I do admit it was a bit exaggerated: there actually are ways. You know it,  
and I know it. But what about the average guy, not knowing anything about  
Japanese, kanji, radicals and stroke counts? How will he manage to enter  
these funny-looking characters, perhaps not even knowing it's Japanese?  
And does he have to learn a new input method each time he stumbles across  
identifiers written in a character set he doesn't know? And even if he  
finds a way, the chances are that it will be terribly inefficient. Having  
to pay attention on how you can type the things you want is a really big  
problem when it comes to programming: you have a lot of other things to  
think about.


What does this have to do with the adoption of PEP-3131? Are you
saying that if non-english speakers are allowed to use non-english
identifiers in their code, that you will have to *write* code in a
language
you don't know using a script you don't know?

If you, for some extremely improbable reason *have* to modify the
code, then you will be cutting and pasting the existing variables.
If you are creating new variables, then given that you don't know
the language and have no idea what to name the variable, the
mechanics of entering it are the least of your problems. Name
the new variable in ascii and leave it to a native speaker to fix
later.

If the aswer to that is, "see, non-english Python is bad", then
arguments
against *enforcing* english-only python are elsewhere in the thread
so I won't repeat here.
 
L

laurentszyster

PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
(e-mail address removed)

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: Löffelstiel, changé, ошибка, or 売り場
(hoping that the latter one means "counter").

+1

If only for one simple reason: JSON objects have UNICODE names and it
may be convenient from a Python web agent to be able to instanciate/
serialize any such object as-is ... to/from a Python class instead of
a dictionnary.

Regards,


Laurent Szyster
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,165
Latest member
JavierBrak
Top