PEP 3131: Supporting Non-ASCII Identifiers

Javier Bezos · May 18, 2007

Istvan Albert said:
How about debugging this (I wonder will it even make it through?) :

class 6???????
> 6?? = 0
> 6????? ?? ?=10

This question is more or less what a Korean who doesn't
speak English would ask if he had to debug a program
written in English.

(I don't know what it means, just copied over some words
from a japanese news site,

A Japanese speaking Korean, it seems.

Javier

Gregor Horvath · May 18, 2007

Istvan said:
Of course there is, how do I type the Ã¼ ? (I can copy/paste for
example, but that gets old quick).

I doubt that you can debug the code without Unicode chars. It seems that
you do no understand German and therefore you do not know what the
purpose of this program is.
Can you tell me if there is an error in the snippet without Unicode?

I would refuse to try do debug a program that I do not understand.
Avoiding Unicode does not help a bit in this regard.

Gregor

Paul Boddie · May 18, 2007

This question is more or less what a Korean who doesn't
speak English would ask if he had to debug a program
written in English.

Perhaps, but the treatment by your mail/news software plus the
delightful Google Groups of the original text (which seemed intact in
the original, although I don't have the fonts for the content) would
suggest that not just social or cultural issues would be involved.
It's already more difficult than it ought to be to explain to people
why they have trouble printing text to the console, for example, and
if one considers issues with badly configured text editors putting the
wrong character values into programs, even if Python complains about
it, there's still going to be some explaining to do.

One thing that some people already dislike about Python is the
"editing discipline" required. Although I don't have much time for
people whose coding "skills" involve random edits using badly
configured editors, trashing the indentation and the appearance of the
code (regardless of the language involved), we do need to consider the
need to bring people "up to speed" gracefully by encouraging the
proper use of tools, and so on, all without making it seem really
difficult and discouraging people from learning the language.

Paul

Gregor Horvath · May 18, 2007

Paul said:
Perhaps, but the treatment by your mail/news software plus the
delightful Google Groups of the original text (which seemed intact in
the original, although I don't have the fonts for the content) would
suggest that not just social or cultural issues would be involved.

I do not see the point.
If my editor or newsreader does display the text correctly or not is no
difference for me, since I do not understand a word of it anyway. It's a
meaningless stream of bits for me.
It's save to assume that for people who are finding this meaningful
their setup will display it correctly. Otherwise they could not work
with their computer anyway.

Until now I did not find a single Computer in my German domain who
cannot display: ß.

Gregor

Javier Bezos · May 18, 2007

Perhaps, but the treatment by your mail/news software plus the
delightful Google Groups of the original text (which seemed intact in
the original, although I don't have the fonts for the content) would
suggest that not just social or cultural issues would be involved.

The fact my Outlook changed the text is irrelevant
for something related to Python. And just remember
how Google mangled the intentation of Python code
some time ago. This was a technical issue which has
been solved, and no doubt my laziness (I didn't
switch to Unicode) won't prevent non-ASCII identifiers
be properly showed in general.

Javier

sjdevnull · May 18, 2007

The fact my Outlook changed the text is irrelevant
for something related to Python.

On the contrary, it cuts to the heart of the problem. There are
hundreds of tools out there that programmers use, and mailing lists
are certainly an incredibly valuable tool--introducing a change that
makes code more likely to be silently mangled seems like a negative.

Of course, there are other benefits to the PEP, so I'm only barely
opposed. But dismissing the fact that Outlook and other quite common
tools may have severe problems with code seems naive (or disingenuous,
but I don't think that's the case here).

Paul Boddie · May 18, 2007

Gregor said:
I do not see the point.
If my editor or newsreader does display the text correctly or not is no
difference for me, since I do not understand a word of it anyway. It's a
meaningless stream of bits for me.

But if your editor doesn't even bother to preserve those bits
correctly, it makes a big difference. When ï¼–ìžíšŒë‹´ê´€ë ¨ë¡ ì¡° becomes 6???????
because someone's tool did the equivalent of
unicode_obj.encode("iso-8859-1", "replace"), then the stream of bits
really does become meaningless. (We'll see if the former identifier
even resembles what I've just pasted later on, or whether it resembles
the latter.)

It's save to assume that for people who are finding this meaningful
their setup will display it correctly. Otherwise they could not work
with their computer anyway.

Sure, it's all about "editor discipline" or "tool discipline" just as
I wrote. I'm in favour of the PEP, generally, but I worry about the
long explanations required when people find that their programs are
now ill-formed because someone made a quick edit in a bad editor.

Paul

Neil Hodgson · May 19, 2007

Istvan Albert:

But you're making a strawman argument by using extended ASCII
characters that would work anyhow. How about debugging this (I wonder
will it even make it through?) :

class ï¼–ìžíšŒë‹´ê´€ë ¨ë¡ ì¡°
ï¼–ìžíšŒ = 0
ï¼–ìžíšŒë‹´ê´€ë ¨ ê³ ê·€ ëª…=10

That would be invalid syntax since the third line is an assignment
with target identifiers separated only by spaces.

Neil

John Roth · May 19, 2007

PEP 1 specifies that PEP authors need to collect feedback from the
community. As the author of PEP 3131, I'd like to encourage comments
to the PEP included below, either here (comp.lang.python), or to
(e-mail address removed)

In summary, this PEP proposes to allow non-ASCII letters as
identifiers in Python. If the PEP is accepted, the following
identifiers would also become valid as class, function, or
variable names: LÃ¶ffelstiel, changÃ©, Ð¾ÑˆÐ¸Ð±ÐºÐ°, or å£²ã‚Šå ´
(hoping that the latter one means "counter").

I notice that Guido has approved it, so I'm looking at what it would
take to support it for Python FIT. The actual issue (for me) is
translating labels for cell columns (and similar) into Python
identifiers. After looking at the firestorm, I've come to the
conclusion that the old methods need to be retained not only for
backwards compatability but also for people who want to translate
existing fixtures.

The guidelines in PEP 3131 for standard library code appear to be
adequate for code that's going to be contributed to the community. I
will most likely emphasize those in my documentation.

Providing a method that would translate an arbitrary string into a
valid Python identifier would be helpful. It would be even more
helpful if it could provide a way of converting untranslatable
characters. However, I suspect that the translate (normalize?) routine
in the unicode module will do.

John Roth
Phthon FIT

Gregor Horvath · May 19, 2007

opposed. But dismissing the fact that Outlook and other quite common
tools may have severe problems with code seems naive (or disingenuous,
but I don't think that's the case here).

Of course there is broken software out there. There are even editors
that mix tabs and spaces ;-) Python did not introduce braces to solve
this problem but encouraged to use appropriate tools. It seems to work
for 99% of us. Same here.
It is the 21st century. Tools that destroy Unicode byte streams are
seriously broken. Face it. You can not halt progress because of some
broken software. Fix or drop it instead.

I do not think that this will be a big problem because only a very small
fraction of specialized local code will use Unicode identifiers anyway.

Unicode strings and comments are allowed today and I didn't heard of a
single issue of destroyed strings because of bad editors, although I
guess that Unicode strings in code are way more common than Unicode
identifiers would ever be.

Gregor

Javier Bezos · May 19, 2007

On the contrary, it cuts to the heart of the problem. There are
hundreds of tools out there that programmers use, and mailing lists
are certainly an incredibly valuable tool--introducing a change that
makes code more likely to be silently mangled seems like a negative.

In such a case, the Python indentation should be
rejected (quite interesting you removed from my
post the part mentioning it). I can promise there
are Korean groups and there are no problems at
all in using Hangul (the Korean writing).

Javier

Guest · May 19, 2007

Providing a method that would translate an arbitrary string into a

valid Python identifier would be helpful. It would be even more
helpful if it could provide a way of converting untranslatable
characters. However, I suspect that the translate (normalize?) routine
in the unicode module will do.

Not at all. Unicode normalization only unifies different "spellings"
of the same character.

For transliteration, no simple algorithm exists, as it generally depends
on the language. However, if you just want any kind of ASCII string,
you can use the Unicode error handlers (PEP 293). For example, the
program

import unicodedata, codecs

def namereplace(exc):
if isinstance(exc,
(UnicodeEncodeError, UnicodeTranslateError)):
s = u""
for c in exc.object[exc.start:exc.end]:
s += "N_"+unicode(unicodedata.name(c).replace(" ","_"))+"_"
return (s, exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)

codecs.register_error("namereplace", namereplace)

print u"Schl\xfcssel".encode("ascii", "namereplace")

prints SchlN_LATIN_SMALL_LETTER_U_WITH_DIAERESIS_ssel.

HTH,
Martin

Guest · May 19, 2007

But you're making a strawman argument by using extended ASCII

That would be invalid syntax since the third line is an assignment
with target identifiers separated only by spaces.

Plus, the identifier starts with a number (even though ï¼– is not DIGIT
SIX, but FULLWIDTH DIGIT SIX, it's still of category Nd, and can't
start an identifier).

Regards,
Martin

Richard Hanson · May 19, 2007

On Fri, 18 May 2007 06:28:03 +0200, Martin v. Löwis wrote:

[excellent as always exposition by Martin]

Thanks, Martin.

P.S. Anybody who wants to play with generating visualisations
of the PEP, here are the functions I used:

[code snippets]

Thanks for those functions, too -- I've been exploring with them and
am slowly coming to some understanding.

-- Richard Hanson

"To many native-English-speaking developers well versed in other
programming environments, Python is *already* a foreign language --
judging by the posts here in c.l.py over the years." ;-)
__________________________________________________

Guest · May 19, 2007

Martin said:
I've reported this before, but happily do it again: I have lived many
years without knowing what a "hub" is, and what "to pass" means if
it's not the opposite of "to fail". Yet, I have used their technical
meanings correctly all these years.

I was not speaking of the more general (non-technical) meanings, but of
the technical ones. The claim which I challenged was that people learn
just the "use" (syntax) but not the "meaning" (semantics) of these
terms. I think you are actually supporting my argument

Guest · May 19, 2007

Martin said:
What specific tools should be discussed, and what specific problems
do you expect?

Systems that cannot display code parts correctly. I expect problems with
unreadable tracebacks, for example.

Also: Are existing tools that somehow process Python source code e.g. to
test wether it meets certain criteria (pylint & co) or to aid in
creating documentation (epydoc & co) fully unicode-ready?

Peter Maas · May 19, 2007

Martin said:
Python code is written by many people in the world who are not familiar
with the English language, or even well-acquainted with the Latin
writing system.

I believe that there is a not a single programmer in the world who doesn't
know ASCII. It isn't hard to learn the latin alphabet and you have to know
it anyway to use the keywords and the other ASCII characters to write numbers,
punctuation etc. Most non-western alphabets have ASCII transcription rules
and contain ASCII as a subset. On the other hand non-ascii identifiers
lead to fragmentation and less understanding in the programming world so I
don't like them. I also don't like non-ascii domain names where the same
arguments apply.

Let the data be expressed with Unicode but the logic with ASCII.

Istvan Albert · May 20, 2007

Plus, the identifier starts with a number (even though ï¼– is not DIGIT
SIX, but FULLWIDTH DIGIT SIX, it's still of category Nd, and can't
start an identifier).

Actually both of these issues point to the real problem with this PEP.

I knew about them (note that the colon is also missing) alas I
couldn't fix them.
My editor would could not remove a space or add a colon anymore, it
would immediately change the rest of the characters to something
crazy.

(Of course now someone might feel compelled to state that this is an
editor problem but I digress, the reality is that features need to
adapt to reality, moreso had I used a different editor I'd be still
unable to write these characters).

i.

Christophe Cavalaria · May 20, 2007

Istvan said:
Actually both of these issues point to the real problem with this PEP.

I knew about them (note that the colon is also missing) alas I
couldn't fix them.
My editor would could not remove a space or add a colon anymore, it
would immediately change the rest of the characters to something
crazy.

(Of course now someone might feel compelled to state that this is an
editor problem but I digress, the reality is that features need to
adapt to reality, moreso had I used a different editor I'd be still
unable to write these characters).

The reality is that the few users who care about having chinese in their
code *will* be using an editor that supports them.

rurpy · May 21, 2007

What do you mean, "check for"? If, say, numeric starts using math
characters (as has been suggested), I'm not exactly going to stop
using numeric. It'll still be a lot better than nothing, just
slightly less better than it used to be.

The PEP explicitly states that no non-ascii identifiers
will be permitted in the standard library. The opinions
expressed here seems almost unamimous that non-ascii
identifiers are a bad idea in any sort of shared public
code. Why do you think the occurance of non-ascii
identifiers in Numpy is likely?

Sure. But when you're talking about maintaining code, there's a very
high value to having all the existing tools work with it whether
they're wide-character aware or not.

I agree. On Windows I often use Notepad to edit
python files. (There goes my credibility!

So I don't like tab-only indent proposals that assume
I can set tabs to be an arbitrary number of spaces.
But tab-only indentation would affect every python
program and every python programmer.

In the case of non-ascii identifiers, the potential
gains are so big for non-english spreakers, and (IMO)
the difficulty of working with non-ascii identifiers
times the probibility of having to work with them,
so low, that the former clearly outweighs the latter.

Yes. But it's not like this makes things so horribly awful that it's
worth my time to reimplement large external libraries. I remain at -0
on the proposal;

it'll cause some headaches for the majority of
current Python programmers, but it may have some benefits to a
sizeable minority

This is the crux of the matter I think. That
non-ascii identifiers will spead like a virus, infecting
program after program until every piece of Python code
is nothing but a mass of wreathing unintellagible non-
ascii characters. (OK, maybe I am overstating a little.

I (and I think other proponents) don't think this is
likely to happen, and the the benefits to non-english
speakers of being able to write maintainable code far
outweigh the very rare case when it does occur.

Atoms, Identifiers, and Primaries	21	Apr 17, 2013
Generating valid identifiers	8	Jul 26, 2012
Non-identifiers in dictionary keys for **expression syntax	3	May 23, 2013
Renaming identifiers & debugging	14	Feb 25, 2010
Looking for UNICODE to ASCII Conversioni Example Code	15	Oct 18, 2013
Python 3.5, bytes, and %-interpolation (aka PEP 461)	10	Feb 24, 2014
Is PEP-8 a Code or More of a Guideline?	52	May 27, 2007
Extended identifiers?	1	Jun 15, 2012

PEP 3131: Supporting Non-ASCII Identifiers

Javier Bezos

Gregor Horvath

Paul Boddie

Gregor Horvath

Javier Bezos

sjdevnull

Paul Boddie

Neil Hodgson

John Roth

Gregor Horvath

Javier Bezos

Guest

Guest

Richard Hanson

Guest

Guest

Peter Maas

Istvan Albert

Christophe Cavalaria

rurpy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads