PEP 3131: Supporting Non-ASCII Identifiers

G

gatti

That is a deliberate part of the specification. It is intentional that
it does *not* specify a precise list, but instead defers that list
to the version of the Unicode standard used (in the unicodedata
module).

Ok, maybe you considered listing characters but you earnestly decided
to follow an authority; but this reliance on the Unicode standard is
not a merit: it defers to an external entity (UAX 31 and the Unicode
database) a foundation of Python syntax.
The obvious purpose of Unicode Annex 31 is defining a framework for
parsing the identifiers of arbitrary programming languages, it's only,
in its own words, "specifications for recommended defaults for the use
of Unicode in the definitions of identifiers and in pattern-based
syntax". It suggests an orderly way to add tens of thousands of exotic
characters to programming language grammars, but it doesn't prove it
would be wise to do so.

You seem to like Unicode Annex 31, but keep in mind that:
- it has very limited resources (only the Unicode standard, i.e. lists
and properties of characters, and not sensible programming language
design, software design, etc.)
- it is culturally biased in favour of supporting as much of the
Unicode character set as possible, disregarding the practical
consequences and assuming without discussion that programming language
designers want to do so
- it is also culturally biased towards the typical Unicode patterns of
providing well explained general algorithms, ensuring forward
compatibility, and relying on existing Unicode standards (in this
case, character types) rather than introducing new data (but the
character list of Table 3 is unavoidable); the net result is caring
even less for actual usage.
And, indeed, this is now recognized as one of the bigger mistakes
of the XML recommendation: they provide an explicit list, and fail
to consider characters that are unassigned. In XML 1.1, they try
to address this issue, by now allowing unassigned characters in
XML names even though it's not certain yet what those characters
mean (until they are assigned).

XML 1.1 is, for practical purposes, not used except by mistake. I
challenge you to show me XML languages or documents of some importance
that need XML 1.1 because they use non-ASCII names.
XML 1.1 is supported by many tools and standards because of buzzword
compliance, enthusiastic obedience to the W3C and low cost of
implementation, but this doesn't mean that its features are an
improvement over XML 1.0.
Probably. Nobody in the Unicode consortium noticed, but what
do they know about suitability of Unicode characters...

Don't be silly. These characters are suitable for writing text, not
for use in identifiers; the fact that UAX 31 allows them merely proves
how disconnected from actual programming language needs that document
is.

In typical word processing, what characters are used is the editor's
problem and the only thing that matters is the correctness of the
printed result; program code is much more demanding, as it needs to do
more (exact comparisons, easy reading...) with less (straightforward
keyboard inputs and monospaced fonts instead of complex input systems
and WYSIWYG graphical text). The only way to work with program text
successfully is limiting its complexity.
Hard to input characters, hard to see characters, ambiguities and
uncertainty in the sequence of characters, sets of hard to distinguish
glyphs and similar problems are unacceptable.

It seems I'm not the first to notice a lot of Unicode characters that
are unsuitable for identifiers. Appendix I of the XML 1.1 standard
recommends to avoid variation selectors, interlinear annotations (I
missed them...), various decomposable characters, and "names which are
nonsensical, unpronounceable, hard to read, or easily confusable with
other names".
The whole appendix I is a clear admission of self-defeat, probably the
result of committee compromises. Do you think you could do better?

Regards,
Lorenzo Gatti
 
C

Christophe

(e-mail address removed) a écrit :
Just as one risk here:
When reading the above on Google groups, it showed up as "if one could
write ?(u*p)..."
When quoting it for response, it showed up as "could write D(u*p)".

I'm sure that the symbol you used was neither a capital letter d nor a
question mark.

Using identifiers that are so prone to corruption when posting in a
rather popular forum seems dangerous to me--and I'd guess that a lot
of source code highlighters, email lists, etc have similar problems.
I'd even be surprised if some programming tools didn't have similar
problems.

So, it was google groups that continuously corrupted the good UTF-8
posts by force converting them to ISO-8859-1?

Of course, there's also the possibility that it is a problem on *your*
side so, to be fair I've launched google groups and looked for this
thread. And of course the result was that Steven's post displayed
perfectly. I didn't try to reply to it of course, no need to clutter
that thread anymore than it is.
 
E

Eric Brunel

hello

i work for a large phone maker, and for a long time
we thought, very arrogantly, our phones would be ok
for the whole world.

After all, using a phone uses so little words, and
some of them where even replaced with pictograms!
every body should be able to understand appel, bis,
renvoi, mévo, ...

nowdays we make chinese, corean, japanese talking
phones.

because we can do it, because graphics are cheaper
than they were, because it augments our market.
(also because some markets require it)

see the analogy?

Absolutely not: you're talking about internationalization of the
user-interface here, not about the code. There are quite simple ways to
ensure users will see the displays in their own language, even if the
source code is the same for everyone. But your source code will not
automagically translate itself to the language of the guy who'll have to
maintain it or make it evolve. So the analogy actually seems to work
backwards: if you want any coder to be able to read/understand/edit your
code, just don't write it in your own language...
 
G

Guest

Stefan said:
Then get tools that match your working environment.

Integration with existing tools *is* something that a PEP should
consider. This one does not do that sufficiently, IMO.
 
S

Stefan Behnel

I even sometimes
read code snippets on email lists and websites from my handheld, which
is sadly still memory-limited enough that I'm really unlikely to
install anything approaching a full set of Unicode fonts.

One of the arguments against this PEP was that it seemed to be impossible to
find either transliterated identifiers in code or native identifiers in Java
code using a web search. So it is very unlikely that you will need to upgrade
your handheld as it is very unlikely for you to stumble into such code.

Stefan
 
S

Stefan Behnel

René Fleschenberg said:
No, that does not follow from my logic. What I say is: When thinking
about wether to add a new feature, the potential benefits should be
weighed against the potential problems. I see some potential problems
with this PEP and very little potential benefits.


*That* logic can be used to justify the introduction of *any* feature.

*Your* logic can be used to justify dropping *any* feature.

Stefan
 
G

Guest

Stefan said:
Well, as I said before, there are three major differences between the stdlib
and keywords on one hand and identifiers on the other hand. Ignoring arguments
does not make them any less true.

BTW: Please stop replying to my postings by E-Mail (in Thunderbird, use
"Reply" in stead of "Reply to all").

I agree that keywords are a different matter in many respects, but the
only difference between stdlib interfaces and other intefaces is that
the stdlib interfaces are part of the stdlib. That's it. You are still
ignoring the fact that, contrary to what has been suggested in this
thread, it is _not_ possible to write "German" or "Chinese" Python
without cluttering it up with many many English terms. It's not only the
stdlib, but also many many third party libraries. Show me one real
Python program that is feasibly written without throwing in tons of
English terms.

Now, very special environments (what I called "rare and isolated"
earlier) like special learning environments for children are a different
matter. It should be ok if you have to use a specially patched Python
branch there, or have to use an interpreter option that enables the
suggested behaviour. For general programming, it IMO is a bad idea.
 
G

Guest

Marc said:
There are potential users of Python who don't know much english or no
english at all. This includes kids, old people, people from countries
that have "letters" that are not that easy to transliterate like european
languages, people who just want to learn Python for fun or to customize
their applications like office suites or GIS software with a Python
scripting option.

Make it an interpreter option that can be turned on for those cases.
 
S

Stefan Behnel

Eric said:
reason why non-ASCII identifiers should be supported. I just wish I'll
get a '--ascii-only' switch on my Python interpreter (or any other means
to forbid non-ASCII identifiers and/or strings and/or comments).

I could certainly live with that as it would be the right way around. Support
Unicode by default, but allow those who require the lowest common denominator
to enforce it.

Stefan
 
G

Guest

Stefan said:
*Your* logic can be used to justify dropping *any* feature.

No. I am considering both the benefits and the problems. You just happen
to not like the outcome of my considerations [again, please don't reply
by E-Mail, I read the NG].
 
E

Eric Brunel

Maybe you should find out then? Personal ignorance is never an excuse for
rejecting technology.

My "personal ignorance" is fine, thank you; how is yours?: there is no
keyboard *on Earth* allowing to type *all* characters in the whole Unicode
set. So my keyboard may just happen to provide no means at all to type a
greek 'pi', as it doesn't provide any to type Chinese, Japanese, Korean,
Russian, Hebrew, or whatever character set that is not in usage in my
country. And so are all keyboards all over the world.

Have I made my point clear or do you require some more explanations?
 
S

Stefan Behnel

René Fleschenberg said:
I agree that keywords are a different matter in many respects, but the
only difference between stdlib interfaces and other intefaces is that
the stdlib interfaces are part of the stdlib. That's it. You are still
ignoring the fact that, contrary to what has been suggested in this
thread, it is _not_ possible to write "German" or "Chinese" Python
without cluttering it up with many many English terms. It's not only the
stdlib, but also many many third party libraries. Show me one real
Python program that is feasibly written without throwing in tons of
English terms.

Now, very special environments (what I called "rare and isolated"
earlier) like special learning environments for children are a different
matter. It should be ok if you have to use a specially patched Python
branch there, or have to use an interpreter option that enables the
suggested behaviour. For general programming, it IMO is a bad idea.

Ok, let me put it differently.

You *do not* design Python's keywords. You *do not* design the stdlib. You *do
not* design the concepts behind all that. You *use* them as they are. So you
can simply take the identifiers they define and use them the way the docs say.
You do not have to understand these names, they don't have to be words, they
don't have to mean anything to you. They are just tools. Even if you do not
understand English, they will not get in your way. You just learn them.

But you *do* design your own software. You *do* design its concepts. You *do*
design its APIs. You *do* choose its identifiers. And you want them to be
clear and telling. You want them to match your (or your clients) view of the
application. You do not care about the naming of the tools you use inside. But
you do care about clarity and readability in *your own software*.

See the little difference here?

Stefan
 
S

Stefan Behnel

René Fleschenberg said:
Make it an interpreter option that can be turned on for those cases.

No. Make "ASCII-only" an interpreter option that can be turned on for the
cases where it is really required.

Stefan
 
B

Ben

The main problem here seems to be proving the need of something to people who
do not need it themselves. So, if a simple "but I need it because a, b, c" is
not enough, what good is any further prove?

Stefan

For what it's worth, I can only speak English (bad English schooling!)
and I'm definitely +1 on the PEP. Anyone using tools from the last 5
years can handle UTF-8

Cheers,
Ben
 
G

Gregor Horvath

René Fleschenberg said:
*That* logic can be used to justify the introduction of *any* feature.

No. That logic can only be used to justify the introduction of a feature
that brings freedom.

Who are we to dictate the whole python world how to spell an identifier?

Gregor
 
H

Hendrik van Rooyen


[I fixed the broken attribution in your quote]
Sorry about that - I deliberately fudge email addys...

First "while" is a keyword and will remain "while" so
that has nothing to do with anything.

I think this cuts right down to why I oppose the PEP.
It is not so much for technical reasons as for aesthetic
ones - I find reading a mix of languages horrible, and I am
kind of surprised by the strength of my own reaction.

If I try to analyse my feelings, I think that really the PEP
does not go far enough, in a sense, and from memory
it seems to me that only E Brunel, R Fleschenberg and
to a lesser extent the Martellibot seem to somehow think
in a similar way as I do, but I seem to have an extreme
case of the disease...

And the summaries of reasons for and against have left
out objections based on this feeling of ugliness of mixed
language.

Interestingly, the people who seem to think a bit like that all
seem to be non native English speakers who are fluent in
English.

While the support seems to come from people whose English
is perfectly adequate, but who are unsure to the extent that they
apologise for their "bad" English.

Is this a pattern that you have identified? - I don't know.

I still don't like the thought of the horrible mix of "foreign"
identifiers and English keywords, coupled with the English
sentence construction. And that, in a nutshell, is the main
reason for my rather vehement opposition to this PEP.

The other stuff about sharing and my inability to even type
the OP's name correctly with the umlaut is kind of secondary
to this feeling of revulsion.

"Beautiful is better than ugly"

- Hendrik
 
S

sjdevnull

Ben said:
For what it's worth, I can only speak English (bad English schooling!)
and I'm definitely +1 on the PEP. Anyone using tools from the last 5
years can handle UTF-8

The falsehood of the last sentence is why I'm moderately against this
PEP. Even examples within this thread don't display correctly on
several of the machines I have access too (all of which are less than
5 year old OS/browser environments). It strikes me a similar to the
arguments for quoted-printable in the early 1990s, claiming that
everyone can view it or will be able to soon--and even a decade
_after_ "everyone can deal with latin1 just fine" it was still causing
massive headaches.
 
S

sjdevnull

Christophe said:
(e-mail address removed) a ecrit :

So, it was google groups that continuously corrupted the good UTF-8
posts by force converting them to ISO-8859-1?

Of course, there's also the possibility that it is a problem on *your*
side

Well, that's part of the point isn't it? It seems incredibly naive to
me to think that you could use whatever symbol was intended and have
it show up, and the "well fix your machine!" argument doesn't fly. A
lot of the time programmers have to look at stack traces on end-user's
machines (whatever they may be) to help debug. They have to look at
code on the (GUI-less) production servers over a terminal link. They
have to use all kinds of environments where they can't install the
latest and greatest fonts. Promoting code that becomes very hard to
read and debug in real situations seems like a sound negative to me.
 
S

sjdevnull

Stefan said:
One of the arguments against this PEP was that it seemed to be impossible to
find either transliterated identifiers in code or native identifiers in Java
code using a web search. So it is very unlikely that you will need to upgrade
your handheld as it is very unlikely for you to stumble into such code.

Sure, if the feature isn't going to be used then it won't present
problems. I can't really see much of an argument for a PEP that isn't
going to be used, though, and if it is used then it's worthwhile to
think about the implications of having code that many common systems
simply can't deal with (either displaying it incorrectly or actually
corrupting files that pass through them).
 
N

Neil Hodgson

Lorenzo Gatti:
Ok, maybe you considered listing characters but you earnestly decided
to follow an authority; but this reliance on the Unicode standard is
not a merit: it defers to an external entity (UAX 31 and the Unicode
database) a foundation of Python syntax.

PEP 3131 uses a similar definition to C# except that PEP 3131
disallows formatting characters (category Cf). See section 9.4.2 of
http://www.ecma-international.org/publications/standards/Ecma-334.htm

Neil
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,280
Latest member
BGBBrock56

Latest Threads

Top