S
Steven D'Aprano
I don't
want to be in a situation where I need to mechanically "clean"
code (say, from a submitted patch) with a tool because I can't
reliably verify it by eye.
But you can't reliably verify by eye. That's orders of magnitude more
difficult than debugging by eye, and we all know that you can't reliably
debug anything but the most trivial programs by eye.
If you're relying on cursory visual inspection to recognize harmful code,
you're already vulnerable to trojans.
We should learn from the plethora of
Unicode-related security problems that have cropped up in the last
few years.
Of course we should. And one of the things we should learn is when and
how Unicode is a risk, and not imagine that Unicode is some sort of
mystical contamination that creates security problems just by being used.
- Non-ASCII identifiers would be a barrier to code exchange. If I
know
Python I should be able to easily read any piece of code written
in it, regardless of the linguistic origin of the author. If PEP
3131 is accepted, this will no longer be the case.
But it isn't the case now, so that's no different. Code exchange
regardless of human language is a nice principle, but it doesn't work in
practice. How do you use "any piece of code ... regardless of the
linguistic origin of the author" when you don't know what the functions
and classes and arguments _mean_?
Here's a tiny doc string from one of the functions in the standard
library, translated (more or less) to Portuguese. If you can't read
Portuguese at least well enough to get by, how could you possibly use
this function? What would you use it for? What does it do? What arguments
does it take?
def dirsorteinsercao(a, x, baixo=0, elevado=None):
"""da o artigo x insercao na lista a, e mantem-na a
supondo classificado e classificado. Se x estiver ja em a,
introduza-o a direita do x direita mais. Os args opcionais
baixos (defeito 0) e elevados (len(a) do defeito) limitam
a fatia de a a ser procurarado.
"""
# not a non-ASCII character in sight (unless I missed one...)
[Apologies to Portuguese speakers for the dogs-breakfast I'm sure Babel-
fish and I made of the translation.]
The particular function I chose is probably small enough and obvious
enough that you could work out what it does just by following the
algorithm. You might even be able to guess what it is, because Portuguese
is similar enough to other Latin languages that most people can guess
what some of the words might mean (elevados could be height, maybe?). Now
multiply this difficulty by a thousand for a non-trivial module with
multiple classes and dozens of methods and functions. And you might not
even know what language it is in.
No, code exchange regardless of natural language is a nice principle, but
it doesn't exist except in very special circumstances.
A Python
project that uses Urdu identifiers throughout is just as useless
to me, from a code-exchange point of view, as one written in Perl.
That's because you can't read it, not because it uses Unicode. It could
be written entirely in ASCII, and still be unreadable and impossible to
understand.
- Unicode is harder to work with than ASCII in ways that are more
important
in code than in human-language text. Humans eyes don't care if two
visually indistinguishable characters are used interchangeably.
Interpreters do. There is no doubt that people will accidentally
introduce mistakes into their code because of this.
That's no different from typos in ASCII. There's no doubt that we'll give
the same answer we've always given for this problem: unit tests, pylint
and pychecker.