Using non-ascii symbols

Christoph Zwerschke · Jan 24, 2006

UTF-8 is also the standard encoding of SuSE Linux since I version 9.1.
Both VIM and EMACS provide ways to enter unicode. VIM even supports
digraph input which would be particularly senseful in this case.

-- Christoph

Christoph Zwerschke · Jan 24, 2006

Claudio said:
There is no symbol coming to my mind, but I would be glad if
it would express, that 'a' becomes a reference to a Python object being
currently referred by the identifier 'b' (maybe some kind of <-> ?).

With unicode, you have a lot of possibilities to express this:

a â† b # a = b
a â‡ b # a = copy(b)
a â‡š b # a = deepcopy(b)

-- Christoph

Christoph Zwerschke · Jan 24, 2006

Dave said:
C uses ! as a unary logical "not" operator, so != for "not equal" just
seems to follow, um, logically.

Consequently, C should have used !> for <= and !< for >= ...

-- Christoph

Christoph Zwerschke · Jan 24, 2006

Dave said:
The latter, IMHO. Especially variable names. Consider i vs. ì vs. í
vs. î vs. ï vs. ...

There could be conventions discouraging you to use ambiguous symbols.
Even today, you wouldn't use a lowercase "l" or an "O" because it can be
confused with a digit 1 or 0. But you're right this problem would become
much greater with unicode chars. This kind of pitfall has already been
overlooked with the introduction of international domain names which are
exploitable for phishing attacks...

-- Christoph

Claudio Grondi · Jan 24, 2006

Christoph said:
With unicode, you have a lot of possibilities to express this:

a â† b # a = b
a â‡ b # a = copy(b)
a â‡š b # a = deepcopy(b)

^-- with this above also the notation

a â† b # a = b

starts to be obvious to me, as it covers also some of the specifics of
Python.

Nice idea.

Claudio

Christoph Zwerschke · Jan 24, 2006

Fredrik said:
umm. if you have an editor that can convert things back and forth, you
don't really need language support for "digraphs"...

It would just be very impractical to convert back and forth every time
you want to run a program. Python also supports tabs AND spaces though
you can easily convert things.

But indeed, in 100 years or so ;-) if people get accustomed to using
these symbols and input will be easy, digraph support could become
optional and then phase out... Just as now happens with C trigraphs.

-- Christoph

Dave Hansen · Jan 24, 2006

Consequently, C should have used !> for <= and !< for >= ...

Well, actually, no.

"Less (than) or equal" is <=. "Greater (than) or equal" is >=. "Not
equal" is !=.

If you want to write code for the IOCCC, you could use !(a>b) instead
of a<=b...

Regards,
-=Dave

Steven D'Aprano · Jan 24, 2006

The latter, IMHO. Especially variable names. Consider i vs. Ã¬ vs. Ã
vs. Ã® vs. Ã¯ vs. ...

Agreed, but that's the programmer's fault for choosing stupid variable
names. (One character names are almost always a bad idea. Names which can
be easily misread are always a bad idea.) Consider how easy it is to
shoot yourself in the foot with plain ASCII:

l1 = 0
l2 = 4
....
pages of code
....
assert 11 + l2 = 4

Dave Hansen · Jan 24, 2006

Agreed, but that's the programmer's fault for choosing stupid variable
names. (One character names are almost always a bad idea. Names which can
be easily misread are always a bad idea.) Consider how easy it is to

I wasn't necessarily expecting single-character names. Indeed, the
different between i and ì is easier to see than the difference
between, say, long_variable_name and long_varìable_name. For me,
anyway.

shoot yourself in the foot with plain ASCII:

l1 = 0
l2 = 4
...
pages of code
...
assert 11 + l2 = 4

You've shot yourself twice, there. Python would tell you about the
second error, though.

Regards,
-=Dave

Steven D'Aprano · Jan 24, 2006

I wasn't necessarily expecting single-character names. Indeed, the
different between i and Ã¬ is easier to see than the difference
between, say, long_variable_name and long_varÃ¬able_name. For me,
anyway.

Sure. But that's no worse than pxfoobrtnamer and pxfoobtrnamer.

I'm not saying that adding more characters to the mix won't increase the
opportunity to pick bad names. But this isn't a new problem, it is an old
problem.

You've shot yourself twice, there.

Deliberately so. The question is, in real code without the assert, should
the result of the addition be 4, 12, 15 or 23?

James Stroud · Jan 25, 2006

Robert said:
James Stroud wrote:

Get a better keyboard? or OS?

Please talk to my boss. Tell him I want a Quad G5 with about 2 Giga ram.
I'll by the keyboard myself, no problemo.

On OS X,

â‰¤ is Alt-,
â‰¥ is Alt-.
â‰ is Alt-=

Fewer keystrokes than <= or >= or !=.

James

Robert Kern · Jan 25, 2006

James said:
Please talk to my boss. Tell him I want a Quad G5 with about 2 Giga ram.
I'll by the keyboard myself, no problemo.

Alternatively, you can simply learn to use the tools in front of you.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#input

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Bengt Richter · Jan 25, 2006

On the page http://wiki.python.org/moin/Python3.0Suggestions
I noticed an interesting suggestion:

"These operators â‰¤ â‰¥ â‰ should be added to the language having the
following meaning:

<= >= !=

this should improve readibility (and make language more accessible to
beginners).

This should be an evolution similar to the digraphe and trigraph
(digramme et trigramme) from C and C++ languages."

How do people on this group feel about this suggestion?

The symbols above are not even latin-1, you need utf-8.

Maybe we need a Python unisource type which is abstract like unicode,
and through encoding can be rendered various ways. Of course it would have
internal representation in some encoding, probably utf-16le, but glyphs
for operators and such would be normalized, and then could be rendered
as multi-glyphs or special characters however desired. This means that
unisource would not just be an encoding resulting from decoding just
a character encoding like latin-1, but would be a result of decoding
source in a Python-syntax-sensitive way, differentiating between <=
as a relational operator vs '<=' in a string literal or comment etc.

(There are not many usefuls symbols in latin-1. Maybe one could use Ã—
for cartesian products...)

And while they are better readable, they are not better typable (at
least with most current editors).

Is this idea absurd or will one day our children think that restricting
to 7-bit ascii was absurd?

I think it's important to have readable ascii representations available for
programming elements at least.

Are there similar attempts in other languages? I can only think of APL,
but that was a long time ago.

Once you open your mind for using non-ascii symbols, I'm sure one can
find a bunch of useful applications. Variable names could be allowed to
be non-ascii, as in XML. Think class names in Arabian... Or you could
use Greek letters if you run out of one-letter variable names, just as
Mathematicians do. Would this be desirable or rather a horror scenario?
Opinions?

I think there are pros and cons. What if the "href" in HTML could be spelled in
any characters? I.e., some things are part of a standard encoding and representation
system. Some of python is like that. "True" should not be spelled "Vrai" or "Sant," except
in localized messages, IMO, unless perhaps there is a unisource type that normalizes
these things too, and can render in localized formats. ... I guess China is a
pretty big market, so I wonder what they will do.

Someone has to get really excited about it, and have the expertise or willingness
to slog their way to expertise, and the persistence to get something done. And all
that in the face of the fact that much of the problem will be engineering consensus,
not engineering technical solutions. So are you excited? Good luck ;-)

Probably the best anyone with any excitement to spare could do is ask Martin
what he could use help with, if anything. He'd probably not like muddying any
existing clear visions and plans with impractical ramblings though ;-)

Regards,
Bengt Richter

Ido Yehieli · Jan 25, 2006

I still remember it not being supported on most or all big Iron servers
at my previuos uni (were mostly SunOS, Digital UNIX among others)

Peter Hansen · Jan 25, 2006

Dave said:
C uses ! as a unary logical "not" operator, so != for "not equal" just
seems to follow, um, logically.

Pascal used <>, which intuitively (to me, anyway ;-) read "less than
or greater than," i.e., "not equal."

For quantitative data, anyway, or things which can be ordered consistently.

It's unclear to me how well this concept maps to other sorts of data.
Complex numbers, for example.

I think "not equal", at least the way our brains handle it in general,
is not equivalent to "less than or greater than".

That is, I think the concept "not equal" is less than or greater than
the concept "less than or greater than". <wink>

-Peter

Steven D'Aprano · Jan 25, 2006

I think "not equal", at least the way our brains handle it in general,
is not equivalent to "less than or greater than".

That is, I think the concept "not equal" is less than or greater than
the concept "less than or greater than". <wink>

For objects that don't have total ordering, "not equal" != is not the
same as "less than or greater than" <>.

The two obvious examples are complex numbers, where C1 != C2 can be
evaluated, but C1 <> C2 is not defined, and NaNs, where NaN != NaN is
always true but NaN <> NaN is undefined.

Terry Hancock · Jan 25, 2006

On the page
http://wiki.python.org/moin/Python3.0Suggestions
I noticed an interesting suggestion:

"These operators â‰¤ â‰¥ â‰ should be added to the
language having the following meaning:

<= >= !=

this should improve readibility (and make language more
accessible to beginners).

This should be an evolution similar to the digraphe and
trigraph (digramme et trigramme) from C and C++
languages."

How do people on this group feel about this suggestion?

In principle, and in the long run, I am definitely for it.

Pragmatically, though, there are still a lot of places
where it would cause me pain. For example, it exposes
problems even in reading this thread in my mail client
(which is ironic, considering that it manages to correctly
render Russian and Japanese spam messages. Grrr.).

OTOH, there will *always* be backwards systems, so you
can't wait forever to move to using newer features.

The symbols above are not even latin-1, you need utf-8.

And while they are better readable, they are not better
typable (at least with most current editors).

They're not that bad. I manage to get kana and kanji working
correctly when I really need them.

Are there similar attempts in other languages? I can only
think of APL, but that was a long time ago.

I'm pretty sure that there are. The idea of adding UTF8 for
use in identifiers and stuff has been around for awhile for
Python. I'm pretty sure you can do this already in Java,
can't you? (I think I read this somewhere, but I don't
think it gets used much).

Once you open your mind for using non-ascii symbols, I'm
sure one can find a bunch of useful applications.
Variable names could be allowed to be non-ascii, as in
XML. Think class names in Arabian... Or you could use
Greek letters if you run out of one-letter variable names,
just as Mathematicians do. Would this be desirable or
rather a horror scenario? Opinions?

Greek letters would be a real relief in writing scientific
software. There's something deeply annoying about variables
named THETA, theta, and Theta. Or "w" meaning "omega.

People coming from other programming backgrounds may object
that these uses are less informative. But in the sciences,
some of these symbols have as much recognizability as "+" or
"$" do to other people. Reading math notation from a
scientists, I can be pretty darned certain that "c" is "the
speed of light" or that "epsilon" is a small, allowable
variation in a variable. And so on. It's true that there are
occasionable problems when problem domains merge, but that's
true of words, too.

It would also reduce the difficulty of going back and forth
between the paper describing the math, and the program
using it.

One thing that I also think would be good is to open up the
operator set for Python. Right now you can overload the
existing operators, but you can't easily define new ones.
And even if you do, you are very limited in what you can
use, and understandability suffers.

But unicode provides codeblocks for operators that
mathematicians use for special operators ("circle-times"
etc). That would both reduce confusion for people bothered
by weird choices of overloading "*" and "+" and allow people
who need these features the ability to use them.

It's also relevant that scientists in China and Saudi Arabia
probably use a roman "c" for the speed of light, or a "mu"
to represent a mass, so it's likely more understandable
internationally than using, say "lightspeed" and "mass".

OTOH, using identifiers in many different languages would
have the opposite effect. Right now, English is accepted as
a lingua franca for programming (and I admit that as a
native speaker of English, I benefit from that), but if it
became common practice to use lots of different languages,
cooperation might suffer.

But then, that's probably why English still dominates with
Java. I suspect that just means people wouldn't use it as
much. And I've certainly dealt with source code commented
in Spanish or German. It didn't kill me.

So, I'd say that in the long run:

1) Yes it will be adopted

2) The math and greek-letter type symbols will be the big
win

3) Localized variable names will be useful to some people,
but not widely popular, especially for cooperative free
software projects (of course, in the Far East, for example,
han character names might become very popular as they span
several languages). But I bet it will remain underused so
long as English remains the most popular international trade
language.

In the meantime, though, I predict many luddites will
scream "But it doesn't work on my vintage VT-220 terminal!"
(And I may even be one of them).

Cheers,
Terry

Christoph Zwerschke · Jan 26, 2006

These were some interesting remarks, Terry.

I just asked myself how Chinese programmers feel about this. I don't
know Chinese, but probably they could write a whole program using only
one-character names for variables, and it would be still readable (at
least for Chinese)... Would this be used or would they rather prefer to
write in English on account of compatibilty issues (technical and human
readability in international projects) or because typing these chars is
more cumbersome than ascii chars? Any Chinese here?

-- Christoph

Rocco Moretti · Jan 26, 2006

Terry said:
One thing that I also think would be good is to open up the
operator set for Python. Right now you can overload the
existing operators, but you can't easily define new ones.
And even if you do, you are very limited in what you can
use, and understandability suffers.

One of the issues that would need to be dealt with in allowing new
operators to be defined is how to work out precedence rules for the new
operators. Right now you can redefine the meaning of addition and
multiplication, but you can't change the order of operations. (Witness
%, and that it must have the same precedence in both multiplication and
string replacement.)

If you allow (semi)arbitrary characters to be used as operators, some
scheme must be chosen for assigning a place in the precedence hierarchy.

Terry Hancock · Jan 26, 2006

For the tests that I tried earlier, using han characters
as the variable names doesn't seem to be possible (Syntax
Error) in python. I'd love to see if I can use han char
for all those keywords like import, but it doesn't work.

Yeah, I'm pretty sure we're talking about the future here.

That depends. People with ages in the middle or older
probably have very rare experience of typing han
characters. But with the popularity of computer
as well as the development of excellent input packages,
and most importantly,
the online-chats that many teenagers hooking to, next
several geneartions can type han char easily and
comfortably.

That's interesting. I think many people in the West tend to
imagine han/kanji characters as archaisms that will
disappear (because to most Westerners they seem impossibly
complex to learn and use, "not suited for the modern
world"). I used to think this was likely, although I always
thought the characters were beautiful, so it would be a
shame.

After taking a couple of semesters of Japanese, though, I've
come to appreciate why they are preferred. Getting rid of
them would be like convincing English people to kunvurt to
pur fonetik spelin'.

Which isn't happening either, I can assure you. ;-)

One thing that is lack in other languages is the "phrase
input"---- almost every
han input package provides this customizable feature. With
all these combined,
many of youngesters can type as fast as they talk. I
believe many of them input
han characters much faster than inputting English.

I guess this is like Canna/SKK server for typing Japanese.
I've never tried to localize my desktop to Japanese (and I
don't think I want to -- I can't read it all that well!),
but I've used kanji input in Yudit and a kanji-enabled
terminal.

I'm not sure I understand how this works, but surely if
Python can provide readline support in the interactive
shell, it ought to be able to handle "phrase input"/"kanji
input." Come to think of it, you probably can do this by
running the interpreter in a kanji terminal -- but Python
just doesn't know what to do with the characters yet.

The "side effect" of this technology advance might be that
in the future the
simplified chinese characters might deprecate, 'cos
there's no need to simplify
any more.

Heh. I must say the traditional characters are easier for
*me* to read. But that's probably because the Japanese kanji
are based on them, and that's what I learned. I never could
get the hang of "grass hand" or the "cursive" Chinese han
character style.

I would like to point out also, that as long as Chinese
programmers don't go "hog wild" and use obscure characters,
I suspect that I would have much better luck reading their
programs with han characters, than with, say, the Chinese
phonetic names! Possibly even better than what they thought
were the correct English words, if their English isn't that
good.

Cheers,
Terry

Flatten an email Message with a non-ASCII body using 8bit CTE	0	Jan 24, 2013
Using __abstractmethod__ with non-methods	0	Jun 16, 2011
PEP 3131: Supporting Non-ASCII Identifiers	399	May 13, 2007
DBD::Oracle, Unicode, non-UTF8-non-ASCII strings	0	Jul 23, 2009
conversion of non-ascii characters with xslt?	3	Jun 20, 2007
Symbols garbage collector in Ruby1.9, fixed?	23	Mar 30, 2009
DeprecationWarning: Non-ASCII character '\xc0'	2	Feb 6, 2004
Funny story about symbols	5	Dec 19, 2004

Using non-ascii symbols

Christoph Zwerschke

Christoph Zwerschke

Christoph Zwerschke

Christoph Zwerschke

Claudio Grondi

Christoph Zwerschke

Dave Hansen

Steven D'Aprano

Dave Hansen

Steven D'Aprano

James Stroud

Robert Kern

Bengt Richter

Ido Yehieli

Peter Hansen

Steven D'Aprano

Terry Hancock

Christoph Zwerschke

Rocco Moretti

Terry Hancock

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads