Unicode in Python

Rustom Mody · Apr 23, 2014

Chris said:
it's impossible for most people to type (and programming with a palette
of arbitrary syntactic tokens isn't my idea of fun)...

Where's the suggestion to use a "palette of arbitrary tokens" ?

I just tried a greek keyboard; ie do
$ setxkbmap -option "grp:switch,grp:alt_shift_toggle,grp_led:scroll" -layout "us,gr"

Thereafter typing
abcdefghijklmnopqrstuvwxyz
after a Shift-Alt
gives
Î±Î²ÏˆÎ´ÎµÏ†Î³Î·Î¹Î¾ÎºÎ»Î¼Î½Î¿Ï€;ÏÏƒÏ„Î¸Ï‰Ï‚Ï‡Ï…Î¶

One more Shift-Alt and back to roman

IOW the extra typing cost for greek letters is negligible
over the corresponding roman ones

Of course
- One would need to define such a keyboard (setxkb)
- One would have to find similar technologies for other OSes (Im on
debian; even ubuntu/unity grabs too many keys)

Chris Angelico · Apr 23, 2014

Where's the suggestion to use a "palette of arbitrary tokens" ?

I just tried a greek keyboard; ie do
$ setxkbmap -option "grp:switch,grp:alt_shift_toggle,grp_led:scroll" -layout "us,gr"

Thereafter typing
abcdefghijklmnopqrstuvwxyz
after a Shift-Alt
gives
Î±Î²ÏˆÎ´ÎµÏ†Î³Î·Î¹Î¾ÎºÎ»Î¼Î½Î¿Ï€;ÏÏƒÏ„Î¸Ï‰Ï‚Ï‡Ï…Î¶

One more Shift-Alt and back to roman

Okay. Now what about your other symbols? Your alternative assignment
operator, for instance. How do you type that?

ChrisA

Steven D'Aprano · Apr 23, 2014

Where's the suggestion to use a "palette of arbitrary tokens" ?

I just tried a greek keyboard; ie do
$ setxkbmap -option "grp:switch,grp:alt_shift_toggle,grp_led:scroll"
-layout "us,gr"

Thereafter typing
abcdefghijklmnopqrstuvwxyz
after a Shift-Alt
gives
Î±Î²ÏˆÎ´ÎµÏ†Î³Î·Î¹Î¾ÎºÎ»Î¼Î½Î¿Ï€;ÏÏƒÏ„Î¸Ï‰Ï‚Ï‡Ï…Î¶

One more Shift-Alt and back to roman

IOW the extra typing cost for greek letters is negligible over the
corresponding roman ones

25 Unicode characters down, 1114000+ to go

There's not just the keyboard mapping. There's the mental cost of knowing
which keyboard mapping you need ("is it Greek, Hebrew, or maths
symbols?"), the cost of remembering the mapping from the keys you see on
the keyboard to the keys they are mapped to ("is Î© mapped to O or W?")
and so forth. If you know lambda-calculus, you might associate Î» with
functions, but if you don't, it's as obfuscated as associating Ð§ with
raising exceptions.

if not isinstance(obj, int):
Ð§TypeError("expected an int, got %r" % type(obj))

Devin Jeanpierre · Apr 23, 2014

There's not just the keyboard mapping. There's the mental cost of knowing
which keyboard mapping you need ("is it Greek, Hebrew, or maths
symbols?"), the cost of remembering the mapping from the keys you see on
the keyboard to the keys they are mapped to ("is Î© mapped to O or W?")
and so forth. If you know lambda-calculus, you might associate Î» with
functions, [...]

Or if you know Python and the name of the letter ("lambda").

But yes, typing out the special characters is annoying. I just use
words. The only downside to using words is, how do you specify capital
versus lowercase letters? "Gamma = ..." violates the style guide!

-- Devin

Rustom Mody · Apr 23, 2014

25 Unicode characters down, 1114000+ to go

The question would arise if there was some suggestion to add
1114000(+) characters to the syntactic/lexical definition of python.

IOW while its true that unicode is a character-set, its better to think
of it as a repertory -- here is the universal set from which a choice is available.

Okay. Now what about your other symbols? Your alternative assignment
operator, for instance. How do you type that?

In case you missed it, I said:

Of course
- One would need to define such a keyboard (setxkb)
- One would have to find similar technologies for other OSes

In more detail:
In our normal use of a US-104 keyboard, every letter 'costs' something.
eg 'a' costs 1 keystroke
'A' costs 2 (Shift+a)
Most people do not count that as a significant cost.
and when kids come on this list and talk smsese -- i wanna do so-n-so

we chide them for keystrokes at the cost of readability.

In such a (default) setup typing a âˆ§ or âˆ¨ is not possible at all without
something like a char-picker and at best has an ergonomic cost that is an
order of magnitude higher than the 'naturally available' characters.

On the other hand when/if a keyboard mapping is defined in which
the characters that are commonly needed are available, it is
reasonable to expect the âˆ¨,âˆ§ to cost no more than 2 strokeseach
(ie about as much as an 'A'; slightly more than an 'a'. Which means
that 'âˆ¨' is expected to cost about the same as 'or' and âˆ§ to cost less than an 'and'

Readability is another question altogether.
Random example from my machine
calendar.py line 99
If one finds this:

return year % 4 == 0 and (year % 100 != 0 or year % 400 == 0)

more readable than
return year%4=0 âˆ§ (year%100â‰ 0 âˆ¨ year%100 = 0)
then perhaps the following is the most preferred?

COMPUTE YEAR MODULO 4 EQUALS 0 AND YEAR MODULO 100 NOT
EQUAL TO ZERO OR YEAR MODULO 100 EQUAL to 0

IOW COBOL is desirable?

Chris Angelico · Apr 23, 2014

In such a (default) setup typing a âˆ§ or âˆ¨ is not possibleat all without
something like a char-picker and at best has an ergonomic cost that is an
order of magnitude higher than the 'naturally available' characters.

On the other hand when/if a keyboard mapping is defined in which
the characters that are commonly needed are available, it is
reasonable to expect the âˆ¨,âˆ§ to cost no more than 2 strokes each
(ie about as much as an 'A'; slightly more than an 'a'. Which means
that 'âˆ¨' is expected to cost about the same as 'or' and âˆ§to cost less than an 'and'

So how much effort are you going to go to for, effectively, the same
end result? You can type "or" with the same keystrokes, and it takes
zero setup work and zero memorization (you may forget which keystroke
you set up for âˆ¨, but I doubt you'll forget how to spell "or", evenif
you think it means gold/yellow). Where's the benefit? I'm seriously
not seeing it.

ChrisA

Steven D'Aprano · Apr 23, 2014

perhaps the following is the most preferred?

COMPUTE YEAR MODULO 4 EQUALS 0 AND YEAR MODULO 100 NOT EQUAL TO ZERO OR
YEAR MODULO 100 EQUAL to 0

IOW COBOL is desirable?

If the only choices are COBOL on one hand and the mutant offspring of
Perl and APL on the other, I'd vote for COBOL.

But surely they aren't the only options, and it is possible to find a
happy medium which is neither excessively verbose nor painfully,
cryptically terse.

Remember that we're talking about general purpose programming here. There
are domains which favour terseness and a vast number of symbols, e.g.
mathematics, but most programming is not in that domain, even when it
uses tools from that domain.

Steven D'Aprano · Apr 23, 2014

On the other hand when/if a keyboard mapping is defined in which the
characters that are commonly needed are available, it is reasonable to
expect the âˆ¨,âˆ§ to cost no more than 2 strokes each (ie about as much as
an 'A'; slightly more than an 'a'. Which means that 'âˆ¨' is expected to
cost about the same as 'or' and âˆ§ to cost less than an 'and'

Oh, a further thought...

Consider your example:

return year%4=0 âˆ§ (year%100â‰ 0 âˆ¨ year%100 = 0)

vs

return year%4=0 and (year%100!=0 or year%100 = 0)

[aside: personally I like â‰ and if there was a platform independent way
to type it in any editor, I'd much prefer it over != or <> ]

Apart from the memorization problem, which I've already touched on, there
is the mode problem. Keyboard layouts are modes, and you're swapping
modes. Every time you swap modes, there is a small mental cost. Think of
it as an interrupt which has to be caught, pausing the current thought
and starting a new one. So rather than:

char char char char char char char ...

you have:

char char char INTERRUPT
char INTERRUPT
char char char ...

which is a heavier cost that it appears from just counting keystrokes. Of
course, the more experienced you become, the smaller that cost will be,
but it will never be quite as low as just a "regular" keystroke.

Normally, when people use multiple keyboards, its because that interrupt
cost is amortized over a significant amount of typing:

INTERRUPT (English layout)
paragraph paragraph paragraph paragraph
INTERRUPT (Greek layout)
paragraph paragraph paragraph
INTERRUPT (English again)
paragraph ...

and possibly even lost in the noise of a far greater interrupt, namely
task-switching from one application to another. So it's manageable. But
switching layouts for a single character is likely to be far more
painful, especially for casual users of that layout.

Based on an extremely generous estimate that I use "lambda" four times in
100 lines of code, I might use Î» perhaps once in a thousand non-Greek
characters. Similarly, I might use âˆ§ or âˆ¨ maybe once per hundred
characters. That means I'm unlikely to ever get familiar enough with
those that the cost of two interrupts per use will be negligible.

Rustom Mody · Apr 23, 2014

Oh, a further thought...

Consider your example:

return year%4=0 âˆ§ (year%100â‰ 0 âˆ¨ year%100 = 0)

return year%4=0 and (year%100!=0 or year%100 = 0)

[aside: personally I like â‰ and if there was a platform independent way
to type it in any editor, I'd much prefer it over != or <> ]

Apart from the memorization problem, which I've already touched on, there
is the mode problem. Keyboard layouts are modes, and you're swapping
modes. Every time you swap modes, there is a small mental cost. Think of
it as an interrupt which has to be caught, pausing the current thought
and starting a new one. So rather than:

char char char char char char char ...

you have:

char char char INTERRUPT
char INTERRUPT
char char char ...

which is a heavier cost that it appears from just counting keystrokes. Of
course, the more experienced you become, the smaller that cost will be,
but it will never be quite as low as just a "regular" keystroke.

Normally, when people use multiple keyboards, its because that interrupt
cost is amortized over a significant amount of typing:

INTERRUPT (English layout)
paragraph paragraph paragraph paragraph
INTERRUPT (Greek layout)
paragraph paragraph paragraph
INTERRUPT (English again)
paragraph ...

and possibly even lost in the noise of a far greater interrupt, namely
task-switching from one application to another. So it's manageable. But
switching layouts for a single character is likely to be far more
painful, especially for casual users of that layout.

Based on an extremely generous estimate that I use "lambda" four times in
100 lines of code, I might use Î» perhaps once in a thousand non-Greek
characters. Similarly, I might use âˆ§ or âˆ¨ maybe once per hundred
characters. That means I'm unlikely to ever get familiar enough with
those that the cost of two interrupts per use will be negligible.

Its gratifying to see an argument whose framing is cognitive-based!

More on that later.

For now: mode/modeless

Yes most of us prefer the Shift key to the Caps Lock even for stretches of capitals. So analogously here is a modeless solution

Earlier I found this mode-switching version
$ setxkbmap -option "grp:switch,grp:alt_shift_toggle,grp_led:scroll" -layout "us,gr"
this makes Shift-Alt the mode-switcher

This one on the other hand
$ setxkbmap -layout "us,gr" -option "grp:switch"
will make right-alt behave like 'Greek-Shift'

ie typing
abcdefghijklmnopqrstuvwxyz
with RAlt depressed throughout, produces
Î±Î²ÏˆÎ´ÎµÏ†Î³Î·Î¹Î¾ÎºÎ»Î¼Î½Î¿Ï€;ÏÏƒÏ„Î¸Ï‰Ï‚Ï‡Ï…Î¶

This makes the a Greek letter's ergonomic cost identical to a capital English
letter's: For Greek use RAlt the way one uses Shift for English.

Notes:
1. Tried on Debian and Ubuntu -- Recent Ubuntus are rather more ill-mannered in
the way they appropriates keys. Still it works as far as I can see.

2. ';' ?? ie semicolon is produced from 'q'? Whats that semicolon doing there?? But then Greek is -- well -- Greek to me! (As is xkb!)

wxjmfauth · Apr 26, 2014

==========

I wrote once 90 % of Python 2 apps (a generic term) supposed to
process text, strings are not working.

In Python 3, that's 100 %. It is somehow only by chance, apps may
give the illusion they are properly working.

jmf

Frank Millman · Apr 26, 2014

==========

I wrote once 90 % of Python 2 apps (a generic term) supposed to
process text, strings are not working.

In Python 3, that's 100 %. It is somehow only by chance, apps may
give the illusion they are properly working.

It is quite frustrating when you make these statements without explaining
what you mean by 'not working'.

It would be really useful if you could spell out -

1. what you did
2. what you expected to happen
3. what actually happened

Frank Millman

Ian Kelly · Apr 26, 2014

It is quite frustrating when you make these statements without explaining
what you mean by 'not working'.

As far as anybody has been able to determine, what jmf means by "not
working" is that strings containing the â‚¬ character are handled less
efficiently than strings that do not contain it in certain contrived test
cases.

wxjmfauth · Apr 27, 2014

Le samedi 26 avril 2014 15:38:29 UTC+2, Ian a écrit :

Rustom Mody · Apr 27, 2014

On the other hand when/if a keyboard mapping is defined in which the
characters that are commonly needed are available, it is reasonable to
expect the âˆ¨,âˆ§ to cost no more than 2 strokes each (ie about as much as
an 'A'; slightly more than an 'a'. Which means that 'âˆ¨' is expected to
cost about the same as 'or' and âˆ§ to cost less than an 'and'

Click to expand...

Oh, a further thought...
Consider your example:
return year%4=0 âˆ§ (year%100â‰ 0 âˆ¨ year%100 = 0)
vs
return year%4=0 and (year%100!=0 or year%100 = 0)
[aside: personally I like â‰ and if there was a platform independent way
to type it in any editor, I'd much prefer it over != or <> ]

Click to expand...

I checked haskell and find the unicode support is better.

For variables (ie identifiers) python and haskell are much the same:

Python3:
1

Haskell:

Prelude> let Î± = 1
Prelude> Î±
1

However in haskell one can also do this unlike python:
*Main> 2 â‰ 3
True

All that's needed to make this work is this set of new-in-terms-of-old definitions:

[The -- is comments for those things that dont work as one may wish]
--------------
import qualified Data.Set as Set
-- Experimenting with Unicode in Haskell source

-- Numbers
x â‰ y = x /= y
x â‰¤ y = x <= y
x â‰¥ y = x >= y
x Ã· y = divMod x y
x â‡‘ y = x ^ y

x Ã— y = x * y -- readability hmmm !!!
Ï€ = pi

-- âŒŠ x = floor x
-- âŒˆ x = ceiling x

-- Lists
xs â¤š ys = xs ++ ys
n â†‘ xs = take n xs
n â†“ xs = drop n xs

-- Bools
x âˆ§ y = x && y
x âˆ¨ y = y || y
-- Â¬x = not x

-- Sets

x âˆˆ s = x `Set.member` s
s âˆª t = s `Set.union` t
s âˆ© t = s `Set.intersection` t
s âŠ† t = s `Set.isSubsetOf` t
s âŠ‚ t = s `Set.isProperSubsetOf` t
s âŠˆ t = not (s `Set.isSubsetOf` t)
-- âˆ… = Set.null

wxjmfauth · Apr 28, 2014

Le samedi 26 avril 2014 15:38:29 UTC+2, Ian a écrit :

As far as anybody has been able to determine, what jmf means by "not working" is that strings containing the EURO character are handled less efficiently than strings that do not contain it in certain contrived test cases.

----

Python 2.7 + cp1252:
- Solid and coherent system (nothing to do with the Euro).

Python 3:
- It missed the unicode shift.
- Covering the whole unicode range will not make
Python a unicode compliant product.
- Flexible String Representation (a problem per se),
a mathematical absurditiy which does the opposite of
the coding schemes endorsed by Unicord.org (sheet of
paper and pencil!)
- Very deeply buggy (quadrature of the circle problem).

Positive side:
- A very nice tool to teach the coding of characters
and unicode.

jmf

random832 · May 1, 2014

Python 3:
- It missed the unicode shift.
- Covering the whole unicode range will not make
Python a unicode compliant product.

Please cite exactly what portion of the unicode standard requires
operations with all characters to be handled in the same amount of time
and space, and forbids optimizations that make some characters handled
faster or in less space than others.

Michael Torrie · May 2, 2014

Can't help but feed the troll... forgive me.

Python 2.7 + cp1252:
- Solid and coherent system (nothing to do with the Euro).

Except that cp1252 is not unicode. Perhaps some subset of unicode can
be encoded into bytes using cp1252. But if it works for you keep using
it, and stop spreading nonsense about FSR.

Python 3:
- Flexible String Representation (a problem per se),
a mathematical absurditiy which does the opposite of
the coding schemes endorsed by Unicord.org (sheet of
paper and pencil!)
- Very deeply buggy (quadrature of the circle problem).

Maybe it's the language barrier, but whatever it is you are talking
about, I certainly can't make out.

You've been ranting about FSR for years without being able to clearly
say what's wrong with it. Please quote unicode specifications that you
feel Python does not implement. What unicode characters cannot be
represented? Does Python choke on certain unicode strings or expose
entities it should not (like Javascript does)?

Why would you think that the unicode consortium's list of byte encodings
are the only possible valid ways of encoding unicode to a byte stream?

If you're going to continue to write this sort of stuff, please have the
decency to answer these questions at least.

Positive side:
- A very nice tool to teach the coding of characters
and unicode.

Indeed.

wxjmfauth · May 3, 2014

Le vendredi 2 mai 2014 05:50:40 UTC+2, Michael Torrie a écrit :

Can't help but feed the troll... forgive me.

Except that cp1252 is not unicode. Perhaps some subset of unicode can

be encoded into bytes using cp1252. But if it works for you keep using

it, and stop spreading nonsense about FSR.

Maybe it's the language barrier, but whatever it is you are talking

about, I certainly can't make out.

You've been ranting about FSR for years without being able to clearly

say what's wrong with it. Please quote unicode specifications that you

feel Python does not implement. What unicode characters cannot be

represented? Does Python choke on certain unicode strings or expose

entities it should not (like Javascript does)?

Why would you think that the unicode consortium's list of byte encodings

are the only possible valid ways of encoding unicode to a byte stream?

If you're going to continue to write this sort of stuff, please have the

decency to answer these questions at least.

Indeed.

========

-

wxjmfauth · May 8, 2014

Le jeudi 1 mai 2014 19:21:14 UTC+2, (e-mail address removed) a écrit :

Please cite exactly what portion of the unicode standard requires

operations with all characters to be handled in the same amount of time

and space, and forbids optimizations that make some characters handled

faster or in less space than others.

==========

I missed you comment. Regression is only a side effect.

I can make Python failing (lead Python to failures) with
any piece of text or valid sequence of characters I wish [*].

I'm no more writing code (apps), only maintaining
my interactive interpreters.

[*] I do not count as failures, issues like cp65001,
only "basic" text/string manipulations.

jmf

Unicode in Python

Rustom Mody

Chris Angelico

Steven D'Aprano

Devin Jeanpierre

Rustom Mody

Chris Angelico

Steven D'Aprano

Steven D'Aprano

Rustom Mody

wxjmfauth

Frank Millman

Ian Kelly

wxjmfauth

Rustom Mody

wxjmfauth

random832

Michael Torrie

wxjmfauth

wxjmfauth

Members online

Forum statistics

Latest Threads