M
Mark Lawrence
"I believe that Pythonistas should commit themselves to achieving the
goal, before this decade is out, of making Python 3 the default version
and having everybody be cool with unicode."
I like that, thank you.
"I believe that Pythonistas should commit themselves to achieving the
goal, before this decade is out, of making Python 3 the default version
and having everybody be cool with unicode."
I'm cool with Unicode as long as it "just works" without me ever
having to understand it and I can interact effortlessly with plain old
ASCII files. Evertime I start to read anything about Unicode with any
technical detail at all, I start to get dizzy and bleed from the ears.
I think Python is doing it correctly. If I want to operate on
"clusters" I'll normalize the string first.
Thanks for this excellent post.
I think Python is doing it correctly. If I want to operate on "clusters" I'll normalize the string first.
On 12/2/13 3:38 PM, Ethan Furman wrote:
This is where my knowledge about Unicode gets fuzzy. Isn't it the casethat some grapheme clusters (or whatever the right word is) can't benormalized down to a single code point? Characters can accept manyaccents, for example. In that case, you can't always normalize and usethe existing string methods, but would need more specialized code.
That is correct.
If Unicode had a distinct code point for every possible combination of
base-character plus an arbitrary number of diacritics or accents, the
0x10FFFF code points wouldn't be anywhere near enough.
I see over 300 diacritics used just in the first 5000 code points. Let's
pretend that's only 100, and that you can use up to a maximum of 5 at a
time. That gives 79375496 combinations per base character, much larger
than the total number of Unicode code points in total.
If anyone wishes to check my logic:
# count distinct combining chars
import unicodedata
s = ''.join(chr(i) for i in range(33, 5000))
s = unicodedata.normalize('NFD', s)
t = [c for c in s if unicodedata.combining(c)]
len(set(t))
# calculate the number of combinations
def comb(r, n):
"""Combinations nCr"""
p = 1
for i in range(r+1, n+1):
p *= i
for i in range(1, n-r+1):
p /= i
return p
sum(comb(i, 100) for i in range(6))
I'm not suggesting that all of those accents are necessarily in use in
the real world, but there are languages which construct arbitrary
combinations of accents. (Or so I have been lead to believe.)
Hrmm, well, after being educatedI think I may have to reverse my position. Given that not every cluster can be
normalized to a single code point perhaps Python is doing it the best possible way. On the other hand, we have a
uni*code* type, not a uni*char* type. Maybe 3.5 can have that.![]()
Yon intuitively pointed a very important feature
of "unicode". However, it is not necessary, this is
exactly what unicode does (when used properly).
jmf
Yon intuitively pointed a very important feature of "unicode".
However, it is not necessary, this is exactly what unicode does
(when used properly).
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.