Is Unicode support so hard...

J

jmfauth

In a previous post,

http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
,

Chris “Kwpolska” Warrick wrote:

“Is Unicode support so hard, especially in the 21st century?”

--

Unicode is not really complicate and it works very well (more
than two decades of development if you take into account
iso-14****).

But, - I can say, "as usual" - people prefer to spend their
time to make a "better Unicode than Unicode" and it usually
fails. Python does not escape to this rule.

-----

I'm "busy" with TeX (unicode engine variant), fonts and typography.
This gives me plenty of ideas to test the "flexible string
representation" (FSR). I should recognize this FSR is failing
particulary very well...

I can almost say, a delight.

jmf
Unicode lover
 
N

Ned Batchelder

In a previous post,

http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
,

Chris “Kwpolska” Warrick wrote:

“Is Unicode support so hard, especially in the 21st century?”

--

Unicode is not really complicate and it works very well (more
than two decades of development if you take into account
iso-14****).

But, - I can say, "as usual" - people prefer to spend their
time to make a "better Unicode than Unicode" and it usually
fails. Python does not escape to this rule.

-----

I'm "busy" with TeX (unicode engine variant), fonts and typography.
This gives me plenty of ideas to test the "flexible string
representation" (FSR). I should recognize this FSR is failing
particulary very well...

I can almost say, a delight.

jmf
Unicode lover
I'm totally confused about what you are saying. What does "make a
better Unicode than Unicode" mean? Are you saying that Python is guilty
of this? In what way? Can you provide specifics? Or are you saying
that you like how Python has implemented it? "FSR is failing ... a
delight"? I don't know what you mean.

--Ned.
 
B

Benjamin Kaplan

I'm totally confused about what you are saying. What does "make a better
Unicode than Unicode" mean? Are you saying that Python is guilty of this?
In what way? Can you provide specifics? Or are you saying that you like
how Python has implemented it? "FSR is failing ... a delight"? I don't
know what you mean.

--Ned.

Don't bother trying to figure this out. jmfauth has been hijacking
every thread that mentions Unicode to complain about the flexible
string representation introduced in Python 3.3. Apparently, having
proper Unicode semantics (indexing is based on characters, not code
points) at the expense of performance when calling .replace on the
only non-ASCII or BMP character in the string is a horrible bug.
 
C

Chris Angelico

I'm totally confused about what you are saying. What does "make a better
Unicode than Unicode" mean? Are you saying that Python is guilty of this?
In what way? Can you provide specifics? Or are you saying that you like
how Python has implemented it? "FSR is failing ... a delight"? I don't
know what you mean.

You're not familiar with jmf? He's one of our resident trolls. Allow
me to summarize Python 3's Unicode support...
From 3.0 up to and including 3.2.x, Python could be built as either
"narrow" or "wide". A wide build consumes four bytes per character in
every string, which is rather wasteful (given that very few strings
actually NEED that); a narrow build gets some things wrong. (I'm using
a 2.7 here as I don't have a narrow-build 3.x handy; the same
considerations apply, though.)

Python 2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit
(Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
len(u"asdf\U00012345qwer") 10
u"asdf\U00012345qwer"[8]
u'e'

In a narrow build, strings are stored in UTF-16, so astral characters
count as two. This means that a program will behave unexpectedly
differently on different platforms (other languages, such as
ECMAScript, actually *mandate* UTF-16; at least this means you can
depend on this otherwise-bizarre behaviour regardless of what platform
you're on), and I have to say this is counter-intuitive.

Enter Python 3.3 and PEP 393 strings. Now *EVERY* Python build is,
conceptually, wide. (I'm not sure how PEP 393 applies to other Pythons
- Jython, PyPy, etc - so assume that whenever I refer to Python, I'm
restricting this to CPython.) The underlying representation might be
more efficient, but to the script, it's exactly the same as a wide
build. If a string has no characters that demand more width, it'll be
stored nice and narrow. (It's the same technique that Pike has been
using for a while, so it's a proven system; in any case, we know that
this is going to work, it's just a question of performance - it adds a
fixed overhead.) Great! We save memory in Python programs. Wonderful!
Right?

Enter jmf. No, it's not wonderful, because OBVIOUSLY Python is now
America-centric, because now the full Unicode range is divided into
"these ones get stored in 1 byte per char, these in 2, these in 4".
Clearly that's making life way worse for everyone else. Also, compared
to the narrow build that jmf was previously using, this uses heaps
MORE space in the stupid micro-benchmarks that he keeps on trotting
out, because he has just one astral character in a sea of ASCII. And
that's totally what programs are doing all the time, too. Never mind
that basic operations like length, slicing, etc are no longer buggy,
no, Python has taken a terrible step backwards here.

Oh, and check this out:
"""Move characters around in a string."""
l=len(s)//4
return s[:l]+s[l*2:l*3]+s[l:l*2]+s[l*3:]
'asdfzxcvqwer1234'

Looks fine.
u'asd\U00012167xc\U00023745we\U00034456bla'

Where'd those characters come from? I was just moving stuff around,
right? I can't get new characters out of it... can I?

Flash forward to current date, and jmf has hijacked so many threads to
moan about PEP 393 that I'm actually happy about this one, simply
because he gave it a new subject line and one appropriate to a
discussion about Unicode.

ChrisA
 
C

Chris “Kwpolska†Warrick

Don't bother trying to figure this out. jmfauth has been hijacking
every thread that mentions Unicode to complain about the flexible
string representation introduced in Python 3.3. Apparently, having
proper Unicode semantics (indexing is based on characters, not code
points) at the expense of performance when calling .replace on the
only non-ASCII or BMP character in the string is a horrible bug.

Don’t forget the original context: this was a short remark to a guyI
was responding to. His newsgroups software (slrn according to the
headers) mangled the encoding of U+201C and U+201D in my From field,
turning them into three question marks each. And jmf started a rant,
as usual…

PS. There are two fancy Unicode characters around. Can you find both
of them, jmf?
 
M

Mark Lawrence

Don't bother trying to figure this out. jmfauth has been hijacking
every thread that mentions Unicode to complain about the flexible
string representation introduced in Python 3.3. Apparently, having
proper Unicode semantics (indexing is based on characters, not code
points) at the expense of performance when calling .replace on the
only non-ASCII or BMP character in the string is a horrible bug.

He can't complain about performance for the .replace issue any more as
it's been fixed http://bugs.python.org/issue16061

Sadly he'll almost certainly have more edge cases up his sleeve while
continuing to ignore minor issues like memory saving and correctness.
 
E

Ethan Furman

Flash forward to current date, and jmf has hijacked so many threads to
moan about PEP 393 that I'm actually happy about this one, simply
because he gave it a new subject line and one appropriate to a
discussion about Unicode.

+1000
 
R

rusi

    Hi jmf,


    This is too vague for me.

    Which string representation should Python use?
1) UTF-32
2) UTF-8
3) Python 3.3 -- 1, 2, or 4 bytes per character decided at runtime
4) Python 3.2 -- 2 or 4 bytes per character decided at Python build time
5) Something else

jmf recommends UTF-8.

Apart from the fact the UTF-8 would be less (time) performant in all
cases and more extremely so in cases like indexing, the fact that jmf
says so makes it more ridiculous.
According to jmf python sucks up to ASCII (those big bad Americans… of
whom Steven is the first…) whereas unicode is the true international/
universal standard.

I guess the irony is clear to all (except jmf) given that:
- its unicode that sucks up to ASCII by carefully conforming in the
first 127 positions including the completely useless control chars;
python just implements the standard
- UTF-8 is an ASCII-biased unicode-compression method viz UTF-8 is
most space-efficient on ASCII at the cost of being generally time-
inefficient
- All jmf's beefs (as far as I remember) are variations on the theme:
"time-inefficiency is equivalent to non-unicode-compliant"

In short he manifests a dog-in-the-manger mindset:
"Since the whole world will never speak french (grief, mope, grumble,
thrash…) everyone should pay for the Chinese character set's size even
if they are monolingually English"

All that said…

I believe that the recent correction in unicode performance followed
jmf's grumbles
(Mark please correct me if I am wrong)
So python community can be thankful to jmf even if he insists on
laboring under bizarre political hallucinations.

[Written from India where a monolingual person is as rare as a
palmtree on a polecap]
 
S

Steven D'Aprano

According to jmf python sucks up to ASCII (those big bad Americans… of
whom Steven is the first…)

Watch who you're calling an American, mate.
 
C

Chris Angelico

Watch who you're calling an American, mate.

I think he knows, and that's why he said it. You and I are foremost
among Americans who are destroying Python.

ChrisA
 
8

88888 Dihedral

jmfauthæ–¼ 2013å¹´4月21日星期日UTC+8上åˆ1時12分43秒寫é“:
In a previous post,



http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#

,



Chris “Kwpolska†Warrick wrote:



“Is Unicode support so hard, especially in the 21st century?â€



--



Unicode is not really complicate and it works very well (more

than two decades of development if you take into account

iso-14****).



But, - I can say, "as usual" - people prefer to spend their

time to make a "better Unicode than Unicode" and it usually

fails. Python does not escape to this rule.



-----



I'm "busy" with TeX (unicode engine variant), fonts and typography.

This gives me plenty of ideas to test the "flexible string

representation" (FSR). I should recognize this FSR is failing

particulary very well...



I can almost say, a delight.



jmf

Unicode lover

To support the unicode is easy in the language part.
But to support the unicode in a platform involves
the OS and the display and input hardware devices
which are not suitable to be free most of the time.
 
T

Terry Jan Reedy

I believe that the recent correction in unicode performance followed
jmf's grumbles

No, the correction followed upon his accurate report of a regression,
last August, which was unfortunately mixed in with grumbles and
inaccurate claims. Others separated out and verified the accurate
report. I reported it to pydev and enquired as to its necessity, I
believe Mark opened the tracker issue, and the two people who worked on
optimizing 3.3 a year ago fairly quickly came up with two different
patches. The several month delay after was a matter of testing and
picking the best approach.
 
M

Mark Lawrence

No, the correction followed upon his accurate report of a regression,
last August, which was unfortunately mixed in with grumbles and
inaccurate claims. Others separated out and verified the accurate
report. I reported it to pydev and enquired as to its necessity, I
believe Mark opened the tracker issue, and the two people who worked on
optimizing 3.3 a year ago fairly quickly came up with two different
patches. The several month delay after was a matter of testing and
picking the best approach.

I'd again like to point out that all I did was raise the issue. It was
based on data provided by Steven D'Aprano and confirmed by Serhiy Storchaka.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top