Unicode 7

W

wxjmfauth

Let see how Python is ready for the next Unicode version
(Unicode 7.0.0.Beta).

timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = 'z'") [1.4027834829454946, 1.38714224331963, 1.3822586635296261]
timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = '\u0fce'") [5.462776291480395, 5.4479432055423445, 5.447874284053398]


# more interesting
timeit.repeat("(x*1000 + y)[:-1]",\
.... setup="x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')")
[1.3496489533188765, 1.328654286266783, 1.3300913977710707]
Note 1: "lookup" is not the problem.

Note 2: From Unicode.org : "[...] We strongly encourage [...] and test
them with their programs [...]"

-> Done.

jmf
 
T

Tim Chase

timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = 'z'") [1.4027834829454946, 1.38714224331963, 1.3822586635296261]
timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y =
'\u0fce'") [5.462776291480395, 5.4479432055423445, 5.447874284053398]


# more interesting
timeit.repeat("(x*1000 + y)[:-1]",\
... setup="x = 'abc'.encode('utf-8'); y =
'\u0fce'.encode('utf-8')") [1.3496489533188765, 1.328654286266783,
1.3300913977710707]

While I dislike feeding the troll, what I see here is: on your
machine, all unicode manipulations in the test should take ~5.4
seconds. But Python notices that some of your strings *don't*
require a full 32-bits and thus optimizes those operations, cutting
about 75% of the processing time (wow...4-bytes-per-char to
1-byte-per-char, I wonder where that 75% savings comes from).

So rather than highlight any *problem* with Python, your [mostly
worthless microbenchmark non-realworld] tests show that Python's
unicode implementation is awesome.

Still waiting to see an actual bug-report as mentioned on the other
thread.

-tkc
 
M

MRAB

Let see how Python is ready for the next Unicode version
(Unicode 7.0.0.Beta).

timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = 'z'") [1.4027834829454946, 1.38714224331963, 1.3822586635296261]
timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = '\u0fce'") [5.462776291480395, 5.4479432055423445, 5.447874284053398]


# more interesting
timeit.repeat("(x*1000 + y)[:-1]",\
... setup="x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')")
[1.3496489533188765, 1.328654286266783, 1.3300913977710707]
Although the third example is the fastest, it's also the wrong way to
handle Unicode:
x = 'abc'.encode('utf-8'); y = '\u0fce'.encode('utf-8')
t = (x*1000 + y)[:-1].decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position
3000-3001: unex
pected end of data
Note 1: "lookup" is not the problem.

Note 2: From Unicode.org : "[...] We strongly encourage [...] and test
them with their programs [...]"

-> Done.

jmf
 
W

wxjmfauth

@ Time Chase

I'm perfectly aware about what I'm doing.


@ MRAB

"...Although the third example is the fastest, it's also the wrong
way to handle Unicode: ..."

Maybe that's exactly the opposite. It illustrates very well,
the quality of coding schemes endorsed by Unicode.org.
I deliberately choose utf-8.


Q. How to save memory without wasting time in encoding?
By using products using natively the unicode coding schemes?

Are you understanding unicode? Or are you understanding
unicode via Python?

---

A Tibetan monk [*] using Py32:
timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = 'z'") [2.3394840182882186, 2.3145832750782653, 2.3207231951529685]
timeit.repeat("(x*1000 + y)[:-1]", setup="x = 'abc'; y = '\u0fce'") [2.328517624800078, 2.3169403900011076, 2.317586282812048]

[*] Your curiosity has certainly shown, what this code point means.
For the others:
U+0FCE TIBETAN SIGN RDEL NAG RDEL DKAR
signifies good luck earlier, bad luck later


(My comment: Good luck with Python or bad luck with Python)

jmf
 
T

Tim Chase

@ Time Chase

I'm perfectly aware about what I'm doing.

Apparently, you're quite adept at appending superfluous characters to
sensible strings...did you benchmark your email composition, too? ;-)

-tkc (aka "Tim", not "Time")
 
S

Steven D'Aprano

<snipped>

Since its Unicode-troll time, here's my contribution
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html


I disagree with much of your characterisation of the Unix assumption, and
I point out that out of the two most widespread flavours of OS today,
Linux/Unix and Windows, it is *Windows* and not Unix which still
regularly uses legacy encodings.

Also your link to Joel On Software mistakenly links to me instead of Joel.

There's a missing apostrophe in "Ive" [sic] in Acknowledgment #2.

I didn't notice any other typos.
 
W

wxjmfauth

Le mercredi 30 avril 2014 20:48:48 UTC+2, Tim Chase a écrit :
Apparently, you're quite adept at appending superfluous characters to

sensible strings...did you benchmark your email composition, too? ;-)



-tkc (aka "Tim", not "Time")

Mea culpa, ...
 
R

Rustom Mody

Also your link to Joel On Software mistakenly links to me instead of Joel.
There's a missing apostrophe in "Ive" [sic] in Acknowledgment #2.

Done, Done.
I didn't notice any other typos.

Thank you sir!
I point out that out of the two most widespread flavours of OS today,
Linux/Unix and Windows, it is *Windows* and not Unix which still
regularly uses legacy encodings.

Not sure what you are suggesting...
That (I am suggesting that) 8859 is legacy and 1252 is not?
I disagree with much of your characterisation of the Unix assumption,

I'd be interested to know the details -- Contents? Details? Tone? Tenor? Blaspheming the sacred scripture?
(if you are so inclined of course)
 
T

Terry Reedy

I will not comment on the Unix-assumption part, but I think you go wrong
with this: "Unicode is a Headache". The major headache is that unicode
and its very few encodings are not universally used. The headache is all
the non-unicode legacy encodings still being used. So you better title
this section 'Non-Unicode is a Headache'.

The first sentence is this misleading tautology: "With ASCII, data is
ASCII whether its file, core, terminal, or network; ie "ABC" is
65,66,67." Let me translate: "If all text is ASCII encoded, then text
data is ASCII, whether ..." But it was never the case that all text was
ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe
still uses the latter. Other mainframe makers used other encodings of
A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never
universal. You could have just as well said "With EBCDIC, data is
EBCDIC, whether ..."

https://en.wikipedia.org/wiki/Ascii
https://en.wikipedia.org/wiki/EBCDIC

A crucial step in the spread of Ascii was its use for microcomputers,
including the IBM PC. The latter was considered a toy by the mainframe
guys. If they had known that PCs would partly take over the computing
world, they might have suggested or insisted that the it use EBCDIC.

"With unicode there are:
encodings"
where 'encodings' is linked to
https://en.wikipedia.org/wiki/Character_encodings_in_HTML

If html 'always' used utf-8 (like xml), as has become common but not
universal, all of the problems with *non-unicode* character sets and
encodings would disappear. The pre-unicode declarations could then
disappear. More truthful: "without unicode there are 100s of encodings
and with unicode only 3 that we should worry about.

"in-memory formats"

These are not the concern of the using programmer as long as they do not
introduce bugs or limitations (as do all the languages stuck on UCS-2
and many using UTF-16, including old Python narrow builds). Using what
should generally be the universal transmission format, UFT-8, as the
internal format means either losing indexing and slicing, having those
operations slow from O(1) to O(len(string)), or adding an index table
that is not part of the unicode standard. Using UTF-32 avoids the above
but usually wasted space -- up to 75%.

"strange beasties like python's FSR"

Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
is an *internal optimization* that benefits most unicode operations that
people actually perform. It uses UTF-32 by default but adapts to the
strings users create by compressing the internal format. The compression
is trivial -- simple dropping leading null bytes common to all
characters -- so each character is still readable as is. The string
headers records how many bytes are left. Is the idea of algorithms that
adapt to inputs really strange to you?

Like good adaptive algorthms, the FSR is invisible to the user except
for reducing space or time or maybe both. Unicode operations are
otherwise the same as with previous wide builds. People who used to use
narrow-builds also benefit from bug elimination. The only 'headaches'
involved might have been those of the developers who optimized previous
wide builds.

CPython has many other functions with special-case optimizations and
'fast paths' for common, simple cases. For instance, (some? all?) number
operations are optimized for pairs of integers. Do you call these
'strange beasties'?

PyPy is faster than CPython, when it is, because it is even more
adaptable to particular computations by creating new fast paths. The
mechanism to create these 'strange beasties' might have been a headache
for the writers, but when it works, which it now seems to, it is not for
the users.
 
M

MRAB

I will not comment on the Unix-assumption part, but I think you go wrong
with this: "Unicode is a Headache". The major headache is that unicode
and its very few encodings are not universally used. The headache is all
the non-unicode legacy encodings still being used. So you better title
this section 'Non-Unicode is a Headache'.
[snip]
I think he's right when he says "Unicode is a headache", but only
because it's being used to handle languages which are, themselves, a
"headache": left-to-right versus right-to-left, sometimes on the same
line; diacritics, possibly several on a glyph; etc.
 
R

Rustom Mody

I will not comment on the Unix-assumption part, but I think you go wrong
with this: "Unicode is a Headache". The major headache is that unicode
and its very few encodings are not universally used. The headache is all
the non-unicode legacy encodings still being used. So you better title
this section 'Non-Unicode is a Headache'.
[snip]
I think he's right when he says "Unicode is a headache", but only
because it's being used to handle languages which are, themselves, a
"headache": left-to-right versus right-to-left, sometimes on the same
line; diacritics, possibly several on a glyph; etc.

Yes, the headaches go a little further back than Unicode.
There is a certain large old book...
In which is described the building of a 'tower that reached up to heaven'....

At which point 'it was decided'¶ to do something to prevent that.

And our headaches started.

I dont know how one causally connects the 'headaches' but Ive seen
- mojibake
- unicode 'number-boxes' (what are these called?)
- Worst of all what we *dont* see -- how many others dont see what we see?

I never knew of any of this in the good ol days of ASCII

¶ Passive voice is often the best choice in the interests of political correctness

It would be a pleasant surprise if everyone sees a pilcrow at start of lineabove
 
R

Rustom Mody

On 5/1/2014 2:04 PM, Rustom Mody wrote:
I will not comment on the Unix-assumption part, but I think you go wrong
with this: "Unicode is a Headache". The major headache is that unicode
and its very few encodings are not universally used. The headache is all
the non-unicode legacy encodings still being used. So you better title
this section 'Non-Unicode is a Headache'.
The first sentence is this misleading tautology: "With ASCII, data is
ASCII whether its file, core, terminal, or network; ie "ABC" is
65,66,67." Let me translate: "If all text is ASCII encoded, then text
data is ASCII, whether ..." But it was never the case that all text was
ASCII encoded. IBM used 6-bit BCDIC and then 8-bit EBCDIC and I believe
still uses the latter. Other mainframe makers used other encodings of
A-Z + 0-9 + symbols + control codes. The all-ASCII paradise was never
universal. You could have just as well said "With EBCDIC, data is
EBCDIC, whether ..."

A crucial step in the spread of Ascii was its use for microcomputers,
including the IBM PC. The latter was considered a toy by the mainframe
guys. If they had known that PCs would partly take over the computing
world, they might have suggested or insisted that the it use EBCDIC.
"With unicode there are:
encodings"
where 'encodings' is linked to
https://en.wikipedia.org/wiki/Character_encodings_in_HTML
If html 'always' used utf-8 (like xml), as has become common but not
universal, all of the problems with *non-unicode* character sets and
encodings would disappear. The pre-unicode declarations could then
disappear. More truthful: "without unicode there are 100s of encodings
and with unicode only 3 that we should worry about.
"in-memory formats"
These are not the concern of the using programmer as long as they do not
introduce bugs or limitations (as do all the languages stuck on UCS-2
and many using UTF-16, including old Python narrow builds). Using what
should generally be the universal transmission format, UFT-8, as the
internal format means either losing indexing and slicing, having those
operations slow from O(1) to O(len(string)), or adding an index table
that is not part of the unicode standard. Using UTF-32 avoids the above
but usually wasted space -- up to 75%.
"strange beasties like python's FSR"
Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
is an *internal optimization* that benefits most unicode operations that
people actually perform. It uses UTF-32 by default but adapts to the
strings users create by compressing the internal format. The compression
is trivial -- simple dropping leading null bytes common to all
characters -- so each character is still readable as is. The string
headers records how many bytes are left. Is the idea of algorithms that
adapt to inputs really strange to you?
Like good adaptive algorthms, the FSR is invisible to the user except
for reducing space or time or maybe both. Unicode operations are
otherwise the same as with previous wide builds. People who used to use
narrow-builds also benefit from bug elimination. The only 'headaches'
involved might have been those of the developers who optimized previous
wide builds.
CPython has many other functions with special-case optimizations and
'fast paths' for common, simple cases. For instance, (some? all?) number
operations are optimized for pairs of integers. Do you call these
'strange beasties'?

Here is an instance of someone who would like a certain optimization to be
dis-able-able

https://mail.python.org/pipermail/python-list/2014-February/667169.html

To the best of my knowledge its nothing to do with unicode or with jmf.

Why if optimizations are always desirable do C compilers have:
-O0 O1 O2 O3 and zillions of more specific flags?

JFTR I have no issue with FSR. What we have to hand to jmf - willingly
or otherwise - is that many more people have heard of FSR thanks to him. [I am one of them]

I dont even know whether jmf has a real
technical (as he calls it 'mathematical') issue or its entirely political:

"Why should I pay more for a EURO sign than a $ sign?"

Well perhaps that is more related to the exchange rate than to python!
 
R

Rustom Mody

"Why should I pay more for a EURO sign than a $ sign?"

A unicode 'headache' there:
I typed the Euro sign (trying again € ) not EURO

Somebody -- I guess its GG in overhelpful mode -- converted it
And made my post:
Content-Type: text/plain; charset=ISO-8859-1

Will some devanagarari vowels help it stop being helpful?
अ आ इ ई उ ऊ ठà¤
 
R

Rustom Mody

Rustom Mody writes:
Okay, so can you change your article to reflect the fact that the
headaches both pre-date Unicode, and are made much easier by Unicode?

Predate: Yes
Made easier: No
Ah yes, the neo-Sumerian story "Enmerkar_and_the_Lord_of_Aratta"
<URL:https://en.wikipedia.org/wiki/Enmerkar_and_the_Lord_of_Aratta>.
Probably inspired by stories older than that, of course.

Thanks for that link
And other myths with fantastic reasons for the diversity of language
<URL:https://en.wikipedia.org/wiki/Mythical_origins_of_language>.

This one takes the cake - see 1st para
http://hilgart.org/enformy/BronsonRekindling.pdf

Yes, by ignoring all other writing systems except one's own - and
thereby excluding most of the world's people - the system can be made
simpler.
Hopefully the proportion of programmers who still feel they can make
such a parochial choice is rapidly shrinking.

See link above: Ethnic differences and chauvinism are invariably linked
 
C

Chris Angelico

Here is an instance of someone who would like a certain optimization to be
dis-able-able

https://mail.python.org/pipermail/python-list/2014-February/667169.html

To the best of my knowledge its nothing to do with unicode or with jmf.

It doesn't, and it has only to do with testing. I've had similar
issues at times; for instance, trying to benchmark one language or
language construct against another often means fighting against an
optimizer. (How, for instance, do you figure out what loop overhead
is, when an empty loop is completely optimized out?) This is nothing
whatsoever to do with Unicode, nor to do with the optimization that
Python and Pike (and maybe other languages) do with the storage of
Unicode strings.

ChrisA
 
S

Steven D'Aprano

"strange beasties like python's FSR"

Have you really let yourself be poisoned by JMF's bizarre rants? The FSR
is an *internal optimization* that benefits most unicode operations that
people actually perform. It uses UTF-32 by default but adapts to the
strings users create by compressing the internal format. The compression
is trivial -- simple dropping leading null bytes common to all
characters -- so each character is still readable as is.

For anyone who, like me, wasn't convinced that Unicode worked that way,
you can see for yourself that it does. You don't need Python 3.3, any
version of 3.x will work. In Python 2.7, it should work if you just
change the calls from "chr()" to "unichr()":

py> for i in range(256):
.... c = chr(i)
.... u = c.encode('utf-32-be')
.... assert u[:3] == b'\0\0\0'
.... assert u[3:] == c.encode('latin-1')
....
py> for i in range(256, 0xFFFF+1):
.... c = chr(i)
.... u = c.encode('utf-32-be')
.... assert u[:2] == b'\0\0'
.... assert u[2:] == c.encode('utf-16-be')
....
py>


So Terry is correct: dropping leading zeroes, and treating the remainder
as either Latin-1 or UTF-16, works fine, and potentially saves a lot of
memory.
 
R

Rustom Mody

It doesn't, and it has only to do with testing. I've had similar
issues at times; for instance, trying to benchmark one language or
language construct against another often means fighting against an
optimizer. (How, for instance, do you figure out what loop overhead
is, when an empty loop is completely optimized out?) This is nothing
whatsoever to do with Unicode, nor to do with the optimization that
Python and Pike (and maybe other languages) do with the storage of
Unicode strings.

This was said in response to Terry's
CPython has many other functions with special-case optimizations and
'fast paths' for common, simple cases. For instance, (some? all?) number
operations are optimized for pairs of integers. Do you call these
'strange beasties'?

which evidently vanished -- optimized out :D -- in multiple levels of quoting
 
T

Terry Reedy

I will not comment on the Unix-assumption part, but I think you go wrong
with this: "Unicode is a Headache". The major headache is that unicode
and its very few encodings are not universally used. The headache is all
the non-unicode legacy encodings still being used. So you better title
this section 'Non-Unicode is a Headache'.
[snip]
I think he's right when he says "Unicode is a headache", but only
because it's being used to handle languages which are, themselves, a
"headache": left-to-right versus right-to-left, sometimes on the same
line;

Handling that without unicode is even worse.
diacritics, possibly several on a glyph; etc.

Ditto.
 
R

Rustom Mody

On 5/1/2014 2:04 PM, Rustom Mody wrote:
Since its Unicode-troll time, here's my contribution
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html
I will not comment on the Unix-assumption part, but I think you go wrong
with this: "Unicode is a Headache". The major headache is that unicode
and its very few encodings are not universally used. The headache is all
the non-unicode legacy encodings still being used. So you better title
this section 'Non-Unicode is a Headache'.
[snip]
I think he's right when he says "Unicode is a headache", but only
because it's being used to handle languages which are, themselves, a
"headache": left-to-right versus right-to-left, sometimes on the same
line;
Handling that without unicode is even worse.

Whats the best cure for headache?

Cut off the head

Whats the best cure for Unicode?

Ascii

Saying however that there is no headache in unicode does not make the headache
go away:

http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/

No I am not saying that the contents/style/tone are right.
However people are evidently suffering the transition.
Denying it is not a help.

And unicode consortium's ways are not exactly helpful to its own cause:
Imagine the C standard committee deciding that adding mandatory garbage collection
to C is a neat idea

Unicode consortium's going from old BMP to current (6.0) SMPs to who-knows-what
in the future is similar.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top