Python 3.2 has some deadly infection

Rustom Mody · Jun 6, 2014

Combine that with Chris':

Yes and no. "ASCII" means two things: Firstly, it's a mapping from the
letter A to the number 65, from the exclamation mark to 33, from the
backslash to 92, and so on. And secondly, it's an encoding of those
numbers into the lowest seven bits of a byte, with the high byte left
clear. Between those two, you get a means of representing the letter
'A' as the byte 0x41, and one of them is an encoding.

Click to expand...

and the situation appears quite the opposite of Ethan's description:
In the 'old world' ASCII was both mapping and encoding and so there was
never a justification to distinguish encoding from codepoint.
It is unicode that demands these distinctions.
If we could magically go to a world where the number of bits in a byte was 32
all this headache would go away. [Actually just 21 is enough!]

Click to expand...

An ASCII mentality lets you be sloppy. That doesn't mean the
distinction doesn't exist. When I first started programming in C, int
was *always* 16 bits long and *always* little-endian (because I used
only one compiler). I could pretend that those bits in memory actually
were that integer, that there were no other ways that integer could be
encoded. That doesn't mean that encodings weren't important. And as
soon as I started working on a 32-bit OS/2 system, and my ints became
bigger, I had to concern myself with that. Even more so when I got
into networking, and byte order became important to me. And of course,
these days I work with integers that are encoded in all sorts of
different ways (a Python integer isn't just a puddle of bytes in
memory), and I generally let someone else take care of the details,
but the encodings are still there.

ASCII was once your one companion, it was all that mattered. ASCII was
once a friendly encoding, then your world was shattered. Wishing it
were somehow here again, wishing it were somehow near... sometimes it
seemed, if you just dreamed, somehow it would be here! Wishing you
could use just bytes again, knowing that you never would... dreaming
of it won't help you to do all that you dream you could!

It's time to stop chasing the phantom and start living in the Raoul
world... err, the real world.

I thought that "If only bytes were 21+ bits wide" would sound sufficiently
nonsensical, that I did not need to explicitly qualify it as a utopian dream!

Marko Rauhamaa · Jun 6, 2014

Steven D'Aprano said:
A Unicode string as an abstract data type has no encoding.

Unicode itself is an encoding. See it in action here:

72 101 108 108 111 44 32 119 111 114 108 100

It is a Platonic ideal, a pure form like the real numbers.

Far from it. It is a mapping from symbols to integers. The symbols are
the Platonic ones.

The Unicode/ASCII encoding above represents the same "Platonic" string
as this ESCDIC one:

212 133 147 147 150 107 64 166 150 153 137 132

Unicode string like this:

s = u"NOBODY expects the Spanish Inquisition!"

should not be thought of as a bunch of bytes in some encoding,

Encoding is not tied to bytes or even computers. People can speak in
code, after all.

Marko

Chris Angelico · Jun 6, 2014

I thought that "If only bytes were 21+ bits wide" would sound sufficiently
nonsensical, that I did not need to explicitly qualify it as a utopian dream!

Humour never dies!

ChrisA
(In case it's not obvious, by the way, everything I said above is a
reference to the Phantom of the Opera.)

Rustom Mody · Jun 6, 2014

Chris Angelico :

The American Standard Code for Information Interchange [...] is a
character-encoding scheme [...] <URL:

And a similar argument to this is seen on that page's talk page!
http://en.wikipedia.org/wiki/Talk:ASCII#Character_set_vs._Character_encoding.3F

Chris Angelico · Jun 6, 2014

Encoding is not tied to bytes or even computers. People can speak in
code, after all.

Obligatory: http://xkcd.com/257/

ChrisA

Marko Rauhamaa · Jun 6, 2014

Marko Rauhamaa said:
Far from it. It is a mapping from symbols to integers. The symbols are
the Platonic ones.

Well, of course, even the symbols are a code. Letters code sounds and
digits code numbers.

And the sounds and numbers code ideas. Now we are getting close to being
truly Platonic.

Marko

Chris Angelico · Jun 6, 2014

Chris Angelico :

Click to expand...

The American Standard Code for Information Interchange [...] is a
character-encoding scheme [...] <URL:

Click to expand...

And a similar argument to this is seen on that page's talk page!
http://en.wikipedia.org/wiki/Talk:ASCII#Character_set_vs._Character_encoding.3F

Which proves that Wikipedia is exactly as reliable as a mailing list.

ChrisA

Ned Batchelder · Jun 6, 2014

Unicode itself is an encoding. See it in action here:

72 101 108 108 111 44 32 119 111 114 108 100

Far from it. It is a mapping from symbols to integers. The symbols are
the Platonic ones.

The Unicode/ASCII encoding above represents the same "Platonic" string
as this ESCDIC one:

212 133 147 147 150 107 64 166 150 153 137 132

Encoding is not tied to bytes or even computers. People can speak in
code, after all.

Marko, you are right about the broader English meaning of the word
"encoding". The original point here was that "Unicode text" provides no
information about what sequence of bytes is at work.

In the Unicode ecosystem, an encoding is a specification of how the text
will be represented in a byte stream. Saying something is "Unicode"
doesn't provide that information. You have to say, "UTF8" or "UTF16" or
"UCS2", etc, in order to know how bytes will be involved.

When Ethan said, "a Unicode string, as a data type, has no encoding," he
meant (as he explained) that a Unicode string doesn't require or imply
any particular mapping to bytes.

I'm sure you understand this, I'm just trying to clarify the different
meanings of the word "encoding".

wxjmfauth · Jun 6, 2014

Le vendredi 6 juin 2014 17:50:50 UTC+2, Chris Angelico a écrit :

byte.) Unicode can't, because there are many different pros and cons

to the different encodings, and so we have UCS Transformation Formats

like UTF-8 and UTF-32. Each one is an encoding that maps a codepoint

to a sequence of bytes.

A big NO.

jmf

Denis McMahon · Jun 6, 2014

Yes and no. "ASCII" means two things:

ASCII means: American Standard Code for Information Interchange aka ASA
Standard X3.4-1963

into the lowest seven bits of a byte, with the high byte left clear.

high BIT left clear.

Chris Angelico · Jun 6, 2014

high BIT left clear.

That thing. Unless you have bytes inside bytes (byteception?), you'll
only have room for one high bit. Some day I'll get my brain and my
fingers to agree on everything we do... but that day is not today.

ChrisA

rurpy · Jun 8, 2014

[...]
But Linux Unicode support is much better than Windows. Unicode support in
Windows is crippled by continued reliance on legacy code pages, and by
the assumption deep inside the Windows APIs that Unicode means "16 bit
characters". See, for example, the amount of space spent on fixing
Windows Unicode handling here:

http://www.utf8everywhere.org/

While not disagreeing with the the general premise of that page, it
has some problems that raise doubts in my mind about taking everything
the author says at face value.

For example

"Q: Why would the Asians give up on UTF-16 encoding, which saves
them 50% the memory per character?"
[...] in fact UTF-8 is used just as often in those [Asian] countries.

That is not my experience, at least for Japan. See my comments in
https://mail.python.org/pipermail/python-ideas/2012-June/015429.html
where I show that utf8 files are a tiny minority of the text files
found by Google.

He then gives a table with the size of utf8 and utf16 encoded contents
(ie stripped of html stuff) of an unnamed Japanese wikipedia page to
show that even without a lot of (html-mandated) ascii, the space savings
are not very much compared to the theoretical "50%" savings he stated:

" Dense text (Î” UTF-8)
UTF-8 ... 222 KB (0%)
UTF-16 ... 176 KB (âˆ’21%)"

Note that he calculates the space saving as (utf8-utf16)/utf8.
Yet by that metric the theoretical saving is *NOT* 50%, it is 33%.
For example 1000 Japanese characters will use 2000 bytes in utf16
and 3000 in utf8.

I did the same test using
http://ja.wikipedia.org/wiki/織田信長
I stripped html tags, javascript and redundant ascii whitespace characters
The stripped utf-8 file was 164946 bytes, the utf-16 encoded version of
same was 117756. That gives (using the (utf8-utf16)/utf16 metric he used
to claim 50% idealized savings) 40% which is quite a bit closer to the
idealized 50% than his 21%.

I would have more faith in his opinions about things I don't know
about (such as unicode programming on Windows) if his other info
were more trustworthy. IOW, just because it's on the internet doesn't
mean it's true.

Language design	70	Sep 10, 2013
Minimum Total Difficulty	0	Nov 15, 2023
Question about python 3.2 distutils	0	Mar 17, 2012
[RELEASED] Python 3.2 rc 3	0	Feb 14, 2011
EOL for Python 3.2?	2	Sep 29, 2012
[RELEASED] Python 3.2 rc 1	0	Jan 16, 2011
[RELEASED] Python 3.2 alpha 2	0	Sep 6, 2010
Link errors embedding Python 3.2	0	May 24, 2011

Python 3.2 has some deadly infection

Rustom Mody

Marko Rauhamaa

Chris Angelico

Rustom Mody

Chris Angelico

Marko Rauhamaa

Chris Angelico

Ned Batchelder

wxjmfauth

Denis McMahon

Chris Angelico

rurpy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads