A few questiosn about encoding

Mark Lawrence · Jun 20, 2013

Le jeudi 20 juin 2013 13:43:28 UTC+2, MRAB a écrit :

And all these coding schemes have something in common,
they work all with a unique set of code points, more
precisely a unique set of encoded code points (not
the set of implemented code points (byte)).

Just what the flexible string representation is not
doing, it artificially devides unicode in subsets and try
to handle eache subset differently.

On this other side, that is because it is impossible to
work properly with multiple sets of encoded code points
that all these coding schemes exist today. There are simply
no other way.

Even "exotic" schemes like "CID-fonts" used in pdf
are based on that scheme.

jmf

I entirely agree with the viewpoints of jmfauth, Nick the Greek, rr,
Xah Lee and Ilias Lazaridis on the grounds that disagreeing and stating
my beliefs ends up with the Python Mailing List police standing on my
back doorsetep. Give me the NSA or GCHQ any day of the week

--
"Steve is going for the pink ball - and for those of you who are
watching in black and white, the pink is next to the green." Snooker
commentator 'Whispering' Ted Lowe.

Mark Lawrence

wxjmfauth · Jun 23, 2013

Le jeudi 20 juin 2013 19:17:12 UTC+2, MRAB a écrit :

UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4

bytes, and those who previously used ASCII still need only 1 byte per

codepoint!

Sorry, but no, it does not work in that way:
confusion between the set of encoded code points
and the implementation of these called code units.

utf-8: how many bytes to hold an "a" in memory?
one byte.

flexible string representation: how many bytes to
hold an "a" in memory? One byte? No, two.
(Funny, it consumes more memory to hold an ascii char
than ascii itself)

utf-8: In a series of bytes implementing the encoded code
points supposed to hold a string, picking a byte and
finding to which encoded code point it belongs is a no prolem.

flexible string representation: In a series of bytes
implementing the encoded code points supposed to hold a
string, picking a byte and finding to which encoded code
point it belongs is ... impossible !

One of the cause of the bad working of this flexible string
representation.

The basics of any coding scheme, unicode included.

jmf

Steven D'Aprano · Jun 23, 2013

utf-8: how many bytes to hold an "a" in memory? one byte.

flexible string representation: how many bytes to hold an "a" in memory?
One byte? No, two. (Funny, it consumes more memory to hold an ascii char
than ascii itself)

Incorrect. Python strings have overhead because they are objects, so
let's see the difference adding a single character makes:

# Python 3.3, with the hated flexible string representation:
py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99)
1

# Python 3.2:
py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99)
4

How about a French Ã© character? Of course, ASCII cannot store it *at
all*, but let's see what Python can do:

# The hated Python 3.3 again:
py> sys.getsizeof('Ã©'*100) - sys.getsizeof('Ã©'*99)
1

# And Python 3.2:
py> sys.getsizeof('Ã©'*100) - sys.getsizeof('Ã©'*99)
4

utf-8: In a series of bytes implementing the encoded code points
supposed to hold a string, picking a byte and finding to which encoded
code point it belongs is a no prolem.

Incorrect. UTF-8 is unsuitable for random access, since it has variable-
width characters, anything from 1 to 4 bytes. So you cannot just jump
directly to character 1000 in a block of text, you have to inspect each
byte one-by-one to decide whether it is a 1, 2, 3 or 4 byte character.

flexible string representation: In a series of bytes implementing the
encoded code points supposed to hold a string, picking a byte and
finding to which encoded code point it belongs is ... impossible !

Incorrect. It is absolutely trivial. Each string is marked as either 1-
byte, 2-byte or 4-byte. If it is a 1-byte string, then each byte is one
character. If it is a 2-byte string, then it is just like Python 3.2
narrow build, and each two bytes is a character. If it is a 4-byte
string, then it is just like Python 3.2 wide build, and each four bytes
is a character. Within a single string, the number of bytes per character
is fixed, and random access is easy and fast.

wxjmfauth · Jun 25, 2013

Le dimanche 23 juin 2013 18:30:40 UTC+2, Steven D'Aprano a écrit :

Incorrect. Python strings have overhead because they are objects, so

let's see the difference adding a single character makes:

# Python 3.3, with the hated flexible string representation:

py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99)

1

# Python 3.2:

py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99)

4

How about a French é character? Of course, ASCII cannot store it *at

all*, but let's see what Python can do:

# The hated Python 3.3 again:

py> sys.getsizeof('é'*100) - sys.getsizeof('é'*99)

1

# And Python 3.2:

py> sys.getsizeof('é'*100) - sys.getsizeof('é'*99)

4

Incorrect. UTF-8 is unsuitable for random access, since it has variable-

width characters, anything from 1 to 4 bytes. So you cannot just jump

directly to character 1000 in a block of text, you have to inspect each

byte one-by-one to decide whether it is a 1, 2, 3 or 4 byte character.

Incorrect. It is absolutely trivial. Each string is marked as either 1-

byte, 2-byte or 4-byte. If it is a 1-byte string, then each byte is one

character. If it is a 2-byte string, then it is just like Python 3.2

narrow build, and each two bytes is a character. If it is a 4-byte

string, then it is just like Python 3.2 wide build, and each four bytes

is a character. Within a single string, the number of bytes per character

is fixed, and random access is easy and fast.

Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
files.py (encoding error)	0	Jun 10, 2013
files.py (weird encoding error)	0	Jun 10, 2013
newbie with a encoding question, please help	8	Apr 1, 2010
Question of UTF16BE encoding / decoding	2	May 5, 2009
Python3 - encoding issues	4	Nov 29, 2009
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
Flatten an email Message with a non-ASCII body using 8bit CTE	0	Jan 24, 2013

A few questiosn about encoding

Mark Lawrence

wxjmfauth

Steven D'Aprano

wxjmfauth

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads