A few questiosn about encoding

  • Thread starter Íéêüëáïò Êïýñáò
  • Start date
M

Mark Lawrence

Le jeudi 20 juin 2013 13:43:28 UTC+2, MRAB a écrit :

And all these coding schemes have something in common,
they work all with a unique set of code points, more
precisely a unique set of encoded code points (not
the set of implemented code points (byte)).

Just what the flexible string representation is not
doing, it artificially devides unicode in subsets and try
to handle eache subset differently.

On this other side, that is because it is impossible to
work properly with multiple sets of encoded code points
that all these coding schemes exist today. There are simply
no other way.

Even "exotic" schemes like "CID-fonts" used in pdf
are based on that scheme.

jmf

I entirely agree with the viewpoints of jmfauth, Nick the Greek, rr,
Xah Lee and Ilias Lazaridis on the grounds that disagreeing and stating
my beliefs ends up with the Python Mailing List police standing on my
back doorsetep. Give me the NSA or GCHQ any day of the week :(

--
"Steve is going for the pink ball - and for those of you who are
watching in black and white, the pink is next to the green." Snooker
commentator 'Whispering' Ted Lowe.

Mark Lawrence
 
W

wxjmfauth

Le jeudi 20 juin 2013 19:17:12 UTC+2, MRAB a écrit :
UTF-8 divides Unicode into subsets which are encoded in 1, 2, 3, or 4

bytes, and those who previously used ASCII still need only 1 byte per

codepoint!

Sorry, but no, it does not work in that way:
confusion between the set of encoded code points
and the implementation of these called code units.

utf-8: how many bytes to hold an "a" in memory?
one byte.

flexible string representation: how many bytes to
hold an "a" in memory? One byte? No, two.
(Funny, it consumes more memory to hold an ascii char
than ascii itself)


utf-8: In a series of bytes implementing the encoded code
points supposed to hold a string, picking a byte and
finding to which encoded code point it belongs is a no prolem.

flexible string representation: In a series of bytes
implementing the encoded code points supposed to hold a
string, picking a byte and finding to which encoded code
point it belongs is ... impossible !

One of the cause of the bad working of this flexible string
representation.

The basics of any coding scheme, unicode included.

jmf
 
S

Steven D'Aprano

utf-8: how many bytes to hold an "a" in memory? one byte.

flexible string representation: how many bytes to hold an "a" in memory?
One byte? No, two. (Funny, it consumes more memory to hold an ascii char
than ascii itself)

Incorrect. Python strings have overhead because they are objects, so
let's see the difference adding a single character makes:

# Python 3.3, with the hated flexible string representation:
py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99)
1

# Python 3.2:
py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99)
4


How about a French é character? Of course, ASCII cannot store it *at
all*, but let's see what Python can do:


# The hated Python 3.3 again:
py> sys.getsizeof('é'*100) - sys.getsizeof('é'*99)
1


# And Python 3.2:
py> sys.getsizeof('é'*100) - sys.getsizeof('é'*99)
4


utf-8: In a series of bytes implementing the encoded code points
supposed to hold a string, picking a byte and finding to which encoded
code point it belongs is a no prolem.

Incorrect. UTF-8 is unsuitable for random access, since it has variable-
width characters, anything from 1 to 4 bytes. So you cannot just jump
directly to character 1000 in a block of text, you have to inspect each
byte one-by-one to decide whether it is a 1, 2, 3 or 4 byte character.

flexible string representation: In a series of bytes implementing the
encoded code points supposed to hold a string, picking a byte and
finding to which encoded code point it belongs is ... impossible !

Incorrect. It is absolutely trivial. Each string is marked as either 1-
byte, 2-byte or 4-byte. If it is a 1-byte string, then each byte is one
character. If it is a 2-byte string, then it is just like Python 3.2
narrow build, and each two bytes is a character. If it is a 4-byte
string, then it is just like Python 3.2 wide build, and each four bytes
is a character. Within a single string, the number of bytes per character
is fixed, and random access is easy and fast.
 
W

wxjmfauth

Le dimanche 23 juin 2013 18:30:40 UTC+2, Steven D'Aprano a écrit :
Incorrect. Python strings have overhead because they are objects, so

let's see the difference adding a single character makes:



# Python 3.3, with the hated flexible string representation:

py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99)

1



# Python 3.2:

py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99)

4





How about a French é character? Of course, ASCII cannot store it *at

all*, but let's see what Python can do:





# The hated Python 3.3 again:

py> sys.getsizeof('é'*100) - sys.getsizeof('é'*99)

1





# And Python 3.2:

py> sys.getsizeof('é'*100) - sys.getsizeof('é'*99)

4












Incorrect. UTF-8 is unsuitable for random access, since it has variable-

width characters, anything from 1 to 4 bytes. So you cannot just jump

directly to character 1000 in a block of text, you have to inspect each

byte one-by-one to decide whether it is a 1, 2, 3 or 4 byte character.










Incorrect. It is absolutely trivial. Each string is marked as either 1-

byte, 2-byte or 4-byte. If it is a 1-byte string, then each byte is one

character. If it is a 2-byte string, then it is just like Python 3.2

narrow build, and each two bytes is a character. If it is a 4-byte

string, then it is just like Python 3.2 wide build, and each four bytes

is a character. Within a single string, the number of bytes per character

is fixed, and random access is easy and fast.

:)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top