Flexible string representation, unicode, typography, ...

wxjmfauth · Aug 23, 2012

This is neither a complaint nor a question, just a comment.

In the previous discussion related to the flexible
string representation, Roy Smith added this comment:

http://groups.google.com/group/comp...read/thread/2645504f459bab50/eda342573381ff42

Not only I agree with his sentence:
"Clearly, the world has moved to a 32-bit character set."

he used in his comment a very intersting word: "punctuation".

There is a point which is, in my mind, not very well understood,
"digested", underestimated or neglected by many developers:
the relation between the coding of the characters and the typography.

Unicode (the consortium), does not only deal with the coding of
the characters, it also worked on the characters *classification*.

A deliberatly simplistic representation: "letters" in the bottom
of the table, lower code points/integers; "typographic characters"
like punctuation, common symbols, ... high in the table, high code
points/integers.

The conclusion is inescapable, if one wish to work in a "unicode
mode", one is forced to use the whole palette of the unicode
code points, this is the *nature* of Unicode.

Technically, believing that it possible to optimize only a subrange
of the unicode code points range is simply an illusion. A lot of
work, probably quite complicate, which finally solves nothing.

Python, in my mind, fell in this trap.

"Simple is better than complex."
-> hard to maintained
"Flat is better than nested."
-> code points range
"Special cases aren't special enough to break the rules."
-> special unicode code points?
"Although practicality beats purity."
-> or the opposite?
"In the face of ambiguity, refuse the temptation to guess."
-> guessing a user will only work with the "optimmized" char subrange.
....

Small illustration. Take an a4 page containing 50 lines of 80 ascii
characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
and you will see all the optimization efforts destroyed.
8040

Just my 2 € (code point 0x20ac) cents.

jmf

Mark Lawrence · Aug 23, 2012

This is neither a complaint nor a question, just a comment.

In the previous discussion related to the flexible
string representation, Roy Smith added this comment:

http://groups.google.com/group/comp...read/thread/2645504f459bab50/eda342573381ff42

Not only I agree with his sentence:
"Clearly, the world has moved to a 32-bit character set."

he used in his comment a very intersting word: "punctuation".

There is a point which is, in my mind, not very well understood,
"digested", underestimated or neglected by many developers:
the relation between the coding of the characters and the typography.

Unicode (the consortium), does not only deal with the coding of
the characters, it also worked on the characters *classification*.

A deliberatly simplistic representation: "letters" in the bottom
of the table, lower code points/integers; "typographic characters"
like punctuation, common symbols, ... high in the table, high code
points/integers.

The conclusion is inescapable, if one wish to work in a "unicode
mode", one is forced to use the whole palette of the unicode
code points, this is the *nature* of Unicode.

Technically, believing that it possible to optimize only a subrange
of the unicode code points range is simply an illusion. A lot of
work, probably quite complicate, which finally solves nothing.

Python, in my mind, fell in this trap.

"Simple is better than complex."
-> hard to maintained
"Flat is better than nested."
-> code points range
"Special cases aren't special enough to break the rules."
-> special unicode code points?
"Although practicality beats purity."
-> or the opposite?
"In the face of ambiguity, refuse the temptation to guess."
-> guessing a user will only work with the "optimmized" char subrange.
...

Small illustration. Take an a4 page containing 50 lines of 80 ascii
characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
and you will see all the optimization efforts destroyed.

8040

Just my 2 € (code point 0x20ac) cents.

jmf

I'm looking forward to all the patches you are going to provide to
correct all these (presumably) cPython deficiencies. When do they start
arriving on the bug tracker?

MRAB · Aug 23, 2012

(e-mail address removed):

This example is still benefiting from shrinking the number of bytes
in half over using 32 bits per character as was the case with Python 3.2:

Perhaps the solution should've been to just switch between 2/4 bytes
instead
of 1/2/4 bytes.

Ian Kelly · Aug 23, 2012

Perhaps the solution should've been to just switch between 2/4 bytes instead
of 1/2/4 bytes.

Why? You don't lose any complexity by doing that. I can see
arguments for 1/2/4 or for just 4, but I can't see any advantage of
2/4 over either of those.

wxjmfauth · Aug 23, 2012

Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :

(e-mail address removed):

This example is still benefiting from shrinking the number of bytes

in half over using 32 bits per character as was the case with Python 3.2:

16036

Correct, but how many times does it happen?
Practically never.

In this unicode stuff, I'm fascinated by the obsession
to solve a problem which is, due to the nature of
Unicode, unsolvable.

For every optimization algorithm, for every code
point range you can optimize, it is always possible
to find a case breaking that optimization.

This follows quasi the mathematical logic. To proof a
law is valid, you have to proof all the cases
are valid. To proof a law is invalid, just find one
case showing it.

Sure, it is possible to optimize the unicode usage
by not using French characters, punctuation, mathematical
symbols, currency symbols, CJK characters...
(select undesired characters here: http://www.unicode.org/charts/).

In that case, why using unicode?
(A problematic not specific to Python)

jmf

Ian Kelly · Aug 23, 2012

Correct, but how many times does it happen?
Practically never.

What are you talking about? Surely it happens the same number of
times that your example happens, since it's the same example. By
dismissing this example as being too infrequent to be of any
importance, you dismiss the validity of your own example as well.

In this unicode stuff, I'm fascinated by the obsession
to solve a problem which is, due to the nature of
Unicode, unsolvable.

For every optimization algorithm, for every code
point range you can optimize, it is always possible
to find a case breaking that optimization.

So what? Similarly, for any generalized data compression algorithm,
it is possible to engineer inputs for which the "compressed" output is
as large as or larger than the original input (this is easy to prove).
Does this mean that compression algorithms are useless? I hardly
think so, as evidenced by the widespread popularity of tools like gzip
and WinZip.

You seem to be saying that because we cannot pack all Unicode strings
into 1-byte or 2-byte per character representations, we should just
give up and force everybody to use maximum-width representations for
all strings. That is absurd.

Sure, it is possible to optimize the unicode usage
by not using French characters, punctuation, mathematical
symbols, currency symbols, CJK characters...
(select undesired characters here: http://www.unicode.org/charts/).

In that case, why using unicode?
(A problematic not specific to Python)

Obviously, it is because I want to have the *ability* to represent all
those characters in my strings, even if I am not necessarily going to
take advantage of that ability in every single string that I produce.
Not all of the strings I use are going to fit into the 1-byte or
2-byte per character representation. Fine, whatever -- that's part of
the cost of internationalization. However, *most* of the strings that
I work with (this entire email message, for instance) -- and, I think,
most of the strings that any developer works with (identifiers in the
standard library, for instance) -- will fit into at least the 2-byte
per character representation. Why shackle every string everywhere to
4 bytes per character when for a majority of them we can do much
better than that?

Mark Lawrence · Aug 23, 2012

Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :
Correct, but how many times does it happen?
Practically never.

In this unicode stuff, I'm fascinated by the obsession
to solve a problem which is, due to the nature of
Unicode, unsolvable.

For every optimization algorithm, for every code
point range you can optimize, it is always possible
to find a case breaking that optimization.

This follows quasi the mathematical logic. To proof a
law is valid, you have to proof all the cases
are valid. To proof a law is invalid, just find one
case showing it.

Sure, it is possible to optimize the unicode usage
by not using French characters, punctuation, mathematical
symbols, currency symbols, CJK characters...
(select undesired characters here: http://www.unicode.org/charts/).

In that case, why using unicode?
(A problematic not specific to Python)

jmf

What do you propose should be used instead, as you appear to be the
resident expert in the field?

Ramchandra Apte · Aug 24, 2012

This is neither a complaint nor a question, just a comment.

In the previous discussion related to the flexible

string representation, Roy Smith added this comment:

http://groups.google.com/group/comp...read/thread/2645504f459bab50/eda342573381ff42

Not only I agree with his sentence:

"Clearly, the world has moved to a 32-bit character set."

he used in his comment a very intersting word: "punctuation".

There is a point which is, in my mind, not very well understood,

"digested", underestimated or neglected by many developers:

the relation between the coding of the characters and the typography.

Unicode (the consortium), does not only deal with the coding of

the characters, it also worked on the characters *classification*.

A deliberatly simplistic representation: "letters" in the bottom

of the table, lower code points/integers; "typographic characters"

like punctuation, common symbols, ... high in the table, high code

points/integers.

The conclusion is inescapable, if one wish to work in a "unicode

mode", one is forced to use the whole palette of the unicode

code points, this is the *nature* of Unicode.

Technically, believing that it possible to optimize only a subrange

of the unicode code points range is simply an illusion. A lot of

work, probably quite complicate, which finally solves nothing.

Python, in my mind, fell in this trap.

"Simple is better than complex."

-> hard to maintained

"Flat is better than nested."

-> code points range

"Special cases aren't special enough to break the rules."

-> special unicode code points?

"Although practicality beats purity."

-> or the opposite?

"In the face of ambiguity, refuse the temptation to guess."

-> guessing a user will only work with the "optimmized" char subrange.

...

Small illustration. Take an a4 page containing 50 lines of 80 ascii

characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),

and you will see all the optimization efforts destroyed.

8040

Just my 2 € (code point 0x20ac) cents.

jmf

The zen of python is simply a guideline

rusi · Aug 24, 2012

What are you talking about? Surely it happens the same number of
times that your example happens, since it's the same example. By
dismissing this example as being too infrequent to be of any
importance, you dismiss the validity of your own example as well.

So what? Similarly, for any generalized data compression algorithm,
it is possible to engineer inputs for which the "compressed" output is
as large as or larger than the original input (this is easy to prove).
Does this mean that compression algorithms are useless? I hardly
think so, as evidenced by the widespread popularity of tools like gzip
and WinZip.

You seem to be saying that because we cannot pack all Unicode strings
into 1-byte or 2-byte per character representations, we should just
give up and force everybody to use maximum-width representations for
all strings. That is absurd.

Obviously, it is because I want to have the *ability* to represent all
those characters in my strings, even if I am not necessarily going to
take advantage of that ability in every single string that I produce.
Not all of the strings I use are going to fit into the 1-byte or
2-byte per character representation. Fine, whatever -- that's part of
the cost of internationalization. However, *most* of the strings that
I work with (this entire email message, for instance) -- and, I think,
most of the strings that any developer works with (identifiers in the
standard library, for instance) -- will fit into at least the 2-byte
per character representation. Why shackle every string everywhere to
4 bytes per character when for a majority of them we can do much
better than that?

Actually what exactly are you (jmf) asking for?
Its not clear to anybody as best as we can see...

Mark Lawrence · Aug 24, 2012

Actually what exactly are you (jmf) asking for?
Its not clear to anybody as best as we can see...

A knee in the temple and a dagger up the <censored> ?

From another
Monty Python sketch for those who don't know.

Dennis Lee Bieber · Aug 24, 2012

A knee in the temple and a dagger up the <censored> ? From another
Monty Python sketch for those who don't know.

A poignard in the codpiece...

Antoine Pitrou · Aug 25, 2012

Ramchandra Apte said:
The zen of python is simply a guideline

What's more, the Zen guides the language's design, not its implementation.
People who think CPython is a complicated implementation can take a look at PyPy

Regards

Antoine.

wxjmfauth · Aug 25, 2012

Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit :

What's more, the Zen guides the language's design, not its implementation..

People who think CPython is a complicated implementation can take a look at PyPy

Unicode design: a flat table of code points, where all code
points are "equals".
As soon as one attempts to escape from this rule, one has to
"pay" for it.
The creator of this machinery (flexible string representation)
can not even benefit from it in his native language (I think
I'm correctly informed).

Hint: Google -> "Das grosse Eszett"

jmf

wxjmfauth · Aug 25, 2012

Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit :

What's more, the Zen guides the language's design, not its implementation..

People who think CPython is a complicated implementation can take a look at PyPy

Unicode design: a flat table of code points, where all code
points are "equals".
As soon as one attempts to escape from this rule, one has to
"pay" for it.
The creator of this machinery (flexible string representation)
can not even benefit from it in his native language (I think
I'm correctly informed).

Hint: Google -> "Das grosse Eszett"

jmf

Mark Lawrence · Aug 25, 2012

Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit :

Unicode design: a flat table of code points, where all code
points are "equals".
As soon as one attempts to escape from this rule, one has to
"pay" for it.
The creator of this machinery (flexible string representation)
can not even benefit from it in his native language (I think
I'm correctly informed).

Hint: Google -> "Das grosse Eszett"

jmf

It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
still baffled as to the point if any. Could someone please enlightem me?

Frank Millman · Aug 25, 2012

It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
still baffled as to the point if any. Could someone please enlightem me?

Here's what I think he is saying. I am posting this to test the water. I
am also confused, and if I have got it wrong hopefully someone will
correct me.

In python 3.3, unicode strings are now stored as follows -
if all characters can be represented by 1 byte, the entire string is
composed of 1-byte characters
else if all characters can be represented by 1 or 2 bytea, the entire
string is composed of 2-byte characters
else the entire string is composed of 4-byte characters

There is an overhead in making this choice, to detect the lowest number
of bytes required.

jmfauth believes that this only benefits 'english-speaking' users, as
the rest of the world will tend to have strings where at least one
character requires 2 or 4 bytes. So they incur the overhead, without
getting any benefit.

Therefore, I think he is saying that he would have preferred that python
standardise on 4-byte characters, on the grounds that the saving in
memory does not justify the performance overhead.

Frank Millman

Mark Lawrence · Aug 25, 2012

Here's what I think he is saying. I am posting this to test the water. I
am also confused, and if I have got it wrong hopefully someone will
correct me.

In python 3.3, unicode strings are now stored as follows -
if all characters can be represented by 1 byte, the entire string is
composed of 1-byte characters
else if all characters can be represented by 1 or 2 bytea, the entire
string is composed of 2-byte characters
else the entire string is composed of 4-byte characters

There is an overhead in making this choice, to detect the lowest number
of bytes required.

jmfauth believes that this only benefits 'english-speaking' users, as
the rest of the world will tend to have strings where at least one
character requires 2 or 4 bytes. So they incur the overhead, without
getting any benefit.

Therefore, I think he is saying that he would have preferred that python
standardise on 4-byte characters, on the grounds that the saving in
memory does not justify the performance overhead.

Frank Millman

I thought Terry Reedy had shot down any claims about performance
overhead, and that the memory savings in many cases must be substantial
and therefore worthwhile. Or have I misread something? Or what?

Chris Angelico · Aug 25, 2012

I thought Terry Reedy had shot down any claims about performance overhead,
and that the memory savings in many cases must be substantial and therefore
worthwhile. Or have I misread something? Or what?

My reading of the thread(s) is/are that there are two reasons for the
debate to continue to rage:

1) Comparisons with a "narrow build" in which most characters take two
bytes but there are one or two characters that get encoded with
surrogates. The new system will allocate four bytes per character for
the whole string.

2) Arguments on the basis of huge strings that represent _all the
data_ that your program's working with, forgetting that there are
numerous strings all through everything that are ASCII-only.

ChrisA

Terry Reedy · Aug 25, 2012

I thought Terry Reedy had shot down any claims about performance
overhead, and that the memory savings in many cases must be substantial
and therefore worthwhile. Or have I misread something?

No, you have correctly read what I and others have said. Jim appears to
not be interested in dialog. Lets leave it at that.

wxjmfauth · Aug 25, 2012

Le samedi 25 août 2012 11:46:34 UTC+2, Frank Millman a écrit :

Here's what I think he is saying. I am posting this to test the water. I

am also confused, and if I have got it wrong hopefully someone will

correct me.

In python 3.3, unicode strings are now stored as follows -

if all characters can be represented by 1 byte, the entire string is

composed of 1-byte characters

else if all characters can be represented by 1 or 2 bytea, the entire

string is composed of 2-byte characters

else the entire string is composed of 4-byte characters

There is an overhead in making this choice, to detect the lowest number

of bytes required.

jmfauth believes that this only benefits 'english-speaking' users, as

the rest of the world will tend to have strings where at least one

character requires 2 or 4 bytes. So they incur the overhead, without

getting any benefit.

Therefore, I think he is saying that he would have preferred that python

standardise on 4-byte characters, on the grounds that the saving in

memory does not justify the performance overhead.

Frank Millman

Very well explained. Thanks.

More precisely, affected are not only the 'english-speaking'
users, but all the users who are using not latin-1 characters.
(See the title of this topic, ... typography).

Being at the same time, latin-1 and unicode compliant is
a plain absurdity in the mathematical sense.

---

For those you do not know, the go language has introduced
the rune type. As far as I know, nobody is complaining, I
have not even seen a discussion related to this subject.

100% Unicode compliant from the day 0. Congratulations.

jmf

Chardet, file, ... and the Flexible String Representation	17	Sep 6, 2013
Is Unicode support so hard...	12	Apr 20, 2013
Thinking Unicode	0	Aug 8, 2013
Unicode questions	17	Oct 19, 2010
Flexible (liquid) web design	3	Oct 11, 2013
Verbose and flexible args and kwargs syntax	88	Dec 11, 2011
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Unicode Chars in Windows Path	12	Apr 3, 2014

Flexible string representation, unicode, typography, ...

wxjmfauth

Mark Lawrence

MRAB

Ian Kelly

wxjmfauth

Ian Kelly

Mark Lawrence

Ramchandra Apte

rusi

Mark Lawrence

Dennis Lee Bieber

Antoine Pitrou

wxjmfauth

wxjmfauth

Mark Lawrence

Frank Millman

Mark Lawrence

Chris Angelico

Terry Reedy

wxjmfauth

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads