RE Module Performance

Ian Kelly · Jul 25, 2013

jmf's point is more about writing the editor widget (Scintilla, as
opposed to SciTE), which most people will never bother to do. I've
written several text editors, always by embedding someone else's
widget, and therefore not concerning myself with its internal string
representation. Frankly, Python's strings are a *terrible* internal
representation for an editor widget - not because of PEP 393, but
simply because they are immutable, and every keypress would result in
a rebuilding of the string. On the flip side, I could quite plausibly
imagine using a list of strings; whenever text gets inserted, the
string gets split at that point, and a new string created for the
insert (which also means that an Undo operation simply removes one
entire string). In this usage, the FSR is beneficial, as it's possible
to have different strings at different widths.

But mainly, I'm just wondering how many people here have any basis
from which to argue the point he's trying to make. I doubt most of us
have (a) implemented an editor widget, or (b) tested multiple
different internal representations to learn the true pros and cons of
each. And even if any of us had, that still wouldn't have any bearing
on PEP 393, which is about applications, not editor widgets. As stated
above, Python strings before AND after PEP 393 are poor choices for an
editor, ergo arguing from that standpoint is pretty useless. Not that
that bothers jmf...

I think you've just motivated me to finally get around to writing the
custom output widget for my MUD client. Of course that will be
simpler than a standard rich text editor widget, since it will never
receive input from the user and modifications will (typically) always
come in the form of append operations. I intend to write it in pure
Python (well, wxPython), however.

Steven D'Aprano · Jul 26, 2013

36:25 +0100, Jeremy Sanders wrote:
"To conserve memory, Emacs does not hold fixed-length 22-bit numbers
that are codepoints of text characters within buffers and strings.
Rather, Emacs uses a variable-length internal representation of
characters, that stores each character as a sequence of 1 to 5 8-bit
bytes, depending on the magnitude of its codepoint[1]. For example,
any ASCII character takes up only 1 byte, a Latin-1 character takes
up 2 bytes, etc. We call this representation of text multibyte.

Well, you've just proven what Vim users have always suspected: Emacs
doesn't really exist.

... lolwut?

Click to expand...

JMF has explained that it is impossible, impossible I say!, to write an
editor using a flexible string representation. Since Emacs uses such a
flexible string representation, Emacs is impossible, and therefore
Emacs doesn't exist.

QED.

Click to expand...

Except that the described representation used by Emacs is a variant of
UTF-8, not an FSR. It doesn't have three different possible encodings
for the letter 'a' depending on what other characters happen to be in
the string.

As I understand it, jfm would be perfectly happy if Python used UTF-8
(or presumably the Emacs variant) as its internal string representation.

UTF-8 uses a flexible representation on a character-by-character basis.
When parsing UTF-8, one needs to look at EVERY character to decide how
many bytes you need to read. In Python 3, the flexible representation is
on a string-by-string basis: once Python has looked at the string header,
it can tell whether the *entire* string takes 1, 2 or 4 bytes per
character, and the string is then fixed-width. You can't do that with
UTF-8.

To put it in terms of pseudo-code:

# Python 3.3
def parse_string(astring):
# Decision gets made once per string.
if astring uses 1 byte:
count = 1
elif astring uses 2 bytes:
count = 2
else:
count = 4
while not done:
char = convert(next(count bytes))

# UTF-8
def parse_string(astring):
while not done:
b = next(1 byte)
# Decision gets made for every single char
if uses 1 byte:
char = convert(b)
elif uses 2 bytes:
char = convert(b, next(1 byte))
elif uses 3 bytes:
char = convert(b, next(2 bytes))
else:
char = convert(b, next(3 bytes))

So UTF-8 requires much more runtime overhead than Python 3.3, and Emac's
variation can in fact require more bytes per character than either.
(UTF-8 and Python 3.3 can require up to four bytes, Emacs up to five.)
I'm not surprised that JMF would prefer UTF-8 -- he is completely out of
his depth, and is a fine example of the Dunning-Kruger effect in action.
He is so sure he is right based on so little evidence.

One advantage of UTF-8 is that for some BMP characters, you can get away
with only three bytes instead of four. For transmitting data over the
wire, or storage on disk, that's potentially up to a 25% reduction in
space, which is not to be sneezed at. (Although in practice it's usually
much less than that, since the most common characters are encoded to 1 or
2 bytes, not 4). But that comes at the cost of much more runtime
overhead, which in my opinion makes UTF-8 a second-class string
representation compared to fixed-width representations.

Michael Torrie · Jul 26, 2013

Let start with a simple string \textemdash or \texttendash

26

That's meaningless. You're comparing the overhead of a string object
itself (a one-time cost anyway), not the overhead of storing the actual
characters. This is the only meaningful comparison:

Actually I'm not even sure what your point is after all this time of
railing against FSR. You have said in the past that Python penalizes
users of character sets that require wider byte encodings, but what
would you have us do? use 4-byte characters and penalize everyone
equally? Use 2-byte characters that incorrectly expose surrogate pairs
for some characters? Use UTF-8 in memory and do O(n) indexing? Are your
programs (actual programs, not contrived benchmarks) actually slower
because of FSR? Is FSR incorrect? If so, according to what part of the
unicode standard? I'm not trying to troll, or feed the troll. I'm
actually curious.

I think perhaps you feel that many of us who don't use unicode often
don't understand unicode because some of us don't understand you. If
so, I'm not sure that's actually true.

Michael Torrie · Jul 26, 2013

JMF has explained that it is impossible, impossible I say!, to write an
editor using a flexible string representation. Since Emacs uses such a
flexible string representation, Emacs is impossible, and therefore Emacs
doesn't exist.

Now I'm even more confused. He once pointed to Go as an example of how
unicode should be done in a language. yet Go uses UTF-8 I think.

But I don't think UTF-8 is what JMF refers to as "flexible string
representation." FSR does use 1,2 or 4 bytes per character, but each
character in the string uses the same width. That's different from
UTF-8 or UTF-16, which is variable width per character.

Ian Kelly · Jul 26, 2013

UTF-8 uses a flexible representation on a character-by-character basis.
When parsing UTF-8, one needs to look at EVERY character to decide how
many bytes you need to read. In Python 3, the flexible representation is
on a string-by-string basis: once Python has looked at the string header,
it can tell whether the *entire* string takes 1, 2 or 4 bytes per
character, and the string is then fixed-width. You can't do that with
UTF-8.

UTF-8 does not use a flexible representation. A codec that is
encoding a string in UTF-8 and examining a particular character does
not have any choice of how to encode that character; there is exactly
one sequence of bits that is the UTF-8 encoding for the character.
Further, for any given sequence of code points there is exactly one
sequence of bytes that is the UTF-8 encoding of those code points. In
contrast, with the FSR there are as many as three different sequences
of bytes that encode a sequence of code points, with one of them (the
shortest) being canonical. That's what makes it flexible.

Anyway, my point was just that Emacs is not a counter-example to jmf's
claim about implementing text editors, because UTF-8 is not what he
(or anybody else) is referring to when speaking of the FSR or
"something like the FSR".

wxjmfauth · Jul 26, 2013

Le jeudi 25 juillet 2013 22:45:38 UTC+2, Ian a écrit :

36:25 +0100, Jeremy Sanders wrote:
"To conserve memory, Emacs does not hold fixed-length 22-bit numbers
that are codepoints of text characters within buffers and strings.
Rather, Emacs uses a variable-length internal representation of
characters, that stores each character as a sequence of 1 to 5 8-bit
bytes, depending on the magnitude of its codepoint[1]. For example,
any ASCII character takes up only 1 byte, a Latin-1 character takes up
2 bytes, etc. We call this representation of text multibyte.

Well, you've just proven what Vim users have always suspected: Emacs
doesn't really exist.

... lolwut?

Click to expand...

JMF has explained that it is impossible, impossible I say!, to write an

Click to expand...

editor using a flexible string representation. Since Emacs uses such a

Click to expand...

flexible string representation, Emacs is impossible, and therefore Emacs

Click to expand...

doesn't exist.

QED.

Click to expand...

Except that the described representation used by Emacs is a variant of

UTF-8, not an FSR. It doesn't have three different possible encodings

for the letter 'a' depending on what other characters happen to be in

the string.

As I understand it, jfm would be perfectly happy if Python used UTF-8

(or presumably the Emacs variant) as its internal string

representation.

------

And emacs it probably working smoothly.

Your comment summarized all this stuff very correctly and
very shortly.

utf8/16/32? I do not care. There are all working correctly,
smoothly and efficiently. In fact, these utf's are already
doing correctly, what this FSR is doing in a wrong way.

My preference? utf32. Why? It is the most simple and
consequently performing choice. I'm not a narrow minded
ascii user. (I do not pretend to belong to those who
are solving the quadrature of the circle, I pretend to
belong to those who know, the quadrature of the circle
is not solvable).

Note: text processing tools or tools that have to process
characters — and the tools to build these tools — are all
moving to utf32, if not already done. There are technical
reasons behind this, which are going beyond the
pure raw unicode. There are however still 100% Unicode
compliant.

jmf

wxjmfauth · Jul 26, 2013

Le vendredi 26 juillet 2013 05:09:34 UTC+2, Michael Torrie a écrit :

Now I'm even more confused. He once pointed to Go as an example of how

unicode should be done in a language. yet Go uses UTF-8 I think.

But I don't think UTF-8 is what JMF refers to as "flexible string

representation." FSR does use 1,2 or 4 bytes per character, but each

character in the string uses the same width. That's different from

UTF-8 or UTF-16, which is variable width per character.

I have already explained / commented this.

--------

Hint: To understand Unicode (and every coding scheme), you should
understand "utf". The how and the *why*.

jmf

wxjmfauth · Jul 26, 2013

Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit :

UTF-8 does not use a flexible representation. A codec that is

encoding a string in UTF-8 and examining a particular character does

not have any choice of how to encode that character; there is exactly

one sequence of bits that is the UTF-8 encoding for the character.

Further, for any given sequence of code points there is exactly one

sequence of bytes that is the UTF-8 encoding of those code points. In

contrast, with the FSR there are as many as three different sequences

of bytes that encode a sequence of code points, with one of them (the

shortest) being canonical. That's what makes it flexible.

Anyway, my point was just that Emacs is not a counter-example to jmf's

claim about implementing text editors, because UTF-8 is not what he

(or anybody else) is referring to when speaking of the FSR or

"something like the FSR".

--------

BTW, it is not necessary to use an endorsed Unicode coding
scheme (utf*), a string literal would have been possible,
but then one falls on memory issures.

All these utf are following the basic coding scheme.

I repeat again.
A coding scheme works with a unique set of characters
and its implementation works with a unique set of
encoded code points (the utf's, in case of Unicode).

And again, that why we live today with all these coding
schemes, or, to take the problem from the other side,
that's because one has to work with a unique set of
encoded code points, that all these coding schemes had to
be created.

utf's have not been created by newbies ;-)

jmf

wxjmfauth · Jul 26, 2013

Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit :

UTF-8 does not use a flexible representation. A codec that is

encoding a string in UTF-8 and examining a particular character does

not have any choice of how to encode that character; there is exactly

one sequence of bits that is the UTF-8 encoding for the character.

Further, for any given sequence of code points there is exactly one

sequence of bytes that is the UTF-8 encoding of those code points. In

contrast, with the FSR there are as many as three different sequences

of bytes that encode a sequence of code points, with one of them (the

shortest) being canonical. That's what makes it flexible.

Anyway, my point was just that Emacs is not a counter-example to jmf's

claim about implementing text editors, because UTF-8 is not what he

(or anybody else) is referring to when speaking of the FSR or

"something like the FSR".

-----

Let's be clear. I'm perfectly understanding what is utf-8
and that's for that precise reason, I put the "editor"
as an exemple on the table.

This FSR is not *a* coding scheme. It is more a composite
coding scheme. (And form there, all the problems).

BTW, I'm pleased to read "sequence of bits" and not bytes.
Again, utf transformers are producing sequence of bits,
call Unicode Transformation Units, with lengths of
8/16/32 *bits*, from there the names utf8/16/32.
UCS transformers are (were) producing bytes, from there
the names ucs-2/4.

jmf

Antoon Pardon · Jul 26, 2013

Op 26-07-13 15:21, (e-mail address removed) schreef:

Hint: To understand Unicode (and every coding scheme), you should
understand "utf". The how and the *why*.

No you don't. You are mixing the information with how the information
is coded. utf is like base64, a way of coding the information that is
usefull for storage or transfer. But once you have decode the byte
stream, you no longer need any understanding of base64 to process your
information. Likewise, once you have decode the bytestream into uniocde
information you don't need knowledge of utf to process unicode strings.

Michael Torrie · Jul 27, 2013

I have already explained / commented this.

Maybe it got lost in translation, but I don't understand your point with
that.

Hint: To understand Unicode (and every coding scheme), you should
understand "utf". The how and the *why*.

Hmm, so if python used utf-8 internally to represent unicode strings
would not that punish *all* users (not just non-ascii users) since
searching a string for a certain character position requires an O(n)
operation? UTF-32 I could see (and indeed that's essentially what FSR
uses when necessary does it not?), but not utf-8 or utf-16.

Steven D'Aprano · Jul 27, 2013

UTF-8 does not use a flexible representation.

I disagree, and so does Jeremy Sanders who first pointed out the
similarity between Emacs' UTF-8 and Python's FSR. I'll quote from the
Emacs documentation again:

"To conserve memory, Emacs does not hold fixed-length 22-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Emacs uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 5 8-bit bytes, depending on
the magnitude of its codepoint. For example, any ASCII character takes
up only 1 byte, a Latin-1 character takes up 2 bytes, etc."

And the Python FSR:

"To conserve memory, Python does not hold fixed-length 21-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Python uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 4 8-bit bytes, depending on
the magnitude of the largest codepoint in the string. For example, any
all-ASCII or all-Latin1 string takes up only 1 byte per character, an all-
BMP string takes up 2 bytes per character, etc."

See the similarity now? Both flexibly change the width used by code-
points, UTF-8 based on the code-point itself regardless of the rest of
the string, Python based on the largest code-point in the string.

[...]

Anyway, my point was just that Emacs is not a counter-example to jmf's
claim about implementing text editors, because UTF-8 is not what he (or
anybody else) is referring to when speaking of the FSR or "something
like the FSR".

Whether JMF can see the similarities between different implementations of
strings or not is beside the point, those similarities do exist. As do
the differences, of course, but in this case the differences are in
favour of Python's FSR. Even if your string is entirely Latin1, a UTF-8
implementation *cannot know that*, and still has to walk the string byte-
by-byte checking whether the current code point requires 1, 2, 3, or 4
bytes, while a FSR implementation can simply record the fact that the
string is pure Latin1 at creation time, and then treat it as fixed-width
from then on.

JMF claims that FSR is "impossible" to use efficiently, and yet he
supports encoding schemes which are *less* efficient. Go figure. He tells
us he has no problem with any of the established UTF encodings, and yet
the FSR internally uses UTF-16 and UTF-32. (Technically, it's UCS-2, not
UTF-16, since there are no surrogate pairs. But the difference is
insignificant.)

Having watched this issue from Day One when JMF first complained about
it, I believe this is entirely about denying any benefit to ASCII users.
Had Python implemented a system identical to the current FSR except that
it added a fourth category, "all ASCII", which used an eight-byte
encoding scheme (thus making ASCII strings twice as expensive as strings
including code points from the Supplementary Multilingual Planes), JMF
would be the scheme's number one champion.

I cannot see any other rational explanation for why JMF prefers broken,
buggy Unicode implementations, or implementations which are equally
expensive for all strings, over one which is demonstrably correct,
demonstrably saves memory, and for realistic, non-contrived benchmarks,
demonstrably faster, except that he wants to punish ASCII users more than
he wants to support Unicode users.

Ian Kelly · Jul 27, 2013

See the similarity now? Both flexibly change the width used by code-
points, UTF-8 based on the code-point itself regardless of the rest of
the string, Python based on the largest code-point in the string.

No, I think we're just using the word "flexible" differently. In my
view, simply being variable-width does not make an encoding "flexible"
in the sense of the FSR. But I'm not going to keep repeating myself
in order to argue about it.

Having watched this issue from Day One when JMF first complained about
it, I believe this is entirely about denying any benefit to ASCII users.
Had Python implemented a system identical to the current FSR except that
it added a fourth category, "all ASCII", which used an eight-byte
encoding scheme (thus making ASCII strings twice as expensive as strings
including code points from the Supplementary Multilingual Planes), JMF
would be the scheme's number one champion.

I agree. In fact I made a similar observation back in December:

http://mail.python.org/pipermail/python-list/2012-December/636942.html

Steven D'Aprano · Jul 27, 2013

No, I think we're just using the word "flexible" differently. In my
view, simply being variable-width does not make an encoding "flexible"
in the sense of the FSR. But I'm not going to keep repeating myself in
order to argue about it.

But I paid for the full half hour!

http://en.wikipedia.org/wiki/The_Argument_Sketch

Steven D'Aprano · Jul 27, 2013

BTW, I'm pleased to read "sequence of bits" and not bytes. Again, utf
transformers are producing sequence of bits, call Unicode Transformation
Units, with lengths of 8/16/32 *bits*, from there the names utf8/16/32.
UCS transformers are (were) producing bytes, from there the names
ucs-2/4.

Not only does your distinction between bits and bytes make no practical
difference on nearly all hardware in common use today[1], but the Unicode
Consortium disagrees with you, and defines UTC in terms of bytes:

"A Unicode transformation format (UTF) is an algorithmic mapping from
every Unicode code point (except surrogate code points) to a unique byte
sequence."

http://www.unicode.org/faq/utf_bom.html#gen2

[1] There may still be some old supercomputers where a byte is more than
8 bits in use, but they're unlikely to support Unicode.

Dennis Lee Bieber · Jul 27, 2013

I disagree, and so does Jeremy Sanders who first pointed out the
similarity between Emacs' UTF-8 and Python's FSR. I'll quote from the
Emacs documentation again:

"To conserve memory, Emacs does not hold fixed-length 22-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Emacs uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 5 8-bit bytes, depending on
the magnitude of its codepoint. For example, any ASCII character takes
up only 1 byte, a Latin-1 character takes up 2 bytes, etc."

And the Python FSR:

"To conserve memory, Python does not hold fixed-length 21-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Python uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 4 8-bit bytes, depending on
the magnitude of the largest codepoint in the string. For example, any
all-ASCII or all-Latin1 string takes up only 1 byte per character, an all-
BMP string takes up 2 bytes per character, etc."

As I read those: Python states "any all-ASCII or all-Latin1 string
takes up only 1 byte per character", etc. IE; the entire STRING is based
upon the minimal size that can encode all characters in the string.

The EMACS statement doesn't specify a "string", it implies, in "any
ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes,
etc.", that a string can contain mixed length characters.

wxjmfauth · Jul 27, 2013

Le samedi 27 juillet 2013 04:05:03 UTC+2, Michael Torrie a écrit :

Maybe it got lost in translation, but I don't understand your point with

that.

Hmm, so if python used utf-8 internally to represent unicode strings

would not that punish *all* users (not just non-ascii users) since

searching a string for a certain character position requires an O(n)

operation? UTF-32 I could see (and indeed that's essentially what FSR

uses when necessary does it not?), but not utf-8 or utf-16.

------

Did you read my previous link? Unicode Character Encoding Model.
Did you understand it?

Unicode only - No FSR (I skip some points and I still attempt to
be still correct.)

Unicode is a four-steps process.
[ {unique set of characters} --> {unique set of code points, the
"labels"} --> {unique set of encoded code points} ] --> implementation
(bytes)

First point to notice. "pure unicode", [...], is different from
the "implementation". *This is a deliberate choice*.

The critical step is the path {unique set of characters} --->
{unique set of encoded code points} in such a way so that
the implementation can "work comfortably" with this *unique* set
of encoded code points. Conceptualy, the implementation works
with an unique set of "already prepared encoded code points".
This is a very critical step. To explain it in a dirty way:
in the above chain, this problem is "already" eliminated and
solved. Like a byte/char coding schemes where this step is
a no-op.

Now, and if you wish this is a seperated/different problem.
To create this unique set of encoded code points, "Unicode"
uses these "utf(s)". I repeat again, a confusing name, for the
process and the result of the process. (I neglect ucs).
What are these? Chunks of bits, group of 8/16/32 bits, words.
It is up to the implementation to convert these sequences
of bits into bytes, ***if you wish to convert these in bytes!***.
Suprise! Why not putting two of the 32-bits words in a 64-bits
"machine"? (see golang / rune / int32).

Back to utf. utfs are not only elements of a unique set of encoded
code points. They have an interesting feature. Each "utf chunk"
holds intrisically the character (in fact the code point) it is
supposed to represent. In utf-32, the obvious case, it is just
the code point. In utf-8, that's the first chunk which helps and
utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
implementation using bytes, for any pointer position it is always
possible to find the corresponding encoded code point and from this
the corresponding character without any "programmed" information. See
my editor example, how to find the char under the caret? In fact,
a silly example, how can the caret can be positioned or moved, if
the underlying corresponding encoded code point can not be
dicerned!

Next step and one another separated problem.
Why all these utf versions? It is always the
same story. Some prefer the universality (utf-32) and
some prefer, well, some kind of conservatism. utf-8 is
more complicated, it demands more work and logically,
in an expected way, some performance regression.
utf-8 is more suited to produce bytes, utf16/32 for
internal processing. utf-8 had no choice to lose the
indexing. And so on.
Fact: all these coding schemes are working with a unique
set of encoded code points (suprise again, it's like byte
string!). The loss of performance of utf-8 is very minimal
compared to the loss of performance one can get compare to
a multiple coding scheme. This kind of work has been done,
and if my informations are correct, even by the creators
of utf-8. (There are sometimes good scientists).

There are plenty of advantages in using utf instead of
something else and advantages in other fields than just
the pure coding.
utf-16/32 schemes have the advantages to ditch ascii
for ever. The ascii concept is no more existing.

One should also understand that all this stuff has
not been created from scratch. It was a balance between
existing technologies. MS sticked with the idea, no more
ascii, let's use ucs-2 and the *x world breaks the unicode
adoption as possible. utf-8 is one of the compromise for
the adoption of Unicode. Retrospectivly, a not so good
compromise.

Computer scientists are funny scientists. They do love
to solve the problems they created themselves.

-----

Quickly. sys.getsizeof() at the light of what I explained.

1) As this FSR works with multiple encoding, it has to keep
track of the encoding. it puts is in the overhead of str
class (overhead = real overhead + encoding). In such
a absurd way, that a
40

needs 14 bytes more than a
26

You may vary the length of the str. The problem is
still here. Not bad for a coding scheme.

2) Take a look at this. Get rid of the overhead.
2000040

What does it mean? It means that Python has to
reencode a str every time it is necessary because
it works with multiple codings.

This FSR is not even a copy of the utf-8.1000003

utf-8 or any (utf) never need and never spend their time
in reencoding.

3) Unicode compliance. We know retrospectively, latin-1,
is was a bad choice. Unusable for 17 European languages.
Believe of not. 20 years of Unicode of incubation is not
long enough to learn it. When discussing once with a French
Python core dev, one with commit access, he did not know one
can not use latin-1 for the French language! BTW, I proposed
to the French devs, to test the FST with the set of characters,
recognized by the "Imprimerie Nationale", some kind of
the legal French authority regarding characters and typography.
Never heared about it. Of course, I dit it.

In short
FSR = bad performance + bad memory mangement + non unicode
compliance.

Good point. FSR, nice tool for those who wish to teach
Unicode. It is not every day, one has such an opportunity.

---------

I'm practicaly no more programming, writing applications.
I'm still active and observing since a decade and plus all this
unicode world, languages (go, c#, Python, Ruby), text
processing systems (esp. Unicode TeX engines) and font technology.
Very, very interesting.

jmf

Ian Kelly · Jul 28, 2013

Back to utf. utfs are not only elements of a unique set of encoded
code points. They have an interesting feature. Each "utf chunk"
holds intrisically the character (in fact the code point) it is
supposed to represent. In utf-32, the obvious case, it is just
the code point. In utf-8, that's the first chunk which helps and
utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
implementation using bytes, for any pointer position it is always
possible to find the corresponding encoded code point and from this
the corresponding character without any "programmed" information. See
my editor example, how to find the char under the caret? In fact,
a silly example, how can the caret can be positioned or moved, if
the underlying corresponding encoded code point can not be
dicerned!

Yes, given a pointer location into a utf-8 or utf-16 string, it is
easy to determine the identity of the code point at that location.
But this is not often a useful operation, save for resynchronization
in the case that the string data is corrupted. The caret of an editor
does not conceptually correspond to a pointer location, but to a
character index. Given a particular character index (e.g. 127504), an
editor must be able to determine the identity and/or the memory
location of the character at that index, and for UTF-8 and UTF-16
without an auxiliary data structure that is a O(n) operation.

2) Take a look at this. Get rid of the overhead.

2000040

What does it mean? It means that Python has to
reencode a str every time it is necessary because
it works with multiple codings.

Large strings in practical usage do not need to be resized like this
often. Python 3.3 has been in production use for months now, and you
still have yet to produce any real-world application code that
demonstrates a performance regression. If there is no real-world
regression, then there is no problem.

3) Unicode compliance. We know retrospectively, latin-1,
is was a bad choice. Unusable for 17 European languages.
Believe of not. 20 years of Unicode of incubation is not
long enough to learn it. When discussing once with a French
Python core dev, one with commit access, he did not know one
can not use latin-1 for the French language!

Probably because for many French strings, one can. As far as I am
aware, the only characters that are missing from Latin-1 are the Euro
sign (an unfortunate victim of history), the ligature Å“ (I have no
doubt that many users just type oe anyway), and the rare capital Å¸
(the miniscule version is present in Latin-1). All French strings
that are fortunate enough to be absent these characters can be
represented in Latin-1 and so will have a 1-byte width in the FSR.

Antoon Pardon · Jul 28, 2013

Op 27-07-13 20:21, (e-mail address removed) schreef:

Quickly. sys.getsizeof() at the light of what I explained.

1) As this FSR works with multiple encoding, it has to keep
track of the encoding. it puts is in the overhead of str
class (overhead = real overhead + encoding). In such
a absurd way, that a

40

needs 14 bytes more than a

26

You may vary the length of the str. The problem is
still here. Not bad for a coding scheme.

2) Take a look at this. Get rid of the overhead.

2000040

What does it mean? It means that Python has to
reencode a str every time it is necessary because
it works with multiple codings.

So? The same effect can be seen with other datatypes.

16

This FSR is not even a copy of the utf-8.
1000003

Why should it be? Why should a unicode string be a copy
of its utf-8 encoding? That makes as much sense as expecting
that a number would be a copy of its string reprensentation.

utf-8 or any (utf) never need and never spend their time
in reencoding.

So? That python sometimes needs to do some kind of background
processing is not a problem, whether it is garbage collection,
allocating more memory, shufling around data blocks or reencoding a
string, that doesn't matter. If you've got a real world example where
one of those things noticeably slows your program down or makes the
program behave faulty then you have something that is worthy of
attention.

Until then you are merely harboring a pet peeve.

Michael Torrie · Jul 28, 2013

Good point. FSR, nice tool for those who wish to teach
Unicode. It is not every day, one has such an opportunity.

I had a long e-mail composed, but decided to chop it down, but still too
long. so I ditched a lot of the context, which jmf also seems to do.
Apologies.

1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32
is an official encoding. FSR only differs from UTF-32 in that the
padding zeros are stripped off such that it is stored in the most
compact form that can handle all the characters in string, which is
always known at string creation time. Now you can argue many things,
but to say FSR is not unicode compliant is quite a stretch! What
unicode entities or characters cannot be stored in strings using FSR?
What sequences of bytes in FSR result in invalid Unicode entities?

2. strings in Python *never change*. They are immutable. The +
operator always copies strings character by character into a new string
object, even if Python had used UTF-8 internally. If you're doing a lot
of string concatenations, perhaps you're using the wrong data type. A
byte buffer might be better for you, where you can stuff utf-8 sequences
into it to your heart's content.

3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
slicing a string would be very very slow, and that's unacceptable for
the use cases of python strings. I'm assuming you understand big O
notation, as you talk of experience in many languages over the years.
FSR and UTF-32 both are O(1) for slicing and lookups. UTF-8, 16 and any
variable-width encoding are always O(n). A lot slower!

4. Unicode is, well, unicode. You seem to hop all over the place from
talking about code points to bytes to bits, using them all
interchangeably. And now you seem to be claiming that a particular byte
encoding standard is by definition unicode (UTF-8). Or at least that's
how it sounds. And also claim FSR is not compliant with unicode
standards, which appears to me to be completely false.

Is my understanding of these things wrong?

import syntax	0	Jul 29, 2013
Cross-Platform Python3 Equivalent to notify-send	1	Jul 27, 2013
Aloha! Check out the Betabots!	0	Oct 1, 2013
Critic my module	13	Jul 25, 2013
PEP8 79 char max	3	Jul 29, 2013
List as Contributor	0	Jul 20, 2013
Play Ogg Files	0	Jul 20, 2013
Share Code Tips	13	Jul 19, 2013

RE Module Performance

Ian Kelly

Steven D'Aprano

Michael Torrie

Michael Torrie

Ian Kelly

wxjmfauth

wxjmfauth

wxjmfauth

wxjmfauth

Antoon Pardon

Michael Torrie

Steven D'Aprano

Ian Kelly

Steven D'Aprano

Steven D'Aprano

Dennis Lee Bieber

wxjmfauth

Ian Kelly

Antoon Pardon

Michael Torrie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads