RE Module Performance

wxjmfauth · Jul 29, 2013

Le lundi 29 juillet 2013 13:57:47 UTC+2, Chris Angelico a écrit :

45

Wonder if that might maybe have an impact on the timings.

ChrisA

Good point. I stupidely forgot this.

jmf

wxjmfauth · Jul 29, 2013

Le dimanche 28 juillet 2013 19:36:00 UTC+2, Terry Reedy a écrit :

Not necessarily so. See below.

Slicing is at least O(m) where m is the length of the slice.

I posted about a week ago, in response to Chris A., a method by which

lookup for UTF-16 can be made O(log2 k), or perhaps more accurately,

O(1+log2(k+1)), where k is the number of non-BMP chars in the string.

This uses an auxiliary array of k ints. An auxiliary array of n ints

would make UFT-16 lookup O(1), but then one is using more space than

with UFT-32. Similar comments apply to UTF-8.

The unicode standard says that a single strings should use exactly one

coding scheme. It does *not* say that all strings in an application must

use the same scheme. I just rechecked a few days ago. It also does not

say that an application cannot associate additional data with a string

to make processing of the string easier.

To my knowledge, the Unicode doc always speak about
the misc. utf* coding schemes in an "exclusive or" way.

Having multiple encoded strings is one thing. Manipulating
multiple encoded strings is something else.

Maybe the mistake was to not emphasize the fact that
one has to work with a unique set of encoded code points
(utf-8 or utf-16 or utf-32) because it was considered,
as to obvious one can not work properly with multiple
coding schemes.

You are also right in saying " ...application cannot associate
additional data...".
The doc does not specify it either. It is superfleous.

jmf

wxjmfauth · Jul 29, 2013

Le lundi 29 juillet 2013 13:57:47 UTC+2, Chris Angelico a écrit :

45

Wonder if that might maybe have an impact on the timings.

ChrisA

--------

class C:
a = 'abc'
b = 'def'
def aaa(self):
pass
def bbb(self):
pass
def ccc(self):
pass

if __name__ == '__main__':
import timeit
print(timeit.timeit("r = dir(C)", setup="from __main__ import C"))

c:\python32\pythonw -u "timitmod.py" 15.258061416225663
Exit code: 0
c:\Python33\pythonw -u "timitmod.py" 17.052203122286194
Exit code: 0

jmf

Chris Angelico · Jul 29, 2013

Did you even think to check that before you posted timings?

ChrisA

wxjmfauth · Jul 29, 2013

Le lundi 29 juillet 2013 16:49:34 UTC+2, Chris Angelico a écrit :

Did you even think to check that before you posted timings?

ChrisA

Boum, no! the diff is one.
I have however noticed, I can increase the number
of attributes (ascii), the timing differences
is very well marked.
I do not draw conclusions. Such a factor for one
unit....

jmf

wxjmfauth · Jul 30, 2013

Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :

Yes, given a pointer location into a utf-8 or utf-16 string, it is

easy to determine the identity of the code point at that location.

But this is not often a useful operation, save for resynchronization

in the case that the string data is corrupted. The caret of an editor

does not conceptually correspond to a pointer location, but to a

character index. Given a particular character index (e.g. 127504), an

editor must be able to determine the identity and/or the memory

location of the character at that index, and for UTF-8 and UTF-16

without an auxiliary data structure that is a O(n) operation.

------

Same conceptual mistake as Steven's example with its buffers,
the buffer does not know it holds characters.
This is not the point to discuss.

-----

I am pretty sure that once you have typed your 127504
ascii characters, you are very happy the buffer of your
editor does not waste time in reencoding the buffer as
soon as you enter an €, the 125505th char. Sorry, I wanted
to say z instead of euro, just to show that backspacing the
last char and reentering a new char implies twice a reencoding.

Somebody wrote "FSR" is just an optimization. Yes, but in case
of an editor à la FSR, this optimization take place everytime you
enter a char. Your poor editor, in fact the FSR, is finally
spending its time in optimizing and finally it optimizes nothing.
(It is even worse).

If you type correctly a z instead of an €, it is not necessary
to reencode the buffer. Problem, you do you know that you do
not have to reencode? simple just check it, and by just checking
it wastes time to test it you have to optimized or not and hurt
a little bit more what is supposed to be an optimization.

Do not confuse the process of optimisation and the result of
optimization (funny, it's like the utf's).

There is a trick to make the editor to know if it has
to be "optimized". Just put some flag somewhere. Then
you fall on the "Houston" syndrome. Houston, we got a
problem, our buffer consumes much more bytes than expected.
26

Now the good news. In an editor à la FSR, the
"composition" is not so important. You know,
"practicality beats purity". The hard job
is the text rendering engine and the handling
of the font (even in a raw unicode editor).
And as these tools are luckily not woking à la FSR
(probably because they understand the coding
of the characters), your editor is still working
not so badly.

jmf

Antoon Pardon · Jul 30, 2013

Op 30-07-13 16:01, (e-mail address removed) schreef:

I am pretty sure that once you have typed your 127504
ascii characters, you are very happy the buffer of your
editor does not waste time in reencoding the buffer as
soon as you enter an €, the 125505th char. Sorry, I wanted
to say z instead of euro, just to show that backspacing the
last char and reentering a new char implies twice a reencoding.

Using a single string as an editor buffer is a bad idea in python
for the simple reason that strings are immutable. So adding
characters would mean continuously copying the string buffer
into a new string with the next character added. Copying
127504 characters into a new string will not make that much
of a difference whether the octets are just copied to octets
or are unpacked into 32 bit words.

Somebody wrote "FSR" is just an optimization. Yes, but in case
of an editor à la FSR, this optimization take place everytime you
enter a char. Your poor editor, in fact the FSR, is finally
spending its time in optimizing and finally it optimizes nothing.
(It is even worse).

Even if you would do it this way, it would *not* take place
every time you enter a char. Once your buffer would contain
a wide character, it would just need to convert the single
character that is added after each keystroke. It would not
need to convert the whole buffer after each key stroke.

If you type correctly a z instead of an €, it is not necessary
to reencode the buffer. Problem, you do you know that you do
not have to reencode? simple just check it, and by just checking
it wastes time to test it you have to optimized or not and hurt
a little bit more what is supposed to be an optimization.

Your scenario is totally unrealistic. First of all because of
the immutable nature of python strings, second because you
suggest that real time usage would result in frequent conversions
which is highly unlikely.

Chris Angelico · Jul 30, 2013

I am pretty sure that once you have typed your 127504
ascii characters, you are very happy the buffer of your
editor does not waste time in reencoding the buffer as
soon as you enter an €, the 125505th char. Sorry, I wanted
to say z instead of euro, just to show that backspacing the
last char and reentering a new char implies twice a reencoding.

You're still thinking that the editor's buffer is a Python string. As
I've shown earlier, this is a really bad idea, and that has nothing to
do with FSR/PEP 393. An immutable string is *horribly* inefficient at
this; if you want to keep concatenating onto a string, the recommended
method is a list of strings that gets join()d at the end, and the same
technique works well here. Here's a little demo class that could make
the basis for such a system:

class EditorBuffer:
def __init__(self,fn):
self.fn=fn
self.buffer=[open(fn).read()]
def insert(self,pos,char):
if pos==0:
# Special case: insertion at beginning of buffer
if len(self.buffer[0])>1024: self.buffer.insert(0,char)
else: self.buffer[0]=char+self.buffer[0]
return
for idx,part in enumerate(self.buffer):
l=len(part)
if pos>l:
pos-=l
continue
if pos<l:
# Cursor is somewhere inside this string
splitme=self.buffer[idx]
self.buffer[idx:idx+1]=splitme[

os],splitme[pos:]
l=pos
# Cursor is now at the end of this string
if l>1024: self.buffer[idx:idx+1]=self.buffer[idx],char
else: self.buffer[idx]+=char
return
raise ValueError("Cannot insert past end of buffer")
def __str__(self):
return ''.join(self.buffer)
def save(self):
open(fn,"w").write(str(self))

It guarantees that inserts will never need to resize more than 1KB of
text. As a real basis for an editor, it still sucks, but it's purely
to prove this one point.

ChrisA

MRAB · Jul 30, 2013

Op 30-07-13 16:01, (e-mail address removed) schreef:

Using a single string as an editor buffer is a bad idea in python for
the simple reason that strings are immutable.

Using a single string as an editor buffer is a bad idea in _any_
language because an insertion would require all the following
characters to be moved.

So adding characters would mean continuously copying the string
buffer into a new string with the next character added. Copying
127504 characters into a new string will not make that much of a
difference whether the octets are just copied to octets or are
unpacked into 32 bit words.

Even if you would do it this way, it would *not* take place every
time you enter a char. Once your buffer would contain a wide
character, it would just need to convert the single character that is
added after each keystroke. It would not need to convert the whole
buffer after each key stroke.

Your scenario is totally unrealistic. First of all because of the
immutable nature of python strings, second because you suggest that
real time usage would result in frequent conversions which is highly
unlikely.

What you would have is a list of mutable chunks.

Inserting into a chunk would be fast, and a chunk would be split if
it's already full. Also, small adjacent chunks would be joined together.

Finally, a chunk could use FSR to reduce memory usage.

Antoon Pardon · Jul 30, 2013

Op 30-07-13 18:13, MRAB schreef:

Using a single string as an editor buffer is a bad idea in _any_
language because an insertion would require all the following
characters to be moved.

Not if you use a gap buffer.

MRAB · Jul 30, 2013

Op 30-07-13 18:13, MRAB schreef:

Not if you use a gap buffer.

The disadvantage there is that when you move the cursor you must move
characters around. For example, what if the cursor was at the start and
you wanted to move it to the end? Also, when the gap has been filled,
you need to make a new one.

Tim Delaney · Jul 30, 2013

I am pretty sure that once you have typed your 127504
ascii characters, you are very happy the buffer of your
editor does not waste time in reencoding the buffer as
soon as you enter an â‚¬, the 125505th char. Sorry, I wanted
to say z instead of euro, just to show that backspacing the
last char and reentering a new char implies twice a reencoding.

And here we come to the root of your complete misunderstanding and
mischaracterisation of the FSR. You don't appear to understand that
strings in Python are immutable and that to add a character to an
existing string requires copying the entire string + new character. In
your hypothetical situation above, you have already performed 127504
copy + new character operations before you ever get to a single widening
operation. The overhead of the copy + new character repeated 127504
times dwarfs the overhead of a single widening operation.

Given your misunderstanding, it's no surprise that you are focused on
microbenchmarks that demonstrate that copying entire strings and adding
a character can be slower in some situations than others. When the only
use case you have is implementing the buffer of an editor using an
immutable string I can fully understand why you would be concerned about
the performance of adding and removing individual characters. However,
in that case *you're focused on the wrong problem*.

Until you can demonstrate an understanding that doing the above in any
language which has immutable strings is completely insane you will have
no credibility and the only interest anyone will pay to your posts is
refuting your FUD so that people new to the language are not driven off
by you.

Tim Delaney

Joshua Landau · Jul 30, 2013

Op 30-07-13 18:13, MRAB schreef:

Not if you use a gap buffer.

Additionally, who says a language couldn't use, say, B-Trees for all of its
list-like types, including strings?

Antoon Pardon · Jul 30, 2013

Op 30-07-13 19:14, MRAB schreef:

The disadvantage there is that when you move the cursor you must move
characters around. For example, what if the cursor was at the start and
you wanted to move it to the end? Also, when the gap has been filled,
you need to make a new one.

So? Why are you making this a point of discussion? I was not aware that
the pro and cons of various editor buffer implemantations was relevant
to the point I was trying to make.

If you prefer an other data structure in the editor you are working on,
I will not dissuade you.

wxjmfauth · Jul 30, 2013

Matable, immutable, copyint + xxx, bufferint, O(n) ....
Yes, but conceptualy the reencoding happen sometime, somewhere.
The internal "ucs-2" will never automagically be transformed
into "ucs-4" (eg).
7.160483334521416

And do not forget, in a pure utf coding scheme, your
char or a char will *never* be larger than 4 bytes.
48

jmf

Chris Angelico · Jul 30, 2013

Matable, immutable, copyint + xxx, bufferint, O(n) ....
Yes, but conceptualy the reencoding happen sometime, somewhere.
The internal "ucs-2" will never automagically be transformed
into "ucs-4" (eg).

But probably not on the entire document. With even a brainless scheme
like I posted code for, no more than 1024 bytes will need to be
recoded at a time (except in some odd edge cases, and even then, no
more than once for any given file).

And do not forget, in a pure utf coding scheme, your
char or a char will *never* be larger than 4 bytes.

48

Yeah, you have a few odd issues like, oh, I dunno, GC overhead,
reference count, object class, and string length, all stored somewhere
there. Honestly jmf, if you want raw assembly you know where to get
it.

ChrisA

Terry Reedy · Jul 30, 2013

Additionally, who says a language couldn't use, say, B-Trees for all of
its list-like types, including strings?

Tk apparently uses a B-tree in its text widget.

Michael Torrie · Jul 31, 2013

So? Why are you making this a point of discussion? I was not aware that
the pro and cons of various editor buffer implemantations was relevant
to the point I was trying to make.

I for one found it very interesting. In fact this thread caused me to
wonder how one actually does create an efficient editor. Off the
original topic true, but still very interesting.

Michael Torrie · Jul 31, 2013

Matable, immutable, copyint + xxx, bufferint, O(n) ....
Yes, but conceptualy the reencoding happen sometime, somewhere.
The internal "ucs-2" will never automagically be transformed
into "ucs-4" (eg).

So what major python project are you working on where you've found FSR
in general to be a problem? Maybe we can help you work out a more
appropriate data structure and algorithm to use.

But if you're not developing something, and not developing in Python,
perhaps you should withdraw and let us use our horrible FSR in peace,
because it doesn't seem to bother the vast majority of python
programmers, and does not bother some large python projects out there.
In fact I think most of us welcome integrated, correct, full unicode.

Steven D'Aprano · Jul 31, 2013

And do not forget, in a pure utf coding scheme, your char or a char will
*never* be larger than 4 bytes.

48

Neither character above is larger than 4 bytes. You forgot to deduct the
size of the object header. Python is a high-level object-oriented
language, if you care about minimizing every possible byte, you should
use a low-level language like C. Then you can give every character 21
bits, and be happy that you don't waste even one bit.

import syntax	0	Jul 29, 2013
Cross-Platform Python3 Equivalent to notify-send	1	Jul 27, 2013
Aloha! Check out the Betabots!	0	Oct 1, 2013
Critic my module	13	Jul 25, 2013
PEP8 79 char max	3	Jul 29, 2013
List as Contributor	0	Jul 20, 2013
Play Ogg Files	0	Jul 20, 2013
Share Code Tips	13	Jul 19, 2013

RE Module Performance

wxjmfauth

wxjmfauth

wxjmfauth

Chris Angelico

wxjmfauth

wxjmfauth

Antoon Pardon

Chris Angelico

MRAB

Antoon Pardon

MRAB

Tim Delaney

Joshua Landau

Antoon Pardon

wxjmfauth

Chris Angelico

Terry Reedy

Michael Torrie

Michael Torrie

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads