RE Module Performance

  • Thread starter Devyn Collier Johnson
  • Start date
T

Terry Reedy

3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that
slicing a string would be very very slow,

Not necessarily so. See below.
and that's unacceptable for
the use cases of python strings. I'm assuming you understand big O
notation, as you talk of experience in many languages over the years.
FSR and UTF-32 both are O(1) for slicing and lookups.

Slicing is at least O(m) where m is the length of the slice.
UTF-8, 16 and any variable-width encoding are always O(n).\

I posted about a week ago, in response to Chris A., a method by which
lookup for UTF-16 can be made O(log2 k), or perhaps more accurately,
O(1+log2(k+1)), where k is the number of non-BMP chars in the string.

This uses an auxiliary array of k ints. An auxiliary array of n ints
would make UFT-16 lookup O(1), but then one is using more space than
with UFT-32. Similar comments apply to UTF-8.

The unicode standard says that a single strings should use exactly one
coding scheme. It does *not* say that all strings in an application must
use the same scheme. I just rechecked a few days ago. It also does not
say that an application cannot associate additional data with a string
to make processing of the string easier.
 
C

Chris Angelico

I posted about a week ago, in response to Chris A., a method by which lookup
for UTF-16 can be made O(log2 k), or perhaps more accurately,
O(1+log2(k+1)), where k is the number of non-BMP chars in the string.

Which is an optimization choice that favours strings containing very
few non-BMP characters. To justify the extra complexity of out-of-band
storage, you would need to be working with almost exclusively the BMP.
That would drastically improve jmf's microbenchmarks which do exactly
that, but it would penalize strings that are almost exclusively
higher-codepoint characters. Its quality, then, would be based on a
major survey of string usage: are there enough strings with
mostly-BMP-but-a-few-SMP? Bearing in mind that pure BMP is handled
better by PEP 393, so this is only of value when there are actually
those mixed strings.

ChrisA
 
W

wxjmfauth

Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
Yes, given a pointer location into a utf-8 or utf-16 string, it is

easy to determine the identity of the code point at that location.

But this is not often a useful operation, save for resynchronization

in the case that the string data is corrupted. The caret of an editor

does not conceptually correspond to a pointer location, but to a

character index. Given a particular character index (e.g. 127504), an

editor must be able to determine the identity and/or the memory

location of the character at that index, and for UTF-8 and UTF-16

without an auxiliary data structure that is a O(n) operation.








Large strings in practical usage do not need to be resized like this

often. Python 3.3 has been in production use for months now, and you

still have yet to produce any real-world application code that

demonstrates a performance regression. If there is no real-world

regression, then there is no problem.











Probably because for many French strings, one can. As far as I am

aware, the only characters that are missing from Latin-1 are the Euro

sign (an unfortunate victim of history), the ligature Å“ (I have no

doubt that many users just type oe anyway), and the rare capital Ÿ

(the miniscule version is present in Latin-1). All French strings

that are fortunate enough to be absent these characters can be

represented in Latin-1 and so will have a 1-byte width in the FSR.

------

latin-1? that's not even truth.
39


jmf
 
C

Chris Angelico

Somewhat off topic, but befitting of the triviality of this thread, do I
understand correctly that you are saying garbage collection never causes any
noticeable slowdown in real-world circumstances? That's not remotely true.

If it's done properly, garbage collection shouldn't hurt the *overall*
performance of the app; most of the issues with GC timing are when one
operation gets unexpectedly delayed for a GC run (making performance
measurement hard, and such). It should certainly never cause your
program to behave faultily, though I have seen cases where the GC run
appears to cause the program to crash - something like this:

some_string = buggy_call()
....
gc()
....
print(some_string)

The buggy call mucked up the reference count, so the gc run actually
wiped the string from memory - resulting in a segfault on next usage.
But the GC wasn't at fault, the original call was. (Which, btw, was
quite a debugging search, especially since the function in question
wasn't my code.)

ChrisA
 
A

Antoon Pardon

Op 28-07-13 20:19, Joshua Landau schreef:
On 28 July 2013 09:45, Antoon Pardon <[email protected]

Op 27-07-13 20:21, (e-mail address removed) <mailto:[email protected]>
schreef:

utf-8 or any (utf) never need and never spend their time
in reencoding.


So? That python sometimes needs to do some kind of background
processing is not a problem, whether it is garbage collection,
allocating more memory, shufling around data blocks or reencoding a
string, that doesn't matter. If you've got a real world example where
one of those things noticeably slows your program down or makes the
program behave faulty then you have something that is worthy of
attention.


Somewhat off topic, but befitting of the triviality of this thread, do I
understand correctly that you are saying garbage collection never causes
any noticeable slowdown in real-world circumstances? That's not remotely
true.

No that is not what I am saying. But if jmf would be complaining about
garbage collection in an analog way as he is complaining about the FSR,
he wouldn't be complaining about real-world circumstances but about
theorectical possibilities and micro bench marks. In those circunstances
the "garbage collection problem" wouldn't be worthy of attention much.
 
M

MRAB

Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
1

One byte per codepoint.
1

Also one byte per codepoint.
12

Clearly there's more going on here.

FSR is an optimisation. You'll always be able to find some
circumstances where an optimisation makes things worse, but what
matters is the overall result.
 
T

Terry Reedy

If it's done properly, garbage collection shouldn't hurt the *overall*
performance of the app;

There are situations, some discussed on this list, where doing gc
'right' means turning off the cycle garbage collector. As I remember, an
example is creating a list of a million tuples, which otherwise triggers
a lot of useless background bookkeeping. The cyclic gc is tuned for
'normal' use patterns.
 
W

wxjmfauth

Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit :
I had a long e-mail composed, but decided to chop it down, but still too

long. so I ditched a lot of the context, which jmf also seems to do.

Apologies.



1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32

is an official encoding. FSR only differs from UTF-32 in that the

padding zeros are stripped off such that it is stored in the most

compact form that can handle all the characters in string, which is

always known at string creation time. Now you can argue many things,

but to say FSR is not unicode compliant is quite a stretch! What

unicode entities or characters cannot be stored in strings using FSR?

What sequences of bytes in FSR result in invalid Unicode entities?



2. strings in Python *never change*. They are immutable. The +

operator always copies strings character by character into a new string

object, even if Python had used UTF-8 internally. If you're doing a lot

of string concatenations, perhaps you're using the wrong data type. A

byte buffer might be better for you, where you can stuff utf-8 sequences

into it to your heart's content.



3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that

slicing a string would be very very slow, and that's unacceptable for

the use cases of python strings. I'm assuming you understand big O

notation, as you talk of experience in many languages over the years.

FSR and UTF-32 both are O(1) for slicing and lookups. UTF-8, 16 and any

variable-width encoding are always O(n). A lot slower!



4. Unicode is, well, unicode. You seem to hop all over the place from

talking about code points to bytes to bits, using them all

interchangeably. And now you seem to be claiming that a particular byte

encoding standard is by definition unicode (UTF-8). Or at least that's

how it sounds. And also claim FSR is not compliant with unicode

standards, which appears to me to be completely false.



Is my understanding of these things wrong?

------

Compare these (a BDFL exemple, where I'using a non-ascii char)

Py 3.2 (narrow build)
42

Tell me which one seems to be more "unicode compliant"?
The goal of Unicode is to handle every char "equaly".

Now, the problem: memory. Do not forget that à la "FSR"
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).
12044

A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.


jmf
 
W

wxjmfauth

Le dimanche 28 juillet 2013 21:04:56 UTC+2, MRAB a écrit :
1



One byte per codepoint.




1



Also one byte per codepoint.




12



Clearly there's more going on here.



FSR is an optimisation. You'll always be able to find some

circumstances where an optimisation makes things worse, but what

matters is the overall result.


----

Yes, I know my examples are always wrong, never
real examples.

I can point long strings, I should point short strings.
I point a short string (char), it is not long enough.
Strings as dict keys, no the problem is in Python dict.
Performance? no that's a memory issue.
Memory? no, it's a question to keep perfomance.
I am using this char, no you should not, it's no common.
The nabla operator in TeX file, who is so stupid to use
that char?
Many time, I'm just mimicking 'BDFL' examples, just
by replacing "his" ascii chars by non ascii char ;-)
And so on.

To be short, this is *never* the FSR, always something
else.

Suggestion. Start by solving all these "micro-benchmarks".
all the memory cases. It a good start, no?


jmf
 
M

MRAB

On 28/07/2013 20:23, (e-mail address removed) wrote:
[snip]
Compare these (a BDFL exemple, where I'using a non-ascii char)

Py 3.2 (narrow build)
Why are you using a narrow build of Python 3.2? It doesn't treat all
codepoints equally (those outside the BMP can't be stored in one code
unit) and, therefore, it isn't "Unicode compliant"!
 
A

Antoon Pardon

Op 28-07-13 21:23, (e-mail address removed) schreef:
Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit :

------

Compare these (a BDFL exemple, where I'using a non-ascii char)

Py 3.2 (narrow build)

42

Tell me which one seems to be more "unicode compliant"?

Cant tell, you give no relevant information on which one can decide
this question.
The goal of Unicode is to handle every char "equaly".

Not to this kind of detail, which is looking at irrelevant
implementation details.
Now, the problem: memory. Do not forget that à la "FSR"
mechanism for a non-ascii user is *irrelevant*. As
soon as one uses one single non-ascii, your ascii feature
is lost. (That why we have all these dedicated coding
schemes, utfs included).

So? Why should that trouble me? As far as I understand
whether I have an ascii string or not is totally irrelevant
to the application programmer. Within the application I
just process strings and let the programming environment
keep track of these details in a transparant way unless
you start looking at things like getsizeof, which gives
you implementation details that are mostly irrelevant
in deciding whether the behaviour is compliant or not.
12044

A bit secret. The larger a repertoire of characters
is, the more bits you needs.
Secret #2. You can not escape from this.

And totally unimportant for deciding complyance.
 
A

Antoon Pardon

Op 28-07-13 21:30, (e-mail address removed) schreef:
To be short, this is *never* the FSR, always something
else.

Suggestion. Start by solving all these "micro-benchmarks".
all the memory cases. It a good start, no?

There is nothing to solve. Unicode doesn't force implementations
to use the same size of memory for strings of the same length.

So you pointing out examples of same length strings that don't
use the same size of memory doesn't point at something that
must be solved.
 
L

Lele Gaifax

Suggestion. Start by solving all these "micro-benchmarks".
all the memory cases. It a good start, no?

Since you seem the only one who has this dramatic problem with such
micro-benchmarks, that BTW have nothing to do with "unicode compliance",
I'd suggest *you* should find a better implementation and propose it to
the core devs.

An even better suggestion, with due respect, is to get a life and find
something more interesting to do, or at least better arguments :)

ciao, lele.
 
S

Steven D'Aprano

Do not forget that à la "FSR" mechanism for a non-ascii user is
*irrelevant*.

You have been told repeatedly, Python's internals are *full* of ASCII-
only strings.

py> dir(list)
['__add__', '__class__', '__contains__', '__delattr__', '__delitem__',
'__dir__', '__doc__', '__eq__', '__format__', '__ge__',
'__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__',
'__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__',
'__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__',
'__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy',
'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

There's 45 ASCII-only strings right there, in only one built-in type, out
of dozens. There are dozens, hundreds of ASCII-only strings in Python:
builtin functions and classes, attributes, exceptions, internal
attributes, variable names, and so on.

You already know this, and yet you persist in repeating nonsense.
 
J

Joshua Landau

true.

If it's done properly, garbage collection shouldn't hurt the *overall*
performance of the app; most of the issues with GC timing are when one
operation gets unexpectedly delayed for a GC run (making performance
measurement hard, and such). It should certainly never cause your
program to behave faultily, though I have seen cases where the GC run
appears to cause the program to crash - something like this:

some_string = buggy_call()
...
gc()
...
print(some_string)

The buggy call mucked up the reference count, so the gc run actually
wiped the string from memory - resulting in a segfault on next usage.
But the GC wasn't at fault, the original call was. (Which, btw, was
quite a debugging search, especially since the function in question
wasn't my code.)

GC does have sometimes severe impact in memory-constrained environments,
though. See http://sealedabstract.com/rants/why-mobile-web-apps-are-slow/,
about half-way down, specifically
http://sealedabstract.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-14-at-10.15.29-PM.png
..

The best verification of these graphs I could find was
https://blog.mozilla.org/nnethercote/category/garbage-collection/, although
it's not immediately clear in Chrome's and Opera's case mainly due to none
of the benchmarks pushing memory usage significantly.

I also don't quite agree with the first post (sealedabstract) because I get
by *fine* on 2GB memory, so I don't see why you can't on a phone. Maybe IOS
is just really heavy. Nonetheless, the benchmarks aren't lying.
 
C

Chris Angelico

GC does have sometimes severe impact in memory-constrained environments,
though. See http://sealedabstract.com/rants/why-mobile-web-apps-are-slow/,
about half-way down, specifically
http://sealedabstract.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-14-at-10.15.29-PM.png.

The best verification of these graphs I could find was
https://blog.mozilla.org/nnethercote/category/garbage-collection/, although
it's not immediately clear in Chrome's and Opera's case mainly due to none
of the benchmarks pushing memory usage significantly.

I also don't quite agree with the first post (sealedabstract) because I get
by *fine* on 2GB memory, so I don't see why you can't on a phone. Maybe IOS
is just really heavy. Nonetheless, the benchmarks aren't lying.

The ultimate in non-managed memory (the opposite of a GC) would have
to be the assembly language programming I did in my earlier days,
firing up DEBUG.EXE and writing a .COM file that lived inside a single
64KB segment for everything (256-byte Program Segment Prefix, then
code, then initialized data, then uninitialized data and stack),
crashing the computer with remarkable ease. Everything "higher level"
than that (even malloc/free) has its conveniences and its costs,
usually memory wastage. If you malloc random-sized blocks, free them
at random, and ensure that your total allocated size stays below some
limit, you'll still eventually run yourself out of memory. This is
unsurprising. The only question is, how bad is the wastage and how
much time gets devoted to it?

ChrisA
 
W

wxjmfauth

Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit :
Do not forget that à la "FSR" mechanism for a non-ascii user is
*irrelevant*.



You have been told repeatedly, Python's internals are *full* of ASCII-

only strings.



py> dir(list)

['__add__', '__class__', '__contains__', '__delattr__', '__delitem__',

'__dir__', '__doc__', '__eq__', '__format__', '__ge__',

'__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__',

'__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__',

'__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',

'__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__',

'__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy',

'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']



There's 45 ASCII-only strings right there, in only one built-in type, out

of dozens. There are dozens, hundreds of ASCII-only strings in Python:

builtin functions and classes, attributes, exceptions, internal

attributes, variable names, and so on.



You already know this, and yet you persist in repeating nonsense.
22.300465007102908

3.3[/QUOTE]
27.13981129541519

For the record, I do not put your example to contradict
you. I was expecting such a result even before testing.

Now, if you do not understand why, you do not understand.
There nothing wrong.

jmf
 
C

Chris Angelico

Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit :
3.2

3.2:
42

3.3:
45

Wonder if that might maybe have an impact on the timings.

ChrisA
 
H

Heiko Wundram

Am 29.07.2013 13:43, schrieb (e-mail address removed):
3.2
27.13981129541519

For the record, I do not put your example to contradict
you. I was expecting such a result even before testing.

Now, if you do not understand why, you do not understand.
There nothing wrong.

Please give a single *proof* (not your gut feeling) that this is related
to the FSR, and not rather due to other side-effects such as changes in
how dir() works or (as Chris pointed out) due to more members on the
list type in 3.3. If you can't or won't give that proof, there's no
sense in continuing the discussion.
 
D

Devyn Collier Johnson

Am 29.07.2013 13:43, schrieb (e-mail address removed):

Please give a single *proof* (not your gut feeling) that this is
related to the FSR, and not rather due to other side-effects such as
changes in how dir() works or (as Chris pointed out) due to more
members on the list type in 3.3. If you can't or won't give that
proof, there's no sense in continuing the discussion.
Wow! The RE Module thread I created is evolving into Unicode topics.
That thread grew up so fast!

DCJ
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

import syntax 0
Cross-Platform Python3 Equivalent to notify-send 1
Aloha! Check out the Betabots! 0
Critic my module 13
PEP8 79 char max 3
List as Contributor 0
Play Ogg Files 0
Share Code Tips 13

Members online

No members online now.

Forum statistics

Threads
473,797
Messages
2,569,647
Members
45,378
Latest member
danzeev

Latest Threads

Top