RE Module Performance

Chris Angelico · Jul 31, 2013

if you care about minimizing every possible byte, you should
use a low-level language like C. Then you can give every character 21
bits, and be happy that you don't waste even one bit.

Could go better! Since not every character has been assigned, and some
are specifically banned (eg U+FFFE and U+D800-U+DFFF), you could cut
them out of your representation system and save memory!

ChrisA

Antoon Pardon · Jul 31, 2013

Op 31-07-13 05:30, Michael Torrie schreef:

I for one found it very interesting. In fact this thread caused me to
wonder how one actually does create an efficient editor. Off the
original topic true, but still very interesting.

Yes, it can be interesting. But I really think if that is what you want
to discuss, it deserves its own subject thread.

Antoon Pardon · Jul 31, 2013

Op 30-07-13 21:09, (e-mail address removed) schreef:

Matable, immutable, copyint + xxx, bufferint, O(n) ....
Yes, but conceptualy the reencoding happen sometime, somewhere.

Which is a far cry from your previous claim that it happened
every time you enter a char.

This of course make your case harder to argue. Because the
impact of something that happens sometime, somewhere is
vastly less than something that happens everytime you enter
a char.

The internal "ucs-2" will never automagically be transformed
into "ucs-4" (eg).

It will just start producing wrong results when someone starts
using characters that don't fit into ucs-2.

7.160483334521416

And do not forget, in a pure utf coding scheme, your
char or a char will *never* be larger than 4 bytes.

Nonsense.

18

wxjmfauth · Jul 31, 2013

FSR:
===

The 'a' in 'a€' and 'a\U0001d11e:

['{:#010b}'.format(c) for c in 'a€'.encode('utf-16-be')] ['0b00000000', '0b01100001', '0b00100000', '0b10101100']
['{:#010b}'.format(c) for c in 'a\U0001d11e'.encode('utf-32-be')]

Click to expand...

Click to expand...

['0b00000000', '0b00000000', '0b00000000', '0b01100001',
'0b00000000', '0b00000001', '0b11010001', '0b00011110']

Has to be done.

sys.getsizeof('a€')
42
sys.getsizeof('a\U0001d11e')
48
sys.getsizeof('aa')
27

Unicode/utf*
============

i) ("primary key") Create and use a unique set of encoded
code points.
ii) ("secondary key") Depending of the wish,
memory/performance: utf-8/16/32

Two advantages at the light of the above example:
iii) The "a" has never to be reencoded.
iv) An "a" size never exceeds 4 bytes.

Hard job to solve/satisfy i), ii), iii) and iv) at the same time.
Is is possible? ;-) The solution is in the problem.

jmf

Antoon Pardon · Jul 31, 2013

Op 31-07-13 10:32, (e-mail address removed) schreef:

Unicode/utf*
============

i) ("primary key") Create and use a unique set of encoded
code points.

FSR does this.

st1 = 'a€'
st2 = 'aa'
ord(st1[0]) 97
ord(st2[0]) 97

Click to expand...

Click to expand...

ii) ("secondary key") Depending of the wish,
memory/performance: utf-8/16/32

Whose wish? I don't know any language that allows the
programmer choose the internal representation of its
strings. If it is the designers choice FSR does this,
if it is the programmers choice, I don't see why
this is necessary for compliance.

Two advantages at the light of the above example:
iii) The "a" has never to be reencoded.

FSR: check. Using a container with wider slots is not a reëncoding.
If such widening is encoding then your 'choice' between utf-8/16/32
implies that it will also have to reencode when it changes from
utf-8 to utf-16 or utf-32.

iv) An "a" size never exceeds 4 bytes.

FSR: check.

Hard job to solve/satisfy i), ii), iii) and iv) at the same time.
Is is possible? ;-) The solution is in the problem.

Mayby you should use bytes or bytearrays if that is really what you want.

Michael Torrie · Jul 31, 2013

Op 31-07-13 05:30, Michael Torrie schreef:

Yes, it can be interesting. But I really think if that is what you want
to discuss, it deserves its own subject thread.

Subject lines can and should be changed to reflect the ebbs and flows of
the discussion.

In fact this thread's subject should have been changed a long time ago
since the original topic was RE module performance!

Michael Torrie · Jul 31, 2013

Unicode/utf*

Why do you keep using the terms "utf" and "Unicode" interchangeably?

wxjmfauth · Jul 31, 2013

Le mercredi 31 juillet 2013 07:45:18 UTC+2, Steven D'Aprano a écrit :

Neither character above is larger than 4 bytes. You forgot to deduct the

size of the object header. Python is a high-level object-oriented

language, if you care about minimizing every possible byte, you should

use a low-level language like C. Then you can give every character 21

bits, and be happy that you don't waste even one bit.

.... char never consumes or requires more than 4 bytes ...

jmf

Chris Angelico · Jul 31, 2013

... char never consumes or requires more than 4 bytes ...

The integer 5 should be able to be stored in 3 bits.
14

Clearly Python is doing something really horribly wrong here. In fact,
sys.getsizeof needs to be changed to return a float, to allow it to
more properly reflect these important facts.

ChrisA

import syntax	0	Jul 29, 2013
Cross-Platform Python3 Equivalent to notify-send	1	Jul 27, 2013
Aloha! Check out the Betabots!	0	Oct 1, 2013
Critic my module	13	Jul 25, 2013
PEP8 79 char max	3	Jul 29, 2013
List as Contributor	0	Jul 20, 2013
Play Ogg Files	0	Jul 20, 2013
Share Code Tips	13	Jul 19, 2013

RE Module Performance

Chris Angelico

Antoon Pardon

Antoon Pardon

wxjmfauth

Antoon Pardon

Michael Torrie

Michael Torrie

wxjmfauth

Chris Angelico

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads