Performance of int/long in Python 3

R

rusi

    Its a reference to a comment by Jamie Zawinski (relatively famous
developer of Netscape Navigator and other things):

And Xemacs (which is famous in the free sw world for other things!)
    "Every program attempts to expand until it can read mail. Those
programs which cannot so expand are replaced by ones which can."

:) Ok got it
 
S

Steven D'Aprano

Sorting a million string list (all the file paths on a particular
computer) went from 0.4 seconds with Python 3.2 to 0.78 with 3.3 so
we're out of the 'not noticeable by humans' range. Perhaps this is still
a 'micro-benchmark' - I'd just like to avoid adding email access to get
this over the threshold.

I cannot confirm this performance regression. On my laptop (Debian Linux,
not Windows), I can sort a million file names in approximately 1.2
seconds in both Python 3.2 and 3.3. There is no meaningful difference in
speed between the two versions.
 
T

Terry Jan Reedy

What system *and* what compiler and compiler options. Unless 3.2 and 3.3
are both compiler with the same compiler and settings, we do not know
the source of the difference.
I cannot confirm this performance regression. On my laptop (Debian Linux,
not Windows), I can sort a million file names in approximately 1.2
seconds in both Python 3.2 and 3.3. There is no meaningful difference in
speed between the two versions.

I am guessing that Neil's undisclosed system (that I can see) is
Windows, since other benchmarks have been more different on Windows than
on *nix. Given that we *know* that the 3.2 and 3.3 distribution are
compiled with different compilers and run with different C runtimes, it
is possible that some of the difference is from that and not from python
at all.

tjr
 
C

Chris Angelico

rusi wrote:
"Every program attempts to expand until it can read mail. Those programs
which cannot so expand are replaced by ones which can."

In my personal experience, it's calculators. I put command-line
calculators into *everything*... often in the form of more general
executors, and thus restricted to admins, but it's still a calculator.

For some reason, the ability to type "calc 1+2" and get back 3 is very
satisfying to me. You know, in case I ever forget what one plus two
makes.

ChrisA
 
C

Chris Angelico

I cannot confirm this performance regression. On my laptop (Debian Linux,
not Windows), I can sort a million file names in approximately 1.2
seconds in both Python 3.2 and 3.3. There is no meaningful difference in
speed between the two versions.

I'd be curious to know the sorts of characters used. Given that it's
probably a narrow-vs-wide Python difference we're talking here, the
actual distribution of codepoints may well make a difference.

ChrisA
 
C

Chris Angelico

Chris Angelico:




I was going to upload it but then I thought of potential client
-confidentiality problems and the need to audit a list that long.

Hmm. I was about to say "Can you just do a quick collections.Counter()
of the string widths in 3.3, as an easy way of seeing which ones use
BMP or higher characters", but I can't find a simple way to query a
string's width. Can't see it as a method of the string object, nor in
the string or sys modules. It ought to be easy enough at the C level -
just look up the two bits representing 'kind' - but I've not found it
exposed to Python. Is there anything?

ChrisA
 
I

Ian Kelly

Hmm. I was about to say "Can you just do a quick collections.Counter()
of the string widths in 3.3, as an easy way of seeing which ones use
BMP or higher characters", but I can't find a simple way to query a
string's width. Can't see it as a method of the string object, nor in
the string or sys modules. It ought to be easy enough at the C level -
just look up the two bits representing 'kind' - but I've not found it
exposed to Python. Is there anything?

4 if max(map(ord, s)) > 0xffff else 2 if max(map(ord, s)) > 0xff else 1
 
C

Chris Angelico

4 if max(map(ord, s)) > 0xffff else 2 if max(map(ord, s)) > 0xff else 1

Yeah, that's iterating over the whole string (twice, if it isn't width
4). The system already knows what the size is, I was hoping for an
uber-quick inspection of the string header.

ChrisA
 
S

Steven D'Aprano

Yeah, that's iterating over the whole string (twice, if it isn't width
4).

Then don't write it as a one-liner :p

n = max(map(ord, s))
4 if n > 0xffff else 2 if n > 0xff else 1


Here's another way:


(sys.getsizeof(s) - sys.getsizeof(''))/len(s)

should work.


There's probably also a way to do it using ctypes.


The system already knows what the size is, I was hoping for an
uber-quick inspection of the string header.

I'm not sure that I would want strings to have a method reporting this,
but it might be nice to have a function in the inspect module to do so.
 
C

Chris Angelico

Here's another way:


(sys.getsizeof(s) - sys.getsizeof(''))/len(s)

should work.

Hmm, I had been under the impression that there was a certain "base
length" below which strings all had the same size. Yes, that also
works; though again, it's something that can be directly queried, at
the C level.
There's probably also a way to do it using ctypes.


I'm not sure that I would want strings to have a method reporting this,
but it might be nice to have a function in the inspect module to do so.

Yeah, that's why I also looked in 'sys'; 'inspect' might well be a
good place for it, too. But it seems such a function doesn't exist,
which is what I was asking.

ChrisA
 
R

rusi

    Reran the programs taking a bit more care with the encoding of the
file. This had no effect on the speeds. There are only a small amount of
paths that don't fit into ASCII:

ASCII 1076101
Latin1 218
BMP 113
Astral 0

# encoding:utf-8
import codecs, os, time
from os.path import join, getsize
with codecs.open("filelist.txt", "r", "utf-8") as f:
     paths = f.read().split("\n")
bucket = [0,0,0,0]
for p in paths:
     b = 0
     maxChar = max([ord(ch) for ch in p])
     if maxChar >= 65536:
         b = 3
     elif maxChar >= 256:
         b = 2
     elif maxChar >= 128:
         b = 1
     bucket = bucket + 1
print("ASCII", bucket[0])
print("Latin1", bucket[1])
print("BMP", bucket[2])
print("Astral", bucket[3])

    Neil


Can you please try one more experiment Neil?
Knock off all non-ASCII strings (paths) from your dataset and try
again.

[It should take little more than converting your above code to a
filter:
if b == 0: print
if b > 0: ignore
]
 
D

Dave Angel

rusi:


Results are the same 0.40 (well, 0.001 less but I don't think the
timer is that accurate) for Python 3.2 and 0.78 for Python 3.3.

Neil

That would seem to imply that the speed regression on your data is NOT
caused by the differing size encodings. Perhaps it is the difference in
MSC compiler version, or other changes made between 3.2 and 3.3

Of course, I can't then explain why Steven didn't get the same results.
Perhaps the difference between 32bit Python and 64 on Windows? Or
perhaps you have significantly more (or significantly fewer)
"collisions" than Steven did.


Before I saw this message, I was thinking of suggesting that you supply
a key= parameter to sort, specifying as a key the Unicode character
65536 higher than the one supplied. That way all the keys to be sorted
would be 32 bits in size. If this made the timings change noticeably,
it could be a big clue.
 
M

Mark Lawrence

--------

This FSR is wrong by design. A naive way to embrace Unicode.

jmf

The hole you're digging for yourself is getting bigger and bigger and
I'm loving it :)
 
D

Dave Angel

Dave Angel:
That would seem to imply that the speed regression on your data is NOT
caused by the differing size encodings. Perhaps it is the difference in
MSC compiler version, or other changes made between 3.2 and 3.3

Its not caused by there actually being different size encodings but
that the code is checking encoding size 2-4 times for each character.

Back in 3.2 the comparison loop looked like:

while (len1 > 0 && len2 > 0) {
Py_UNICODE c1, c2;

c1 = *s1++;
c2 = *s2++;

if (c1 != c2)
return (c1 < c2) ? -1 : 1;

len1--; len2--;
}

For 3.3 this has changed to

for (i = 0; i < len1 && i < len2; ++i) {
Py_UCS4 c1, c2;
c1 = PyUnicode_READ(kind1, data1, i);
c2 = PyUnicode_READ(kind2, data2, i);

if (c1 != c2)
return (c1 < c2) ? -1 : 1;
}

with PyUnicode_READ being

#define PyUnicode_READ(kind, data, index) \
((Py_UCS4) \
((kind) == PyUnicode_1BYTE_KIND ? \
((const Py_UCS1 *)(data))[(index)] : \
((kind) == PyUnicode_2BYTE_KIND ? \
((const Py_UCS2 *)(data))[(index)] : \
((const Py_UCS4 *)(data))[(index)] \
) \
))

There are either 1 or 2 kind checks in each call to PyUnicode_READ
and 2 calls to PyUnicode_READ inside the loop. A compiler may decide to
move the kind checks out of the loop and specialize the loop but MSVC
2010 appears to not do so.

I don't know how good MSC's template logic is, but it seems this would
be a good case for an explicit template, typed on the 'kind's values.
Or are all C++ features disabled when compiling Python? Failing that,
just code up 9 cases, and do a switch on the kinds.

I'm also puzzled. I thought that the sort algorithm used a hash of all
the items to be sorted, and only reverted to a raw comparison of the
original values when the hash collided. Is that not the case? Or is
the code you post here only used when the hash collides?


The assembler (32-bit build) for each
PyUnicode_READ looks like

mov ecx, DWORD PTR _kind1$[ebp]
cmp ecx, 1
jne SHORT $LN17@unicode_co@2
lea ecx, DWORD PTR [ebx+eax]
movzx edx, BYTE PTR [ecx+edx]
jmp SHORT $LN16@unicode_co@2
$LN17@unicode_co@2:
cmp ecx, 2
jne SHORT $LN15@unicode_co@2
movzx edx, WORD PTR [ebx+edi]
jmp SHORT $LN16@unicode_co@2
$LN15@unicode_co@2:
mov edx, DWORD PTR [ebx+esi]
$LN16@unicode_co@2:

It appears that the compiler is keeping the three pointers in three
separate registers (eax, esi and edi) even though those are 3 aliases
for the same pointer. This is preventing it from putting other values
in those registers.

It'd probably do better if the C code manipulated the pointers, rather
than using an index i each time. But if it did, perhaps gcc would
generate worse code.

If I were coding the assembler by hand (Intel only), I'd be able to
avoid the multiple cmp operations, simply by comparing first to 2, then
doing a jne and a ja. I dunno whether the compiler would notice if I
coded the equivalent in C. (make both comparisons to 2, one for less,
and one for more)
The kind1/kind2 variables aren't even going into registers and at
least one test+branch and a jump are executed for every character. Two
tests for 2 and 4 byte kinds. len1 and len2 don't get to go into
registers either.

Here's the full assembler output for unicode_compare:

; COMDAT _unicode_compare
_TEXT SEGMENT
_kind2$ = -20 ; size = 4
_kind1$ = -16 ; size = 4
_len2$ = -12 ; size = 4
_len1$ = -8 ; size = 4
_data2$ = -4 ; size = 4
_unicode_compare PROC ; COMDAT
; _str1$ = ecx
; _str2$ = eax

; 10417: {

push ebp
mov ebp, esp
sub esp, 20 ; 00000014H
push ebx
push esi
mov esi, eax

; 10418: int kind1, kind2;
; 10419: void *data1, *data2;
; 10420: Py_ssize_t len1, len2, i;
; 10421:
; 10422: kind1 = PyUnicode_KIND(str1);

mov eax, DWORD PTR [ecx+16]
mov edx, eax
shr edx, 2
and edx, 7
push edi
mov DWORD PTR _kind1$[ebp], edx

; 10423: kind2 = PyUnicode_KIND(str2);

mov edx, DWORD PTR [esi+16]
mov edi, edx
shr edi, 2
and edi, 7
mov DWORD PTR _kind2$[ebp], edi

; 10424: data1 = PyUnicode_DATA(str1);

test al, 32 ; 00000020H
je SHORT $LN9@unicode_co@2
test al, 64 ; 00000040H
je SHORT $LN7@unicode_co@2
lea ebx, DWORD PTR [ecx+24]
jmp SHORT $LN10@unicode_co@2
$LN7@unicode_co@2:
lea ebx, DWORD PTR [ecx+36]
jmp SHORT $LN10@unicode_co@2
$LN9@unicode_co@2:
mov ebx, DWORD PTR [ecx+36]
$LN10@unicode_co@2:

; 10425: data2 = PyUnicode_DATA(str2);

test dl, 32 ; 00000020H
je SHORT $LN13@unicode_co@2
test dl, 64 ; 00000040H
je SHORT $LN11@unicode_co@2
lea edx, DWORD PTR [esi+24]
jmp SHORT $LN30@unicode_co@2
$LN11@unicode_co@2:
lea eax, DWORD PTR [esi+36]
mov DWORD PTR _data2$[ebp], eax
mov edx, eax
jmp SHORT $LN14@unicode_co@2
$LN13@unicode_co@2:
mov edx, DWORD PTR [esi+36]
$LN30@unicode_co@2:
mov DWORD PTR _data2$[ebp], edx
$LN14@unicode_co@2:

; 10426: len1 = PyUnicode_GET_LENGTH(str1);

mov edi, DWORD PTR [ecx+8]

; 10427: len2 = PyUnicode_GET_LENGTH(str2);

mov ecx, DWORD PTR [esi+8]

; 10428:
; 10429: for (i = 0; i < len1 && i < len2; ++i) {

xor eax, eax
mov DWORD PTR _len1$[ebp], edi
mov DWORD PTR _len2$[ebp], ecx
test edi, edi
jle SHORT $LN2@unicode_co@2

; 10426: len1 = PyUnicode_GET_LENGTH(str1);

mov esi, edx
mov edi, edx

; 10428:
; 10429: for (i = 0; i < len1 && i < len2; ++i) {

sub ebx, edx
jmp SHORT $LN4@unicode_co@2
$LL28@unicode_co@2:
mov edx, DWORD PTR _data2$[ebp]
$LN4@unicode_co@2:
cmp eax, ecx
jge SHORT $LN29@unicode_co@2

; 10430: Py_UCS4 c1, c2;
; 10431: c1 = PyUnicode_READ(kind1, data1, i);

mov ecx, DWORD PTR _kind1$[ebp]
cmp ecx, 1
jne SHORT $LN17@unicode_co@2
lea ecx, DWORD PTR [ebx+eax]
movzx edx, BYTE PTR [ecx+edx]
jmp SHORT $LN16@unicode_co@2
$LN17@unicode_co@2:
cmp ecx, 2
jne SHORT $LN15@unicode_co@2
movzx edx, WORD PTR [ebx+edi]
jmp SHORT $LN16@unicode_co@2
$LN15@unicode_co@2:
mov edx, DWORD PTR [ebx+esi]
$LN16@unicode_co@2:

; 10432: c2 = PyUnicode_READ(kind2, data2, i);

mov ecx, DWORD PTR _kind2$[ebp]
cmp ecx, 1
jne SHORT $LN21@unicode_co@2
mov ecx, DWORD PTR _data2$[ebp]
movzx ecx, BYTE PTR [eax+ecx]
jmp SHORT $LN20@unicode_co@2
$LN21@unicode_co@2:
cmp ecx, 2
jne SHORT $LN19@unicode_co@2
movzx ecx, WORD PTR [edi]
jmp SHORT $LN20@unicode_co@2
$LN19@unicode_co@2:
mov ecx, DWORD PTR [esi]
$LN20@unicode_co@2:

; 10433:
; 10434: if (c1 != c2)

cmp edx, ecx
jne SHORT $LN31@unicode_co@2
mov ecx, DWORD PTR _len2$[ebp]
inc eax
add edi, 2
add esi, 4
cmp eax, DWORD PTR _len1$[ebp]
jl SHORT $LL28@unicode_co@2
$LN29@unicode_co@2:
mov edi, DWORD PTR _len1$[ebp]
$LN2@unicode_co@2:

; 10436: }
; 10437:
; 10438: return (len1 < len2) ? -1 : (len1 != len2);

cmp edi, ecx
jge SHORT $LN23@unicode_co@2
pop edi
pop esi
or eax, -1
pop ebx

; 10439: }

mov esp, ebp
pop ebp
ret 0
$LN31@unicode_co@2:

; 10435: return (c1 < c2) ? -1 : 1;

sbb eax, eax
pop edi
and eax, -2 ; fffffffeH
pop esi
inc eax
pop ebx

; 10439: }

mov esp, ebp
pop ebp
ret 0
$LN23@unicode_co@2:

; 10436: }
; 10437:
; 10438: return (len1 < len2) ? -1 : (len1 != len2);

xor eax, eax
cmp edi, ecx
pop edi
pop esi
setne al
pop ebx

; 10439: }

mov esp, ebp
pop ebp
ret 0
_unicode_compare ENDP

Neil
 
R

Roy Smith

Neil Hodgson said:
Roy Smith:


About 2 minutes. But that's just getting an example data set. Other
data sets may be loaded more quickly from databases or files or be
created by processing. Reading the example data from a file takes around
the same time as sorting.

Fair enough. In fact, given that reading the file from disk is O(n) and
sorting it is O(n log n), at some point, the sort will totally swamp the
input time. Your original example just happened to be one of the
unusual cases where the sort time is not the rate limiting factor in the
overall process.

I remember reading somewhere that more CPU cycles in the entire history
of computing have been spend doing sorting than anything else.
 
R

Roy Smith

Chris Angelico said:
In my personal experience, it's calculators. I put command-line
calculators into *everything*... often in the form of more general
executors, and thus restricted to admins, but it's still a calculator.

For some reason, the ability to type "calc 1+2" and get back 3 is very
satisfying to me. You know, in case I ever forget what one plus two
makes.

I discovered recently that Spotlight (the OSX built-in search engine)
can do this.
 
C

Chris Angelico

Fair enough. In fact, given that reading the file from disk is O(n) and
sorting it is O(n log n), at some point, the sort will totally swamp the
input time.

But given the much larger fixed cost of disk access, that might take
an awful lot of strings...

ChrisA
 
C

Chris Angelico

I discovered recently that Spotlight (the OSX built-in search engine)
can do this.

Good feature, not surprising. Google Search has had that feature for a
while, and it just "feels right" to be able to look up information the
same way regardless of its source.

ChrisA
 
R

Roy Smith

Steven D'Aprano said:
Then don't write it as a one-liner :p

n = max(map(ord, s))
4 if n > 0xffff else 2 if n > 0xff else 1

This has to inspect the entire string, no? I posted (essentially) this
a few days ago:

if all(ord(c) <= 0xffff for c in s):
return "it's all bmp"
else:
return "it's got astral crap in it"

I'm reasonably sure all() is smart enough to stop at the first False
value.

(sys.getsizeof(s) - sys.getsizeof(''))/len(s)
I wouldn't trust getsizeof() to return exactly what you're looking for.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top