Why is this loop heavy code so slow in Python? Possible Project Euler spoilers

Bruno Desthuilliers · Sep 1, 2007

Ivan Wang a écrit :

[snip code]

Thanks for that. I realise that improving the algorithm will speed
things up. I wanted to know why my less than perfect algorithm was so
much slower in python than exactly the same algorithm in C. Even when
turning off gcc's optimiser with the -O0 flag, the C version is still

100 times quicker.- Hide quoted text -

Click to expand...

- Show quoted text -

Click to expand...

Maybe Python is the slowest programming language in the world.
So there is a joke: some python hater said that "python" can only
crawl rather than run.

Python is slow because:
(1) dynamic binding Yes.
(2) it is a interpretation language

Not quite. It's compiled to byte-code - just like Java (would you call
Java an 'interpreted language' ?)

jwrweatherley · Sep 2, 2007

I'm pretty new to python, but am very happy with it. As well as using
it at work I've been using it to solve various puzzles on the Project
Euler site - http://projecteuler.net. So far it has not let me down,
but it has proved surprisingly slow on one puzzle.

The puzzle is: p is the perimeter of a right angle triangle with
integral length sides, {a,b,c}. which value of p < 1000, is the
number of solutions {a,b,c} maximised?

Here's my python code:

#!/usr/local/bin/python

solutions = [0] * 1001
p = 0

for a in xrange(1, 1000):
for b in xrange(1, 1000 - a):
for c in xrange(1, 1000 - a - b):
p = a + b + c
if p < 1000:
if a ** 2 + b ** 2 == c ** 2:
solutions[p] += 1

max = 0
maxIndex = 0
index = 0
for solution in solutions:
if solution > max:
max = solution
maxIndex = index
index += 1

print maxIndex

It takes 2 minutes and twelve seconds on a 2.4GHz Core2Duo MacBook
Pro. Surprised at how slow it was I implemented the same algorithm in
C:

#include <stdio.h>
#include <stdlib.h>

int main() {
int* solutions = calloc(1000, sizeof(int));

int p;
for(int a = 1; a < 1000; ++a) {
for(int b = 1; b < 1000 - a; ++b) {
for(int c = 1; c < 1000 - a - b; ++c) {
p = a + b + c;
if(p < 1000) {
if(a * a + b * b == c * c) {
solutions[p] += 1;
}
}
}
}
}

int max = 0;
int maxIndex = 0;

for(int i = 0; i < 1000; ++i) {
if(solutions > max) {
max = solutions;
maxIndex = i;
}
}
printf("%d\n", maxIndex);
return 0;
}

gcc -o 39 -std=c99 -O3 39.c

The resulting executable takes 0.24 seconds to run. I'm not expecting
a scripting language to run faster than native code, but I was
surprised at how much slower it was in this case. Any ideas as to what
is causing python so much trouble in the above code?

Arnaud Delobelle · Sep 2, 2007

I'm pretty new to python, but am very happy with it. As well as using
it at work I've been using it to solve various puzzles on the Project
Euler site -http://projecteuler.net. So far it has not let me down,
but it has proved surprisingly slow on one puzzle.

The puzzle is: p is the perimeter of a right angle triangle with
integral length sides, {a,b,c}. which value of p < 1000, is the
number of solutions {a,b,c} maximised?

Here's my python code:

#!/usr/local/bin/python

solutions = [0] * 1001
p = 0

for a in xrange(1, 1000):
for b in xrange(1, 1000 - a):
for c in xrange(1, 1000 - a - b):
p = a + b + c
if p < 1000:
if a ** 2 + b ** 2 == c ** 2:
solutions[p] += 1

max = 0
maxIndex = 0
index = 0
for solution in solutions:
if solution > max:
max = solution
maxIndex = index
index += 1

print maxIndex

It takes 2 minutes and twelve seconds on a 2.4GHz Core2Duo MacBook
Pro. Surprised at how slow it was I implemented the same algorithm in
C:

#include <stdio.h>
#include <stdlib.h>

int main() {
int* solutions = calloc(1000, sizeof(int));

int p;
for(int a = 1; a < 1000; ++a) {
for(int b = 1; b < 1000 - a; ++b) {
for(int c = 1; c < 1000 - a - b; ++c) {
p = a + b + c;
if(p < 1000) {
if(a * a + b * b == c * c) {
solutions[p] += 1;
}
}
}
}
}

int max = 0;
int maxIndex = 0;

for(int i = 0; i < 1000; ++i) {
if(solutions > max) {
max = solutions;
maxIndex = i;
}
}
printf("%d\n", maxIndex);
return 0;

}

gcc -o 39 -std=c99 -O3 39.c

The resulting executable takes 0.24 seconds to run. I'm not expecting
a scripting language to run faster than native code, but I was
surprised at how much slower it was in this case. Any ideas as to what
is causing python so much trouble in the above code?

from math import sqrt

solutions = [0] * 1001
p = 0

for a in xrange(1, 1000):
a2 = a*a
for b in xrange(1, 1000 - a):
c = sqrt(a2 + b*b)
if c == int(c) and a+b+c < 1000:
solutions[a+b+int(c)] += 1

max = 0
maxIndex = 0
index = 0
for solution in solutions:
if solution > max:
max = solution
maxIndex = index
index += 1

print maxIndex

Greg Copeland · Sep 2, 2007

I'm pretty new to python, but am very happy with it. As well as using
it at work I've been using it to solve various puzzles on the Project
Euler site -http://projecteuler.net. So far it has not let me down,
but it has proved surprisingly slow on one puzzle.

Click to expand...

The puzzle is: p is the perimeter of a right angle triangle with
integral length sides, {a,b,c}. which value of p < 1000, is the
number of solutions {a,b,c} maximised?

Click to expand...

Here's my python code:

#!/usr/local/bin/python

Click to expand...

solutions = [0] * 1001
p = 0

Click to expand...

for a in xrange(1, 1000):
for b in xrange(1, 1000 - a):
for c in xrange(1, 1000 - a - b):
p = a + b + c
if p < 1000:
if a ** 2 + b ** 2 == c ** 2:
solutions[p] += 1

Click to expand...

max = 0
maxIndex = 0
index = 0
for solution in solutions:
if solution > max:
max = solution
maxIndex = index
index += 1

Click to expand...

print maxIndex

Click to expand...

It takes 2 minutes and twelve seconds on a 2.4GHz Core2Duo MacBook
Pro. Surprised at how slow it was I implemented the same algorithm in
C:

Click to expand...

#include <stdio.h>
#include <stdlib.h>

Click to expand...

int main() {
int* solutions = calloc(1000, sizeof(int));

Click to expand...

int p;
for(int a = 1; a < 1000; ++a) {
for(int b = 1; b < 1000 - a; ++b) {
for(int c = 1; c < 1000 - a - b; ++c) {
p = a + b + c;
if(p < 1000) {
if(a * a + b * b == c * c) {
solutions[p] += 1;
}
}
}
}
}

Click to expand...

int max = 0;
int maxIndex = 0;

Click to expand...

for(int i = 0; i < 1000; ++i) {
if(solutions > max) {
max = solutions;
maxIndex = i;
}
}
printf("%d\n", maxIndex);
return 0;

}

Click to expand...

gcc -o 39 -std=c99 -O3 39.c

Click to expand...

The resulting executable takes 0.24 seconds to run. I'm not expecting
a scripting language to run faster than native code, but I was
surprised at how much slower it was in this case. Any ideas as to what
is causing python so much trouble in the above code?

Click to expand...

from math import sqrt

solutions = [0] * 1001
p = 0

for a in xrange(1, 1000):
a2 = a*a
for b in xrange(1, 1000 - a):
c = sqrt(a2 + b*b)
if c == int(c) and a+b+c < 1000:
solutions[a+b+int(c)] += 1

max = 0
maxIndex = 0
index = 0
for solution in solutions:
if solution > max:
max = solution
maxIndex = index
index += 1

print maxIndex

For the curious:

O O + P A A + P
======= ======= ======= =======
2:22.56 0:25.65 0:00.75 0:00.20

O = Original Implementation
P = Psyco (psyco.full())
A = Arnaud's Revised Implementation

rzed · Sep 2, 2007

(e-mail address removed) wrote in

The puzzle is: p is the perimeter of a right angle triangle with
integral length sides, {a,b,c}. which value of p < 1000, is the
number of solutions {a,b,c} maximised?

Here's my python code:

#!/usr/local/bin/python

solutions = [0] * 1001
p = 0

for a in xrange(1, 1000):
for b in xrange(1, 1000 - a):
for c in xrange(1, 1000 - a - b):
p = a + b + c
if p < 1000:
if a ** 2 + b ** 2 == c ** 2:
solutions[p] += 1

Once p >= 1000, it ain't goin' back. If you break out of the
innermost loop here after that happens, you'll save a bunch of
time.

max = 0
maxIndex = 0
index = 0
for solution in solutions:
if solution > max:
max = solution
maxIndex = index
index += 1

print maxIndex

It takes 2 minutes and twelve seconds on a 2.4GHz Core2Duo
MacBook Pro.

[...]

jwrweatherley · Sep 2, 2007

[snip code]

Thanks for that. I realise that improving the algorithm will speed
things up. I wanted to know why my less than perfect algorithm was so
much slower in python than exactly the same algorithm in C. Even when
turning off gcc's optimiser with the -O0 flag, the C version is still

Mark Dickinson · Sep 2, 2007

[snip code]

Thanks for that. I realise that improving the algorithm will speed
things up. I wanted to know why my less than perfect algorithm was so
much slower in python than exactly the same algorithm in C. Even when
turning off gcc's optimiser with the -O0 flag, the C version is still

100 times quicker.

Click to expand...

Well, for one thing, you're creating half a million xrange objects in
the course of the search. All the C code has
to do is increment a few integers.

Mark

Ivan Wang · Sep 2, 2007

[snip code]

Thanks for that. I realise that improving the algorithm will speed
things up. I wanted to know why my less than perfect algorithm was so
much slower in python than exactly the same algorithm in C. Even when
turning off gcc's optimiser with the -O0 flag, the C version is still

100 times quicker.- Hide quoted text -

Click to expand...

- Show quoted text -

Maybe Python is the slowest programming language in the world.
So there is a joke: some python hater said that "python" can only
crawl rather than run.

Python is slow because:
(1) dynamic binding
(2) it is a interpretation language
For example, in C code, an interger expression "a+b" will directly
generate an assembly code "add" for x86 processors.
A python interpreter, on the other side, need detect the type of a and
b first, then choose the right "+" operation for them and use a
evaluation stack to get the result.

Psyco is a JIT compiler with just in time specialization which can
somehow solve these two problems

Arnaud Delobelle · Sep 2, 2007

On Sep 2, 12:51 pm, (e-mail address removed) wrote:
[...]

The resulting executable takes 0.24 seconds to run. I'm not expecting
a scripting language to run faster than native code, but I was
surprised at how much slower it was in this case. Any ideas as to what
is causing python so much trouble in the above code?

Sorry I didn't answer your question at all in my previous post
(although my modified method gets the answer in about 6s on a measly
PB G4 1GHz

.
Your little code snippet is just about the worst bit of code for
python to be compared to C. Three loops, only integer arithmetic:
that's not going to make Python shine!

Nevertheless as you optimise your C snippet (-O3), there are probably
a few things that the compiler does for you:

* as in the inner loop, a*a + b*b always remain the same, it is
probably stored in a register once and for all
* in the second loop, a*a remains the same so it might be calculated
once and for all as well.

It gives this:

M = 1000
solutions = [0] * M

def f1():
"Your original implementation"
for a in xrange(1, M):
for b in xrange(1, M - a):
for c in xrange(1, M - a - b):
if a**2 + b**2 == c**2:
solutions[a+b+c] += 1

def f2():
"a*a + b*b precalculated"
for a in xrange(1, M):
a2 = a*a
for b in xrange(1, M - a):
s = a2 + b*b
for c in xrange(1, M - a - b):
if s == c*c:
solutions[a+b+c] += 1

It doesn't make it as quick as C of course, but more than halves the
execution time.

Cameron Laird · Sep 2, 2007

[snip code]

Thanks for that. I realise that improving the algorithm will speed
things up. I wanted to know why my less than perfect algorithm was so
much slower in python than exactly the same algorithm in C. Even when
turning off gcc's optimiser with the -O0 flag, the C version is still

100 times quicker.

Click to expand...

Click to expand...

Well, for one thing, you're creating half a million xrange objects in
the course of the search. All the C code has
to do is increment a few integers.

Mark

Right: Mr. Dickinson's original question is entirely
legitimate, and it's not adequate to respond, as some
follow-ups did, with ways to improve the Python-coded
algorithm.

The correct answer, which I want to reinforce, is that
the exhibited Python and C versions are NOT "exactly
the same algorithm", at least not without more quali-
fication. Part of Python expertise is to recognize
that creation of xrange objects, mentioned above, is
far from free. Also, -O3 gives C the opportunity,
also remarked in a follow-up, to factor calculations
outside their loops.

Alex Martelli · Sep 2, 2007

Mark Dickinson said:
[snip code]

Thanks for that. I realise that improving the algorithm will speed
things up. I wanted to know why my less than perfect algorithm was so
much slower in python than exactly the same algorithm in C. Even when
turning off gcc's optimiser with the -O0 flag, the C version is still

100 times quicker.

Click to expand...

Click to expand...

Well, for one thing, you're creating half a million xrange objects in
the course of the search. All the C code has
to do is increment a few integers.

I don't think the creation of xrange objects is a meaningful part of
Python's execution time here. Consider:

M = 1000
solutions = [0] * M

def f2():
"a*a + b*b precalculated"
for a in xrange(1, M):
a2 = a*a
for b in xrange(1, M - a):
s = a2 + b*b
for c in xrange(1, M - a - b):
if s == c*c:
solutions[a+b+c] += 1

def f3(M=M, solutions=solutions):
"pull out all the stops"
xrs = [xrange(1, k) for k in xrange(0, M+1)]
for a in xrs[M]:
a2 = a*a
for b in xrs[M-a]:
s = a2 + b*b
for c in xrs[M-a-b]:
if s == c*c:
solutions[a+b+c] += 1

import time

t = time.time()
f2()
e = time.time()
print e-t, max(xrange(M), key=solutions.__getitem__)

solutions = [0]*M
t = time.time()
f3(M, solutions)
e = time.time()
print e-t, max(xrange(M), key=solutions.__getitem__)

f2 is Arnaud's optimization of the OP's algorithm by simple hoisting; f3
further hoists the xrange creation -- it creates only 1000 such objects
rather than half a million. And yet...:

brain:~/py25 alex$ python puz.py
34.6613101959 840
36.2000119686 840
brain:~/py25 alex$

....which suggests that creating an xrange object is _cheaper_ than
indexing a list...

Alex

Paul Rubin · Sep 2, 2007

...which suggests that creating an xrange object is _cheaper_ than
indexing a list...

Why not re-use the xrange instead of keeping a list around?

Python 2.4.4 (#1, Oct 23 2006, 13:58:00)

>>> a = xrange(3)
>>> print list(a) [0, 1, 2]
>>> print list(a)

Click to expand...

Click to expand...

[0, 1, 2]

Alex Martelli · Sep 2, 2007

Paul Rubin said:
...which suggests that creating an xrange object is _cheaper_ than
indexing a list...

Click to expand...

Why not re-use the xrange instead of keeping a list around?

Python 2.4.4 (#1, Oct 23 2006, 13:58:00)

a = xrange(3)
print list(a) [0, 1, 2]
print list(a)

Click to expand...

Click to expand...

[0, 1, 2]

Reusing xranges is exactly what my code was doing -- at each for loop
you need an xrange(1, k) for a different value of k, which is why you
need some container to keep them around (and a list of xrange objects is
the simplest applicable container).

Your suggestion doesn't appear to make any sense in the context of the
optimization problem at hand -- what list(...) calls are you thinking
of?! Please indicate how your suggestion would apply in the context of:

def f3(M=M, solutions=solutions):
"pull out all the stops"
xrs = [xrange(1, k) for k in xrange(0, M+1)]
for a in xrs[M]:
a2 = a*a
for b in xrs[M-a]:
s = a2 + b*b
for c in xrs[M-a-b]:
if s == c*c:
solutions[a+b+c] += 1

Alex

Paul Rubin · Sep 2, 2007

Reusing xranges is exactly what my code was doing

Oh cool, I missed that, I was just going by the text description.
Looking at the code, yes, it's re-using the xranges. Thanks.

Mark Dickinson · Sep 2, 2007

Mark Dickinson said:
Mark Dickinson said:

Well, for one thing, you're creating half a million xrange objects in
the course of the search. All the C code has
to do is increment a few integers.

Click to expand...

I don't think the creation of xrange objects is a meaningful part of
Python's execution time here. Consider:
[...]

Agreed---I just came to the same conclusion after doing some tests.
So maybe it's the billion or so integer objects being created that
dominate the running time? (Not sure how many integer objects
actually are created here: doesn't Python cache *some* small
integers?)

Mark

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Sep 2, 2007

(2) it is a interpretation language

Not quite. It's compiled to byte-code - just like Java (would you call
Java an 'interpreted language' ?)

Python is not implemented like Java. In Java (at least in HotSpot),
the byte code is further compiled to machine code before execution;
in Python, the byte code is interpreted.

Whether this makes Python an interpreter or a compiler,
I don't know.

Regards,
Martin

Alex Martelli · Sep 2, 2007

Mark Dickinson said:
Mark Dickinson said:

Well, for one thing, you're creating half a million xrange objects in
the course of the search. All the C code has
to do is increment a few integers.

Click to expand...

I don't think the creation of xrange objects is a meaningful part of
Python's execution time here. Consider:
[...]

Click to expand...

Agreed---I just came to the same conclusion after doing some tests.
So maybe it's the billion or so integer objects being created that
dominate the running time? (Not sure how many integer objects
actually are created here: doesn't Python cache *some* small
integers?)

Yep, some, say -5 to 100 or thereabouts; it also caches on a free-list
all the "empty" integer-objects it ever has (rather than returning the
memory for the system), so I don't think there's much optimization to be
had on that score either.

Alex

Wildemar Wildenburger · Sep 2, 2007

Martin said:
Python is not implemented like Java. In Java (at least in HotSpot),
the byte code is further compiled to machine code before execution;
in Python, the byte code is interpreted.

OK, good. Naive question now comming to mind: Why doesn't Python do the
latter as well?

/W

Mark Dickinson · Sep 2, 2007

The resulting executable takes 0.24 seconds to run. I'm not expecting
a scripting language to run faster than native code, but I was
surprised at how much slower it was in this case. Any ideas as to what
is causing python so much trouble in the above code?

Below is some code and some timings that show that Python is spending
at least 1/3 of its time computing the squares: a*a, b*b and c*c. So
to explain why Python is so much slower than C, it should be enough to
explain what's going on with these multiplications. Here's my attempt
at an explanation:

To compute a*a, Python has to do the following, all at run-time

(1) Find the type of a.
(2) Look up the corresponding multiplication method for this type.
(3) (In the multiplication method): find the type of the right-hand
side, in this case a again.
(4) Having established that both arguments are ints, worry about
whether the result is going to overflow (in which case a long needs to
be returned).
(5) Do the multiplication.
(6) Allocate a new Python integer to hold the result, and return it.

The C code only has to do step (5), and this boils down to a single
machine instruction.

Mark

Code and timings: (Python 2.5.1/G4.)

def test():
solutions = [0] * 1000
for a in xrange(1, 1000):
for b in xrange(1, 1000 - a):
for c in xrange(1, 1000 - a - b):
if a*a + b*b == c*c:
solutions[a+b+c] += 1
return solutions

def test2():
solutions = [0] * 1000
squares = [x*x for x in xrange(1000)]
for a in xrange(1, 1000):
for b in xrange(1, 1000 - a):
for c in xrange(1, 1000 - a - b):
if squares[a] + squares == squares[c]:
solutions[a+b+c] += 1
return solutions

5 function calls in 299.984 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall
filename:lineno(function)
1 0.010 0.010 0.010 0.010 :0(setprofile)
1 0.000 0.000 299.973 299.973 <string>:1(<module>)
1 0.000 0.000 299.984 299.984 profile:0(l = test(); l2
= test2())
0 0.000 0.000 profile:0(profiler)
1 182.262 182.262 182.262 182.262 test.py:1(test)
1 117.711 117.711 117.711 117.711 test.py:10(test2)

Steven D'Aprano · Sep 2, 2007

OK, good. Naive question now comming to mind: Why doesn't Python do the
latter as well?

/W

There is no single version of Java, and the reference interpretation runs
on a virtual machine just like Python. Today there are virtual machine
implementations of Java, native compilers, and Just In Time compilers for
Java, including HotSpot mentioned by Martin, but Java the language was
originally defined to run on a VM.

See, for example, here: http://schmidt.devlib.org/java/compilers.html

There are costs to native compilation, the biggest one of course being
the initial investment in time and effort in creating the native
compiler. Sun and other commercial companies have invested a lot of money
in Java, and I don't think the money invested in Python has come even
close. Consequently, any work into JIT compilation for Java has been done
by volunteers.

Nevertheless, we have Psyco, which is a JIT compiler of sorts; work also
continues on PyPy (Python implemented in Python) which, it is hoped, will
lead to faster Python implementations.

Part of the reason that native compilation for Python is hard is that
Python's dynamic object model makes it very difficult to apply the same
sorts of compiler optimizations that C and Java allow. Comparatively
little of the work done by Python can be safely pushed from run time to
compile time, so it is unlikely that the average Python application will
ever run as fast as the equivalent C code -- although that ignores the
question of what "the equivalent C code" could possibly mean. (If the C
code includes all the dynamic high-level features of Python, it too would
surely run as slowly as Python, and if it didn't, it can hardly be said
to be equivalent.) Nevertheless, by using something like Psyco parts of
your Python code can run at virtually the same speed as C.

A big question mark in my mind is Lisp, which according to aficionados is
just as dynamic as Python, but has native compilers that generate code
running as fast as highly optimized C. I'm not qualified to judge whether
the lessons learnt from Lisp can be applied to Python, but in any case
Lisp is an old, old language -- only Fortran is older. The amount of
development effort and money put into Lisp dwarfs that put into Python by
possibly a hundred or more.

So... if you'd like to see Python run as fast as C or Lisp, and you have
a few tens of millions of dollars spare to invest in development, I think
the Python Software Foundation would love to hear from you.

Why is this loop heavy code so slow in Python? Possible Project Euler spoilers

Bruno Desthuilliers

jwrweatherley

Arnaud Delobelle

Greg Copeland

rzed

jwrweatherley

Mark Dickinson

Ivan Wang

Arnaud Delobelle

Cameron Laird

Alex Martelli

Paul Rubin

Alex Martelli

Paul Rubin

Mark Dickinson

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Alex Martelli

Wildemar Wildenburger

Mark Dickinson

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads