# How can I make this piece of code even faster?

P

#### pablobarhamalzas

Ok, I'm working on a predator/prey simulation, which evolve using genetic algorithms. At the moment, they use a quite simple feed-forward neural network, which can change size over time. Each brain "tick" is performed by the following function (inside the Brain class):

def tick(self):
input_num = self.input_num
hidden_num = self.hidden_num
output_num = self.output_num

hidden = *hidden_num
output = *output_num

inputs = self.input
h_weight = self.h_weight
o_weight = self.o_weight

e = math.e

count = -1
for x in range(hidden_num):
temp = 0
for y in range(input_num):
count += 1
temp += inputs[y] * h_weight[count]
hidden[x] = 1/(1+e**(-temp))

count = -1
for x in range(output_num):
temp = 0
for y in range(hidden_num):
count += 1
temp += hidden[y] * o_weight[count]
output[x] = 1/(1+e**(-temp))

self.output = output

The function is actually quite fast (~0.040 seconds per 200 calls, using 10input, 20 hidden and 3 output neurons), and used to be much slower untill I fiddled about with it a bit to make it faster. However, it is still somewhat slow for what I need it.

My question to you is if you an see any obvious (or not so obvious) way of making this faster. I've heard about numpy and have been reading about it, but I really can't see how it could be implemented here.

Cheers!

F

Ok, I'm working on a predator/prey simulation, which evolve using genetic
algorithms. At the moment, they use a quite simple feed-forward neural
network, which can change size over time. Each brain "tick" is performed by
the following function (inside the Brain class):

def tick(self):
input_num = self.input_num
hidden_num = self.hidden_num
output_num = self.output_num

hidden = *hidden_num
output = *output_num

inputs = self.input
h_weight = self.h_weight
o_weight = self.o_weight

e = math.e

count = -1
for x in range(hidden_num):
temp = 0
for y in range(input_num):
count += 1
temp += inputs[y] * h_weight[count]
hidden[x] = 1/(1+e**(-temp))

count = -1
for x in range(output_num):
temp = 0
for y in range(hidden_num):
count += 1
temp += hidden[y] * o_weight[count]
output[x] = 1/(1+e**(-temp))

self.output = output

The function is actually quite fast (~0.040 seconds per 200 calls, using
10 input, 20 hidden and 3 output neurons), and used to be much slower
untill I fiddled about with it a bit to make it faster. However, it is
still somewhat slow for what I need it.

My question to you is if you an see any obvious (or not so obvious) way of
but I really can't see how it could be implemented here.

Cheers!

Low level optimizations:

If you're in Python 2.x (and not 3), you should use xrange() instead of
range(), or maybe even create a local variable and increment it and check
its value within a while (that way you can save a few instructions on
method invocations from xrange/range).

Anyways, if that's not fast enough, just port it to c/c++ (or one of the
alternatives to speed it up while still in python: numba, cython,
shedskin). Or (if you can), try to use PyPy and see if you get more speed
without doing anything.

Cheers,

Fabio

R

#### Roy Smith

Ok, I'm working on a predator/prey simulation, which evolve using genetic
algorithms. At the moment, they use a quite simple feed-forward neural
network, which can change size over time. Each brain "tick" is performed by
the following function (inside the Brain class):

def tick(self):
input_num = self.input_num
hidden_num = self.hidden_num
output_num = self.output_num

hidden = *hidden_num
output = *output_num

inputs = self.input
h_weight = self.h_weight
o_weight = self.o_weight

e = math.e

count = -1
for x in range(hidden_num):
temp = 0
for y in range(input_num):
count += 1
temp += inputs[y] * h_weight[count]
hidden[x] = 1/(1+e**(-temp))

count = -1
for x in range(output_num):
temp = 0
for y in range(hidden_num):
count += 1
temp += hidden[y] * o_weight[count]
output[x] = 1/(1+e**(-temp))

self.output = output

The function is actually quite fast (~0.040 seconds per 200 calls, using 10
input, 20 hidden and 3 output neurons), and used to be much slower untill I
fiddled about with it a bit to make it faster. However, it is still somewhat
slow for what I need it.

My question to you is if you an see any obvious (or not so obvious) way of
but I really can't see how it could be implemented here.

First thing, I would add some instrumentation to see where the most time
is being spent. My guess is in the first set of nested loops, where the
inner loop gets executed hidden_num * input_num (i.e. 10 * 20 = 200)
times. But timing data is better than my guess.

Assuming I'm right, though, you do compute range(input_num) 20 times.
You don't need to do that. You might try xrange(), or you might just
factor out creating the list outside the outer loop. But, none of that
seems like it should make much difference.

What possible values can temp take? If it can only take certain
discrete values and you can enumerate them beforehand, you might want to
build a dict mapping temp -> 1/(1+e**(-temp)) and then all that math
becomes just a table lookup.

P

#### pablobarhamalzas

Hi there.
I'm using python 3, where xrange doesn't exist any more (range is now equivalent). And "temp" doesn't have any fixed discrete values it always takes.

I have tried cython but it doesn't seem to work well (maybe using it wrong?).

Any other ideas?

C

#### Chris Angelico

temp = 0
for y in range(input_num):
count += 1
temp += inputs[y] * h_weight[count]
hidden[x] = 1/(1+e**(-temp))

It's a micro-optimization that'll probably have negligible effect, but
it can't hurt: Instead of adding to temp and raising e to -temp, carry
the value of temp as a negative number:

temp -= inputs[y] * h_weight[count]
hidden[x] = 1/(1+e**temp)

Ditto in the second loop.

Not sure which way performance would go, but would it be more readable
to take an iterator for h_weight and o_weight? Something like this:

# Slot this into your existing structure
inputs = self.input
h_weight = iter(self.h_weight)
o_weight = iter(self.o_weight)

e = math.e

for x in range(hidden_num):
temp = 0
for y in inputs:
temp += y * next(h_weight)
hidden[x] = 1/(1+e**(-temp))

for x in range(output_num):
temp = 0
for y in hidden:
temp += y * next(o_weight)
output[x] = 1/(1+e**(-temp))
# End.

If that looks better, the next change I'd look to make is replacing
the 'for y' loops with sum() calls on generators:

temp = sum(y * next(o_weight) for y in hidden)

And finally replace the entire 'for x' loops with list comps... which
makes for two sizeable one-liners, which I like and many people
detest:

def tick(self):
inputs = self.inputs
h_weight = iter(self.h_weight)
o_weight = iter(self.o_weight)
e = math.e
hidden = [1/(1+e**sum(-y * next(h_weight) for y in inputs))
for _ in range(hidden_num)]
self.output = [1/(1+e**sum(-y * next(o_weight) for y in
hidden)) for _ in range(output_num)]

Up to you to decide whether you find that version more readable, or at
least sufficiently readable, and then to test performance But it's
shorter by quite a margin, which I personally like. Oh, and I'm
relying on you to make sure I've made the translation correctly, which
I can't confirm without a pile of input data to test it on. All I can
say is that it's syntactically correct.

ChrisA

P

#### pablobarhamalzas

Hi there Chris.
Unfortunately, using iterations was about twice as slow as the original implementation, so that's not the solution.
Thank's anyway.

C

#### Chris Angelico

Hi there Chris.
Unfortunately, using iterations was about twice as slow as the original implementation, so that's not the solution.
Thank's anyway.

Fascinating! Well, was worth a try anyhow. But that's a very surprising result.

ChrisA

S

#### Steven D'Aprano

On Sat, 20 Jul 2013 13:22:03 -0700, pablobarhamalzas asked:

"How can I make this piece of code even faster?"

- Use a faster computer.
- Put in more memory.
- If using Unix or Linux, decrease the "nice" priority of the process.

I mention these because sometimes people forget that if you have a choice
between "spend 10 hours at \$70 per hour to optimize code", and "spend
\$200 to put more memory in", putting more memory in may be more cost
effective.

Other than that, what you describe sounds like it could be a good
candidate for PyPy to speed the code up, although PyPy is still (mostly)
Python 2. You could take this question to the pypy mailing list and ask
there.

http://mail.python.org/mailman/listinfo/pypy-dev

You also might like to try Cython or Numba.

As far as pure-Python optimizations, once you have a decent algorithm,
there's probably not a lot of room for major speed ups. But a couple of
thoughts and a possible optimized version come to mind...

1) In general, it is better/faster to iterate over lists directly, than
indirectly by index number:

for item in sequence:
process(item)

rather than:

for i in range(len(sequence)):
item = sequence
process(item)

If you need both the index and the value:

for i, item in enumerate(sequence):
print(i, process(item))

In your specific case, if I have understood your code's logic, you can
just iterate directly over the appropriate lists, once each.

2) You perform an exponentiation using math.e**(-temp). You will probably
find that math.exp(-temp) is both faster and more accurate.

3) If you need to add numbers, it is better to call sum() or math.fsum()
than add them by hand. sum() may be a tiny bit faster, or maybe not, but
fsum() is more accurate for floats.

See below for my suggestion on an optimized version.

Ok, I'm working on a predator/prey simulation, which evolve using
genetic algorithms. At the moment, they use a quite simple feed-forward
neural network, which can change size over time. Each brain "tick" is
performed by the following function (inside the Brain class):

def tick(self):
input_num = self.input_num
hidden_num = self.hidden_num
output_num = self.output_num

hidden = *hidden_num
output = *output_num

inputs = self.input
h_weight = self.h_weight
o_weight = self.o_weight

e = math.e

count = -1
for x in range(hidden_num):
temp = 0
for y in range(input_num):
count += 1
temp += inputs[y] * h_weight[count]
hidden[x] = 1/(1+e**(-temp))

count = -1
for x in range(output_num):
temp = 0
for y in range(hidden_num):
count += 1
temp += hidden[y] * o_weight[count]
output[x] = 1/(1+e**(-temp))

self.output = output

The function is actually quite fast (~0.040 seconds per 200 calls, using
10 input, 20 hidden and 3 output neurons), and used to be much slower
untill I fiddled about with it a bit to make it faster. However, it is
still somewhat slow for what I need it.

My question to you is if you an see any obvious (or not so obvious) way
of making this faster. I've heard about numpy and have been reading
about it, but I really can't see how it could be implemented here.

Here's my suggestion:

def tick(self):
exp = math.exp
sum = math.fsum # more accurate than builtin sum

# This assumes that both inputs and h_weight have exactly
# self.input_num values.
temp = fsum(i*w for (i, w) in zip(self.inputs, self.h_weight))
hidden = [1/(1+exp(-temp))]*self.hidden_num

# This assumes that both outputs and o_weight have exactly
# self.output_num values.
temp = fsum(o*w for (o, w) in zip(self.outputs, self.o_weight))
self.output = [1/(1+exp(-temp))]*self.output_num

I have neither tested that this works the same as your code (or even
works at all!) nor that it is faster, but I would expect that it will be
faster.

Good luck!

P

#### Peter Otten

Ok, I'm working on a predator/prey simulation, which evolve using genetic
algorithms. At the moment, they use a quite simple feed-forward neural
network, which can change size over time. Each brain "tick" is performed
by the following function (inside the Brain class):

def tick(self):
input_num = self.input_num
hidden_num = self.hidden_num
output_num = self.output_num

hidden = *hidden_num
output = *output_num

inputs = self.input
h_weight = self.h_weight
o_weight = self.o_weight

e = math.e

count = -1
for x in range(hidden_num):
temp = 0
for y in range(input_num):
count += 1
temp += inputs[y] * h_weight[count]
hidden[x] = 1/(1+e**(-temp))

count = -1
for x in range(output_num):
temp = 0
for y in range(hidden_num):
count += 1
temp += hidden[y] * o_weight[count]
output[x] = 1/(1+e**(-temp))

self.output = output

The function is actually quite fast (~0.040 seconds per 200 calls, using
10 input, 20 hidden and 3 output neurons), and used to be much slower
untill I fiddled about with it a bit to make it faster. However, it is
still somewhat slow for what I need it.

My question to you is if you an see any obvious (or not so obvious) way of
but I really can't see how it could be implemented here.

Cheers!

Assuming every list is replaced with a numpy.array,

h_weight.shape == (hidden_num, input_num)
o_weight.shape == (output_num, hidden_num)

and as untested as it gets:

def tick(self):
temp = numpy.dot(self.inputs, self.h_weight)
hidden = 1/(1+numpy.exp(-temp))

temp = numpy.dot(hidden, self.o_weight)
self.output = 1/(1+numpy.exp(-temp))

My prediction: this is probably wrong, but if you can fix the code it will
be stinkin' fast S

#### Serhiy Storchaka

20.07.13 23:22, (e-mail address removed) Ð½Ð°Ð¿Ð¸ÑÐ°Ð²(Ð»Ð°):
e = math.e

count = -1
for x in range(hidden_num):
temp = 0
for y in range(input_num):
count += 1
temp += inputs[y] * h_weight[count]
hidden[x] = 1/(1+e**(-temp)) [...]
My question to you is if you an see any obvious (or not so obvious) way of making this faster.

1. Use math.exp() instead of math.e**.

2. I'm not sure that it will be faster, but try to use sum().

temp = sum(inputs[y] * h_weight[count + y] for y in range(input_num))
count += input_num

or

temp = sum(map(operator.mul, inputs, h_weight[count:count+input_num]))
count += input_num

P

#### Paul Rudin

Steven D'Aprano said:
On Sat, 20 Jul 2013 13:22:03 -0700, pablobarhamalzas asked:

"How can I make this piece of code even faster?"

- Use a faster computer.
- Put in more memory.
- If using Unix or Linux, decrease the "nice" priority of the process.

I mention these because sometimes people forget that if you have a choice
between "spend 10 hours at \$70 per hour to optimize code", and "spend
\$200 to put more memory in", putting more memory in may be more cost
effective.

Sure - but it's helpful if programmers understand a little bit about the
computational complexity of algorithms. If it's just a question of
making each basic step of your algorithm a bit faster, then it may well
be better to spend money on better hardware than on squeezing more out
of your code. OTOH if you've got an n^2 implementation and there's
actually an n.log n solution available then you should probably re-code.

Of course if what you've got is actually adequate for your use-case then
it maybe that you don't actually need to do anything at all...

C

#### Chris Angelico

Sure - but it's helpful if programmers understand a little bit about the
computational complexity of algorithms. If it's just a question of
making each basic step of your algorithm a bit faster, then it may well
be better to spend money on better hardware than on squeezing more out
of your code. OTOH if you've got an n^2 implementation and there's
actually an n.log n solution available then you should probably re-code.

I haven't analyzed every suggestion in this thread in detail, but I
don't think any of them affects the algorithmic complexity of the
code. They're all incremental changes.

ChrisA

P

#### pablobarhamalzas

Thank's for all the replies! I've tried some of the imporovements you suggested (using math.exp() and sum() or math.fsum()).
None of that made the code faster, because they are functions you are calling lots of times, and function calling is quite time expensive (same as x**(1/2) is faster than math.sqrt(x)).

I'm going to try to convert tu numpy now (I have no idea how to do it at the moment), thank's again to everyone.

S

#### Steven D'Aprano

Thank's for all the replies! I've tried some of the imporovements you
suggested (using math.exp() and sum() or math.fsum()). None of that made
the code faster, because they are functions you are calling lots of
times, and function calling is quite time expensive (same as x**(1/2) is
faster than math.sqrt(x)).

You are *badly* mistaken. Not only is sqrt more accurate, but it is also
much faster.

[[email protected] ~]\$ python3.3 -m timeit -s "x = 2.357e7" "x**0.5"
1000000 loops, best of 3: 0.319 usec per loop
[[email protected] ~]\$ python3.3 -m timeit -s "x = 2.357e7" -s "from math import
sqrt" "sqrt(x)"
10000000 loops, best of 3: 0.172 usec per loop

How exactly are you timing the code?

P

#### pablobarhamalzas

El domingo, 21 de julio de 2013 12:31:42 UTC+2, Steven D'Aprano escribió:
Thank's for all the replies! I've tried some of the imporovements you
suggested (using math.exp() and sum() or math.fsum()). None of that made
the code faster, because they are functions you are calling lots of
times, and function calling is quite time expensive (same as x**(1/2) is
faster than math.sqrt(x)).

You are *badly* mistaken. Not only is sqrt more accurate, but it is also

much faster.

[[email protected] ~]\$ python3.3 -m timeit -s "x = 2.357e7" "x**0.5"

1000000 loops, best of 3: 0.319 usec per loop

[[email protected] ~]\$ python3.3 -m timeit -s "x = 2.357e7" -s "from math import

sqrt" "sqrt(x)"

10000000 loops, best of 3: 0.172 usec per loop

How exactly are you timing the code?

I'm timing the whole program with cProfile. Removing math.sqrt() from a function and using **(1/2) instead cut the execution time for a significant amount (~0.035 to ~0.020). I can't see another explanation for the speed increase...

C

#### Chris Angelico

Thank's for all the replies! I've tried some of the imporovements you
suggested (using math.exp() and sum() or math.fsum()). None of that made
the code faster, because they are functions you are calling lots of
times, and function calling is quite time expensive (same as x**(1/2) is
faster than math.sqrt(x)).

You are *badly* mistaken. Not only is sqrt more accurate, but it is also
much faster.

[[email protected] ~]\$ python3.3 -m timeit -s "x = 2.357e7" "x**0.5"
1000000 loops, best of 3: 0.319 usec per loop
[[email protected] ~]\$ python3.3 -m timeit -s "x = 2.357e7" -s "from math import
sqrt" "sqrt(x)"
10000000 loops, best of 3: 0.172 usec per loop

Don't forget the cost of attribute lookup, which adds 50% to the
sqrt() figure. Still faster than exponentiation. (Figures from Python
3.4 alpha, but unlikely to be materially different.)

[email protected]:~\$ python3 -m timeit -s "x = 2.357e7" "x**0.5"
1000000 loops, best of 3: 0.239 usec per loop
[email protected]:~\$ python3 -m timeit -s "x = 2.357e7" -s "from math
import sqrt" "sqrt(x)"
10000000 loops, best of 3: 0.102 usec per loop
[email protected]:~\$ python3 -m timeit -s "x = 2.357e7" -s "import math"
"math.sqrt(x)"
10000000 loops, best of 3: 0.155 usec per loop

ChrisA

J

#### Joshua Landau

Ok, I'm working on a predator/prey simulation, which evolve using geneticalgorithms. At the moment, they use a quite simple feed-forward neural network, which can change size over time. Each brain "tick" is performed by the following function (inside the Brain class):
The function is actually quite fast (~0.040 seconds per 200 calls, using 10 input, 20 hidden and 3 output neurons), and used to be much slower untill I fiddled about with it a bit to make it faster. However, it is still somewhat slow for what I need it.

My question to you is if you an see any obvious (or not so obvious) way of making this faster. I've heard about numpy and have been reading about it, but I really can't see how it could be implemented here.

Currently we're just guessing; if you gave us an appropriate stand-in
for "self" (so that we can call the function) we could be helpful much
more easily.

S

#### Stefan Behnel

El domingo, 21 de julio de 2013 12:31:42 UTC+2, Steven D'Aprano escribiÃ³:
[[email protected] ~]\$ python3.3 -m timeit -s "x = 2.357e7" "x**0.5"
1000000 loops, best of 3: 0.319 usec per loop
[[email protected] ~]\$ python3.3 -m timeit -s "x = 2.357e7" -s "from math import
sqrt" "sqrt(x)"
10000000 loops, best of 3: 0.172 usec per loop

How exactly are you timing the code?

I'm timing the whole program with cProfile. Removing math.sqrt() from a function and using **(1/2) instead cut the execution time for a significant amount (~0.035 to ~0.020).

With or without the profiler running? Note that profiling will slow down
your code (especially function calls), often significantly and sometimes
even in such an unbalanced way that it visibly changes its execution
profile. Always make sure you validate your code changes with benchmarks,
outside of the profiler.

Stefan

M

#### Michael Torrie

Thank's for all the replies! I've tried some of the imporovements you suggested (using math.exp() and sum() or math.fsum()).
None of that made the code faster, because they are functions you are calling lots of times, and function calling is quite time expensive (same as x**(1/2) is faster than math.sqrt(x)).

I'm going to try to convert tu numpy now (I have no idea how to do it at the moment), thank's again to everyone.

Perhaps you'd have better results if you'd post a runnable piece of
code. Otherwise we're just guessing since no one has the ability to