Microbenchmark: Summing over array of doubles

Yaroslav Bulatov · Aug 1, 2004

I made an array of 10 million floats timed how long it takes to sum
the elements, here's what I got (millis):

gcc -O2: 21
Python with numarray: 104
Python with Numeric: 302
java: 325
gcc: 348
Python with Psyco: 1317
Pure Python using sum: 2312
Pure Python: 5631

http://yaroslav.hopto.org/russianwiki/index.php/c-numeric-speed

numarray takes over Numeric at about 1100 floats

I'm doing intensive computation on arrays in Python, so if you have
suggestions on Python/C solutions that could push the envelope, please
let me know.

Yaroslav

Duncan Booth · Aug 1, 2004

(e-mail address removed) (Yaroslav Bulatov) wrote in

I made an array of 10 million floats timed how long it takes to sum
the elements, here's what I got (millis):

I just had a brief look at the code you posted. Are you not concerned about
accuracy in any of your calculations? Summing a 10 million element array by
simply accumulating each element into a running total is guaranteed to give
a lousy result.

Also, without running it I don't see how most of your Python examples work
since you seem to call the timing function with the result of the
calculation which doesn't make sense.

Christopher T King · Aug 1, 2004

I just had a brief look at the code you posted. Are you not concerned about
accuracy in any of your calculations? Summing a 10 million element array by
simply accumulating each element into a running total is guaranteed to give
a lousy result.

Lousy or not, I believe that's how numarray is implemented internally, so
at least all the benchmarks are the same. If you want accuracy summing
that many numbers, you're going to have to do it in software (likely by
summing each mantissa bit individually and reconstructing the float
afterward), so it will be abysmally slow (obviously not what the OP
wants).

Christopher T King · Aug 1, 2004

I'm doing intensive computation on arrays in Python, so if you have
suggestions on Python/C solutions that could push the envelope, please
let me know.

If you're doing mostly vector calculations as opposed to summing, I've
been doing some work on adding SIMD support to numarray, with pleasing
results (around 2x speedups). I've also done some work adding local
parallel processing support to numarray, with not-so-pleasing results
(mostly due to Python overhead).

Regarding your results:

numarray should be just as fast as the -O2 C version. I was puzzled at
first as to where the speed discrepancy came from, but the culprit is in
the -O2 flag: gcc -O2 noticies that sum is never used, and thus removes
the loop entirely. As a matter of fact, there isn't even any fadd
instruction in the assembler output:

call clock
movl %eax, %esi
movl $9999999, %ebx
..L11:
decl %ebx
jns .L11
subl $16, %esp
call clock

As you can see, the 21ms you're seeing is the time spent counting down
from 9,999,999 to 0. To obtain correct results, add a line such as
'printf("%f\n",sum);' after the main loop in the C version. This will
force gcc to leave the actual calculation in place and give you accurate
results.

The above fix will likely render numarray faster than the C version.
Using gcc -O3 rather than gcc -O2 will get fairer results, as this is what
numarray uses.

Is there any reason why in the Python/numarray version, you use
Numeric's RandomArray rather than numarray.random_array? It shouldn't
affect your results, but it would speed up initialization time a bit.

There are a few inefficiences in the pytime module (mostly involving
range() and *args/**kwargs), but I don't think they'll have too big of an
impact on your results. Instead, I'd suggest running the numarray/Numeric
tests using Psyco to remove much of the Python overhead.

For completeness, I'd also suggest both running the Java version using a
JIT compiler such as Kaffe, and compiling it natively using gcj (the
latter should approach the speed of C).

Duncan Booth · Aug 2, 2004

Lousy or not, I believe that's how numarray is implemented internally,
so at least all the benchmarks are the same. If you want accuracy
summing that many numbers, you're going to have to do it in software
(likely by summing each mantissa bit individually and reconstructing
the float afterward), so it will be abysmally slow (obviously not what
the OP wants).

My point being that speed isn't everything. Most applications doing large
floating point calculations should be concerned about accuracy, and how not
to add up a large array of floating point numbers was one of the first
things I was taught in computer science classes. The fastest way to sum 10
million numbers (albeit at some considerable loss of accuracy):

return 0

The 'correct' way to sum a large array of floats is, of course, to
calculate a lot of partial sums and sum them. For example, naively, we
might say that to sum the array we sum the first half, sum the second half
and add the results together. This definition being recursive ensures that
if the inputs are all of a similar size then all the intermediate
calculations involve summing similarly sized values. A more sophisticated
technique might also try to handle the case where not all values are a
similar size.

If you are worried about speed, then calculating partial sums is also the
way to go: the naive technique used by the original poster leaves no scope
for parallel calculation. It should be possible to write a slightly more
complex loop in the C version that runs a lot faster on systems that are
capable of performing multiple floating point instructions in parallel.

beliavsky · Aug 2, 2004

I made an array of 10 million floats timed how long it takes to sum
the elements, here's what I got (millis):

gcc -O2: 21
Python with numarray: 104
Python with Numeric: 302
java: 325
gcc: 348
Python with Psyco: 1317
Pure Python using sum: 2312
Pure Python: 5631

http://yaroslav.hopto.org/russianwiki/index.php/c-numeric-speed

numarray takes over Numeric at about 1100 floats

I'm doing intensive computation on arrays in Python, so if you have
suggestions on Python/C solutions that could push the envelope, please
let me know.

What hardware are you using?

On a 2.8 GHz Intel Pentium 4, a Fortran 95 code compiled with Compaq
Visual Fortran 6.6C using the -optimize:5 option takes 40
milliseconds. Probably you could improve the speed of the C program by
compiling it with the Intel C compiler. The Fortran program speed is
constrained by array accesses, not the additions (which are done in
hardware). A program to compute sum(xx**4) takes the same time as one
to compute sum(xx).

program xsum_double
! sum 1e7 doubles
implicit none
integer, parameter :: n = 10000000
real(kind=8) :: xx(n)
real :: t1,t2,xsum
call random_seed()
call random_number(xx)
call cpu_time(t1)
xsum = sum(xx)
call cpu_time(t2)
print*,1000*(t2-t1),xsum
end program xsum_double

Yaroslav Bulatov · Aug 2, 2004

Duncan Booth said:
My point being that speed isn't everything. Most applications doing large
floating point calculations should be concerned about accuracy, and how not
to add up a large array of floating point numbers was one of the first
things I was taught in computer science classes. The fastest way to sum 10
million numbers (albeit at some considerable loss of accuracy):

return 0

The 'correct' way to sum a large array of floats is, of course, to
calculate a lot of partial sums and sum them. For example, naively, we
might say that to sum the array we sum the first half, sum the second half
and add the results together. This definition being recursive ensures that
if the inputs are all of a similar size then all the intermediate
calculations involve summing similarly sized values. A more sophisticated
technique might also try to handle the case where not all values are a
similar size.

If you are worried about speed, then calculating partial sums is also the
way to go: the naive technique used by the original poster leaves no scope
for parallel calculation. It should be possible to write a slightly more
complex loop in the C version that runs a lot faster on systems that are
capable of performing multiple floating point instructions in parallel.

You are right, naive summing generates significant accuracy losses.
I estimated the error by summing [0.123456789012345]*1000000 and
comparing it to
1234567.89012345. All methods have error about 1e-4 .

The method below sums the array at the same speed as regular Python
sum loop, but reports error < 1e-15 .

def sum2(arr):
size = len(arr)
for itr in range(int(ceil(log(size)/log(2)))):
for i in xrange(0, size, 2**(itr+1)):
next_i = i+2**itr
if next_i<size:
arr+=arr[next_i]
return arr[0]

When arbitrary precision numbers are being used, this method has
performance O(nd) vs. regular summing O(n(d+log d)) where n is size of
array, d is number of digits of the elements. In practice I got about
20% performance increase when summing an array of 1000 Decimals using
method above.

A better estimate of error difference would use random digits as
opposed to [x]*10000000, but I don't know how I would calculate exact
answer in any reasonable amount of time. (using Decimals takes over a
second for 4000 array (with conversion))

Yaroslav

Yaroslav Bulatov · Aug 2, 2004

Christopher T King said:
If you're doing mostly vector calculations as opposed to summing, I've
been doing some work on adding SIMD support to numarray, with pleasing
results (around 2x speedups). I've also done some work adding local
parallel processing support to numarray, with not-so-pleasing results
(mostly due to Python overhead).

Regarding your results:

numarray should be just as fast as the -O2 C version. I was puzzled at
first as to where the speed discrepancy came from, but the culprit is in
the -O2 flag: gcc -O2 noticies that sum is never used, and thus removes
the loop entirely. As a matter of fact, there isn't even any fadd
instruction in the assembler output:

call clock
movl %eax, %esi
movl $9999999, %ebx
.L11:
decl %ebx
jns .L11
subl $16, %esp
call clock

As you can see, the 21ms you're seeing is the time spent counting down
from 9,999,999 to 0. To obtain correct results, add a line such as
'printf("%f\n",sum);' after the main loop in the C version. This will
force gcc to leave the actual calculation in place and give you accurate
results.

The above fix will likely render numarray faster than the C version.
Using gcc -O3 rather than gcc -O2 will get fairer results, as this is what
numarray uses.

You are right, how silly of me! Fixing the script now results in 130
millis mean, 8.42 millis standard deviation, which is slower than
numarray (104, 2.6 respectively). I wonder why numarray gives faster
results on such a simple task?

Is there any reason why in the Python/numarray version, you use
Numeric's RandomArray rather than numarray.random_array? It shouldn't
affect your results, but it would speed up initialization time a bit.

There isn't a good reason, I simply didn't know about
numarray.random_array

Tim Peters · Aug 2, 2004

[Yaroslav Bulatov]
....

A better estimate of error difference would use random digits as
opposed to [x]*10000000, but I don't know how I would calculate exact
answer in any reasonable amount of time. (using Decimals takes over a
second for 4000 array (with conversion))

Google on

floating point distillation

It's possible to compute the exact sum of a list of floats using float
arithmetic, where, ignoring overflow, the result is a minimal set of
fp numbers whose mathematical (infinite precision) sum exactly equals
the mathematical sum of the inputs.

In the worst case the best of these methods can require O(log N)
passes over the list of N inputs, but in real life they usually
converge to the result in two passes independent of N, sometimes with
a third pass but over a very short list of floats.

The better-known "Kahan summation" technique is essentially a
partially-broken computation of the two highest-order components of
one distillation pass.

A case like summing [1e100] + [1.0] * (N-1) shows that left-to-right
can deliver arbitrarily bad results (the fp sum of that is 1e100
regardless of N).

Christopher T King · Aug 2, 2004

You are right, how silly of me! Fixing the script now results in 130
millis mean, 8.42 millis standard deviation, which is slower than
numarray (104, 2.6 respectively). I wonder why numarray gives faster
results on such a simple task?

The reason is that your C loop uses array indexing, whereas numarray's
increments the array pointer on each iteration through the loop an forgoes
any indexing. Even though this translates into about same number of
opcodes on x86, the latter is slightly faster because it skips the
(implied) bitshift operation. Also don't forget to compile your C version
with -O3, as this can make a big difference in that little loop (maybe not
in your case though).

In principle, for simple access patterns like summation, etc., numarray
will always be faster than your "typical" C implementation, since its core
vector functions were designed to be very efficient. Of course, an
efficient C implementation will be as fast or faster, but you'd have to
use those efficiency tricks in every vector operation you perform, which
could be quite tedious without writing your own library. (The above
generaliztion of course only applies to very large arrays.)

Either way, your results are quite promising -- they show that Python with
numarray is as good or better than C when it comes to vector processing
applications, all while being easier to use. That's good publicity!

Speed up creation of combo box options	2	Mar 10, 2006
comp.lang.vhdl FAQ part 1 of 4: general	0	Jul 8, 2003
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Microbenchmark: Summing over array of doubles

Yaroslav Bulatov

Duncan Booth

Christopher T King

Christopher T King

Duncan Booth

beliavsky

Yaroslav Bulatov

Yaroslav Bulatov

Tim Peters

Christopher T King

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads