Microbenchmark: Summing over array of doubles

Discussion in 'Python' started by Yaroslav Bulatov, Aug 1, 2004.

  1. I made an array of 10 million floats timed how long it takes to sum
    the elements, here's what I got (millis):

    gcc -O2: 21
    Python with numarray: 104
    Python with Numeric: 302
    java: 325
    gcc: 348
    Python with Psyco: 1317
    Pure Python using sum: 2312
    Pure Python: 5631

    http://yaroslav.hopto.org/russianwiki/index.php/c-numeric-speed

    numarray takes over Numeric at about 1100 floats

    I'm doing intensive computation on arrays in Python, so if you have
    suggestions on Python/C solutions that could push the envelope, please
    let me know.

    Yaroslav
     
    Yaroslav Bulatov, Aug 1, 2004
    #1
    1. Advertising

  2. Yaroslav Bulatov

    Duncan Booth Guest

    (Yaroslav Bulatov) wrote in
    news::

    > I made an array of 10 million floats timed how long it takes to sum
    > the elements, here's what I got (millis):


    I just had a brief look at the code you posted. Are you not concerned about
    accuracy in any of your calculations? Summing a 10 million element array by
    simply accumulating each element into a running total is guaranteed to give
    a lousy result.

    Also, without running it I don't see how most of your Python examples work
    since you seem to call the timing function with the result of the
    calculation which doesn't make sense.
     
    Duncan Booth, Aug 1, 2004
    #2
    1. Advertising

  3. On 1 Aug 2004, Duncan Booth wrote:

    > I just had a brief look at the code you posted. Are you not concerned about
    > accuracy in any of your calculations? Summing a 10 million element array by
    > simply accumulating each element into a running total is guaranteed to give
    > a lousy result.


    Lousy or not, I believe that's how numarray is implemented internally, so
    at least all the benchmarks are the same. If you want accuracy summing
    that many numbers, you're going to have to do it in software (likely by
    summing each mantissa bit individually and reconstructing the float
    afterward), so it will be abysmally slow (obviously not what the OP
    wants).
     
    Christopher T King, Aug 1, 2004
    #3
  4. On 31 Jul 2004, Yaroslav Bulatov wrote:

    > I'm doing intensive computation on arrays in Python, so if you have
    > suggestions on Python/C solutions that could push the envelope, please
    > let me know.


    If you're doing mostly vector calculations as opposed to summing, I've
    been doing some work on adding SIMD support to numarray, with pleasing
    results (around 2x speedups). I've also done some work adding local
    parallel processing support to numarray, with not-so-pleasing results
    (mostly due to Python overhead).

    Regarding your results:

    numarray should be just as fast as the -O2 C version. I was puzzled at
    first as to where the speed discrepancy came from, but the culprit is in
    the -O2 flag: gcc -O2 noticies that sum is never used, and thus removes
    the loop entirely. As a matter of fact, there isn't even any fadd
    instruction in the assembler output:

    call clock
    movl %eax, %esi
    movl $9999999, %ebx
    ..L11:
    decl %ebx
    jns .L11
    subl $16, %esp
    call clock

    As you can see, the 21ms you're seeing is the time spent counting down
    from 9,999,999 to 0. To obtain correct results, add a line such as
    'printf("%f\n",sum);' after the main loop in the C version. This will
    force gcc to leave the actual calculation in place and give you accurate
    results.

    The above fix will likely render numarray faster than the C version.
    Using gcc -O3 rather than gcc -O2 will get fairer results, as this is what
    numarray uses.

    Is there any reason why in the Python/numarray version, you use
    Numeric's RandomArray rather than numarray.random_array? It shouldn't
    affect your results, but it would speed up initialization time a bit.

    There are a few inefficiences in the pytime module (mostly involving
    range() and *args/**kwargs), but I don't think they'll have too big of an
    impact on your results. Instead, I'd suggest running the numarray/Numeric
    tests using Psyco to remove much of the Python overhead.

    For completeness, I'd also suggest both running the Java version using a
    JIT compiler such as Kaffe, and compiling it natively using gcj (the
    latter should approach the speed of C).
     
    Christopher T King, Aug 2, 2004
    #4
  5. Yaroslav Bulatov

    Duncan Booth Guest

    Christopher T King <> wrote in
    news:p:

    > On 1 Aug 2004, Duncan Booth wrote:
    >
    >> I just had a brief look at the code you posted. Are you not concerned
    >> about accuracy in any of your calculations? Summing a 10 million
    >> element array by simply accumulating each element into a running
    >> total is guaranteed to give a lousy result.

    >
    > Lousy or not, I believe that's how numarray is implemented internally,
    > so at least all the benchmarks are the same. If you want accuracy
    > summing that many numbers, you're going to have to do it in software
    > (likely by summing each mantissa bit individually and reconstructing
    > the float afterward), so it will be abysmally slow (obviously not what
    > the OP wants).
    >


    My point being that speed isn't everything. Most applications doing large
    floating point calculations should be concerned about accuracy, and how not
    to add up a large array of floating point numbers was one of the first
    things I was taught in computer science classes. The fastest way to sum 10
    million numbers (albeit at some considerable loss of accuracy):

    return 0


    The 'correct' way to sum a large array of floats is, of course, to
    calculate a lot of partial sums and sum them. For example, naively, we
    might say that to sum the array we sum the first half, sum the second half
    and add the results together. This definition being recursive ensures that
    if the inputs are all of a similar size then all the intermediate
    calculations involve summing similarly sized values. A more sophisticated
    technique might also try to handle the case where not all values are a
    similar size.

    If you are worried about speed, then calculating partial sums is also the
    way to go: the naive technique used by the original poster leaves no scope
    for parallel calculation. It should be possible to write a slightly more
    complex loop in the C version that runs a lot faster on systems that are
    capable of performing multiple floating point instructions in parallel.
     
    Duncan Booth, Aug 2, 2004
    #5
  6. Yaroslav Bulatov

    Guest

    (Yaroslav Bulatov) wrote in message news:<>...
    > I made an array of 10 million floats timed how long it takes to sum
    > the elements, here's what I got (millis):
    >
    > gcc -O2: 21
    > Python with numarray: 104
    > Python with Numeric: 302
    > java: 325
    > gcc: 348
    > Python with Psyco: 1317
    > Pure Python using sum: 2312
    > Pure Python: 5631
    >
    > http://yaroslav.hopto.org/russianwiki/index.php/c-numeric-speed
    >
    > numarray takes over Numeric at about 1100 floats
    >
    > I'm doing intensive computation on arrays in Python, so if you have
    > suggestions on Python/C solutions that could push the envelope, please
    > let me know.


    What hardware are you using?

    On a 2.8 GHz Intel Pentium 4, a Fortran 95 code compiled with Compaq
    Visual Fortran 6.6C using the -optimize:5 option takes 40
    milliseconds. Probably you could improve the speed of the C program by
    compiling it with the Intel C compiler. The Fortran program speed is
    constrained by array accesses, not the additions (which are done in
    hardware). A program to compute sum(xx**4) takes the same time as one
    to compute sum(xx).

    program xsum_double
    ! sum 1e7 doubles
    implicit none
    integer, parameter :: n = 10000000
    real(kind=8) :: xx(n)
    real :: t1,t2,xsum
    call random_seed()
    call random_number(xx)
    call cpu_time(t1)
    xsum = sum(xx)
    call cpu_time(t2)
    print*,1000*(t2-t1),xsum
    end program xsum_double
     
    , Aug 2, 2004
    #6
  7. Duncan Booth <> wrote in message news:<Xns95395FD0EF85duncanrcpcouk@127.0.0.1>...
    > Christopher T King <> wrote in
    > news:p:
    >
    > > On 1 Aug 2004, Duncan Booth wrote:
    > >
    > >> I just had a brief look at the code you posted. Are you not concerned
    > >> about accuracy in any of your calculations? Summing a 10 million
    > >> element array by simply accumulating each element into a running
    > >> total is guaranteed to give a lousy result.

    > >
    > > Lousy or not, I believe that's how numarray is implemented internally,
    > > so at least all the benchmarks are the same. If you want accuracy
    > > summing that many numbers, you're going to have to do it in software
    > > (likely by summing each mantissa bit individually and reconstructing
    > > the float afterward), so it will be abysmally slow (obviously not what
    > > the OP wants).
    > >

    >
    > My point being that speed isn't everything. Most applications doing large
    > floating point calculations should be concerned about accuracy, and how not
    > to add up a large array of floating point numbers was one of the first
    > things I was taught in computer science classes. The fastest way to sum 10
    > million numbers (albeit at some considerable loss of accuracy):
    >
    > return 0
    >
    >
    > The 'correct' way to sum a large array of floats is, of course, to
    > calculate a lot of partial sums and sum them. For example, naively, we
    > might say that to sum the array we sum the first half, sum the second half
    > and add the results together. This definition being recursive ensures that
    > if the inputs are all of a similar size then all the intermediate
    > calculations involve summing similarly sized values. A more sophisticated
    > technique might also try to handle the case where not all values are a
    > similar size.
    >
    > If you are worried about speed, then calculating partial sums is also the
    > way to go: the naive technique used by the original poster leaves no scope
    > for parallel calculation. It should be possible to write a slightly more
    > complex loop in the C version that runs a lot faster on systems that are
    > capable of performing multiple floating point instructions in parallel.


    You are right, naive summing generates significant accuracy losses.
    I estimated the error by summing [0.123456789012345]*1000000 and
    comparing it to
    1234567.89012345. All methods have error about 1e-4 .

    The method below sums the array at the same speed as regular Python
    sum loop, but reports error < 1e-15 .

    def sum2(arr):
    size = len(arr)
    for itr in range(int(ceil(log(size)/log(2)))):
    for i in xrange(0, size, 2**(itr+1)):
    next_i = i+2**itr
    if next_i<size:
    arr+=arr[next_i]
    return arr[0]


    When arbitrary precision numbers are being used, this method has
    performance O(nd) vs. regular summing O(n(d+log d)) where n is size of
    array, d is number of digits of the elements. In practice I got about
    20% performance increase when summing an array of 1000 Decimals using
    method above.

    A better estimate of error difference would use random digits as
    opposed to [x]*10000000, but I don't know how I would calculate exact
    answer in any reasonable amount of time. (using Decimals takes over a
    second for 4000 array (with conversion))

    Yaroslav
     
    Yaroslav Bulatov, Aug 3, 2004
    #7
  8. Christopher T King <> wrote in message news:<>...
    > On 31 Jul 2004, Yaroslav Bulatov wrote:
    >
    > > I'm doing intensive computation on arrays in Python, so if you have
    > > suggestions on Python/C solutions that could push the envelope, please
    > > let me know.

    >
    > If you're doing mostly vector calculations as opposed to summing, I've
    > been doing some work on adding SIMD support to numarray, with pleasing
    > results (around 2x speedups). I've also done some work adding local
    > parallel processing support to numarray, with not-so-pleasing results
    > (mostly due to Python overhead).
    >
    > Regarding your results:
    >
    > numarray should be just as fast as the -O2 C version. I was puzzled at
    > first as to where the speed discrepancy came from, but the culprit is in
    > the -O2 flag: gcc -O2 noticies that sum is never used, and thus removes
    > the loop entirely. As a matter of fact, there isn't even any fadd
    > instruction in the assembler output:
    >
    > call clock
    > movl %eax, %esi
    > movl $9999999, %ebx
    > .L11:
    > decl %ebx
    > jns .L11
    > subl $16, %esp
    > call clock
    >
    > As you can see, the 21ms you're seeing is the time spent counting down
    > from 9,999,999 to 0. To obtain correct results, add a line such as
    > 'printf("%f\n",sum);' after the main loop in the C version. This will
    > force gcc to leave the actual calculation in place and give you accurate
    > results.
    >
    > The above fix will likely render numarray faster than the C version.
    > Using gcc -O3 rather than gcc -O2 will get fairer results, as this is what
    > numarray uses.


    You are right, how silly of me! Fixing the script now results in 130
    millis mean, 8.42 millis standard deviation, which is slower than
    numarray (104, 2.6 respectively). I wonder why numarray gives faster
    results on such a simple task?


    > Is there any reason why in the Python/numarray version, you use
    > Numeric's RandomArray rather than numarray.random_array? It shouldn't
    > affect your results, but it would speed up initialization time a bit.


    There isn't a good reason, I simply didn't know about
    numarray.random_array

    >
    > There are a few inefficiences in the pytime module (mostly involving
    > range() and *args/**kwargs), but I don't think they'll have too big of an
    > impact on your results. Instead, I'd suggest running the numarray/Numeric
    > tests using Psyco to remove much of the Python overhead.
    >
    > For completeness, I'd also suggest both running the Java version using a
    > JIT compiler such as Kaffe, and compiling it natively using gcj (the
    > latter should approach the speed of C).
     
    Yaroslav Bulatov, Aug 3, 2004
    #8
  9. Yaroslav Bulatov

    Tim Peters Guest

    [Yaroslav Bulatov]
    ....
    > A better estimate of error difference would use random digits as
    > opposed to [x]*10000000, but I don't know how I would calculate exact
    > answer in any reasonable amount of time. (using Decimals takes over a
    > second for 4000 array (with conversion))


    Google on

    floating point distillation

    It's possible to compute the exact sum of a list of floats using float
    arithmetic, where, ignoring overflow, the result is a minimal set of
    fp numbers whose mathematical (infinite precision) sum exactly equals
    the mathematical sum of the inputs.

    In the worst case the best of these methods can require O(log N)
    passes over the list of N inputs, but in real life they usually
    converge to the result in two passes independent of N, sometimes with
    a third pass but over a very short list of floats.

    The better-known "Kahan summation" technique is essentially a
    partially-broken computation of the two highest-order components of
    one distillation pass.

    A case like summing [1e100] + [1.0] * (N-1) shows that left-to-right
    can deliver arbitrarily bad results (the fp sum of that is 1e100
    regardless of N).
     
    Tim Peters, Aug 3, 2004
    #9
  10. On 2 Aug 2004, Yaroslav Bulatov wrote:

    > You are right, how silly of me! Fixing the script now results in 130
    > millis mean, 8.42 millis standard deviation, which is slower than
    > numarray (104, 2.6 respectively). I wonder why numarray gives faster
    > results on such a simple task?


    The reason is that your C loop uses array indexing, whereas numarray's
    increments the array pointer on each iteration through the loop an forgoes
    any indexing. Even though this translates into about same number of
    opcodes on x86, the latter is slightly faster because it skips the
    (implied) bitshift operation. Also don't forget to compile your C version
    with -O3, as this can make a big difference in that little loop (maybe not
    in your case though).

    In principle, for simple access patterns like summation, etc., numarray
    will always be faster than your "typical" C implementation, since its core
    vector functions were designed to be very efficient. Of course, an
    efficient C implementation will be as fast or faster, but you'd have to
    use those efficiency tricks in every vector operation you perform, which
    could be quite tedious without writing your own library. (The above
    generaliztion of course only applies to very large arrays.)

    Either way, your results are quite promising -- they show that Python with
    numarray is as good or better than C when it comes to vector processing
    applications, all while being easier to use. That's good publicity!
     
    Christopher T King, Aug 3, 2004
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. dan
    Replies:
    1
    Views:
    2,365
    Jack Klein
    Nov 26, 2003
  2. Alf P. Steinbach
    Replies:
    2
    Views:
    551
    Chris Theis
    Feb 5, 2004
  3. RS
    Replies:
    1
    Views:
    431
    Jacek Dziedzic
    Sep 21, 2006
  4. RS
    Replies:
    0
    Views:
    365
  5. flebber

    Using ternary and summing array

    flebber, Oct 31, 2011, in forum: C Programming
    Replies:
    11
    Views:
    539
    David Thompson
    Nov 21, 2011
Loading...

Share This Page