64 bit add and accumulate with MMX

Discussion in 'Perl' started by Brian K. Michalk, Sep 20, 2003.

  1. I have a perl app that is calculating the standard deviation of a 4000
    element 16 bit integer array, that has large dynamic content. I.e,
    the range spans a significant portion of the 16 bits.

    I am trying to increase the performance of this critical loop, and
    I've found that I am exceeding the 32 bit registers causing Perl to
    switch to an infinite precision math library. I've rewritten the loop
    such that I now have only one term that overflows 32 bits.

    I am at the point of inlining assembler to speed up this loop.
    However, the MMX terminology is confusing. I see there are 64bit MMX
    registers that can do multiply and accumulate, but I don't see any
    64bit add commands. My loop is something like this:
    for (i=0; i<4000; i++) { sum += array; }
    Of course my loop is slightly bigger than this, but even this toy
    example has terrible performance on real data when sum exceeds 32
    bits.

    Before I go into the time of inlining C (and assembler), are the MMX
    registers cabable of accumulating a 64 bit register with adding a
    16(or 32) bit operand?

    By the way, I've already tried the GMP library, and it did not help.
    I also recompiled everything with the MMX optimizations turned on. Oh
    yeah, this is on Linux.
     
    Brian K. Michalk, Sep 20, 2003
    #1
    1. Advertising

  2. Brian K. Michalk wrote:

    > I have a perl app that is calculating the standard deviation of a 4000
    > element 16 bit integer array, that has large dynamic content. I.e,
    > the range spans a significant portion of the 16 bits.
    >
    > I am trying to increase the performance of this critical loop, and
    > I've found that I am exceeding the 32 bit registers causing Perl to
    > switch to an infinite precision math library. I've rewritten the loop
    > such that I now have only one term that overflows 32 bits.


    Sorry, but that's bogus:

    The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits,
    cannot have more than 16+12 = 28 bits.

    There must be some other problem.

    Terje

    --
    - <>
    "almost all programming can be viewed as an exercise in caching"
     
    Terje Mathisen, Sep 20, 2003
    #2
    1. Advertising

  3. (Brian K. Michalk) writes:

    > I have a perl app that is calculating the standard deviation of a 4000
    > element 16 bit integer array, that has large dynamic content. I.e,
    > the range spans a significant portion of the 16 bits.
    >
    > I am trying to increase the performance of this critical loop, and
    > I've found that I am exceeding the 32 bit registers causing Perl to
    > switch to an infinite precision math library. I've rewritten the loop
    > such that I now have only one term that overflows 32 bits.
    >
    > I am at the point of inlining assembler to speed up this loop.
    > However, the MMX terminology is confusing. I see there are 64bit MMX
    > registers that can do multiply and accumulate, but I don't see any
    > 64bit add commands. My loop is something like this:
    > for (i=0; i<4000; i++) { sum += array; }
    > Of course my loop is slightly bigger than this, but even this toy
    > example has terrible performance on real data when sum exceeds 32
    > bits.
    >
    > Before I go into the time of inlining C (and assembler), are the MMX
    > registers cabable of accumulating a 64 bit register with adding a
    > 16(or 32) bit operand?
    >
    > By the way, I've already tried the GMP library, and it did not help.
    > I also recompiled everything with the MMX optimizations turned on. Oh
    > yeah, this is on Linux.


    Sounds like you need a better algorithm for calculating the standard
    deviation. The Welford update formula updates the mean (which can be 16
    bits) rather than the usual formula of sum/n (which needs to be 16+log2(n)
    bits)

    For example, have a look at this code

    From
    http://www.cs.trinity.edu/~joldham0/1321/lectures/hashing/mean-var-Welford.pl

    > #! /usr/bin/perl -w
    > ##
    > ## Oldham, Jeffrey D.
    > ## 2000Mar13
    > ## util
    > ##
    > ## Compute the Mean and Variance Using Welford's Formula
    >
    > ## Compute the mean and variance of a stream of floating-point
    > ## numbers. The numbers should appear at the beginning of each line.
    > $mean = 0.0;
    > $variance = 0.0;
    > $count = 0;
    >
    > # Compute the mean and maximum.
    > while (<ARGV>) {
    > ($number, $_) = split;
    > ++$count;
    > $diff = $number - $mean;
    > $mean += $diff / $count;
    > $variance += $diff * $diff * ($count -1) / $count;
    > }
    > $variance /= $count;
    >
    > printf("The mean is %g.\n", $mean);
    > printf("The variance is %g.\n", $variance);
    > printf("The standard deviation is %g.\n", sqrt($variance));




    Also if you're desperate for speed then, since the standard deviation is
    only an estimate, you can double the speed of your code by calculating the
    standard deviation of every other number (ie use 2000 of them) and it will
    be pretty good estimate. You could probably just use a few hundred samples
    and still get a very good approximation.



    P.S.
    If you do code this algorithm in MMX, please post it here as more people
    should understand how to calculate means and SDs with this more accurate
    formula.



    --
    "Thinks: I can't think of a thinks. End of thinks routine": Blue Bottle

    ** Aunty Spam says: Remove the trailing x from the To: field to reply **
     
    Ian McConnell, Sep 20, 2003
    #3
  4. Terje Mathisen <> wrote in message news:<bkh34u$801$>...
    > Brian K. Michalk wrote:
    >
    > > I have a perl app that is calculating the standard deviation of a 4000
    > > element 16 bit integer array, that has large dynamic content. I.e,
    > > the range spans a significant portion of the 16 bits.
    > >
    > > I am trying to increase the performance of this critical loop, and
    > > I've found that I am exceeding the 32 bit registers causing Perl to
    > > switch to an infinite precision math library. I've rewritten the loop
    > > such that I now have only one term that overflows 32 bits.

    >
    > Sorry, but that's bogus:
    >
    > The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits,
    > cannot have more than 16+12 = 28 bits.
    >
    > There must be some other problem.
    >
    > Terje


    Uhh, you are right. Problem in my example. I am indeed working with
    addition of 32 bit integers, the A/D data is still 16 bits. The loop
    is perhaps a little too complicated here, but it is a trace of 4000
    points. I have a 400 point averaging (high pass) window that iterates
    over the entire 4000 element trace. My goal is to find an A/D
    amplitude(alpha) such that 5% of the points in the averaging window
    are above alpha.

    Brute force works just fine, except that it's a really bad choice. If
    I take the standard deviation of the moving window, and then apply the
    z-transform for 5%, out pops the alpha that I need to filter the data.
    It's an order(N) filter, and is blazingly fast for low amplitude
    signals, because they will stay within 32 bit operations. For large
    deviations, the math exceeds 32 bits. Perhaps there is another
    "almost as good" method for calculating standard deviation?

    I'm using(integer math):
    $std = sqrt(
    ($n*$sumsq)-($sum**2))/
    ($n**2)
    )
    Which I have reduced in the interest of overflows to:
    $std = sqrt(
    ($sumsq/$n)-(($sum/$n)**2)
    )

    $sum = 0;
    $sumsq = 0;
    &some_routine_that_primes_filter_here(); # prime sum and sumsq
    # now iterate ofer the trace
    for ($i=0; $i<3950; $i++) {
    $sum += $ary[$i];
    $sum -= $ary[$i+50];
    $sumsq += ($ary[$i]**2);
    $sumsq -= ($ary[$i+50]**2);
    # calculate standard deviation here
    # take the z-transform here
    # use the result to filter here
    }

    But, my question still stands on the MMX issue. Will MMX registers do
    this? It's more of an assembler question, but if theres an
    alternative that will keep the data in 32 bits I'm interested in that
    as well.
     
    Brian K. Michalk, Sep 20, 2003
    #4
  5. JRS: In article <bkh34u$801$>, seen in
    news:comp.lang.asm.x86, Terje Mathisen <>
    posted at Sat, 20 Sep 2003 10:29:17 :-
    >Brian K. Michalk wrote:
    >
    >> I have a perl app that is calculating the standard deviation of a 4000
    >> element 16 bit integer array, that has large dynamic content. I.e,
    >> the range spans a significant portion of the 16 bits.
    >>
    >> I am trying to increase the performance of this critical loop, and
    >> I've found that I am exceeding the 32 bit registers causing Perl to
    >> switch to an infinite precision math library. I've rewritten the loop
    >> such that I now have only one term that overflows 32 bits.

    >
    >Sorry, but that's bogus:
    >
    >The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits,
    >cannot have more than 16+12 = 28 bits.
    >
    >There must be some other problem.


    Standard Deviation is Root-Mean-Square deviation, corrected for
    statistics. Therefore 2*16 + 12, = 44 bits. The 16 is actually 15,
    because the sign does not matter in squaring.

    The standard deviation of the data is exactly defined. But if it is to
    be used as an estimate of the SD of an infinite population from which
    the 4000 were drawn, then it has an uncertainty of about 1 in root 4000,
    say 1 in 64. In that case, a small additional systematic/random error
    may be tolerable.

    The SD will be dominated by the larger entries. Take, therefore, 256
    times the SD of the upper bytes (taken as -128..127) of the data,
    possibly rounding in the MSB of the lower byte. We now have less than
    2*8 + 12, <32, bits.

    Needs :
    (a) Test it; emulation in a convenient language will do. If plausible,
    (b) consult a professional statistician or two.


    --
    © John Stockton, Surrey, UK. ?@merlyn.demon.co.uk DOS 3.3, 6.20; Win98. ©
    Web <URL:http://www.merlyn.demon.co.uk/> - FAQqish topics, acronyms & links.
    PAS EXE TXT ZIP via <URL:http://www.merlyn.demon.co.uk/programs/00index.htm>
    My DOS <URL:http://www.merlyn.demon.co.uk/batfiles.htm> - also batprogs.htm.
     
    Dr John Stockton, Sep 20, 2003
    #5
  6. Brian K. Michalk

    Matt Taylor Guest

    MMX does not support 64-bit additions. SSE 2 adds the paddq instruction
    which can do 64-bit addition with MMX registers, but it is only available on
    Pentium 4 and Opteron/Athlon64.

    You can do very fast CPU-neutral 64-bit arithmetic with add/adc and sub/sbb.
    If Perl is compiled for 64-bits, it should already be capable of this,
    though I'm not sure if that means it would use 64-bit arithmetic for
    everything or just things that need it. Certainly multiply, divide, and
    square root are non-trivial for MMX.

    If you were to write this routine in assembly, then I would accumulate the
    integers using general registers and use the x87 with 80-bit precision for a
    "perfect" square root. Any C compiler that supports long long should be able
    to do the first half, and any C compiler that uses 10-byte long doubles can
    do the latter half.

    -Matt

    "Brian K. Michalk" <> wrote in message
    news:...
    > Terje Mathisen <> wrote in message

    news:<bkh34u$801$>...
    > > Brian K. Michalk wrote:
    > >
    > > > I have a perl app that is calculating the standard deviation of a 4000
    > > > element 16 bit integer array, that has large dynamic content. I.e,
    > > > the range spans a significant portion of the 16 bits.
    > > >
    > > > I am trying to increase the performance of this critical loop, and
    > > > I've found that I am exceeding the 32 bit registers causing Perl to
    > > > switch to an infinite precision math library. I've rewritten the loop
    > > > such that I now have only one term that overflows 32 bits.

    > >
    > > Sorry, but that's bogus:
    > >
    > > The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits,
    > > cannot have more than 16+12 = 28 bits.
    > >
    > > There must be some other problem.
    > >
    > > Terje

    >
    > Uhh, you are right. Problem in my example. I am indeed working with
    > addition of 32 bit integers, the A/D data is still 16 bits. The loop
    > is perhaps a little too complicated here, but it is a trace of 4000
    > points. I have a 400 point averaging (high pass) window that iterates
    > over the entire 4000 element trace. My goal is to find an A/D
    > amplitude(alpha) such that 5% of the points in the averaging window
    > are above alpha.
    >
    > Brute force works just fine, except that it's a really bad choice. If
    > I take the standard deviation of the moving window, and then apply the
    > z-transform for 5%, out pops the alpha that I need to filter the data.
    > It's an order(N) filter, and is blazingly fast for low amplitude
    > signals, because they will stay within 32 bit operations. For large
    > deviations, the math exceeds 32 bits. Perhaps there is another
    > "almost as good" method for calculating standard deviation?
    >
    > I'm using(integer math):
    > $std = sqrt(
    > ($n*$sumsq)-($sum**2))/
    > ($n**2)
    > )
    > Which I have reduced in the interest of overflows to:
    > $std = sqrt(
    > ($sumsq/$n)-(($sum/$n)**2)
    > )
    >
    > $sum = 0;
    > $sumsq = 0;
    > &some_routine_that_primes_filter_here(); # prime sum and sumsq
    > # now iterate ofer the trace
    > for ($i=0; $i<3950; $i++) {
    > $sum += $ary[$i];
    > $sum -= $ary[$i+50];
    > $sumsq += ($ary[$i]**2);
    > $sumsq -= ($ary[$i+50]**2);
    > # calculate standard deviation here
    > # take the z-transform here
    > # use the result to filter here
    > }
    >
    > But, my question still stands on the MMX issue. Will MMX registers do
    > this? It's more of an assembler question, but if theres an
    > alternative that will keep the data in 32 bits I'm interested in that
    > as well.
    >
    >
     
    Matt Taylor, Sep 20, 2003
    #6
  7. Brian K. Michalk wrote:

    > Terje Mathisen <> wrote in message news:<bkh34u$801$>...
    >>Sorry, but that's bogus:
    >>
    >>The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits,
    >>cannot have more than 16+12 = 28 bits.
    >>
    >>There must be some other problem.
    >>

    >
    > Uhh, you are right. Problem in my example. I am indeed working with
    > addition of 32 bit integers, the A/D data is still 16 bits. The loop
    > is perhaps a little too complicated here, but it is a trace of 4000
    > points. I have a 400 point averaging (high pass) window that iterates
    > over the entire 4000 element trace. My goal is to find an A/D
    > amplitude(alpha) such that 5% of the points in the averaging window
    > are above alpha.
    >
    > Brute force works just fine, except that it's a really bad choice. If
    > I take the standard deviation of the moving window, and then apply the
    > z-transform for 5%, out pops the alpha that I need to filter the data.
    > It's an order(N) filter, and is blazingly fast for low amplitude
    > signals, because they will stay within 32 bit operations. For large
    > deviations, the math exceeds 32 bits. Perhaps there is another
    > "almost as good" method for calculating standard deviation?
    >
    > I'm using(integer math):
    > $std = sqrt(
    > ($n*$sumsq)-($sum**2))/
    > ($n**2)
    > )
    > Which I have reduced in the interest of overflows to:
    > $std = sqrt(
    > ($sumsq/$n)-(($sum/$n)**2)
    > )
    >
    > $sum = 0;
    > $sumsq = 0;
    > &some_routine_that_primes_filter_here(); # prime sum and sumsq
    > # now iterate ofer the trace
    > for ($i=0; $i<3950; $i++) {
    > $sum += $ary[$i];
    > $sum -= $ary[$i+50];
    > $sumsq += ($ary[$i]**2);
    > $sumsq -= ($ary[$i+50]**2);
    > # calculate standard deviation here
    > # take the z-transform here
    > # use the result to filter here
    > }
    >
    > But, my question still stands on the MMX issue. Will MMX registers do
    > this? It's more of an assembler question, but if theres an
    > alternative that will keep the data in 32 bits I'm interested in that
    > as well.


    Since you're posting to c.l.a.x86, you're really wanting high speed
    here, right?

    Your inner loop seems kind of strange: You are adding in $ary[$i] and
    subtracting $ary[$i+50], instead of the opposite: Is this intentional?

    Assuming the logic is correct, it should easily run in about 10-15
    cycles/iteration using integer registers on a PIII, and about the same
    speed or a little slower with fp on a P4.

    With just 4000 table entries, I'd do the square calculation only once,
    instead of twice, which makes it easy to overlap multiple iterations to
    hide the latency of the multiplications.

    The 64-bit intermediate values with integer ops would be held in a pair
    of registers, while fp just works.

    Terje
    --
    - <>
    "almost all programming can be viewed as an exercise in caching"
     
    Terje Mathisen, Sep 20, 2003
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roedy Green

    Accumulate 1.0

    Roedy Green, Jul 17, 2004, in forum: Java
    Replies:
    0
    Views:
    384
    Roedy Green
    Jul 17, 2004
  2. Soulblazer

    need help with mmx (vc++)

    Soulblazer, Jul 31, 2004, in forum: C++
    Replies:
    1
    Views:
    384
    Victor Bazarov
    Jul 31, 2004
  3. Tom Lynch
    Replies:
    2
    Views:
    2,039
    Tom Lynch
    Oct 12, 2007
  4. dcharno
    Replies:
    1
    Views:
    276
    Lawrence D'Oliveiro
    Oct 15, 2008
  5. Szyk
    Replies:
    1
    Views:
    821
    Victor Bazarov
    Oct 29, 2011
Loading...

Share This Page