64 bit add and accumulate with MMX

Brian K. Michalk · Sep 20, 2003

I have a perl app that is calculating the standard deviation of a 4000
element 16 bit integer array, that has large dynamic content. I.e,
the range spans a significant portion of the 16 bits.

I am trying to increase the performance of this critical loop, and
I've found that I am exceeding the 32 bit registers causing Perl to
switch to an infinite precision math library. I've rewritten the loop
such that I now have only one term that overflows 32 bits.

I am at the point of inlining assembler to speed up this loop.
However, the MMX terminology is confusing. I see there are 64bit MMX
registers that can do multiply and accumulate, but I don't see any
64bit add commands. My loop is something like this:
for (i=0; i<4000; i++) { sum += array; }
Of course my loop is slightly bigger than this, but even this toy
example has terrible performance on real data when sum exceeds 32
bits.

Before I go into the time of inlining C (and assembler), are the MMX
registers cabable of accumulating a 64 bit register with adding a
16(or 32) bit operand?

By the way, I've already tried the GMP library, and it did not help.
I also recompiled everything with the MMX optimizations turned on. Oh
yeah, this is on Linux.

Terje Mathisen · Sep 20, 2003

Brian said:
I have a perl app that is calculating the standard deviation of a 4000
element 16 bit integer array, that has large dynamic content. I.e,
the range spans a significant portion of the 16 bits.

I am trying to increase the performance of this critical loop, and
I've found that I am exceeding the 32 bit registers causing Perl to
switch to an infinite precision math library. I've rewritten the loop
such that I now have only one term that overflows 32 bits.

Sorry, but that's bogus:

The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits,
cannot have more than 16+12 = 28 bits.

There must be some other problem.

Terje

Ian McConnell · Sep 20, 2003

I have a perl app that is calculating the standard deviation of a 4000
element 16 bit integer array, that has large dynamic content. I.e,
the range spans a significant portion of the 16 bits.

I am trying to increase the performance of this critical loop, and
I've found that I am exceeding the 32 bit registers causing Perl to
switch to an infinite precision math library. I've rewritten the loop
such that I now have only one term that overflows 32 bits.

I am at the point of inlining assembler to speed up this loop.
However, the MMX terminology is confusing. I see there are 64bit MMX
registers that can do multiply and accumulate, but I don't see any
64bit add commands. My loop is something like this:
for (i=0; i<4000; i++) { sum += array; }
Of course my loop is slightly bigger than this, but even this toy
example has terrible performance on real data when sum exceeds 32
bits.

Before I go into the time of inlining C (and assembler), are the MMX
registers cabable of accumulating a 64 bit register with adding a
16(or 32) bit operand?

By the way, I've already tried the GMP library, and it did not help.
I also recompiled everything with the MMX optimizations turned on. Oh
yeah, this is on Linux.

Sounds like you need a better algorithm for calculating the standard
deviation. The Welford update formula updates the mean (which can be 16
bits) rather than the usual formula of sum/n (which needs to be 16+log2(n)
bits)

For example, have a look at this code

From
http://www.cs.trinity.edu/~joldham0/1321/lectures/hashing/mean-var-Welford.pl

#! /usr/bin/perl -w
##
## Oldham, Jeffrey D.
## 2000Mar13
## util
##
## Compute the Mean and Variance Using Welford's Formula

## Compute the mean and variance of a stream of floating-point
## numbers. The numbers should appear at the beginning of each line.
$mean = 0.0;
$variance = 0.0;
$count = 0;

# Compute the mean and maximum.
while (<ARGV>) {
($number, $_) = split;
++$count;
$diff = $number - $mean;
$mean += $diff / $count;
$variance += $diff * $diff * ($count -1) / $count;
}
$variance /= $count;

printf("The mean is %g.\n", $mean);
printf("The variance is %g.\n", $variance);
printf("The standard deviation is %g.\n", sqrt($variance));

Click to expand...

Also if you're desperate for speed then, since the standard deviation is
only an estimate, you can double the speed of your code by calculating the
standard deviation of every other number (ie use 2000 of them) and it will
be pretty good estimate. You could probably just use a few hundred samples
and still get a very good approximation.

P.S.
If you do code this algorithm in MMX, please post it here as more people
should understand how to calculate means and SDs with this more accurate
formula.

Brian K. Michalk · Sep 20, 2003

Terje Mathisen said:
Sorry, but that's bogus:

The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits,
cannot have more than 16+12 = 28 bits.

There must be some other problem.

Terje

Uhh, you are right. Problem in my example. I am indeed working with
addition of 32 bit integers, the A/D data is still 16 bits. The loop
is perhaps a little too complicated here, but it is a trace of 4000
points. I have a 400 point averaging (high pass) window that iterates
over the entire 4000 element trace. My goal is to find an A/D
amplitude(alpha) such that 5% of the points in the averaging window
are above alpha.

Brute force works just fine, except that it's a really bad choice. If
I take the standard deviation of the moving window, and then apply the
z-transform for 5%, out pops the alpha that I need to filter the data.
It's an order(N) filter, and is blazingly fast for low amplitude
signals, because they will stay within 32 bit operations. For large
deviations, the math exceeds 32 bits. Perhaps there is another
"almost as good" method for calculating standard deviation?

I'm using(integer math):
$std = sqrt(
($n*$sumsq)-($sum**2))/
($n**2)
)
Which I have reduced in the interest of overflows to:
$std = sqrt(
($sumsq/$n)-(($sum/$n)**2)
)

$sum = 0;
$sumsq = 0;
&some_routine_that_primes_filter_here(); # prime sum and sumsq
# now iterate ofer the trace
for ($i=0; $i<3950; $i++) {
$sum += $ary[$i];
$sum -= $ary[$i+50];
$sumsq += ($ary[$i]**2);
$sumsq -= ($ary[$i+50]**2);
# calculate standard deviation here
# take the z-transform here
# use the result to filter here
}

But, my question still stands on the MMX issue. Will MMX registers do
this? It's more of an assembler question, but if theres an
alternative that will keep the data in 32 bits I'm interested in that
as well.

Dr John Stockton · Sep 20, 2003

JRS: In article <[email protected]>, seen in

news:comp.lang.asm.x86 said:
Sorry, but that's bogus:

The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits,
cannot have more than 16+12 = 28 bits.

There must be some other problem.

Standard Deviation is Root-Mean-Square deviation, corrected for
statistics. Therefore 2*16 + 12, = 44 bits. The 16 is actually 15,
because the sign does not matter in squaring.

The standard deviation of the data is exactly defined. But if it is to
be used as an estimate of the SD of an infinite population from which
the 4000 were drawn, then it has an uncertainty of about 1 in root 4000,
say 1 in 64. In that case, a small additional systematic/random error
may be tolerable.

The SD will be dominated by the larger entries. Take, therefore, 256
times the SD of the upper bytes (taken as -128..127) of the data,
possibly rounding in the MSB of the lower byte. We now have less than
2*8 + 12, <32, bits.

Needs :
(a) Test it; emulation in a convenient language will do. If plausible,
(b) consult a professional statistician or two.

Matt Taylor · Sep 20, 2003

MMX does not support 64-bit additions. SSE 2 adds the paddq instruction
which can do 64-bit addition with MMX registers, but it is only available on
Pentium 4 and Opteron/Athlon64.

You can do very fast CPU-neutral 64-bit arithmetic with add/adc and sub/sbb.
If Perl is compiled for 64-bits, it should already be capable of this,
though I'm not sure if that means it would use 64-bit arithmetic for
everything or just things that need it. Certainly multiply, divide, and
square root are non-trivial for MMX.

If you were to write this routine in assembly, then I would accumulate the
integers using general registers and use the x87 with 80-bit precision for a
"perfect" square root. Any C compiler that supports long long should be able
to do the first half, and any C compiler that uses 10-byte long doubles can
do the latter half.

-Matt

Brian K. Michalk said:
Terje Mathisen <[email protected]> wrote in message

Sorry, but that's bogus:

The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits,
cannot have more than 16+12 = 28 bits.

There must be some other problem.

Terje

Click to expand...

Uhh, you are right. Problem in my example. I am indeed working with
addition of 32 bit integers, the A/D data is still 16 bits. The loop
is perhaps a little too complicated here, but it is a trace of 4000
points. I have a 400 point averaging (high pass) window that iterates
over the entire 4000 element trace. My goal is to find an A/D
amplitude(alpha) such that 5% of the points in the averaging window
are above alpha.

Brute force works just fine, except that it's a really bad choice. If
I take the standard deviation of the moving window, and then apply the
z-transform for 5%, out pops the alpha that I need to filter the data.
It's an order(N) filter, and is blazingly fast for low amplitude
signals, because they will stay within 32 bit operations. For large
deviations, the math exceeds 32 bits. Perhaps there is another
"almost as good" method for calculating standard deviation?

I'm using(integer math):
$std = sqrt(
($n*$sumsq)-($sum**2))/
($n**2)
)
Which I have reduced in the interest of overflows to:
$std = sqrt(
($sumsq/$n)-(($sum/$n)**2)
)

$sum = 0;
$sumsq = 0;
&some_routine_that_primes_filter_here(); # prime sum and sumsq
# now iterate ofer the trace
for ($i=0; $i<3950; $i++) {
$sum += $ary[$i];
$sum -= $ary[$i+50];
$sumsq += ($ary[$i]**2);
$sumsq -= ($ary[$i+50]**2);
# calculate standard deviation here
# take the z-transform here
# use the result to filter here
}

But, my question still stands on the MMX issue. Will MMX registers do
this? It's more of an assembler question, but if theres an
alternative that will keep the data in 32 bits I'm interested in that
as well.

Terje Mathisen · Sep 20, 2003

Brian said:
Terje Mathisen said:

Sorry, but that's bogus:

The sum of 4000 (which is less than 2^12) elements, each of <= 16 bits,
cannot have more than 16+12 = 28 bits.

There must be some other problem.

Click to expand...

Uhh, you are right. Problem in my example. I am indeed working with
addition of 32 bit integers, the A/D data is still 16 bits. The loop
is perhaps a little too complicated here, but it is a trace of 4000
points. I have a 400 point averaging (high pass) window that iterates
over the entire 4000 element trace. My goal is to find an A/D
amplitude(alpha) such that 5% of the points in the averaging window
are above alpha.

Brute force works just fine, except that it's a really bad choice. If
I take the standard deviation of the moving window, and then apply the
z-transform for 5%, out pops the alpha that I need to filter the data.
It's an order(N) filter, and is blazingly fast for low amplitude
signals, because they will stay within 32 bit operations. For large
deviations, the math exceeds 32 bits. Perhaps there is another
"almost as good" method for calculating standard deviation?

I'm using(integer math):
$std = sqrt(
($n*$sumsq)-($sum**2))/
($n**2)
)
Which I have reduced in the interest of overflows to:
$std = sqrt(
($sumsq/$n)-(($sum/$n)**2)
)

$sum = 0;
$sumsq = 0;
&some_routine_that_primes_filter_here(); # prime sum and sumsq
# now iterate ofer the trace
for ($i=0; $i<3950; $i++) {
$sum += $ary[$i];
$sum -= $ary[$i+50];
$sumsq += ($ary[$i]**2);
$sumsq -= ($ary[$i+50]**2);
# calculate standard deviation here
# take the z-transform here
# use the result to filter here
}

But, my question still stands on the MMX issue. Will MMX registers do
this? It's more of an assembler question, but if theres an
alternative that will keep the data in 32 bits I'm interested in that
as well.

Since you're posting to c.l.a.x86, you're really wanting high speed
here, right?

Your inner loop seems kind of strange: You are adding in $ary[$i] and
subtracting $ary[$i+50], instead of the opposite: Is this intentional?

Assuming the logic is correct, it should easily run in about 10-15
cycles/iteration using integer registers on a PIII, and about the same
speed or a little slower with fp on a P4.

With just 4000 table entries, I'd do the square calculation only once,
instead of twice, which makes it easy to overlap multiple iterations to
hide the latency of the multiplications.

The 64-bit intermediate values with integer ops would be held in a pair
of registers, while fp just works.

Terje

32/64 bit cc differences	110	Jan 10, 2014
64-bit Python for Solaris	0	May 21, 2013
ruby + Qt + Windows + 64 bit	0	Jan 24, 2014
py2exe and 64/32 bit windows	3	Apr 9, 2013
64 bit design	2	Oct 15, 2010
64-bit KISS RNGs	23	Feb 28, 2009
64-bit gotcha	13	Dec 3, 2011
EXCEPTION_ACCESS_VIOLATION on Windows 7 64 bit OS	2	Jun 3, 2010

64 bit add and accumulate with MMX

Brian K. Michalk

Terje Mathisen

Ian McConnell

Brian K. Michalk

Dr John Stockton

Matt Taylor

Terje Mathisen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads