Computing a distance matrix using SSE2

Stefano Teso · Aug 4, 2009

Hi,

First of all I don't know if this is the right group to post, sorry if
it is not; any suggestions on the right place for this kind of
discussion is warmly welcome.

I am writing a stochastic optimization algorithm which, as a part of
the energy calculation procedure, requires the computation of the
squared distance between a 3D vector from an array of n other 3D
vectors. This is the most computational intensive part of the
algorithm, so I wrote it using SSE2 intrinsics as defined in
<emmintr.h>, using GCC 4.4. The vector array is stored in SoA format,
with all components (x, y, z) aligned to a the 16-byte boundary, and
the same goes for the distance array.

The current code looks basically as follows:

static void
calc_distance_row (double *d, const double *xarray, const double
*yarray, const double *zarray, const int i, const int lo, const int
hi)
{

/* computes the *square* of the distance between a fixed vector i and
all vectors j in the range [lo, hi), whose coords are specified in
xarray, yarray, zarray */

/* NOTE: dist2 is a rather obvious C function that uses no intrinsics
*/

int j, lo_aligned, hi_aligned;
__m128d ix, iy, ix, jx, jy, jz, dx, dy, dz, mx, my, mz, s0, s1;
double *x, *y, *z;

lo_aligned = (lo & ~3);
hi_aligned = (hi & ~3);

for (j = lo; j < lo_aligned; j++)
darray[j] = dist2 (xarray, yarray, zarray, xarray[j],
yarray[j], zarray[j]);

x = &xarray[lo_aligned];
y = &yarray[lo_aligned];
z = &zarray[lo_aligned];
d = &darray[lo_aligned];

ix = _mm_load1_pd (&xarray);
iy = _mm_load1_pd (&yarray);
iz = _mm_load1_pd (&zarray);

for (j = lo_aligned; j < hi_aligned; j++) {

jx = _mm_load_pd (&x[0]);
jy = _mm_load_pd (&y[0]);
jz = _mm_load_pd (&z[0]);

dx = _mm_sub_pd (ix, jx);
dy = _mm_sub_pd (iy, iy);
dz = _mm_sub_pd (iz, iz);

mx = _mm_mul_pd (dx, dx);
my = _mm_mul_pd (dy, dy);
mz = _mm_mul_pd (dz, dz);

s0 = _mm_add_pd (mx, my);
s1 = _mm_add_pd (s0, mz);

_mm_store_pd (&d[0], s1);

/* the above chunk of code is replicated once with '2' in
place of '0' in the array indices */

x = &x[4];
y = &y[4];
z = &z[4];
d = &d[4];
}

for (j = hi_aligned; j < hi; j++)
d[j] = dist2 (xarray, yarray, zarray, xarray[j],
yarray[j], zarray[j]);
}

(There may be typos in the above routine, anyway the real code was
tested against a simple minded C version, and proved to work as
expected, outperforming it by about 2x.)

The first and third loops are used to handle non-unrollable cases,
while the second loop -- the only one using SSE2 intrinsics -- is
unrolled twice, for a whopping 4 square distances computed per
iteration. The average array size is 100 to 1000 vectors, so no huge
number crunching here.

Is this the best way to perform this operation? Is there any obvious
pitfall in the above code? In particular, how do I check that I am
incurring in unaligned access penalties, and if that is the case, how
do I avoid them? And even more, is there any way I could improve the
performance by better cache usage / prefetching? I already tried
unrolling the middle loop more, but it held no improvement.

I would like to squeeze as much performance as possible from this
routine. Any suggestion is appreciated.

Thanks in advance,

Hendrik van der Heijden · Aug 4, 2009

Stefano said:
I would like to squeeze as much performance as possible from this
routine. Any suggestion is appreciated.

> dy = _mm_sub_pd (iy, iy);
> dz = _mm_sub_pd (iz, iz);

^ jy,jz ?!

Assembler output would have been nice.

The biggest improvement I can think of would be switching
from doubles to floats, completely or just for the distance
calculation. Do you need double precision?

A small thing: If you don't explicitly use _mm_load_pd, you
disallow the compiler to use a memory operand instruction.
The _mm_sub_pd ops could use a memory operand instead of jx,jy,jz.
Try dx = _mm_sub_pd(ix, (_m128d*)(&x[0])).

Another one: If you do pointer arithmetic like x = &x[4], you
have 5 running variables (x,y,z,d,hi_aligned-j) in the loop
instead of 1 or 2.

Hendrik vdH

Stefano Teso · Aug 4, 2009

Hi,

just a quick update. The code posted is indeed bugged, since lo_align
is always <= lo. I managed to eliminate the first and last cycles by
allowing the computation to go lower than lo (down to lo_aligned) and
higher than hi (up to the newly defined hi_aligned = (hi + 3) & ~3),
and adding extra padding to the distance and coord arrays to handle
the otherwse-out-of-bounds memory access when j > hi. The new loop
looks like:

lo_aligned = lo & ~3;
hi_aligned = (hi + 3) & ~3

for (j = lo_aligned; j < hi_aligned; j) {
<distance code>
}

Any suggestion on how to improve the code is still very welcome.

Thanks,

user923005 · Aug 4, 2009

Hi,

First of all I don't know if this is the right group to post, sorry if
it is not; any suggestions on the right place for this kind of
discussion is warmly welcome.

I am writing a stochastic optimization algorithm which, as a part of
the energy calculation procedure, requires the computation of the
squared distance between a 3D vector from an array of n other 3D
vectors. This is the most computational intensive part of the
algorithm, so I wrote it using SSE2 intrinsics as defined in
<emmintr.h>, using GCC 4.4. The vector array is stored in SoA format,
with all components (x, y, z) aligned to a the 16-byte boundary, and
the same goes for the distance array.

The current code looks basically as follows:

static void
calc_distance_row (double *d, const double *xarray, const double
*yarray, const double *zarray, const int i, const int lo, const int
hi)
{

/* computes the *square* of the distance between a fixed vector i and
all vectors j in the range [lo, hi), whose coords are specified in
xarray, yarray, zarray */

/* NOTE: dist2 is a rather obvious C function that uses no intrinsics
*/

=A0 =A0 int j, lo_aligned, hi_aligned;
=A0 =A0 __m128d ix, iy, ix, jx, jy, jz, dx, dy, dz, mx, my, mz, s0, s1;
=A0 =A0 double *x, *y, *z;

=A0 =A0 lo_aligned =3D (lo & ~3);
=A0 =A0 hi_aligned =3D (hi & ~3);

=A0 =A0 for (j =3D lo; j < lo_aligned; j++)
=A0 =A0 =A0 =A0 darray[j] =3D dist2 (xarray, yarray, zarray, xar= ray[j],
yarray[j], zarray[j]);

=A0 =A0 x =3D &xarray[lo_aligned];
=A0 =A0 y =3D &yarray[lo_aligned];
=A0 =A0 z =3D &zarray[lo_aligned];
=A0 =A0 d =3D &darray[lo_aligned];

=A0 =A0 ix =3D _mm_load1_pd (&xarray);
=A0 =A0 iy =3D _mm_load1_pd (&yarray);
=A0 =A0 iz =3D _mm_load1_pd (&zarray);

=A0 =A0 for (j =3D lo_aligned; j < hi_aligned; j++) {

=A0 =A0 =A0 =A0 jx =3D _mm_load_pd (&x[0]);
=A0 =A0 =A0 =A0 jy =3D _mm_load_pd (&y[0]);
=A0 =A0 =A0 =A0 jz =3D _mm_load_pd (&z[0]);

=A0 =A0 =A0 =A0 dx =3D _mm_sub_pd (ix, jx);
=A0 =A0 =A0 =A0 dy =3D _mm_sub_pd (iy, iy);
=A0 =A0 =A0 =A0 dz =3D _mm_sub_pd (iz, iz);

=A0 =A0 =A0 =A0 mx =3D _mm_mul_pd (dx, dx);
=A0 =A0 =A0 =A0 my =3D _mm_mul_pd (dy, dy);
=A0 =A0 =A0 =A0 mz =3D _mm_mul_pd (dz, dz);

=A0 =A0 =A0 =A0 s0 =3D _mm_add_pd (mx, my);
=A0 =A0 =A0 =A0 s1 =3D _mm_add_pd (s0, mz);

=A0 =A0 =A0 =A0 _mm_store_pd (&d[0], s1);

=A0 =A0 =A0 =A0 /* the above chunk of code is replicated once with '2' in
place of '0' in the array indices */

=A0 =A0 =A0 =A0 x =3D &x[4];
=A0 =A0 =A0 =A0 y =3D &y[4];
=A0 =A0 =A0 =A0 z =3D &z[4];
=A0 =A0 =A0 =A0 d =3D &d[4];
=A0 =A0 }

=A0 =A0 for (j =3D hi_aligned; j < hi; j++)
=A0 =A0 =A0 =A0 d[j] =3D dist2 (xarray, yarray, zarray, xarray[j= ],
yarray[j], zarray[j]);

}

(There may be typos in the above routine, anyway the real code was
tested against a simple minded C version, and proved to work as
expected, outperforming it by about 2x.)

The first and third loops are used to handle non-unrollable cases,
while the second loop -- the only one using SSE2 intrinsics -- is
unrolled twice, for a whopping 4 square distances computed per
iteration. The average array size is 100 to 1000 vectors, so no huge
number crunching here.

Is this the best way to perform this operation? Is there any obvious
pitfall in the above code? In particular, how do I check that I am
incurring in unaligned access penalties, and if that is the case, how
do I avoid them? And even more, is there any way I could improve the
performance by better cache usage / prefetching? I already tried
unrolling the middle loop more, but it held no improvement.

I would like to squeeze as much performance as possible from this
routine. Any suggestion is appreciated.

Thanks in advance,

You want for questions about numerical
algorithms.

tni · Aug 5, 2009

Stefano said:
for (j = lo_aligned; j < hi_aligned; j++) {

jx = _mm_load_pd (&x[0]);
jy = _mm_load_pd (&y[0]);
jz = _mm_load_pd (&z[0]);

dx = _mm_sub_pd (ix, jx);
dy = _mm_sub_pd (iy, iy);
dz = _mm_sub_pd (iz, iz);

Don't do that, the load isn't necessary. Use proper indexing for the
arrays. (Get rid of x, y, z.)

for (j = lo_aligned; j < hi_aligned; j += 2) { // 2 with no unrolling
dx = _mm_sub_pd(ix, xarray[j]);
....

You did notice your bug (dy and dz are 0), right? I wouldn't be
surprised, if most of your loop was optimized away because of that, so
your performance numbers might be complete nonsense.

mx = _mm_mul_pd (dx, dx);
my = _mm_mul_pd (dy, dy);
mz = _mm_mul_pd (dz, dz);

s0 = _mm_add_pd (mx, my);
s1 = _mm_add_pd (s0, mz);

Put all that stuff into a single statement (so that the compiler won't
be tempted to do unnecessary loads/stores):

s = _mm_add_pd(_mm_add_pd(_mm_mul_pd(dx, dx), _mm_mul_pd(dy, dy)),
_mm_mul_pd(dz, dz)));

_mm_store_pd (&d[0], s1);

If you don't need the result right away, do a non-temporal store:
_mm_stream_pd(d + j, s);

Actually, get rid of s:
_mm_stream_pd(d + j, _mm_add_pd(_mm_add_pd(_mm_mul_pd(dx, dx),
_mm_mul_pd(dy, dy)), _mm_mul_pd(dz, dz)));

/* the above chunk of code is replicated once with '2' in
place of '0' in the array indices */

x = &x[4];
y = &y[4];
z = &z[4];
d = &d[4];

Use proper array indexing. GCC is reasonably intelligent and will likely
make use of the complex addressing modes - no need for 4 pointer
increments.

The first and third loops are used to handle non-unrollable cases,
while the second loop -- the only one using SSE2 intrinsics -- is
unrolled twice, for a whopping 4 square distances computed per
iteration.

Loop unrolling will likely not buy you anything. On some CPUs, it may
even hurt performance (they have optimizations for small loops).

Is this the best way to perform this operation? Is there any obvious
pitfall in the above code? In particular, how do I check that I am
incurring in unaligned access penalties, and if that is the case, how
do I avoid them?

If there is an unaligned access, your program will simply crash.

And even more, is there any way I could improve the
performance by better cache usage / prefetching?

It's unlikely that prefetching will buy you anything (it may even hurt
performance).

If you control the memory allocation, you should allocate aligned memory
(_mm_malloc, _mm_free) in the first place.

You should really post the assembly. Disassemble your code with objdump.

sfuerst · Aug 5, 2009

One trick is to notice that |x-y|^2 == |x|^2 + |y|^2 - 2 x.y

This means that if you can pre-process your data, then you can remove
the
subtractions from the inner loop, and shrink the critical path.

Simply construct the dot product of x and (-2y), and then add the
result to
the saved values of |x|^2 and |y|^2. There are extra additions, yes,
but
they may be done in parallel with the multiplies.

Since you only have on order a thousand points, the data will fit in
your
cache. The extra |x|^2 values hopefully won't add too much extra
memory
pressure.

Of course, you probably should benchmark this - sometimes the
"obvious" way
is faster.

Steven

Stefano Teso · Aug 5, 2009

There is an aligned load and an unaligned load. The aligned load will
crash if your data isn't aligned, so you will notice if your data
isn't aligned. malloc () will typically return nicely aligned
pointers.

I am using posix_memalign to allocate the coord data, so the
_mm_load_pd should never trigger any exceptions.

You should worry mostly about making your code cache friendly when you
call this multiple times. And depending on what you do with the
results, there may be some mathematical ways to save operations.

Ok, but I don't know how to find, using GNU/Linux, the code paths that
trigger cache misses by profiling... So I cannot really judge if the
code will be cache friendly or not.

Thanks for your reply.

Stefano Teso · Aug 5, 2009

Hi,

Stefano said:
Stefano said:

=A0 =A0 for (j =3D lo_aligned; j < hi_aligned; j++) {

Click to expand...

=A0 =A0 =A0 =A0 jx =3D _mm_load_pd (&x[0]);
=A0 =A0 =A0 =A0 jy =3D _mm_load_pd (&y[0]);
=A0 =A0 =A0 =A0 jz =3D _mm_load_pd (&z[0]);

Click to expand...

=A0 =A0 =A0 =A0 dx =3D _mm_sub_pd (ix, jx);
=A0 =A0 =A0 =A0 dy =3D _mm_sub_pd (iy, iy);
=A0 =A0 =A0 =A0 dz =3D _mm_sub_pd (iz, iz);

Click to expand...

Don't do that, the load isn't necessary. Use proper indexing for the
arrays. (Get rid of x, y, z.)
Done.

for (j =3D lo_aligned; j < hi_aligned; j +=3D 2) { // 2 with no unrolling
=A0 =A0 =A0dx =3D _mm_sub_pd(ix, xarray[j]);
...

a) I removed the unrolling since it didn't improve the performance at
all (actually it didn't worsen it either, but why keep useless code in
the source?).

b) What is

You did notice your bug (dy and dz are 0), right? I wouldn't be
surprised, if most of your loop was optimized away because of that, so
your performance numbers might be complete nonsense.

The bug was introduced when posting the code, it was not present in
the original source.

Put all that stuff into a single statement (so that the compiler won't
be tempted to do unnecessary loads/stores):

s =3D _mm_add_pd(_mm_add_pd(_mm_mul_pd(dx, dx), _mm_mul_pd(dy, dy)),
_mm_mul_pd(dz, dz)));

I tried this and it doesn't make any difference, performance wise, and
the assembly output looks unaffected. The compiler was probably
already smart enough.

=A0 =A0 =A0 =A0 _mm_store_pd (&d[0], s1);

Click to expand...

If you don't need the result right away, do a non-temporal store:
_mm_stream_pd(d + j, s);

Actually, get rid of s:
_mm_stream_pd(d + j, _mm_add_pd(_mm_add_pd(_mm_mul_pd(dx, dx),
_mm_mul_pd(dy, dy)), _mm_mul_pd(dz, dz)));

Unfortunately I need the values of the distance array right after this
function is called, but thanks for the tip. Using it BTW slows down
the function by about 100 cycles.

Loop unrolling will likely not buy you anything. On some CPUs, it may
even hurt performance (they have optimizations for small loops).

I just checked and you are right; I re-rolled the loop.

You should really post the assembly. Disassemble your code with objdump.

Here you go. C source code:

/* static */ void
calc_dist_row (const double *x,
const double *y,
const double *z,
const int i,
const int bandlo,
const int bandhi)
{
__m128d ix, iy, iz;
__m128d dx, dy, dz;
__m128d mx, my, mz;
__m128d dd;
int j, _bandlo, _bandhi;

_bandlo =3D bandlo & ~1;
_bandhi =3D bandhi;

ix =3D _mm_load1_pd (&(x));
iy =3D _mm_load1_pd (&(y));
iz =3D _mm_load1_pd (&(z));

for (j =3D _bandlo; j < _bandhi; j +=3D 2) {

dx =3D _mm_sub_pd (ix, _mm_load_pd (&x[j]));
dy =3D _mm_sub_pd (iy, _mm_load_pd (&y[j]));
dz =3D _mm_sub_pd (iz, _mm_load_pd (&z[j]));

mx =3D _mm_mul_pd (dx, dx);
my =3D _mm_mul_pd (dy, dy);
mz =3D _mm_mul_pd (dz, dz);

dd =3D _mm_sqrt_pd (_mm_add_pd (_mm_add_pd (mx, my), mz));

_mm_store_pd (&dist_row[j], dd);
}
}

And the assembly:

0000000000000a90 <calc_dist_row>:
a90: 48 63 c9 movsxd rcx,ecx
a93: 41 ff c1 inc r9d
a96: 41 83 e0 fe and r8d,0xfffffffffffffffe
a9a: 41 83 e1 fe and r9d,0xfffffffffffffffe
a9e: f2 0f 12 2c cf movddup xmm5,QWORD PTR [rdi+rcx*8]
aa3: f2 0f 12 24 ce movddup xmm4,QWORD PTR [rsi+rcx*8]
aa8: f2 0f 12 1c ca movddup xmm3,QWORD PTR [rdx+rcx*8]
aad: 45 39 c8 cmp r8d,r9d
ab0: 7d 67 jge b19 <calc_dist_row+0x89>
ab2: 49 63 c0 movsxd rax,r8d
ab5: 48 c1 e0 03 shl rax,0x3
ab9: 48 01 c7 add rdi,rax
abc: 48 01 c6 add rsi,rax
abf: 48 01 c2 add rdx,rax
ac2: 48 03 05 00 00 00 00 add rax,QWORD PTR [rip+0x0]
# ac9 <calc_dist_row+0x39>
ac9: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
ad0: 66 0f 28 c5 movapd xmm0,xmm5
ad4: 66 0f 28 d4 movapd xmm2,xmm4
ad8: 66 0f 5c 07 subpd xmm0,XMMWORD PTR [rdi]
adc: 66 0f 5c 16 subpd xmm2,XMMWORD PTR [rsi]
ae0: 66 0f 28 cb movapd xmm1,xmm3
ae4: 66 0f 59 c0 mulpd xmm0,xmm0
ae8: 66 0f 5c 0a subpd xmm1,XMMWORD PTR [rdx]
aec: 66 0f 59 d2 mulpd xmm2,xmm2
af0: 66 0f 59 c9 mulpd xmm1,xmm1
af4: 66 0f 58 c2 addpd xmm0,xmm2
af8: 41 83 c0 02 add r8d,0x2
afc: 66 0f 58 c1 addpd xmm0,xmm1
b00: 48 83 c7 10 add rdi,0x10
b04: 66 0f 29 00 movapd XMMWORD PTR [rax],xmm0
b08: 48 83 c6 10 add rsi,0x10
b0c: 48 83 c2 10 add rdx,0x10
b10: 48 83 c0 10 add rax,0x10
b14: 45 39 c1 cmp r9d,r8d
b17: 7f b7 jg ad0 <calc_dist_row+0x40>
b19: f3 c3 repz ret
b1b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]

Thanks,

Stefano Teso · Aug 5, 2009

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 dd =3D3D _mm_sqrt_pd (_mm_add_pd (_mm_add=

_pd (mx, my), mz));

Whoops! Ignore the square root here please, pasted the wrong code. The
assembly is the right one though.

sfuerst · Aug 5, 2009

After some benchmarking, it looks like this trick actually helps,
making the code 33% faster.

#include <emmintrin.h>

#define SIZE 1024

static double dist_row[SIZE];

__attribute__((noinline)) void calc_dist_row1(const double *x, const
double *y, const double *z, int i, int bandlo, int bandhi)
{
__m128d ix, iy, iz;
__m128d dx, dy, dz;
__m128d mx, my, mz;
__m128d dd;
int j, _bandlo, _bandhi;

_bandlo = bandlo & ~1;
_bandhi = bandhi;

ix = _mm_load1_pd(&(x));
iy = _mm_load1_pd(&(y));
iz = _mm_load1_pd(&(z));

for (j = _bandlo; j < _bandhi; j += 2)
{

dx = _mm_sub_pd(ix, _mm_load_pd(&x[j]));
dy = _mm_sub_pd(iy, _mm_load_pd(&y[j]));
dz = _mm_sub_pd(iz, _mm_load_pd(&z[j]));

mx = _mm_mul_pd(dx, dx);
my = _mm_mul_pd(dy, dy);
mz = _mm_mul_pd(dz, dz);

dd = _mm_add_pd(_mm_add_pd(mx, my), mz);

_mm_store_pd(&dist_row[j], dd);
}
}

__attribute__((noinline)) void calc_dist_row2(const double *x, const
double *y, const double *z, int i, int bandlo, int bandhi)
{
typedef double v2df __attribute__((vector_size(16)));

v2df ix, iy, iz;
v2df dx, dy, dz;
v2df mx, my, mz;
v2df dd;
int j, _bandlo, _bandhi;

_bandlo = bandlo & ~1;
_bandhi = bandhi;

ix = (v2df) {x, x};
iy = (v2df) {y, y};
iz = (v2df) {z, z};

for (j = _bandlo; j < _bandhi; j += 2)
{
dx = ix - *((v2df *) &x[j]);
dy = iy - *((v2df *) &y[j]);
dz = iz - *((v2df *) &z[j]);

mx = dx * dx;
my = dy * dy;
mz = dz * dz;

*((v2df *) &dist_row[j]) = mx + my + mz;
}
}

/* Slower! */
__attribute__((noinline)) void calc_dist_row3(const double *x, const
double *y, const double *z, const double *r, int i, int bandlo, int
bandhi)
{
typedef double v2df __attribute__((vector_size(16)));

v2df ix, iy, iz, ir;
v2df dx, dy, dz, rr;
v2df mx, my, mz;
v2df dd;
v2df mtwo = {-2, -2};

int j, _bandlo, _bandhi;

_bandlo = bandlo & ~1;
_bandhi = bandhi;

ix = (v2df) {x, x};
iy = (v2df) {y, y};
iz = (v2df) {z, z};
ir = (v2df) {r, r};

ix *= mtwo;
iy *= mtwo;
iz *= mtwo;

for (j = _bandlo; j < _bandhi; j += 2)
{
rr = *((v2df *) &r[j]) + ir;
dx = *((v2df *) &x[j]) * ix;
dy = *((v2df *) &y[j]) * iy;
dz = *((v2df *) &z[j]) * iz;

*((v2df *) &dist_row[j]) = dx + dy + dz + rr;
}
}

int main(void)
{
int i, j;

double x[SIZE];
double y[SIZE];
double z[SIZE];
double r[SIZE];

for (i = 0; i < SIZE; i++)
{
x = i;
y = i * i;
z = i * i - 137;
r = x * x + y * y + z * z;
}

for (j = 0; j < 1000; j++)
{
for (i = 0; i < 1000; i++)
{
/* 3.33s */
//calc_dist_row1(x, y, z, i, 0, SIZE);

/* 3.33s */
//calc_dist_row1(x, y, z, i, 0, SIZE);

/* 2.25s */
calc_dist_row3(x, y, z, r, i, 0, SIZE);
}
}
}

The first subroutine is the original version (without the slow sqrt
call which otherwise dominates the timing). The __attribute__
((noinline)) is to prevent the compiler from optimizing the function
calls away. The second subroutine is the same algorithm translated
into gcc's vector types, and thus much easier to read. The final
subroutine uses the distance formula expansion to reduce the critical
path, and allow more calculation in parallel.

In gcc 4.4, the resulting times for the program to run are given in
the comments in main(). Note that gcc 4.3 optimizes less well, and
the third subroutine is actually slower than the first two... (Also,
note that you should obviously compile with -O3 to get the best times)

Here is the resulting asm for the inner loop:

Originally, it was:

L1:
movapd %xmm5,%xmm0
movapd %xmm4,%xmm2
movapd %xmm3,%xmm1
subpd (%rdi),%xmm0
add $0x10,%rdi
subpd (%rsi),%xmm2
add $0x10,%rsi
subpd (%rdx),%xmm1
add $0x10,%rdx
mulpd %xmm0,%xmm0
mulpd %xmm2,%xmm2
mulpd %xmm1,%xmm1
addpd %xmm2,%xmm0
addpd %xmm1,%xmm0
movapd %xmm0,(%rax)
add $0x10,%rax
cmp %rcx,%rax
jne L1

After the algorithm change:
L1:
movapd (%rdi),%xmm0
add $0x10,%rdi
movapd (%rsi),%xmm1
add $0x10,%rsi
mulpd %xmm4,%xmm0
mulpd %xmm3,%xmm1
addpd %xmm1,%xmm0
movapd (%rdx),%xmm1
add $0x10,%rdx
mulpd %xmm2,%xmm1
addpd %xmm1,%xmm0
movapd (%r8),%xmm1
add $0x10,%r8
addpd %xmm5,%xmm1
addpd %xmm1,%xmm0
movapd %xmm0,(%r10)
add $0x10,%r10
cmp %rax,%r8
jne L1

Steven

sfuerst · Aug 5, 2009

Ah - looks like nearly all of the speedup of the algorithm change can
be gotten by a simple change.

Just change the lines
dx = _mm_sub_pd(ix, _mm_load_pd(&x[j]));
dy = _mm_sub_pd(iy, _mm_load_pd(&y[j]));
dz = _mm_sub_pd(iz, _mm_load_pd(&z[j]));

into
dx = _mm_sub_pd(_mm_load_pd(&x[j]), ix);
dy = _mm_sub_pd(_mm_load_pd(&y[j]), iy);
dz = _mm_sub_pd(_mm_load_pd(&z[j]), iz);

(x-y)**2 == (y-x)**2

This causes gcc to emit much better code, with the benchmark program
in the previous post taking 2.33s to run.

In case anyone is interested, the new inner loop is:
L1:
movapd (%rdi),%xmm0
add $0x10,%rdi
movapd (%rsi),%xmm2
add $0x10,%rsi
subpd %xmm5,%xmm0
movapd (%rdx),%xmm1
add $0x10,%rdx
subpd %xmm4,%xmm2
subpd %xmm3,%xmm1
mulpd %xmm0,%xmm0
mulpd %xmm2,%xmm2
mulpd %xmm1,%xmm1
addpd %xmm2,%xmm0
addpd %xmm1,%xmm0
movapd %xmm0,(%rax)
add $0x10,%rax
cmp %rcx,%rax
jne L1

Steven

tni · Aug 5, 2009

Stefano said:
Hi,

I tried this and it doesn't make any difference, performance wise, and
the assembly output looks unaffected. The compiler was probably
already smart enough.

If you ever use Visual Studio, it can make a difference in some cases.
The register allocator often gets confused when temporary values are
there (and reloads stuff).

BTW, the assembly code looks reasonable, it's unlikely there is more to
be gained. Not sure why GCC isn't using complex addressing, but most
current CPUs can do something like 3 adds per clock, so it likely
doesn't really matter.

(If you really do a sqrt in the loop, it's going to dominate the timing
anyway.)

VS does use complex addressing in this case:

0000000140001070 movapd xmm1,xmm3
0000000140001074 movapd xmm2,xmm4
0000000140001078 movapd xmm0,xmm5
000000014000107C add r11,10h
0000000140001080 sub rdx,1
0000000140001084 subpd xmm2,xmmword ptr [r11-10h]
000000014000108A subpd xmm0,xmmword ptr [r10+r11-10h]
0000000140001091 subpd xmm1,xmmword ptr [r9+r11-10h]
0000000140001098 mulpd xmm2,xmm2
000000014000109C mulpd xmm0,xmm0
00000001400010A0 mulpd xmm1,xmm1
00000001400010A4 addpd xmm2,xmm1
00000001400010A8 addpd xmm2,xmm0
00000001400010AC sqrtpd xmm0,xmm2
00000001400010B0 movapd xmmword ptr [rax+r11-10h],xmm0
00000001400010B7 jne main+70h (140001070h)

Stefano Teso · Aug 7, 2009

Hi,

After some benchmarking, it looks like this trick actually helps,
making the code 33% faster.

Thanks for the tip and the code!

int main(void)
{
=A0 =A0 =A0 =A0 int i, j;

=A0 =A0 =A0 =A0 double x[SIZE];
=A0 =A0 =A0 =A0 double y[SIZE];
=A0 =A0 =A0 =A0 double z[SIZE];
=A0 =A0 =A0 =A0 double r[SIZE];

=A0 =A0 =A0 =A0 for (i =3D 0; i < SIZE; i++)
=A0 =A0 =A0 =A0 {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 x =3D i;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 y =3D i * i;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 z =3D i * i - 137;
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 r =3D x * x + y * y + z= * z;
=A0 =A0 =A0 =A0 }

=A0 =A0 =A0 =A0 for (j =3D 0; j < 1000; j++)
=A0 =A0 =A0 =A0 {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 for (i =3D 0; i < 1000; i++)
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* 3.33s */
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 //calc_dist_row1(x, y, z,= i, 0, SIZE);

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* 3.33s */
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 //calc_dist_row1(x, y, z,= i, 0, SIZE);

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* 2.25s */
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 calc_dist_row3(x, y, z, r= , i, 0, SIZE);
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
=A0 =A0 =A0 =A0 }

}

Does the timing you reported refer to the whole preprocessing + core
loop thing or only to the core loop? Because the data I compute the
distance on is generated dynamically and changes at every call to my
subroutine (actually the subroutine is called *only* if the data
changed, for efficiency), so I would have to preprocess the data each
time.

The first subroutine is the original version (without the slow sqrt
call which otherwise dominates the timing).

Click to expand...

I discovered that calling the sqrt in this routine instead of
scattering sqrt () calls in other part of the code actually improves
the global performance, mainly because _mm_sqrt_pd can handle 2 square
roots at a time (and with the switch to single precision, that goes up
to 4) while sqrt () only one. So I think I'll stick with the
_mm_sqrt_pd () call here...

Regards,

sfuerst · Aug 7, 2009

Hi,

Thanks for the tip and the code!

Does the timing you reported refer to the whole preprocessing + core
loop thing or only to the core loop? Because the data I compute the
distance on is generated dynamically and changes at every call to my
subroutine (actually the subroutine is called *only* if the data
changed, for efficiency), so I would have to preprocess the data each
time.

I discovered that calling the sqrt in this routine instead of
scattering sqrt () calls in other part of the code actually improves
the global performance, mainly because _mm_sqrt_pd can handle 2 square
roots at a time (and with the switch to single precision, that goes up
to 4) while sqrt () only one. So I think I'll stick with the
_mm_sqrt_pd () call here...

Regards,

The timing refers to the whole program run... however the
preprocessing only occurs once, whilst the squared distance routine is
called many times in that program. If you are constantly changing the
distance array, then the preprocessing cost will make that technique a
loser. (It all depends on exactly how many times you re-use a
point... so some benchmarking is in order.)

Note that you can get a 30% speedup from just reordering your
subtractions though. That is nearly all the improvement that the
preprocessed algorithm can give, without any of the drawbacks.

Another trick is to notice that often you don't actually want the
distance... you want 1/distance for normalization etc. Calculating
the reciprocal square root is faster than getting the square root, so
changing the routine to calculate 1/distance, and converting divisions
scattered about your codebase into multiplications might be net win.
A quick search online will show you many optimized rsqrt() routines.

Steven

Stefano Teso · Aug 8, 2009

The timing refers to the whole program run... however the
preprocessing only occurs once, whilst the squared distance routine is
called many times in that program. =A0If you are constantly changing the
distance array, then the preprocessing cost will make that technique a
loser. =A0(It all depends on exactly how many times you re-use a
point... so some benchmarking is in order.)
Sure.

Another trick is to notice that often you don't actually want the
distance... you want 1/distance for normalization etc. =A0Calculating
the reciprocal square root is faster than getting the square root, so
changing the routine to calculate 1/distance, and converting divisions
scattered about your codebase into multiplications might be net win.
A quick search online will show you many optimized rsqrt() routines.

Thanks, but unfortunately I need the distance, not its inverse. I am
aware that there are various special rsqrt () tricks around.

Thanks again,

James Van Buskirk · Aug 8, 2009

Thanks, but unfortunately I need the distance, not its inverse. I am
aware that there are various special rsqrt () tricks around.

Computing rsqrt(x) is no faster than computing sqrt(x).
Assuming 3 iterations is enough:

y = transfer(magic-ishft(transfer(x,magic),-1),y)

a = x*y
b = a*y
c = 0.5*y
d = 3.0-b
y = c*d

a = x*y
b = a*y
c = 0.5*y
d = 3.0-b
y = c*d

a = x*y
b = a*y
c = 0.5*a
d = 3.0-b
y = c*d

The extra multiplication by x can be simply woven into the algorithm
at the last iteration at no cost in latency or throughput. Well, you
can mess around with the latency a little to get:

y = transfer(magic-ishft(transfer(x,magic),-1),y)

a = x*y
b = a*y
c = 0.5*y
d = 0.5*a
e = 3.0-b
y = c*e

a = d*e
b = a*y
c = 0.5*y
d = 0.5*a
e = 3.0-b
y = c*e

a = d*e
b = a*y
d = 0.5*a
e = 3.0-b
y = d*e

Help optimize nbody bench program (c++ sse2 intrinsics)	3	Oct 12, 2012
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
why i am getting error ??	2	Mar 28, 2008
need help with arrays	6	Apr 29, 2008
motion capture application, need improvements	1	Mar 3, 2007
Creating and using a lookup array	26	Sep 24, 2005
help to convert this fortan program to c++	2	Apr 11, 2009
Optimise my ray tracer	85	Jun 2, 2005

Computing a distance matrix using SSE2

Stefano Teso

Hendrik van der Heijden

Stefano Teso

user923005

tni

sfuerst

Stefano Teso

Stefano Teso

Stefano Teso

sfuerst

sfuerst

tni

Stefano Teso

sfuerst

Stefano Teso

James Van Buskirk

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads