S

#### Stefano Teso

First of all I don't know if this is the right group to post, sorry if

it is not; any suggestions on the right place for this kind of

discussion is warmly welcome.

I am writing a stochastic optimization algorithm which, as a part of

the energy calculation procedure, requires the computation of the

squared distance between a 3D vector from an array of n other 3D

vectors. This is the most computational intensive part of the

algorithm, so I wrote it using SSE2 intrinsics as defined in

<emmintr.h>, using GCC 4.4. The vector array is stored in SoA format,

with all components (x, y, z) aligned to a the 16-byte boundary, and

the same goes for the distance array.

The current code looks basically as follows:

static void

calc_distance_row (double *d, const double *xarray, const double

*yarray, const double *zarray, const int i, const int lo, const int

hi)

{

/* computes the *square* of the distance between a fixed vector i and

all vectors j in the range [lo, hi), whose coords are specified in

xarray, yarray, zarray */

/* NOTE: dist2 is a rather obvious C function that uses no intrinsics

*/

int j, lo_aligned, hi_aligned;

__m128d ix, iy, ix, jx, jy, jz, dx, dy, dz, mx, my, mz, s0, s1;

double *x, *y, *z;

lo_aligned = (lo & ~3);

hi_aligned = (hi & ~3);

for (j = lo; j < lo_aligned; j++)

darray[j] = dist2 (xarray

*, yarray*

*, zarray**, xarray[j],*

yarray[j], zarray[j]);

x = &xarray[lo_aligned];

y = &yarray[lo_aligned];

z = &zarray[lo_aligned];

d = &darray[lo_aligned];

ix = _mm_load1_pd (&xarrayyarray[j], zarray[j]);

x = &xarray[lo_aligned];

y = &yarray[lo_aligned];

z = &zarray[lo_aligned];

d = &darray[lo_aligned];

ix = _mm_load1_pd (&xarray

*);*

iy = _mm_load1_pd (&yarrayiy = _mm_load1_pd (&yarray

*);*

iz = _mm_load1_pd (&zarrayiz = _mm_load1_pd (&zarray

*);*

for (j = lo_aligned; j < hi_aligned; j++) {

jx = _mm_load_pd (&x[0]);

jy = _mm_load_pd (&y[0]);

jz = _mm_load_pd (&z[0]);

dx = _mm_sub_pd (ix, jx);

dy = _mm_sub_pd (iy, iy);

dz = _mm_sub_pd (iz, iz);

mx = _mm_mul_pd (dx, dx);

my = _mm_mul_pd (dy, dy);

mz = _mm_mul_pd (dz, dz);

s0 = _mm_add_pd (mx, my);

s1 = _mm_add_pd (s0, mz);

_mm_store_pd (&d[0], s1);

/* the above chunk of code is replicated once with '2' in

place of '0' in the array indices */

x = &x[4];

y = &y[4];

z = &z[4];

d = &d[4];

}

for (j = hi_aligned; j < hi; j++)

d[j] = dist2 (xarrayfor (j = lo_aligned; j < hi_aligned; j++) {

jx = _mm_load_pd (&x[0]);

jy = _mm_load_pd (&y[0]);

jz = _mm_load_pd (&z[0]);

dx = _mm_sub_pd (ix, jx);

dy = _mm_sub_pd (iy, iy);

dz = _mm_sub_pd (iz, iz);

mx = _mm_mul_pd (dx, dx);

my = _mm_mul_pd (dy, dy);

mz = _mm_mul_pd (dz, dz);

s0 = _mm_add_pd (mx, my);

s1 = _mm_add_pd (s0, mz);

_mm_store_pd (&d[0], s1);

/* the above chunk of code is replicated once with '2' in

place of '0' in the array indices */

x = &x[4];

y = &y[4];

z = &z[4];

d = &d[4];

}

for (j = hi_aligned; j < hi; j++)

d[j] = dist2 (xarray

*, yarray**, zarray**, xarray[j],*

yarray[j], zarray[j]);

}

(There may be typos in the above routine, anyway the real code was

tested against a simple minded C version, and proved to work as

expected, outperforming it by about 2x.)

The first and third loops are used to handle non-unrollable cases,

while the second loop -- the only one using SSE2 intrinsics -- is

unrolled twice, for a whopping 4 square distances computed per

iteration. The average array size is 100 to 1000 vectors, so no huge

number crunching here.

Is this the best way to perform this operation? Is there any obvious

pitfall in the above code? In particular, how do I check that I am

incurring in unaligned access penalties, and if that is the case, how

do I avoid them? And even more, is there any way I could improve the

performance by better cache usage / prefetching? I already tried

unrolling the middle loop more, but it held no improvement.

I would like to squeeze as much performance as possible from this

routine. Any suggestion is appreciated.

Thanks in advance,yarray[j], zarray[j]);

}

(There may be typos in the above routine, anyway the real code was

tested against a simple minded C version, and proved to work as

expected, outperforming it by about 2x.)

The first and third loops are used to handle non-unrollable cases,

while the second loop -- the only one using SSE2 intrinsics -- is

unrolled twice, for a whopping 4 square distances computed per

iteration. The average array size is 100 to 1000 vectors, so no huge

number crunching here.

Is this the best way to perform this operation? Is there any obvious

pitfall in the above code? In particular, how do I check that I am

incurring in unaligned access penalties, and if that is the case, how

do I avoid them? And even more, is there any way I could improve the

performance by better cache usage / prefetching? I already tried

unrolling the middle loop more, but it held no improvement.

I would like to squeeze as much performance as possible from this

routine. Any suggestion is appreciated.

Thanks in advance,