Computing a distance matrix using SSE2

Discussion in 'C Programming' started by Stefano Teso, Aug 4, 2009.

  1. Stefano Teso

    Stefano Teso Guest

    Hi,

    First of all I don't know if this is the right group to post, sorry if
    it is not; any suggestions on the right place for this kind of
    discussion is warmly welcome.

    I am writing a stochastic optimization algorithm which, as a part of
    the energy calculation procedure, requires the computation of the
    squared distance between a 3D vector from an array of n other 3D
    vectors. This is the most computational intensive part of the
    algorithm, so I wrote it using SSE2 intrinsics as defined in
    <emmintr.h>, using GCC 4.4. The vector array is stored in SoA format,
    with all components (x, y, z) aligned to a the 16-byte boundary, and
    the same goes for the distance array.

    The current code looks basically as follows:

    static void
    calc_distance_row (double *d, const double *xarray, const double
    *yarray, const double *zarray, const int i, const int lo, const int
    hi)
    {

    /* computes the *square* of the distance between a fixed vector i and
    all vectors j in the range [lo, hi), whose coords are specified in
    xarray, yarray, zarray */

    /* NOTE: dist2 is a rather obvious C function that uses no intrinsics
    */

    int j, lo_aligned, hi_aligned;
    __m128d ix, iy, ix, jx, jy, jz, dx, dy, dz, mx, my, mz, s0, s1;
    double *x, *y, *z;

    lo_aligned = (lo & ~3);
    hi_aligned = (hi & ~3);

    for (j = lo; j < lo_aligned; j++)
    darray[j] = dist2 (xarray, yarray, zarray, xarray[j],
    yarray[j], zarray[j]);

    x = &xarray[lo_aligned];
    y = &yarray[lo_aligned];
    z = &zarray[lo_aligned];
    d = &darray[lo_aligned];

    ix = _mm_load1_pd (&xarray);
    iy = _mm_load1_pd (&yarray);
    iz = _mm_load1_pd (&zarray);

    for (j = lo_aligned; j < hi_aligned; j++) {

    jx = _mm_load_pd (&x[0]);
    jy = _mm_load_pd (&y[0]);
    jz = _mm_load_pd (&z[0]);

    dx = _mm_sub_pd (ix, jx);
    dy = _mm_sub_pd (iy, iy);
    dz = _mm_sub_pd (iz, iz);

    mx = _mm_mul_pd (dx, dx);
    my = _mm_mul_pd (dy, dy);
    mz = _mm_mul_pd (dz, dz);

    s0 = _mm_add_pd (mx, my);
    s1 = _mm_add_pd (s0, mz);

    _mm_store_pd (&d[0], s1);

    /* the above chunk of code is replicated once with '2' in
    place of '0' in the array indices */

    x = &x[4];
    y = &y[4];
    z = &z[4];
    d = &d[4];
    }

    for (j = hi_aligned; j < hi; j++)
    d[j] = dist2 (xarray, yarray, zarray, xarray[j],
    yarray[j], zarray[j]);
    }

    (There may be typos in the above routine, anyway the real code was
    tested against a simple minded C version, and proved to work as
    expected, outperforming it by about 2x.)

    The first and third loops are used to handle non-unrollable cases,
    while the second loop -- the only one using SSE2 intrinsics -- is
    unrolled twice, for a whopping 4 square distances computed per
    iteration. The average array size is 100 to 1000 vectors, so no huge
    number crunching here.

    Is this the best way to perform this operation? Is there any obvious
    pitfall in the above code? In particular, how do I check that I am
    incurring in unaligned access penalties, and if that is the case, how
    do I avoid them? And even more, is there any way I could improve the
    performance by better cache usage / prefetching? I already tried
    unrolling the middle loop more, but it held no improvement.

    I would like to squeeze as much performance as possible from this
    routine. Any suggestion is appreciated.

    Thanks in advance,

    ----
    Stefano
    Stefano Teso, Aug 4, 2009
    #1
    1. Advertising

  2. Stefano Teso schrieb:
    > I would like to squeeze as much performance as possible from this
    > routine. Any suggestion is appreciated.


    > dy = _mm_sub_pd (iy, iy);
    > dz = _mm_sub_pd (iz, iz);


    ^ jy,jz ?!

    Assembler output would have been nice.

    The biggest improvement I can think of would be switching
    from doubles to floats, completely or just for the distance
    calculation. Do you need double precision?

    A small thing: If you don't explicitly use _mm_load_pd, you
    disallow the compiler to use a memory operand instruction.
    The _mm_sub_pd ops could use a memory operand instead of jx,jy,jz.
    Try dx = _mm_sub_pd(ix, (_m128d*)(&x[0])).

    Another one: If you do pointer arithmetic like x = &x[4], you
    have 5 running variables (x,y,z,d,hi_aligned-j) in the loop
    instead of 1 or 2.


    Hendrik vdH
    Hendrik van der Heijden, Aug 4, 2009
    #2
    1. Advertising

  3. Stefano Teso

    Stefano Teso Guest

    Hi,

    just a quick update. The code posted is indeed bugged, since lo_align
    is always <= lo. I managed to eliminate the first and last cycles by
    allowing the computation to go lower than lo (down to lo_aligned) and
    higher than hi (up to the newly defined hi_aligned = (hi + 3) & ~3),
    and adding extra padding to the distance and coord arrays to handle
    the otherwse-out-of-bounds memory access when j > hi. The new loop
    looks like:

    lo_aligned = lo & ~3;
    hi_aligned = (hi + 3) & ~3

    for (j = lo_aligned; j < hi_aligned; j) {
    <distance code>
    }

    Any suggestion on how to improve the code is still very welcome.

    Thanks,

    ----
    Stefano
    Stefano Teso, Aug 4, 2009
    #3
  4. Stefano Teso

    user923005 Guest

    On Aug 4, 7:41=A0am, Stefano Teso
    <> wrote:
    > Hi,
    >
    > First of all I don't know if this is the right group to post, sorry if
    > it is not; any suggestions on the right place for this kind of
    > discussion is warmly welcome.
    >
    > I am writing a stochastic optimization algorithm which, as a part of
    > the energy calculation procedure, requires the computation of the
    > squared distance between a 3D vector from an array of n other 3D
    > vectors. This is the most computational intensive part of the
    > algorithm, so I wrote it using SSE2 intrinsics as defined in
    > <emmintr.h>, using GCC 4.4. The vector array is stored in SoA format,
    > with all components (x, y, z) aligned to a the 16-byte boundary, and
    > the same goes for the distance array.
    >
    > The current code looks basically as follows:
    >
    > static void
    > calc_distance_row (double *d, const double *xarray, const double
    > *yarray, const double *zarray, const int i, const int lo, const int
    > hi)
    > {
    >
    > /* computes the *square* of the distance between a fixed vector i and
    > all vectors j in the range [lo, hi), whose coords are specified in
    > xarray, yarray, zarray */
    >
    > /* NOTE: dist2 is a rather obvious C function that uses no intrinsics
    > */
    >
    > =A0 =A0 int j, lo_aligned, hi_aligned;
    > =A0 =A0 __m128d ix, iy, ix, jx, jy, jz, dx, dy, dz, mx, my, mz, s0, s1;
    > =A0 =A0 double *x, *y, *z;
    >
    > =A0 =A0 lo_aligned =3D (lo & ~3);
    > =A0 =A0 hi_aligned =3D (hi & ~3);
    >
    > =A0 =A0 for (j =3D lo; j < lo_aligned; j++)
    > =A0 =A0 =A0 =A0 darray[j] =3D dist2 (xarray, yarray, zarray, xar=

    ray[j],
    > yarray[j], zarray[j]);
    >
    > =A0 =A0 x =3D &xarray[lo_aligned];
    > =A0 =A0 y =3D &yarray[lo_aligned];
    > =A0 =A0 z =3D &zarray[lo_aligned];
    > =A0 =A0 d =3D &darray[lo_aligned];
    >
    > =A0 =A0 ix =3D _mm_load1_pd (&xarray);
    > =A0 =A0 iy =3D _mm_load1_pd (&yarray);
    > =A0 =A0 iz =3D _mm_load1_pd (&zarray);
    >
    > =A0 =A0 for (j =3D lo_aligned; j < hi_aligned; j++) {
    >
    > =A0 =A0 =A0 =A0 jx =3D _mm_load_pd (&x[0]);
    > =A0 =A0 =A0 =A0 jy =3D _mm_load_pd (&y[0]);
    > =A0 =A0 =A0 =A0 jz =3D _mm_load_pd (&z[0]);
    >
    > =A0 =A0 =A0 =A0 dx =3D _mm_sub_pd (ix, jx);
    > =A0 =A0 =A0 =A0 dy =3D _mm_sub_pd (iy, iy);
    > =A0 =A0 =A0 =A0 dz =3D _mm_sub_pd (iz, iz);
    >
    > =A0 =A0 =A0 =A0 mx =3D _mm_mul_pd (dx, dx);
    > =A0 =A0 =A0 =A0 my =3D _mm_mul_pd (dy, dy);
    > =A0 =A0 =A0 =A0 mz =3D _mm_mul_pd (dz, dz);
    >
    > =A0 =A0 =A0 =A0 s0 =3D _mm_add_pd (mx, my);
    > =A0 =A0 =A0 =A0 s1 =3D _mm_add_pd (s0, mz);
    >
    > =A0 =A0 =A0 =A0 _mm_store_pd (&d[0], s1);
    >
    > =A0 =A0 =A0 =A0 /* the above chunk of code is replicated once with '2' in
    > place of '0' in the array indices */
    >
    > =A0 =A0 =A0 =A0 x =3D &x[4];
    > =A0 =A0 =A0 =A0 y =3D &y[4];
    > =A0 =A0 =A0 =A0 z =3D &z[4];
    > =A0 =A0 =A0 =A0 d =3D &d[4];
    > =A0 =A0 }
    >
    > =A0 =A0 for (j =3D hi_aligned; j < hi; j++)
    > =A0 =A0 =A0 =A0 d[j] =3D dist2 (xarray, yarray, zarray, xarray[j=

    ],
    > yarray[j], zarray[j]);
    >
    > }
    >
    > (There may be typos in the above routine, anyway the real code was
    > tested against a simple minded C version, and proved to work as
    > expected, outperforming it by about 2x.)
    >
    > The first and third loops are used to handle non-unrollable cases,
    > while the second loop -- the only one using SSE2 intrinsics -- is
    > unrolled twice, for a whopping 4 square distances computed per
    > iteration. The average array size is 100 to 1000 vectors, so no huge
    > number crunching here.
    >
    > Is this the best way to perform this operation? Is there any obvious
    > pitfall in the above code? In particular, how do I check that I am
    > incurring in unaligned access penalties, and if that is the case, how
    > do I avoid them? And even more, is there any way I could improve the
    > performance by better cache usage / prefetching? I already tried
    > unrolling the middle loop more, but it held no improvement.
    >
    > I would like to squeeze as much performance as possible from this
    > routine. Any suggestion is appreciated.
    >
    > Thanks in advance,
    >
    > ----
    > Stefano


    You want news:sci.math.num-analysis for questions about numerical
    algorithms.
    user923005, Aug 4, 2009
    #4
  5. Stefano Teso

    tni Guest

    Stefano Teso wrote:

    > for (j = lo_aligned; j < hi_aligned; j++) {
    >
    > jx = _mm_load_pd (&x[0]);
    > jy = _mm_load_pd (&y[0]);
    > jz = _mm_load_pd (&z[0]);
    >
    > dx = _mm_sub_pd (ix, jx);
    > dy = _mm_sub_pd (iy, iy);
    > dz = _mm_sub_pd (iz, iz);


    Don't do that, the load isn't necessary. Use proper indexing for the
    arrays. (Get rid of x, y, z.)

    for (j = lo_aligned; j < hi_aligned; j += 2) { // 2 with no unrolling
    dx = _mm_sub_pd(ix, xarray[j]);
    ....

    You did notice your bug (dy and dz are 0), right? I wouldn't be
    surprised, if most of your loop was optimized away because of that, so
    your performance numbers might be complete nonsense.

    > mx = _mm_mul_pd (dx, dx);
    > my = _mm_mul_pd (dy, dy);
    > mz = _mm_mul_pd (dz, dz);
    >
    > s0 = _mm_add_pd (mx, my);
    > s1 = _mm_add_pd (s0, mz);


    Put all that stuff into a single statement (so that the compiler won't
    be tempted to do unnecessary loads/stores):

    s = _mm_add_pd(_mm_add_pd(_mm_mul_pd(dx, dx), _mm_mul_pd(dy, dy)),
    _mm_mul_pd(dz, dz)));

    > _mm_store_pd (&d[0], s1);


    If you don't need the result right away, do a non-temporal store:
    _mm_stream_pd(d + j, s);

    Actually, get rid of s:
    _mm_stream_pd(d + j, _mm_add_pd(_mm_add_pd(_mm_mul_pd(dx, dx),
    _mm_mul_pd(dy, dy)), _mm_mul_pd(dz, dz)));

    > /* the above chunk of code is replicated once with '2' in
    > place of '0' in the array indices */
    >
    > x = &x[4];
    > y = &y[4];
    > z = &z[4];
    > d = &d[4];


    Use proper array indexing. GCC is reasonably intelligent and will likely
    make use of the complex addressing modes - no need for 4 pointer
    increments.

    > The first and third loops are used to handle non-unrollable cases,
    > while the second loop -- the only one using SSE2 intrinsics -- is
    > unrolled twice, for a whopping 4 square distances computed per
    > iteration.


    Loop unrolling will likely not buy you anything. On some CPUs, it may
    even hurt performance (they have optimizations for small loops).

    > Is this the best way to perform this operation? Is there any obvious
    > pitfall in the above code? In particular, how do I check that I am
    > incurring in unaligned access penalties, and if that is the case, how
    > do I avoid them?


    If there is an unaligned access, your program will simply crash.

    > And even more, is there any way I could improve the
    > performance by better cache usage / prefetching?


    It's unlikely that prefetching will buy you anything (it may even hurt
    performance).

    If you control the memory allocation, you should allocate aligned memory
    (_mm_malloc, _mm_free) in the first place.

    You should really post the assembly. Disassemble your code with objdump.
    tni, Aug 5, 2009
    #5
  6. Stefano Teso

    sfuerst Guest

    One trick is to notice that |x-y|^2 == |x|^2 + |y|^2 - 2 x.y

    This means that if you can pre-process your data, then you can remove
    the
    subtractions from the inner loop, and shrink the critical path.

    Simply construct the dot product of x and (-2y), and then add the
    result to
    the saved values of |x|^2 and |y|^2. There are extra additions, yes,
    but
    they may be done in parallel with the multiplies.

    Since you only have on order a thousand points, the data will fit in
    your
    cache. The extra |x|^2 values hopefully won't add too much extra
    memory
    pressure.

    Of course, you probably should benchmark this - sometimes the
    "obvious" way
    is faster.

    Steven
    sfuerst, Aug 5, 2009
    #6
  7. Stefano Teso

    Stefano Teso Guest

    On 4 Aug, 21:54, "christian.bau"
    <> wrote:
    > There is an aligned load and an unaligned load. The aligned load will
    > crash if your data isn't aligned, so you will notice if your data
    > isn't aligned. malloc () will typically return nicely aligned
    > pointers.


    I am using posix_memalign to allocate the coord data, so the
    _mm_load_pd should never trigger any exceptions.

    > You should worry mostly about making your code cache friendly when you
    > call this multiple times. And depending on what you do with the
    > results, there may be some mathematical ways to save operations.


    Ok, but I don't know how to find, using GNU/Linux, the code paths that
    trigger cache misses by profiling... So I cannot really judge if the
    code will be cache friendly or not.

    Thanks for your reply.

    ----
    Stefano
    Stefano Teso, Aug 5, 2009
    #7
  8. Stefano Teso

    Stefano Teso Guest

    Hi,

    On 5 Aug, 02:07, tni <> wrote:
    > Stefano Teso wrote:
    > > =A0 =A0 for (j =3D lo_aligned; j < hi_aligned; j++) {

    >
    > > =A0 =A0 =A0 =A0 jx =3D _mm_load_pd (&x[0]);
    > > =A0 =A0 =A0 =A0 jy =3D _mm_load_pd (&y[0]);
    > > =A0 =A0 =A0 =A0 jz =3D _mm_load_pd (&z[0]);

    >
    > > =A0 =A0 =A0 =A0 dx =3D _mm_sub_pd (ix, jx);
    > > =A0 =A0 =A0 =A0 dy =3D _mm_sub_pd (iy, iy);
    > > =A0 =A0 =A0 =A0 dz =3D _mm_sub_pd (iz, iz);

    >
    > Don't do that, the load isn't necessary. Use proper indexing for the
    > arrays. (Get rid of x, y, z.)


    Done.

    >
    > for (j =3D lo_aligned; j < hi_aligned; j +=3D 2) { // 2 with no unrolling
    > =A0 =A0 =A0dx =3D _mm_sub_pd(ix, xarray[j]);
    > ...


    a) I removed the unrolling since it didn't improve the performance at
    all (actually it didn't worsen it either, but why keep useless code in
    the source?).

    b) What is

    > You did notice your bug (dy and dz are 0), right? I wouldn't be
    > surprised, if most of your loop was optimized away because of that, so
    > your performance numbers might be complete nonsense.


    The bug was introduced when posting the code, it was not present in
    the original source.

    > > =A0 =A0 =A0 =A0 mx =3D _mm_mul_pd (dx, dx);
    > > =A0 =A0 =A0 =A0 my =3D _mm_mul_pd (dy, dy);
    > > =A0 =A0 =A0 =A0 mz =3D _mm_mul_pd (dz, dz);

    >
    > > =A0 =A0 =A0 =A0 s0 =3D _mm_add_pd (mx, my);
    > > =A0 =A0 =A0 =A0 s1 =3D _mm_add_pd (s0, mz);

    >
    > Put all that stuff into a single statement (so that the compiler won't
    > be tempted to do unnecessary loads/stores):
    >
    > s =3D _mm_add_pd(_mm_add_pd(_mm_mul_pd(dx, dx), _mm_mul_pd(dy, dy)),
    > _mm_mul_pd(dz, dz)));


    I tried this and it doesn't make any difference, performance wise, and
    the assembly output looks unaffected. The compiler was probably
    already smart enough.

    > > =A0 =A0 =A0 =A0 _mm_store_pd (&d[0], s1);

    >
    > If you don't need the result right away, do a non-temporal store:
    > _mm_stream_pd(d + j, s);
    >
    > Actually, get rid of s:
    > _mm_stream_pd(d + j, _mm_add_pd(_mm_add_pd(_mm_mul_pd(dx, dx),
    > _mm_mul_pd(dy, dy)), _mm_mul_pd(dz, dz)));


    Unfortunately I need the values of the distance array right after this
    function is called, but thanks for the tip. Using it BTW slows down
    the function by about 100 cycles.

    > Loop unrolling will likely not buy you anything. On some CPUs, it may
    > even hurt performance (they have optimizations for small loops).


    I just checked and you are right; I re-rolled the loop.

    > You should really post the assembly. Disassemble your code with objdump.


    Here you go. C source code:

    /* static */ void
    calc_dist_row (const double *x,
    const double *y,
    const double *z,
    const int i,
    const int bandlo,
    const int bandhi)
    {
    __m128d ix, iy, iz;
    __m128d dx, dy, dz;
    __m128d mx, my, mz;
    __m128d dd;
    int j, _bandlo, _bandhi;

    _bandlo =3D bandlo & ~1;
    _bandhi =3D bandhi;

    ix =3D _mm_load1_pd (&(x));
    iy =3D _mm_load1_pd (&(y));
    iz =3D _mm_load1_pd (&(z));

    for (j =3D _bandlo; j < _bandhi; j +=3D 2) {

    dx =3D _mm_sub_pd (ix, _mm_load_pd (&x[j]));
    dy =3D _mm_sub_pd (iy, _mm_load_pd (&y[j]));
    dz =3D _mm_sub_pd (iz, _mm_load_pd (&z[j]));

    mx =3D _mm_mul_pd (dx, dx);
    my =3D _mm_mul_pd (dy, dy);
    mz =3D _mm_mul_pd (dz, dz);

    dd =3D _mm_sqrt_pd (_mm_add_pd (_mm_add_pd (mx, my), mz));

    _mm_store_pd (&dist_row[j], dd);
    }
    }

    And the assembly:

    0000000000000a90 <calc_dist_row>:
    a90: 48 63 c9 movsxd rcx,ecx
    a93: 41 ff c1 inc r9d
    a96: 41 83 e0 fe and r8d,0xfffffffffffffffe
    a9a: 41 83 e1 fe and r9d,0xfffffffffffffffe
    a9e: f2 0f 12 2c cf movddup xmm5,QWORD PTR [rdi+rcx*8]
    aa3: f2 0f 12 24 ce movddup xmm4,QWORD PTR [rsi+rcx*8]
    aa8: f2 0f 12 1c ca movddup xmm3,QWORD PTR [rdx+rcx*8]
    aad: 45 39 c8 cmp r8d,r9d
    ab0: 7d 67 jge b19 <calc_dist_row+0x89>
    ab2: 49 63 c0 movsxd rax,r8d
    ab5: 48 c1 e0 03 shl rax,0x3
    ab9: 48 01 c7 add rdi,rax
    abc: 48 01 c6 add rsi,rax
    abf: 48 01 c2 add rdx,rax
    ac2: 48 03 05 00 00 00 00 add rax,QWORD PTR [rip+0x0]
    # ac9 <calc_dist_row+0x39>
    ac9: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
    ad0: 66 0f 28 c5 movapd xmm0,xmm5
    ad4: 66 0f 28 d4 movapd xmm2,xmm4
    ad8: 66 0f 5c 07 subpd xmm0,XMMWORD PTR [rdi]
    adc: 66 0f 5c 16 subpd xmm2,XMMWORD PTR [rsi]
    ae0: 66 0f 28 cb movapd xmm1,xmm3
    ae4: 66 0f 59 c0 mulpd xmm0,xmm0
    ae8: 66 0f 5c 0a subpd xmm1,XMMWORD PTR [rdx]
    aec: 66 0f 59 d2 mulpd xmm2,xmm2
    af0: 66 0f 59 c9 mulpd xmm1,xmm1
    af4: 66 0f 58 c2 addpd xmm0,xmm2
    af8: 41 83 c0 02 add r8d,0x2
    afc: 66 0f 58 c1 addpd xmm0,xmm1
    b00: 48 83 c7 10 add rdi,0x10
    b04: 66 0f 29 00 movapd XMMWORD PTR [rax],xmm0
    b08: 48 83 c6 10 add rsi,0x10
    b0c: 48 83 c2 10 add rdx,0x10
    b10: 48 83 c0 10 add rax,0x10
    b14: 45 39 c1 cmp r9d,r8d
    b17: 7f b7 jg ad0 <calc_dist_row+0x40>
    b19: f3 c3 repz ret
    b1b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]

    Thanks,

    ----
    Stefano
    Stefano Teso, Aug 5, 2009
    #8
  9. Stefano Teso

    Stefano Teso Guest


    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 dd =3D3D _mm_sqrt_pd (_mm_add_pd (_mm_add=

    _pd (mx, my), mz));

    Whoops! Ignore the square root here please, pasted the wrong code. The
    assembly is the right one though.

    ----
    Stefano
    Stefano Teso, Aug 5, 2009
    #9
  10. Stefano Teso

    sfuerst Guest

    After some benchmarking, it looks like this trick actually helps,
    making the code 33% faster.

    #include <emmintrin.h>

    #define SIZE 1024

    static double dist_row[SIZE];

    __attribute__((noinline)) void calc_dist_row1(const double *x, const
    double *y, const double *z, int i, int bandlo, int bandhi)
    {
    __m128d ix, iy, iz;
    __m128d dx, dy, dz;
    __m128d mx, my, mz;
    __m128d dd;
    int j, _bandlo, _bandhi;

    _bandlo = bandlo & ~1;
    _bandhi = bandhi;

    ix = _mm_load1_pd(&(x));
    iy = _mm_load1_pd(&(y));
    iz = _mm_load1_pd(&(z));

    for (j = _bandlo; j < _bandhi; j += 2)
    {

    dx = _mm_sub_pd(ix, _mm_load_pd(&x[j]));
    dy = _mm_sub_pd(iy, _mm_load_pd(&y[j]));
    dz = _mm_sub_pd(iz, _mm_load_pd(&z[j]));

    mx = _mm_mul_pd(dx, dx);
    my = _mm_mul_pd(dy, dy);
    mz = _mm_mul_pd(dz, dz);

    dd = _mm_add_pd(_mm_add_pd(mx, my), mz);

    _mm_store_pd(&dist_row[j], dd);
    }
    }

    __attribute__((noinline)) void calc_dist_row2(const double *x, const
    double *y, const double *z, int i, int bandlo, int bandhi)
    {
    typedef double v2df __attribute__((vector_size(16)));

    v2df ix, iy, iz;
    v2df dx, dy, dz;
    v2df mx, my, mz;
    v2df dd;
    int j, _bandlo, _bandhi;

    _bandlo = bandlo & ~1;
    _bandhi = bandhi;

    ix = (v2df) {x, x};
    iy = (v2df) {y, y};
    iz = (v2df) {z, z};

    for (j = _bandlo; j < _bandhi; j += 2)
    {
    dx = ix - *((v2df *) &x[j]);
    dy = iy - *((v2df *) &y[j]);
    dz = iz - *((v2df *) &z[j]);

    mx = dx * dx;
    my = dy * dy;
    mz = dz * dz;

    *((v2df *) &dist_row[j]) = mx + my + mz;
    }
    }

    /* Slower! */
    __attribute__((noinline)) void calc_dist_row3(const double *x, const
    double *y, const double *z, const double *r, int i, int bandlo, int
    bandhi)
    {
    typedef double v2df __attribute__((vector_size(16)));

    v2df ix, iy, iz, ir;
    v2df dx, dy, dz, rr;
    v2df mx, my, mz;
    v2df dd;
    v2df mtwo = {-2, -2};

    int j, _bandlo, _bandhi;

    _bandlo = bandlo & ~1;
    _bandhi = bandhi;

    ix = (v2df) {x, x};
    iy = (v2df) {y, y};
    iz = (v2df) {z, z};
    ir = (v2df) {r, r};

    ix *= mtwo;
    iy *= mtwo;
    iz *= mtwo;

    for (j = _bandlo; j < _bandhi; j += 2)
    {
    rr = *((v2df *) &r[j]) + ir;
    dx = *((v2df *) &x[j]) * ix;
    dy = *((v2df *) &y[j]) * iy;
    dz = *((v2df *) &z[j]) * iz;

    *((v2df *) &dist_row[j]) = dx + dy + dz + rr;
    }
    }

    int main(void)
    {
    int i, j;

    double x[SIZE];
    double y[SIZE];
    double z[SIZE];
    double r[SIZE];

    for (i = 0; i < SIZE; i++)
    {
    x = i;
    y = i * i;
    z = i * i - 137;
    r[i] = x[i] * x[i] + y[i] * y[i] + z[i] * z[i];
    }

    for (j = 0; j < 1000; j++)
    {
    for (i = 0; i < 1000; i++)
    {
    /* 3.33s */
    //calc_dist_row1(x, y, z, i, 0, SIZE);

    /* 3.33s */
    //calc_dist_row1(x, y, z, i, 0, SIZE);

    /* 2.25s */
    calc_dist_row3(x, y, z, r, i, 0, SIZE);
    }
    }
    }

    The first subroutine is the original version (without the slow sqrt
    call which otherwise dominates the timing). The __attribute__
    ((noinline)) is to prevent the compiler from optimizing the function
    calls away. The second subroutine is the same algorithm translated
    into gcc's vector types, and thus much easier to read. The final
    subroutine uses the distance formula expansion to reduce the critical
    path, and allow more calculation in parallel.

    In gcc 4.4, the resulting times for the program to run are given in
    the comments in main(). Note that gcc 4.3 optimizes less well, and
    the third subroutine is actually slower than the first two... (Also,
    note that you should obviously compile with -O3 to get the best times)


    Here is the resulting asm for the inner loop:

    Originally, it was:

    L1:
    movapd %xmm5,%xmm0
    movapd %xmm4,%xmm2
    movapd %xmm3,%xmm1
    subpd (%rdi),%xmm0
    add $0x10,%rdi
    subpd (%rsi),%xmm2
    add $0x10,%rsi
    subpd (%rdx),%xmm1
    add $0x10,%rdx
    mulpd %xmm0,%xmm0
    mulpd %xmm2,%xmm2
    mulpd %xmm1,%xmm1
    addpd %xmm2,%xmm0
    addpd %xmm1,%xmm0
    movapd %xmm0,(%rax)
    add $0x10,%rax
    cmp %rcx,%rax
    jne L1

    After the algorithm change:
    L1:
    movapd (%rdi),%xmm0
    add $0x10,%rdi
    movapd (%rsi),%xmm1
    add $0x10,%rsi
    mulpd %xmm4,%xmm0
    mulpd %xmm3,%xmm1
    addpd %xmm1,%xmm0
    movapd (%rdx),%xmm1
    add $0x10,%rdx
    mulpd %xmm2,%xmm1
    addpd %xmm1,%xmm0
    movapd (%r8),%xmm1
    add $0x10,%r8
    addpd %xmm5,%xmm1
    addpd %xmm1,%xmm0
    movapd %xmm0,(%r10)
    add $0x10,%r10
    cmp %rax,%r8
    jne L1


    Steven[/i][/i][/i][/i][/i][/i][/i]
    sfuerst, Aug 5, 2009
    #10
  11. Stefano Teso

    sfuerst Guest

    Ah - looks like nearly all of the speedup of the algorithm change can
    be gotten by a simple change.

    Just change the lines
    dx = _mm_sub_pd(ix, _mm_load_pd(&x[j]));
    dy = _mm_sub_pd(iy, _mm_load_pd(&y[j]));
    dz = _mm_sub_pd(iz, _mm_load_pd(&z[j]));

    into
    dx = _mm_sub_pd(_mm_load_pd(&x[j]), ix);
    dy = _mm_sub_pd(_mm_load_pd(&y[j]), iy);
    dz = _mm_sub_pd(_mm_load_pd(&z[j]), iz);

    (x-y)**2 == (y-x)**2

    This causes gcc to emit much better code, with the benchmark program
    in the previous post taking 2.33s to run.

    In case anyone is interested, the new inner loop is:
    L1:
    movapd (%rdi),%xmm0
    add $0x10,%rdi
    movapd (%rsi),%xmm2
    add $0x10,%rsi
    subpd %xmm5,%xmm0
    movapd (%rdx),%xmm1
    add $0x10,%rdx
    subpd %xmm4,%xmm2
    subpd %xmm3,%xmm1
    mulpd %xmm0,%xmm0
    mulpd %xmm2,%xmm2
    mulpd %xmm1,%xmm1
    addpd %xmm2,%xmm0
    addpd %xmm1,%xmm0
    movapd %xmm0,(%rax)
    add $0x10,%rax
    cmp %rcx,%rax
    jne L1

    Steven
    sfuerst, Aug 5, 2009
    #11
  12. Stefano Teso

    tni Guest

    Stefano Teso wrote:
    > Hi,
    >
    > On 5 Aug, 02:07, tni <> wrote:


    >>> =A0 =A0 =A0 =A0 mx =3D _mm_mul_pd (dx, dx);
    >>> =A0 =A0 =A0 =A0 my =3D _mm_mul_pd (dy, dy);
    >>> =A0 =A0 =A0 =A0 mz =3D _mm_mul_pd (dz, dz);
    >>> =A0 =A0 =A0 =A0 s0 =3D _mm_add_pd (mx, my);
    >>> =A0 =A0 =A0 =A0 s1 =3D _mm_add_pd (s0, mz);

    >> Put all that stuff into a single statement (so that the compiler won't
    >> be tempted to do unnecessary loads/stores):
    >>
    >> s =3D _mm_add_pd(_mm_add_pd(_mm_mul_pd(dx, dx), _mm_mul_pd(dy, dy)),
    >> _mm_mul_pd(dz, dz)));

    >
    > I tried this and it doesn't make any difference, performance wise, and
    > the assembly output looks unaffected. The compiler was probably
    > already smart enough.


    If you ever use Visual Studio, it can make a difference in some cases.
    The register allocator often gets confused when temporary values are
    there (and reloads stuff).

    BTW, the assembly code looks reasonable, it's unlikely there is more to
    be gained. Not sure why GCC isn't using complex addressing, but most
    current CPUs can do something like 3 adds per clock, so it likely
    doesn't really matter.

    (If you really do a sqrt in the loop, it's going to dominate the timing
    anyway.)

    VS does use complex addressing in this case:

    0000000140001070 movapd xmm1,xmm3
    0000000140001074 movapd xmm2,xmm4
    0000000140001078 movapd xmm0,xmm5
    000000014000107C add r11,10h
    0000000140001080 sub rdx,1
    0000000140001084 subpd xmm2,xmmword ptr [r11-10h]
    000000014000108A subpd xmm0,xmmword ptr [r10+r11-10h]
    0000000140001091 subpd xmm1,xmmword ptr [r9+r11-10h]
    0000000140001098 mulpd xmm2,xmm2
    000000014000109C mulpd xmm0,xmm0
    00000001400010A0 mulpd xmm1,xmm1
    00000001400010A4 addpd xmm2,xmm1
    00000001400010A8 addpd xmm2,xmm0
    00000001400010AC sqrtpd xmm0,xmm2
    00000001400010B0 movapd xmmword ptr [rax+r11-10h],xmm0
    00000001400010B7 jne main+70h (140001070h)
    tni, Aug 5, 2009
    #12
  13. Stefano Teso

    Stefano Teso Guest

    Hi,

    > After some benchmarking, it looks like this trick actually helps,
    > making the code 33% faster.


    Thanks for the tip and the code!

    > int main(void)
    > {
    > =A0 =A0 =A0 =A0 int i, j;
    >
    > =A0 =A0 =A0 =A0 double x[SIZE];
    > =A0 =A0 =A0 =A0 double y[SIZE];
    > =A0 =A0 =A0 =A0 double z[SIZE];
    > =A0 =A0 =A0 =A0 double r[SIZE];
    >
    > =A0 =A0 =A0 =A0 for (i =3D 0; i < SIZE; i++)
    > =A0 =A0 =A0 =A0 {
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 x =3D i;
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 y =3D i * i;
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 z =3D i * i - 137;
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 r =3D x * x + y * y + z=

    * z;
    > =A0 =A0 =A0 =A0 }
    >
    > =A0 =A0 =A0 =A0 for (j =3D 0; j < 1000; j++)
    > =A0 =A0 =A0 =A0 {
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 for (i =3D 0; i < 1000; i++)
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 {
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* 3.33s */
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 //calc_dist_row1(x, y, z,=

    i, 0, SIZE);
    >
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* 3.33s */
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 //calc_dist_row1(x, y, z,=

    i, 0, SIZE);
    >
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 /* 2.25s */
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 calc_dist_row3(x, y, z, r=

    , i, 0, SIZE);
    > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 }
    > =A0 =A0 =A0 =A0 }
    >
    > }


    Does the timing you reported refer to the whole preprocessing + core
    loop thing or only to the core loop? Because the data I compute the
    distance on is generated dynamically and changes at every call to my
    subroutine (actually the subroutine is called *only* if the data
    changed, for efficiency), so I would have to preprocess the data each
    time.

    > The first subroutine is the original version (without the slow sqrt
    > call which otherwise dominates the timing).


    I discovered that calling the sqrt in this routine instead of
    scattering sqrt () calls in other part of the code actually improves
    the global performance, mainly because _mm_sqrt_pd can handle 2 square
    roots at a time (and with the switch to single precision, that goes up
    to 4) while sqrt () only one. So I think I'll stick with the
    _mm_sqrt_pd () call here...

    Regards,

    ----
    Stefano
    Stefano Teso, Aug 7, 2009
    #13
  14. Stefano Teso

    sfuerst Guest

    On Aug 7, 3:21=A0am, Stefano Teso
    <> wrote:
    > Hi,
    >
    > > After some benchmarking, it looks like this trick actually helps,
    > > making the code 33% faster.

    >
    > Thanks for the tip and the code!
    >
    > Does the timing you reported refer to the whole preprocessing + core
    > loop thing or only to the core loop? Because the data I compute the
    > distance on is generated dynamically and changes at every call to my
    > subroutine (actually the subroutine is called *only* if the data
    > changed, for efficiency), so I would have to preprocess the data each
    > time.
    >
    > > The first subroutine is the original version (without the slow sqrt
    > > call which otherwise dominates the timing).

    >
    > I discovered that calling the sqrt in this routine instead of
    > scattering sqrt () calls in other part of the code actually improves
    > the global performance, mainly because _mm_sqrt_pd can handle 2 square
    > roots at a time (and with the switch to single precision, that goes up
    > to 4) while sqrt () only one. So I think I'll stick with the
    > _mm_sqrt_pd () call here...
    >
    > Regards,
    >
    > ----
    > Stefano


    The timing refers to the whole program run... however the
    preprocessing only occurs once, whilst the squared distance routine is
    called many times in that program. If you are constantly changing the
    distance array, then the preprocessing cost will make that technique a
    loser. (It all depends on exactly how many times you re-use a
    point... so some benchmarking is in order.)

    Note that you can get a 30% speedup from just reordering your
    subtractions though. That is nearly all the improvement that the
    preprocessed algorithm can give, without any of the drawbacks.

    Another trick is to notice that often you don't actually want the
    distance... you want 1/distance for normalization etc. Calculating
    the reciprocal square root is faster than getting the square root, so
    changing the routine to calculate 1/distance, and converting divisions
    scattered about your codebase into multiplications might be net win.
    A quick search online will show you many optimized rsqrt() routines.

    Steven
    sfuerst, Aug 7, 2009
    #14
  15. Stefano Teso

    Stefano Teso Guest


    > The timing refers to the whole program run... however the
    > preprocessing only occurs once, whilst the squared distance routine is
    > called many times in that program. =A0If you are constantly changing the
    > distance array, then the preprocessing cost will make that technique a
    > loser. =A0(It all depends on exactly how many times you re-use a
    > point... so some benchmarking is in order.)


    Sure.

    > Another trick is to notice that often you don't actually want the
    > distance... you want 1/distance for normalization etc. =A0Calculating
    > the reciprocal square root is faster than getting the square root, so
    > changing the routine to calculate 1/distance, and converting divisions
    > scattered about your codebase into multiplications might be net win.
    > A quick search online will show you many optimized rsqrt() routines.


    Thanks, but unfortunately I need the distance, not its inverse. I am
    aware that there are various special rsqrt () tricks around.

    Thanks again,

    ----
    Stefano
    Stefano Teso, Aug 8, 2009
    #15
  16. "Stefano Teso" <> wrote in message
    news:4a7d6720$0$5639$...

    >> Another trick is to notice that often you don't actually want the
    >> distance... you want 1/distance for normalization etc. =A0Calculating
    >> the reciprocal square root is faster than getting the square root, so
    >> changing the routine to calculate 1/distance, and converting divisions
    >> scattered about your codebase into multiplications might be net win.
    >> A quick search online will show you many optimized rsqrt() routines.


    > Thanks, but unfortunately I need the distance, not its inverse. I am
    > aware that there are various special rsqrt () tricks around.


    Computing rsqrt(x) is no faster than computing sqrt(x).
    Assuming 3 iterations is enough:

    y = transfer(magic-ishft(transfer(x,magic),-1),y)

    a = x*y
    b = a*y
    c = 0.5*y
    d = 3.0-b
    y = c*d


    a = x*y
    b = a*y
    c = 0.5*y
    d = 3.0-b
    y = c*d


    a = x*y
    b = a*y
    c = 0.5*a
    d = 3.0-b
    y = c*d

    The extra multiplication by x can be simply woven into the algorithm
    at the last iteration at no cost in latency or throughput. Well, you
    can mess around with the latency a little to get:

    y = transfer(magic-ishft(transfer(x,magic),-1),y)

    a = x*y
    b = a*y
    c = 0.5*y
    d = 0.5*a
    e = 3.0-b
    y = c*e

    a = d*e
    b = a*y
    c = 0.5*y
    d = 0.5*a
    e = 3.0-b
    y = c*e

    a = d*e
    b = a*y
    d = 0.5*a
    e = 3.0-b
    y = d*e

    --
    write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
    6.0134700243160014d-154/),(/'x'/)); end
    James Van Buskirk, Aug 8, 2009
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hipo
    Replies:
    6
    Views:
    2,357
    Gianni Mariani
    May 18, 2006
  2. optical supercomputing

    Optical Computing: special issue - Natural Computing, Springer

    optical supercomputing, Dec 19, 2008, in forum: C Programming
    Replies:
    0
    Views:
    416
    optical supercomputing
    Dec 19, 2008
  3. David Mathog

    OT: SSE2 function test program?

    David Mathog, Nov 18, 2010, in forum: C Programming
    Replies:
    1
    Views:
    649
    David Mathog
    Nov 20, 2010
  4. Szyk
    Replies:
    1
    Views:
    810
    Victor Bazarov
    Oct 29, 2011
  5. Melzzzzz
    Replies:
    3
    Views:
    392
    Melzzzzz
    Oct 14, 2012
Loading...

Share This Page