Counter Intuitive Results: Optimising an FFT routine for Speed

Discussion in 'C Programming' started by Paul Brown, Oct 7, 2003.

  1. Paul Brown

    Paul Brown Guest

    OK - I feel somewhat exonerated.

    I tried various combinations of options with the high iteration code
    segment. Indeed the "pointerized" version DID RUN SIGNIFICANTLY FASTER
    (as in 21%) when I revert to 32 bit floating point here instead of
    double (for those entering the thread here I repeat the salient code).

    This is strange but credible, I understood that all internal
    arithmetic was carried out with doubles, and casts to and from float
    performed as necessary to match the variables data types. This does
    not appear to be the case with Turbo-C

    Now I can put my present trials to rest and start hunting through
    FFTW.

    BTW - I could not get CYCLE.h to work, Turbo-C does not support
    __INLINE__ or __ASM__. Good thing I spent the time to get my
    ****PORTABLE**** timer to work.



    /*
    _1 curReal = wr *data[ ixj ] -wi *data[ ixj
    +1];
    _1 curImag = wr *data[ ixj+1 ] +wi *data[ ixj
    ];
    _1 data[ ixj] = data[ ixData ] -curReal;
    _1 data[ ixj+1] = data[ ixData+1] -curImag;
    _1 data[ ixData] += curReal;
    _1 data[ ixData+1] += curImag;
    */

    #if defined(FFT_FLOAT)
    # define srcPtr0_ floatPtr_1__
    # define srcPtr1_ floatPtr_2__
    # define tgtPtr0_ floatPtr_3__
    # define tgtPtr1_ floatPtr_4__
    #endif

    srcPtr1_ = srcPtr0_ = &( data[ixj]); ++srcPtr1_;
    tgtPtr1_ = tgtPtr0_ = &( data[ixData]); ++tgtPtr1_;
    curReal = wr **srcPtr0_ -wi **srcPtr1_;
    curImag= wr **srcPtr1_ +wi **srcPtr0_;
    *srcPtr0_ = *tgtPtr0_ -curReal;
    *srcPtr1_ = *tgtPtr1_ -curImag;
    *tgtPtr0_ += curReal;
    *tgtPtr1_ += curImag;

    # undef srcPtr0_
    # undef srcPtr1_
    # undef tgtPtr0_
    # undef tgtPtr1_


    This is a typical execution time with full double variables

    FFT_NRC starting to execute 5000 loops
    FFT_NRC time = 923647 ±88(3.5s) ns



    Here is the comparison between 32 bit floats, FFT_NRC_prev is the _1
    code commented out above and FFT_NRC is the pointerized code :

    FFT_NRC starting to execute 5000 loops
    FFT_NRC time = 519175 ±88(3.5s) ns

    FFT_NRC_prev starting to execute 5000 loops
    FFT_NRC_prev time = 658008 ±88(3.5s) ns

    FFT_NRC time change :FFT_NRC_Prev = -138.833 ± 0.124(3.5s) µs
    (-21.10%)

    FFT_NRC_prev starting to execute 1000 loops
    FFT_NRC_prev time = 654509 ±451(3.5s) ns

    FFT_NRC starting to execute 1000 loops
    FFT_NRC time = 515313 ±451(3.5s) ns
     
    Paul Brown, Oct 7, 2003
    #1
    1. Advertising

  2. Paul Brown wrote:
    > This is strange but credible, I understood that all internal
    > arithmetic was carried out with doubles, and casts to and from float
    > performed as necessary to match the variables data types. This does
    > not appear to be the case with Turbo-C


    The x86 registers are extended precision (or at least double, depending
    upon the mode), so your compiler doesn't have much choice in the matter.
    Nor does this affect the speed.

    Single precision is almost always faster than double precision, though,
    at least for significant transform sizes, simply because of cache usage.

    (Double precision variables also need to be 8-byte aligned on x86 or the
    speed dies horribly and unpredictably; not all compilers do this.)

    > BTW - I could not get CYCLE.h to work, Turbo-C does not support
    > __INLINE__ or __ASM__. Good thing I spent the time to get my
    > ****PORTABLE**** timer to work.


    You asked for something with more resolution than clock(), so I told you
    the best resolution you can get; if you're happy with clock(), why did
    you ask? Unfortunately, accessing the cycle counter is necessarily
    compiler/CPU-specific. (You also inexplicably grabbed the code from
    cycle.h labeled as gcc-specific.)

    BTW, clock() may be portable, but any assumptions about its resolution
    are not.

    ....

    Not that any of this is relevant to the DSP chip you eventually plan to
    run on.
     
    Steven G. Johnson, Oct 7, 2003
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. aj
    Replies:
    3
    Views:
    4,243
    Jaakko Varteva
    Nov 24, 2005
  2. Paul Brown
    Replies:
    7
    Views:
    387
    Christian Bau
    Oct 10, 2003
  3. Paul Brown
    Replies:
    4
    Views:
    412
    Randy Howard
    Oct 9, 2003
  4. Paul Brown
    Replies:
    2
    Views:
    310
    Keith Thompson
    Oct 9, 2003
  5. Replies:
    8
    Views:
    356
    infobahn
    Feb 5, 2005
Loading...

Share This Page