Counter Intuitive Results: Optimising an FFT routine for Speed

P

Paul Brown

OK - I feel somewhat exonerated.

I tried various combinations of options with the high iteration code
segment. Indeed the "pointerized" version DID RUN SIGNIFICANTLY FASTER
(as in 21%) when I revert to 32 bit floating point here instead of
double (for those entering the thread here I repeat the salient code).

This is strange but credible, I understood that all internal
arithmetic was carried out with doubles, and casts to and from float
performed as necessary to match the variables data types. This does
not appear to be the case with Turbo-C

Now I can put my present trials to rest and start hunting through
FFTW.

BTW - I could not get CYCLE.h to work, Turbo-C does not support
__INLINE__ or __ASM__. Good thing I spent the time to get my
****PORTABLE**** timer to work.



/*
_1 curReal = wr *data[ ixj ] -wi *data[ ixj
+1];
_1 curImag = wr *data[ ixj+1 ] +wi *data[ ixj
];
_1 data[ ixj] = data[ ixData ] -curReal;
_1 data[ ixj+1] = data[ ixData+1] -curImag;
_1 data[ ixData] += curReal;
_1 data[ ixData+1] += curImag;
*/

#if defined(FFT_FLOAT)
# define srcPtr0_ floatPtr_1__
# define srcPtr1_ floatPtr_2__
# define tgtPtr0_ floatPtr_3__
# define tgtPtr1_ floatPtr_4__
#endif

srcPtr1_ = srcPtr0_ = &( data[ixj]); ++srcPtr1_;
tgtPtr1_ = tgtPtr0_ = &( data[ixData]); ++tgtPtr1_;
curReal = wr **srcPtr0_ -wi **srcPtr1_;
curImag= wr **srcPtr1_ +wi **srcPtr0_;
*srcPtr0_ = *tgtPtr0_ -curReal;
*srcPtr1_ = *tgtPtr1_ -curImag;
*tgtPtr0_ += curReal;
*tgtPtr1_ += curImag;

# undef srcPtr0_
# undef srcPtr1_
# undef tgtPtr0_
# undef tgtPtr1_


This is a typical execution time with full double variables

FFT_NRC starting to execute 5000 loops
FFT_NRC time = 923647 ±88(3.5s) ns



Here is the comparison between 32 bit floats, FFT_NRC_prev is the _1
code commented out above and FFT_NRC is the pointerized code :

FFT_NRC starting to execute 5000 loops
FFT_NRC time = 519175 ±88(3.5s) ns

FFT_NRC_prev starting to execute 5000 loops
FFT_NRC_prev time = 658008 ±88(3.5s) ns

FFT_NRC time change :FFT_NRC_Prev = -138.833 ± 0.124(3.5s) µs
(-21.10%)

FFT_NRC_prev starting to execute 1000 loops
FFT_NRC_prev time = 654509 ±451(3.5s) ns

FFT_NRC starting to execute 1000 loops
FFT_NRC time = 515313 ±451(3.5s) ns
 
S

Steven G. Johnson

Paul said:
This is strange but credible, I understood that all internal
arithmetic was carried out with doubles, and casts to and from float
performed as necessary to match the variables data types. This does
not appear to be the case with Turbo-C

The x86 registers are extended precision (or at least double, depending
upon the mode), so your compiler doesn't have much choice in the matter.
Nor does this affect the speed.

Single precision is almost always faster than double precision, though,
at least for significant transform sizes, simply because of cache usage.

(Double precision variables also need to be 8-byte aligned on x86 or the
speed dies horribly and unpredictably; not all compilers do this.)
BTW - I could not get CYCLE.h to work, Turbo-C does not support
__INLINE__ or __ASM__. Good thing I spent the time to get my
****PORTABLE**** timer to work.

You asked for something with more resolution than clock(), so I told you
the best resolution you can get; if you're happy with clock(), why did
you ask? Unfortunately, accessing the cycle counter is necessarily
compiler/CPU-specific. (You also inexplicably grabbed the code from
cycle.h labeled as gcc-specific.)

BTW, clock() may be portable, but any assumptions about its resolution
are not.

....

Not that any of this is relevant to the DSP chip you eventually plan to
run on.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,128
Latest member
ElwoodPhil
Top