Counter Intuitive Results: Optimising an FFT routine for Speed

Paul Brown · Oct 7, 2003

OK - I feel somewhat exonerated.

I tried various combinations of options with the high iteration code
segment. Indeed the "pointerized" version DID RUN SIGNIFICANTLY FASTER
(as in 21%) when I revert to 32 bit floating point here instead of
double (for those entering the thread here I repeat the salient code).

This is strange but credible, I understood that all internal
arithmetic was carried out with doubles, and casts to and from float
performed as necessary to match the variables data types. This does
not appear to be the case with Turbo-C

Now I can put my present trials to rest and start hunting through
FFTW.

BTW - I could not get CYCLE.h to work, Turbo-C does not support
__INLINE__ or __ASM__. Good thing I spent the time to get my
****PORTABLE**** timer to work.

/*
_1 curReal = wr *data[ ixj ] -wi *data[ ixj
+1];
_1 curImag = wr *data[ ixj+1 ] +wi *data[ ixj
];
_1 data[ ixj] = data[ ixData ] -curReal;
_1 data[ ixj+1] = data[ ixData+1] -curImag;
_1 data[ ixData] += curReal;
_1 data[ ixData+1] += curImag;
*/

#if defined(FFT_FLOAT)
# define srcPtr0_ floatPtr_1__
# define srcPtr1_ floatPtr_2__
# define tgtPtr0_ floatPtr_3__
# define tgtPtr1_ floatPtr_4__
#endif

srcPtr1_ = srcPtr0_ = &( data[ixj]); ++srcPtr1_;
tgtPtr1_ = tgtPtr0_ = &( data[ixData]); ++tgtPtr1_;
curReal = wr **srcPtr0_ -wi **srcPtr1_;
curImag= wr **srcPtr1_ +wi **srcPtr0_;
*srcPtr0_ = *tgtPtr0_ -curReal;
*srcPtr1_ = *tgtPtr1_ -curImag;
*tgtPtr0_ += curReal;
*tgtPtr1_ += curImag;

# undef srcPtr0_
# undef srcPtr1_
# undef tgtPtr0_
# undef tgtPtr1_

This is a typical execution time with full double variables

FFT_NRC starting to execute 5000 loops
FFT_NRC time = 923647 ±88(3.5s) ns

Here is the comparison between 32 bit floats, FFT_NRC_prev is the _1
code commented out above and FFT_NRC is the pointerized code :

FFT_NRC starting to execute 5000 loops
FFT_NRC time = 519175 ±88(3.5s) ns

FFT_NRC_prev starting to execute 5000 loops
FFT_NRC_prev time = 658008 ±88(3.5s) ns

FFT_NRC time change :FFT_NRC_Prev = -138.833 ± 0.124(3.5s) µs
(-21.10%)

FFT_NRC_prev starting to execute 1000 loops
FFT_NRC_prev time = 654509 ±451(3.5s) ns

FFT_NRC starting to execute 1000 loops
FFT_NRC time = 515313 ±451(3.5s) ns

Steven G. Johnson · Oct 7, 2003

Paul said:
This is strange but credible, I understood that all internal
arithmetic was carried out with doubles, and casts to and from float
performed as necessary to match the variables data types. This does
not appear to be the case with Turbo-C

The x86 registers are extended precision (or at least double, depending
upon the mode), so your compiler doesn't have much choice in the matter.
Nor does this affect the speed.

Single precision is almost always faster than double precision, though,
at least for significant transform sizes, simply because of cache usage.

(Double precision variables also need to be 8-byte aligned on x86 or the
speed dies horribly and unpredictably; not all compilers do this.)

BTW - I could not get CYCLE.h to work, Turbo-C does not support
__INLINE__ or __ASM__. Good thing I spent the time to get my
****PORTABLE**** timer to work.

You asked for something with more resolution than clock(), so I told you
the best resolution you can get; if you're happy with clock(), why did
you ask? Unfortunately, accessing the cycle counter is necessarily
compiler/CPU-specific. (You also inexplicably grabbed the code from
cycle.h labeled as gcc-specific.)

BTW, clock() may be portable, but any assumptions about its resolution
are not.

....

Not that any of this is relevant to the DSP chip you eventually plan to
run on.

Counter Intuitive Results: Optimising an FFT routine for Speed	7	Oct 6, 2003
Counter Intuitive Results: Optimising an FFT routine for Speed	4	Oct 7, 2003
Counter Intuitive Results: Optimising an FFT routine for Speed	2	Oct 8, 2003
Speed up creation of combo box options	2	Mar 10, 2006
a c project for help!	0	Jan 5, 2005
Newbie help for using multiprocessing and subprocess packages forcreating child processes	0	Jun 16, 2009
SINGAPORE PRIVATE CONDO / APT FOR SALE / Singapore New Upcoming Residential Projects	5	Dec 16, 2006
sysread(socket..) problem on perl 5.8.0 linux	2	Nov 23, 2004

Counter Intuitive Results: Optimising an FFT routine for Speed

Paul Brown

Steven G. Johnson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads