Counter Intuitive Results: Optimising an FFT routine for Speed

P

Paul Brown

I am confused - I have gone through a two day exercise tuning up an
FFT routine, and at the last stage I find that where the biggest gains
were expected, all my conventional understanding of efficient C seems
to be turned on its head.

Up until this point I have achieved good speed improvements following
an evolutionary approach, starting from the routine entry point I have
modified code, run, validated and timed a test case and follow the
steepest descent optimisation approach.

When I started, with a Cooley-Tukey routine, I was at 2.7ms per 1024
point complex FFT (1GHz Celeron - Win2000). I am still using a baby C
system - Turbo C, but would be surprised if this is the cause of the
problems.

I have reduced this time to 0.6ms per FFT call, when I reached the
heart of the code, the innermost of 3 loops, where the correlation
between complex exponentials and data occurs :

tempr=wr*data[ixj]-wi*data[ixj+1];
tempi=wr*data[ixj+1]+wi*data[ixj];
data[ixj]=data[ixData]-tempr;
data[ixj+1]=data[ixData+1]-tempi;
data[ixData] += tempr;
data[ixData+1] += tempi;

This piece of code is executed 5120 times (ie. N Log_2(N) ) - so a lot
of opportunity to improve things. I have removed superfluous array
indexing and changed to full 64 bit arithmetic (which has proved to be
faster than 32bit in previous tweaking) as follows

#if defined(FFT_DOUBLE)
# define srcPtr0_ doublePtr_1__
# define srcPtr1_ doublePtr_2__
# define tgtPtr0_ doublePtr_3__
# define tgtPtr1_ doublePtr_4__
#endif
#if defined(FFT_FLOAT)
# define srcPtr0_ floatPtr_1__
# define srcPtr1_ floatPtr_2__
# define tgtPtr0_ floatPtr_3__
# define tgtPtr1_ floatPtr_4__
#endif

srcPtr1_ = srcPtr0_ = &( data[ixj]); ++srcPtr1_;
tgtPtr1_ = tgtPtr0_ = &( data[ixData]); ++tgtPtr1_;

curReal = wr **srcPtr0_ -wi **srcPtr1_;
curImag= wr **srcPtr1_ +wi **srcPtr0_;
*srcPtr0_ = *tgtPtr0_ -curReal;
*srcPtr1_ = *tgtPtr1_ -curImag;
*tgtPtr0_ += curReal;
*tgtPtr1_ += curImag;

# undef srcPtr0_
# undef srcPtr1_
# undef tgtPtr0_
# undef tgtPtr1_

However now I find the time per call has gone up significantly.

Using the FFT_FLOAT option above, the call time increased by 39%
compared with the original code.

Using the FFT_DOUBLE option, it is a bit quicker, only 20% longer. So
by rights I should just use the original code, with all variables
changed to doubles, but I am loathe to do so just yet until I
understand why the current bottleneck exists.

Why should accessing vectorised data using pointer references be so
much slower than indexing - surely Turbo-C cannot have pointer
de-referencing so badly wrong?


Some other ancillary points that I would be grateful for feedback on :



I have written a software timer that squeezes better resolution out of
clock() than the CLK_TCK (supposedly 18.2ms), which is normally
available.

I have timed some runs with a stopwatch, this is what I got

_(" This is the standard value, ms per each tick in clock()")
#undef CLK_TCK 18.2

_(" This measured over 45s on a Celeron 1GHz, ms per each tick in
clock()")
#define CLK_TCK 56.12


On a 850MHz Athlon (Win-98) system I can get 600ns resolution, however
with all the system overheads in Win-2k, even using a 1GHz Celeron, I
only achieve around 1ms resolution.

Is there any easy way to disable or reduce a lot of the Win-2k system
workload to get more accurate timing? I remember reading about 5
years ago of a "software stopwatch" that could read a far higher
resolution CPU timer, anyone have a reference to this?


These are typically the sort of results I am getting

FFT_NRC starting to execute 5000 loops
FFT_NRC time = 934456 ±88(3.5s) ns

FFT_NRC_prev starting to execute 5000 loops
FFT_NRC_prev time = 776079 ±88(3.5s) ns

FFT_NRC time change :FFT_NRC_Prev = 158.377 ± 0.124(3.5s) µs
(20.41%)

FFT_NRC_prev starting to execute 1000 loops
FFT_NRC_prev time = 777115 ±451(3.5s) ns

FFT_NRC starting to execute 1000 loops
FFT_NRC time = 935104 ±451(3.5s) ns

Thanks,
Paul.
 
T

Tristan Miller

Greetings.

[massive post about optimization using Turbo C on Windows snipped]

Holy crossposting, Batman! What does this message have to do with Oberon,
functional programming, or ISO/ANSI C?

For algorithm questions, try an algorithm or programming newsgroup. (I
suppose sci.math.num-analysis would be appropriate for this.)

For questions about programming for specific compilers or operating systems,
post to a group dedicated to your particular compiler and/or operating
system.

comp.lang.c is for discussion of standard C only, and not how to tweak
specific compiler or system settings.

comp.lang.functional is, I assume, about programming languages such as LISP
and Scheme, not procedural languages like C.

I have no idea what comp.lang.oberon is doing in the headers.

Regards,
Tristan
 
E

Eric Sosman

Paul said:
I am confused - I have gone through a two day exercise tuning up an
FFT routine, and at the last stage I find that where the biggest gains
were expected, all my conventional understanding of efficient C seems
to be turned on its head.

[Replaced array-index operations with explicit pointers, and
the code got slower.]

Why should accessing vectorised data using pointer references be so
much slower than indexing - surely Turbo-C cannot have pointer
de-referencing so badly wrong?

You've posted this to quite an array of newsgroups. Here in
comp.lang.c (where I've set the followups so as to spare the rest
of the wide audience), the question is officially off-topic because
the C Standard makes no promises about the speeds or even relative
speeds of different constructs. On the other hand, this off-topic
question of efficiency is of great interest, so it winds up getting
written about quite a lot ...

At the simplest level, your question is number 20.14 in the
comp.lang.c Frequently Asked Questions (FAQ) list

http://www.eskimo.com/~scs/C-faq/top.html

.... but you may be unsatisfied with the answer, which boils down
to "It depends." For a concrete example of why array indexing
might (*might*) win in your situation, consider this: In code
like your original (much snipped):

data[ixj]=data[ixData]-tempr;
data[ixj+1]=data[ixData+1]-tempi;

.... it is fairly easy for the compiler to prove to itself that
data[ixj] and data[ixj+1] are distinct objects, and data[ixData]
and data[ixData+1] are likewise distinct, and that all these are
distinct from tempr and tempi. (A really smart optimizer may even
be able to prove ixj!=ixData.) If the compiler can determine that
these data objects do not overlap or otherwise interfere, it can
perhaps rearrange the computation for more efficient use of the
numeric pipelines or of the memory access path. Perhaps it can
fetch both data[ixj] and data[ixj+1] in one operation, operate on
both of them in registers or other CPU-internal fast locations,
and use just one operation to store both results.

But when you rewrite with explicit pointers (paraphrased):

*out_r = *in_r - tempr;
*out_i = *in_i - tempi;

.... it may not be so easy for the compiler to determine that the
pointers are aimed at distinct data objects. And if there's any
possibility that some of these pointers address the same data
locations, the compiler's freedom to rearrange things dwindles.
For example, if the compiler thinks it might be possible that
out_r==&tempi, it dares not fetch the value of tempi until after
it's stored the result of the first assignment -- that is, the
compiler may not be able to hold tempr and tempi in registers,
and may generate code to fetch them anew every time.

Of course, this is speculation: It's only meant to offer a
plausible example of the kind of thing that *might* be going on.
You'd have to examine the actual generated code to have a hope
of figuring out what's really happening in any particular case.
And, of course, the conclusions would only be good for that one
implementation, and under that one set of circumstances: add a
perfectly innocent printf() somewhere and things might change.
 
S

Steven G. Johnson

Paul said:
When I started, with a Cooley-Tukey routine, I was at 2.7ms per 1024
point complex FFT (1GHz Celeron - Win2000). I am still using a baby C
system - Turbo C, but would be surprised if this is the cause of the
problems.

I have reduced this time to 0.6ms per FFT call, when I reached the
heart of the code, the innermost of 3 loops, where the correlation
between complex exponentials and data occurs :

You realize, of course, that your first mistake was starting with the
radix-2 Numerical Recipes in C code and spending your time attempting
micro-optimizations. The biggest speed gains in an FFT (and most
algorithms) come from much higher-level transformations. For example,
our FFTW (www.fftw.org) performs the same size-1024 FFT
(double-precision) almost 10 times faster than you (0.06-0.07ms on a
1GHz P-III, gcc 3.3).

I'm not going to go into the possible transformations that one can do;
we have papers on our web site, and better yet you can just go and read
the code for some of the faster FFTs (see www.fftw.org/speed for
benchmarks) to see their tricks. Such codes aren't as short as
Numerical Recipes, I'm afraid, because you trade complexity for
performance (as well as generality, e.g. arbitrary sizes). If you just
want a fast FFT and are not doing this as a learning experience, you'll
be much better off downloading an existing optimized code.

Cordially,
Steven G. Johnson

PS. With regards to your second question, about timing, you should use a
different routine than clock(). The best resolution comes from reading
the CPU cycle counter directly (see www.fftw.org/download.html for code
to do this on many CPUs).

PPS. A decent compiler should transform array references a[j] in a loop
into separate incremented pointers for you, if it's advantageous to do
so. By being clever you can easily just confuse the compiler's
optimizer unless you know what you're doing.
 
K

Kurt Krueger

(lots of stuff snipped)
Paul Brown wrote:

Steven G. Johnson said:
PPS. A decent compiler should transform array references a[j] in a loop
into separate incremented pointers for you, if it's advantageous to do
so. By being clever you can easily just confuse the compiler's
optimizer unless you know what you're doing.

I've done the above many a time. The only way I've been able to cope is
by learning the instruction set and reading the (equivalent) assembly code.
It's pretty easy to tell when you've confused the optimizer.

But be aware that modern architectures can execute instructions out of the
original sequence and the optimizer is aware of this. So it may refuse to
organize code in the 'best' sequence, knowing that instruction pipelining
will take care of it. Forcing what appears to be 'best' might goof up the
rest of the optimization.

Most of my experience has been on SPARC, but Pentium architectures share
a lot of thefeatures.
 
M

Malcolm

Paul Brown said:
Why should accessing vectorised data using pointer references be so
much slower than indexing - surely Turbo-C cannot have pointer
de-referencing so badly wrong?
x86 has a shortage of registers. By forcing the compiler to store
intermediate results in pointers, you might be running it out of registers
and causing thrashing of the register file.
Just a guess - you'd need to examine the assembly to see where it is really
going wrong.
If you need to optimise code, try not to use an old compiler for a new
target. It can't take advantage of any changes to the architecture since
when it was written.
 
J

John R. Strohm

Paul Brown said:
tempr=wr*data[ixj]-wi*data[ixj+1];
tempi=wr*data[ixj+1]+wi*data[ixj];
data[ixj]=data[ixData]-tempr;
data[ixj+1]=data[ixData+1]-tempi;
data[ixData] += tempr;
data[ixData+1] += tempi;

It has been quite a few years since I last looked at FFT code, and that was
a casual look during some self-education time. This is the fundamental
radix-2 butterfly, isn't it?
 
C

Christian Bau

"John R. Strohm said:
Paul Brown said:
tempr=wr*data[ixj]-wi*data[ixj+1];
tempi=wr*data[ixj+1]+wi*data[ixj];
data[ixj]=data[ixData]-tempr;
data[ixj+1]=data[ixData+1]-tempi;
data[ixData] += tempr;
data[ixData+1] += tempi;

It has been quite a few years since I last looked at FFT code, and that was
a casual look during some self-education time. This is the fundamental
radix-2 butterfly, isn't it?

Looks like it, and programmed in a way that is guaranteed to run slow by
preventing optimisations.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top