Counter Intuitive Results: Optimising an FFT routine for Speed

Paul Brown · Oct 8, 2003

Thanks for various responses - my feeling is that the cost benefit
ratio is now approaching the noise level asymptote, though there was
at least one good point :

compressed plain-text copy of the whole thing at
<ftp://ftp.eskimo.com/u/s/scs/C-faq/faq.Z>
Apparently it's the most up-to-date version available.

Well its plain text, and it is contiguous, which is a lot better than
my 331 html files totalling 982kB, but up to date it is not :

Date: 8 Feb 1999 04:56:00 GMT
Approved: (e-mail address removed)
Expires: 3 Mar 1999 00:00:00 GMT
X-Last-Modified: February 7, 1999

Oh and by the way, since I have been cross-examined on cross-posting :

Newsgroups: comp.lang.c,comp.lang.c.moderated,
alt.comp.lang.learn.c-c++,comp.answers,alt.answers,news.answers

is the general reader of
alt.answers,news.answers
going to be interested in the finer points of C?

Whoa, whoa, hold on, you want this to be fast on a DSP chip? But you're
planning to benchmark and optimize it on a general-purpose CPU?

Absolutely right - I am doing high level system engineeting here, I
will pass on the algorithm implementation and source code (probably
FFTW derived, but may still be Numerical Recipes version enhanced on
several counts) to the S/W engineer, he implements a version in Visual
C++ for the client (PC) workstation, and the embedded version gets
ported to whatever DSP is chosen for the new hardware platform. The
DSP speed is the most critical, but speed on the PC client is also
important. The lessons learned probably remain with me.

The OP should find a group or mailing list specific to the processor .....

The reality of the situation is that this won't happen.

The x86 registers are extended precision (or at least double, depending
upon the mode), so your compiler doesn't have much choice in the matter.
Nor does this affect the speed.

Surely the implication of this is that float variable arithmetic
should be slower than double arithmetic, because of the extra casts?

Single precision is almost always faster than double precision, though,
at least for significant transform sizes, simply because of cache usage.

Well, if 1024 x 2 doubles fit in the cache, the memory reads/writes
should be the same speed as for float (there should be cast overheads
though to slow the float version down).

You asked for something with more resolution than clock(), so I told you
the best resolution you can get; if you're happy with clock(), why did
you ask?
You also inexplicably grabbed the code from cycle.h labeled as gcc-specific.

1. Thanks for the reference to CYCLE.h, it is a useful tool for the
toolbox, when I have something apart from Turbo-C I am sure I will use
it.

2. What resolution is this, are we talking of about 1ns, 10ns or 100ns
(and I do remember that you or someone else said it is for relative
comparisons anyway, but there will be a precise answer for a fixed CPU
and clock speed)?

3. I am not happy with clock, I am distinctly unhappy with same. Not
only is 18.2ms an age for most fast applications, it only actually
ticks at 56.12ms. So I wrote a timer based on clock(), and with this I
can get a resolution of
666ns (900MHz Athlon, Win-98)
and worst case
1145µs (998MHz Celeron, Win-2k).

4. GCC specific? - I grabbed the code for

/*
* Pentium cycle counter
*/
#if (defined(__GNUC__) || defined(__ICC)) && defined(__i386__) ...

__asm__ __volatile__("rdtsc": "=A" (ret));

I have also tried

/*
* X86-64 cycle counter
*/
#if defined(__GNUC__) && defined(__x86_64__) &&
!defined(HAVE_TICK_COUNTER)

asm volatile("rdtsc" : "=a" (a), "=d" (d));

This at least generates the compiler error that "in-line assembly is
not allowed", the __ASM__ keyword not being recognised at all in the
Pentium example.

Any chance of being able to get the assembled hex codes for these
assembler statements and patching these into the executable. I
imagine inserting a unique sequence of straight C statements that a
search and replace executable could replace with the opcodes for
"rdtsc": "=A" (ret)

Seems bizarre I know, but pragmatism was never thus handicapped.

____________________________________________________________________
Another great Dane has made free
With a question of Be or Not be.
Now might Schrödinger's puss,
In descending by Schuss,
Leave one track on each side of a tree?
____________________________________________________________________
Quantum mechanics, hmmm. You put a cat in a box, along with a hammer
and
some poison and a radioactive isotope ... I forget exactly how this
goes.
Anyway, keep some bandages on hand, because I guarantee the cat won't
be
happy.
____________________________________________________________________

Steven G. Johnson · Oct 9, 2003

Paul said:
Absolutely right - I am doing high level system engineeting here

The point is that optimizing for a DSP chip (with all sorts of
special-purpose instructions etc.) is completely different from
optimizing for a general-purpose CPU, and the same code, or even the
same *style* of code, is unlikely to be fast on both kinds of architecture.

Surely the implication of this is that float variable arithmetic
should be slower than double arithmetic, because of the extra casts?

There isn't a one-to-one mapping from C operators to hardware
operations. The "cast" from float to double (or actually, to 80-bit
extended precision) is done "costlessly" in hardware as part of the
load/store.

4. GCC specific? - I grabbed the code for
#if (defined(__GNUC__) || defined(__ICC)) && defined(__i386__) ...

Nota bene: __GNUC__. (If you scroll down in the file, you'll see
alternate code for Visual C++.) Anyway, I think that this is the least
of your worries.

Good luck to you. I hereby declare this thread dead.

Keith Thompson · Oct 9, 2003

Thanks for various responses - my feeling is that the cost benefit
ratio is now approaching the noise level asymptote, though there was
at least one good point :

Well its plain text, and it is contiguous, which is a lot better than
my 331 html files totalling 982kB, but up to date it is not :

I said it seems to be the most up-to-date-version available; as far as
I can tell, it is.

Counter Intuitive Results: Optimising an FFT routine for Speed	4	Oct 7, 2003
Counter Intuitive Results: Optimising an FFT routine for Speed	7	Oct 6, 2003
Counter Intuitive Results: Optimising an FFT routine for Speed	1	Oct 7, 2003
Counter-intuitive io vs no-io time readings	6	Apr 9, 2014
Jet optimising Compiler	0	Oct 24, 2012
Dual Edged Counter	6	Apr 17, 2013
FT and FFT	3	Mar 28, 2010
Speed Test for Comparison	14	Dec 8, 2011

Counter Intuitive Results: Optimising an FFT routine for Speed

Paul Brown

Steven G. Johnson

Keith Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads