Counter Intuitive Results: Optimising an FFT routine for Speed

Discussion in 'C Programming' started by Paul Brown, Oct 6, 2003.

  1. Paul Brown

    Paul Brown Guest

    I am confused - I have gone through a two day exercise tuning up an
    FFT routine, and at the last stage I find that where the biggest gains
    were expected, all my conventional understanding of efficient C seems
    to be turned on its head.

    Up until this point I have achieved good speed improvements following
    an evolutionary approach, starting from the routine entry point I have
    modified code, run, validated and timed a test case and follow the
    steepest descent optimisation approach.

    When I started, with a Cooley-Tukey routine, I was at 2.7ms per 1024
    point complex FFT (1GHz Celeron - Win2000). I am still using a baby C
    system - Turbo C, but would be surprised if this is the cause of the
    problems.

    I have reduced this time to 0.6ms per FFT call, when I reached the
    heart of the code, the innermost of 3 loops, where the correlation
    between complex exponentials and data occurs :

    tempr=wr*data[ixj]-wi*data[ixj+1];
    tempi=wr*data[ixj+1]+wi*data[ixj];
    data[ixj]=data[ixData]-tempr;
    data[ixj+1]=data[ixData+1]-tempi;
    data[ixData] += tempr;
    data[ixData+1] += tempi;

    This piece of code is executed 5120 times (ie. N Log_2(N) ) - so a lot
    of opportunity to improve things. I have removed superfluous array
    indexing and changed to full 64 bit arithmetic (which has proved to be
    faster than 32bit in previous tweaking) as follows

    #if defined(FFT_DOUBLE)
    # define srcPtr0_ doublePtr_1__
    # define srcPtr1_ doublePtr_2__
    # define tgtPtr0_ doublePtr_3__
    # define tgtPtr1_ doublePtr_4__
    #endif
    #if defined(FFT_FLOAT)
    # define srcPtr0_ floatPtr_1__
    # define srcPtr1_ floatPtr_2__
    # define tgtPtr0_ floatPtr_3__
    # define tgtPtr1_ floatPtr_4__
    #endif

    srcPtr1_ = srcPtr0_ = &( data[ixj]); ++srcPtr1_;
    tgtPtr1_ = tgtPtr0_ = &( data[ixData]); ++tgtPtr1_;

    curReal = wr **srcPtr0_ -wi **srcPtr1_;
    curImag= wr **srcPtr1_ +wi **srcPtr0_;
    *srcPtr0_ = *tgtPtr0_ -curReal;
    *srcPtr1_ = *tgtPtr1_ -curImag;
    *tgtPtr0_ += curReal;
    *tgtPtr1_ += curImag;

    # undef srcPtr0_
    # undef srcPtr1_
    # undef tgtPtr0_
    # undef tgtPtr1_

    However now I find the time per call has gone up significantly.

    Using the FFT_FLOAT option above, the call time increased by 39%
    compared with the original code.

    Using the FFT_DOUBLE option, it is a bit quicker, only 20% longer. So
    by rights I should just use the original code, with all variables
    changed to doubles, but I am loathe to do so just yet until I
    understand why the current bottleneck exists.

    Why should accessing vectorised data using pointer references be so
    much slower than indexing - surely Turbo-C cannot have pointer
    de-referencing so badly wrong?


    Some other ancillary points that I would be grateful for feedback on :



    I have written a software timer that squeezes better resolution out of
    clock() than the CLK_TCK (supposedly 18.2ms), which is normally
    available.

    I have timed some runs with a stopwatch, this is what I got

    _(" This is the standard value, ms per each tick in clock()")
    #undef CLK_TCK 18.2

    _(" This measured over 45s on a Celeron 1GHz, ms per each tick in
    clock()")
    #define CLK_TCK 56.12


    On a 850MHz Athlon (Win-98) system I can get 600ns resolution, however
    with all the system overheads in Win-2k, even using a 1GHz Celeron, I
    only achieve around 1ms resolution.

    Is there any easy way to disable or reduce a lot of the Win-2k system
    workload to get more accurate timing? I remember reading about 5
    years ago of a "software stopwatch" that could read a far higher
    resolution CPU timer, anyone have a reference to this?


    These are typically the sort of results I am getting

    FFT_NRC starting to execute 5000 loops
    FFT_NRC time = 934456 ±88(3.5s) ns

    FFT_NRC_prev starting to execute 5000 loops
    FFT_NRC_prev time = 776079 ±88(3.5s) ns

    FFT_NRC time change :FFT_NRC_Prev = 158.377 ± 0.124(3.5s) µs
    (20.41%)

    FFT_NRC_prev starting to execute 1000 loops
    FFT_NRC_prev time = 777115 ±451(3.5s) ns

    FFT_NRC starting to execute 1000 loops
    FFT_NRC time = 935104 ±451(3.5s) ns

    Thanks,
    Paul.
     
    Paul Brown, Oct 6, 2003
    #1
    1. Advertising

  2. Greetings.

    In article <>, Paul Brown
    wrote:
    [massive post about optimization using Turbo C on Windows snipped]

    Holy crossposting, Batman! What does this message have to do with Oberon,
    functional programming, or ISO/ANSI C?

    For algorithm questions, try an algorithm or programming newsgroup. (I
    suppose sci.math.num-analysis would be appropriate for this.)

    For questions about programming for specific compilers or operating systems,
    post to a group dedicated to your particular compiler and/or operating
    system.

    comp.lang.c is for discussion of standard C only, and not how to tweak
    specific compiler or system settings.

    comp.lang.functional is, I assume, about programming languages such as LISP
    and Scheme, not procedural languages like C.

    I have no idea what comp.lang.oberon is doing in the headers.

    Regards,
    Tristan

    --
    _
    _V.-o Tristan Miller [en,(fr,de,ia)] >< Space is limited
    / |`-' -=-=-=-=-=-=-=-=-=-=-=-=-=-=-= <> In a haiku, so it's hard
    (7_\\ http://www.nothingisreal.com/ >< To finish what you
     
    Tristan Miller, Oct 6, 2003
    #2
    1. Advertising

  3. Paul Brown

    Eric Sosman Guest

    Paul Brown wrote:
    >
    > I am confused - I have gone through a two day exercise tuning up an
    > FFT routine, and at the last stage I find that where the biggest gains
    > were expected, all my conventional understanding of efficient C seems
    > to be turned on its head.
    >
    > [Replaced array-index operations with explicit pointers, and
    > the code got slower.]
    >
    > Why should accessing vectorised data using pointer references be so
    > much slower than indexing - surely Turbo-C cannot have pointer
    > de-referencing so badly wrong?


    You've posted this to quite an array of newsgroups. Here in
    comp.lang.c (where I've set the followups so as to spare the rest
    of the wide audience), the question is officially off-topic because
    the C Standard makes no promises about the speeds or even relative
    speeds of different constructs. On the other hand, this off-topic
    question of efficiency is of great interest, so it winds up getting
    written about quite a lot ...

    At the simplest level, your question is number 20.14 in the
    comp.lang.c Frequently Asked Questions (FAQ) list

    http://www.eskimo.com/~scs/C-faq/top.html

    .... but you may be unsatisfied with the answer, which boils down
    to "It depends." For a concrete example of why array indexing
    might (*might*) win in your situation, consider this: In code
    like your original (much snipped):

    data[ixj]=data[ixData]-tempr;
    data[ixj+1]=data[ixData+1]-tempi;

    .... it is fairly easy for the compiler to prove to itself that
    data[ixj] and data[ixj+1] are distinct objects, and data[ixData]
    and data[ixData+1] are likewise distinct, and that all these are
    distinct from tempr and tempi. (A really smart optimizer may even
    be able to prove ixj!=ixData.) If the compiler can determine that
    these data objects do not overlap or otherwise interfere, it can
    perhaps rearrange the computation for more efficient use of the
    numeric pipelines or of the memory access path. Perhaps it can
    fetch both data[ixj] and data[ixj+1] in one operation, operate on
    both of them in registers or other CPU-internal fast locations,
    and use just one operation to store both results.

    But when you rewrite with explicit pointers (paraphrased):

    *out_r = *in_r - tempr;
    *out_i = *in_i - tempi;

    .... it may not be so easy for the compiler to determine that the
    pointers are aimed at distinct data objects. And if there's any
    possibility that some of these pointers address the same data
    locations, the compiler's freedom to rearrange things dwindles.
    For example, if the compiler thinks it might be possible that
    out_r==&tempi, it dares not fetch the value of tempi until after
    it's stored the result of the first assignment -- that is, the
    compiler may not be able to hold tempr and tempi in registers,
    and may generate code to fetch them anew every time.

    Of course, this is speculation: It's only meant to offer a
    plausible example of the kind of thing that *might* be going on.
    You'd have to examine the actual generated code to have a hope
    of figuring out what's really happening in any particular case.
    And, of course, the conclusions would only be good for that one
    implementation, and under that one set of circumstances: add a
    perfectly innocent printf() somewhere and things might change.

    --
     
    Eric Sosman, Oct 6, 2003
    #3
  4. Paul Brown wrote:
    > When I started, with a Cooley-Tukey routine, I was at 2.7ms per 1024
    > point complex FFT (1GHz Celeron - Win2000). I am still using a baby C
    > system - Turbo C, but would be surprised if this is the cause of the
    > problems.
    >
    > I have reduced this time to 0.6ms per FFT call, when I reached the
    > heart of the code, the innermost of 3 loops, where the correlation
    > between complex exponentials and data occurs :


    You realize, of course, that your first mistake was starting with the
    radix-2 Numerical Recipes in C code and spending your time attempting
    micro-optimizations. The biggest speed gains in an FFT (and most
    algorithms) come from much higher-level transformations. For example,
    our FFTW (www.fftw.org) performs the same size-1024 FFT
    (double-precision) almost 10 times faster than you (0.06-0.07ms on a
    1GHz P-III, gcc 3.3).

    I'm not going to go into the possible transformations that one can do;
    we have papers on our web site, and better yet you can just go and read
    the code for some of the faster FFTs (see www.fftw.org/speed for
    benchmarks) to see their tricks. Such codes aren't as short as
    Numerical Recipes, I'm afraid, because you trade complexity for
    performance (as well as generality, e.g. arbitrary sizes). If you just
    want a fast FFT and are not doing this as a learning experience, you'll
    be much better off downloading an existing optimized code.

    Cordially,
    Steven G. Johnson

    PS. With regards to your second question, about timing, you should use a
    different routine than clock(). The best resolution comes from reading
    the CPU cycle counter directly (see www.fftw.org/download.html for code
    to do this on many CPUs).

    PPS. A decent compiler should transform array references a[j] in a loop
    into separate incremented pointers for you, if it's advantageous to do
    so. By being clever you can easily just confuse the compiler's
    optimizer unless you know what you're doing.
     
    Steven G. Johnson, Oct 6, 2003
    #4
  5. Paul Brown

    Kurt Krueger Guest

    (lots of stuff snipped)

    > Paul Brown wrote:


    > >
    > > I have reduced this time to 0.6ms per FFT call, when I reached the
    > > heart of the code, the innermost of 3 loops, where the correlation
    > > between complex exponentials and data occurs :


    "Steven G. Johnson" wrote:
    >
    > PPS. A decent compiler should transform array references a[j] in a loop
    > into separate incremented pointers for you, if it's advantageous to do
    > so. By being clever you can easily just confuse the compiler's
    > optimizer unless you know what you're doing.


    I've done the above many a time. The only way I've been able to cope is
    by learning the instruction set and reading the (equivalent) assembly code.
    It's pretty easy to tell when you've confused the optimizer.

    But be aware that modern architectures can execute instructions out of the
    original sequence and the optimizer is aware of this. So it may refuse to
    organize code in the 'best' sequence, knowing that instruction pipelining
    will take care of it. Forcing what appears to be 'best' might goof up the
    rest of the optimization.

    Most of my experience has been on SPARC, but Pentium architectures share
    a lot of thefeatures.
     
    Kurt Krueger, Oct 6, 2003
    #5
  6. Paul Brown

    Malcolm Guest

    "Paul Brown" <> wrote in message
    >
    > Why should accessing vectorised data using pointer references be so
    > much slower than indexing - surely Turbo-C cannot have pointer
    > de-referencing so badly wrong?
    >

    x86 has a shortage of registers. By forcing the compiler to store
    intermediate results in pointers, you might be running it out of registers
    and causing thrashing of the register file.
    Just a guess - you'd need to examine the assembly to see where it is really
    going wrong.
    If you need to optimise code, try not to use an old compiler for a new
    target. It can't take advantage of any changes to the architecture since
    when it was written.
     
    Malcolm, Oct 6, 2003
    #6
  7. "Paul Brown" <> wrote in message
    news:...
    > tempr=wr*data[ixj]-wi*data[ixj+1];
    > tempi=wr*data[ixj+1]+wi*data[ixj];
    > data[ixj]=data[ixData]-tempr;
    > data[ixj+1]=data[ixData+1]-tempi;
    > data[ixData] += tempr;
    > data[ixData+1] += tempi;


    It has been quite a few years since I last looked at FFT code, and that was
    a casual look during some self-education time. This is the fundamental
    radix-2 butterfly, isn't it?
     
    John R. Strohm, Oct 10, 2003
    #7
  8. In article <bm59sj$>,
    "John R. Strohm" <> wrote:

    > "Paul Brown" <> wrote in message
    > news:...
    > > tempr=wr*data[ixj]-wi*data[ixj+1];
    > > tempi=wr*data[ixj+1]+wi*data[ixj];
    > > data[ixj]=data[ixData]-tempr;
    > > data[ixj+1]=data[ixData+1]-tempi;
    > > data[ixData] += tempr;
    > > data[ixData+1] += tempi;

    >
    > It has been quite a few years since I last looked at FFT code, and that was
    > a casual look during some self-education time. This is the fundamental
    > radix-2 butterfly, isn't it?


    Looks like it, and programmed in a way that is guaranteed to run slow by
    preventing optimisations.
     
    Christian Bau, Oct 10, 2003
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. aj
    Replies:
    3
    Views:
    4,342
    Jaakko Varteva
    Nov 24, 2005
  2. Paul Brown
    Replies:
    4
    Views:
    427
    Randy Howard
    Oct 9, 2003
  3. Paul Brown
    Replies:
    1
    Views:
    331
    Steven G. Johnson
    Oct 7, 2003
  4. Paul Brown
    Replies:
    2
    Views:
    323
    Keith Thompson
    Oct 9, 2003
  5. Replies:
    8
    Views:
    370
    infobahn
    Feb 5, 2005
Loading...

Share This Page