Counter Intuitive Results: Optimising an FFT routine for Speed

Discussion in 'C Programming' started by Paul Brown, Oct 7, 2003.

  1. Paul Brown

    Paul Brown Guest

    Thanks for the replies Tristan, Eric, Steven & Kurt. They have given
    me some good leads.

    I present justification for a lot of the comments that drew
    (constructive) criticism below. Firstly, let me summarise the feedback
    that has pointed me in the right direction :-

    >> Steven

    > 20.13: What's the best way of making my program efficient?
    > A: By picking good algorithms and implementing them carefully.


    <hand up> Count me in for that, there was no intention or suggestion
    of micro-optimisation or use of inline assembler in my post. What I am
    trying to find out is what algorithm is effective and what C syntax is
    more efficient, albeit only on the small Turbo-C development system
    that I have.
    </hand up>

    > http://www.eskimo.com/~scs/C-faq/top.html

    Thanks, I am busy downloading the whole list by sequentially stepping
    through all the URLs at this site.

    > The biggest speed gains in an FFT (and most algorithms) come from much higher-
    > level transformations. For example, our FFTW (www.fftw.org) ...


    I see that I previously downloaded
    FFTW 3.0.1 is the latest official version of FFTW

    a while back, I am overwhelmed with volume :
    458 *.c files (2.9MB)
    42 *.h files

    Could take me a while to assimilate all of this.


    > you should use a different routine than clock()

    I thought I did quite well on the Athlon /Win98, synchronising on the
    20ms ticks and counting in between I could improve resolution to
    600ns. Still dependent on the system time sharing on the CPU though.

    > The best resolution comes from reading the CPU cycle counter directly (see
    > www.fftw.org/download.html for code to do this on many CPUs).

    Yes, I knew there was something called the "software stopwatch",
    hopefully I will come right with CYCLE.H.
    I can't go and peep at this URL at the mo' as my C-FAQ download is
    still stepping through each answer page. CYCLE.H has quite a bit of
    code to it, I suppose I just chuck out most of the CPU specific code
    and keep

    _( Pentium cycle counter)
    typedef unsigned long long ticks;
    static __inline__ ticks getticks(void) {
    ticks ret;
    __asm__ __volatile__("rdtsc": "=A" (ret));
    return ret; }



    >> Tristan

    I followed my usual method of
    QUERY GOOGLE -> READ -> FILTER -> POST

    and using this I selected the four newsgroups that had hits on
    C Speed Optimisation FFT

    > comp.lang.oberon

    Subject: Re: Optimizing array bounds checking
    Newsgroups: comp.lang.oberon

    > comp.lang.functional

    Subject: Re: Academia isn't interested...
    Newsgroups: comp.lang.functional
    .....
    But what you *can* do is generate low-level code from high-level
    specifications. This has already been well-demonstrated in e.g. the
    FFTW
    (written in OCaml), which generates extremely efficient C-algorithms
    for the FFT from specifications and specialises them for the given
    architecture. This, however, is high-level programming again, not
    low-level programming. There is no escape from imperativeness if you
    work with von-Neumann architectures on a low level.


    >> Eric


    > You'd have to examine the actual generated code to have a hope
    > of figuring out what's really happening in any particular case.

    Don't want to do that, anyway my solution has to be fairly portable,
    from me (design engineer) it goes to software engineer, rebuilt under
    Visual C++, and then targetted to TMS320 (or equiv) DSP.

    >> Steven


    > You realize, of course, that your first mistake was starting with the
    > radix-2 Numerical Recipes in C code and spending your time attempting
    > micro-optimizations. The biggest speed gains in an FFT (and most
    > algorithms) come from much higher-level transformations.

    Actually I started with someone else's Fortran->C implementation of an
    algorithm that looks like it came from the Cooley-Tukey paper. Then I
    stepped to the Numerical Recipes (C) algorithm (Danielson - Lanczos?)
    (which by the way still carries the legacy of Fortran array indexing)
    This took -72% off the time.

    I have refined this in about five major steps, so that at the point of
    posting I was
    -23% off the time of the NumRep routine as published (only high
    level tweaks).

    I will perform my last trial of converting the original code to double
    without changing the indexing, just to see what I would have got,
    before ...

    Now, however, I will trawl through all the source in FFTW and see what
    I can find.


    > If you just want a fast FFT and are not doing this as a learning experience, you'll
    > be much better off downloading an existing optimized code.

    Even though I will probably do better with an FFTW routine, the
    lessons learned will be valuable for other pieces of time critical
    code.

    > A decent compiler should transform array references a[j] in a loop
    > into separate incremented pointers for you, if it's advantageous to do
    > so. By being clever you can easily just confuse the compiler's
    > optimizer unless you know what you're doing.

    The two sets of indices do not change sequentially, which is why I
    hoped that me forcing the use of pointers would be faster.


    >> Kurt


    > I've done the above many a time. The only way I've been able to cope is
    > by learning the instruction set and reading the (equivalent) assembly code.
    > It's pretty easy to tell when you've confused the optimizer.

    This would be interesting given unlimited time, I do know that the DSP
    code was definitely hand tweaked in the past as portability and
    maintainability were far lower priority than performance.

    Regards,
    Paul
    Paul Brown, Oct 7, 2003
    #1
    1. Advertising

  2. Paul Brown wrote:
    > Don't want to do that, anyway my solution has to be fairly portable,
    > from me (design engineer) it goes to software engineer, rebuilt under
    > Visual C++, and then targetted to TMS320 (or equiv) DSP.


    Whoa, whoa, hold on, you want this to be fast on a DSP chip? But you're
    planning to benchmark and optimize it on a general-purpose CPU?

    Sigh...
    Steven G. Johnson, Oct 7, 2003
    #2
    1. Advertising

  3. Paul Brown

    Mark Gordon Guest

    On Tue, 07 Oct 2003 13:43:11 -0400
    "Steven G. Johnson" <> wrote:

    > Paul Brown wrote:
    > > Don't want to do that, anyway my solution has to be fairly portable,
    > > from me (design engineer) it goes to software engineer, rebuilt
    > > under Visual C++, and then targetted to TMS320 (or equiv) DSP.

    >
    > Whoa, whoa, hold on, you want this to be fast on a DSP chip? But
    > you're planning to benchmark and optimize it on a general-purpose CPU?
    >
    > Sigh...


    Optimising for the TMS320 processors (they vary a *lot* BTW) is *way* OT
    for this group since on the ones I've used there are all sorts of trick
    you can play in assembler that you can't (and the compiler I used did
    not) use.

    The OP should find a group or mailing list specific to the processor he
    wants to use and probably obtain a library written in assembler to make
    full use of the facilities the processor has. Also read the processor
    manual, ISTR the TMS320C1x/2x/5x books had example FFT code...
    --
    Mark Gordon
    Paid to be a Geek & a Senior Software Developer
    Although my email address says spamtrap, it is real and I read it.
    Mark Gordon, Oct 7, 2003
    #3
  4. (Paul Brown) writes:
    [...]
    > > http://www.eskimo.com/~scs/C-faq/top.html

    > Thanks, I am busy downloading the whole list by sequentially stepping
    > through all the URLs at this site.


    The "other versions" link, <http://www.eskimo.com/~scs/C-faq/versions.html>,
    points you to a compressed plain-text copy of the whole thing at
    <ftp://ftp.eskimo.com/u/s/scs/C-faq/faq.Z>. You can uncompress it
    with "uncompress" or "gunzip". Apparently it's the most up-to-date
    version available.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://www.sdsc.edu/~kst>
    Schroedinger does Shakespeare: "To be *and* not to be"
    Keith Thompson, Oct 8, 2003
    #4
  5. Paul Brown

    Randy Howard Guest

    In article <>,
    says...
    > > The best resolution comes from reading the CPU cycle counter directly (see
    > > www.fftw.org/download.html for code to do this on many CPUs).

    > Yes, I knew there was something called the "software stopwatch",
    > hopefully I will come right with CYCLE.H.
    > I can't go and peep at this URL at the mo' as my C-FAQ download is
    > still stepping through each answer page. CYCLE.H has quite a bit of
    > code to it, I suppose I just chuck out most of the CPU specific code
    > and keep
    >
    > _( Pentium cycle counter)
    > typedef unsigned long long ticks;
    > static __inline__ ticks getticks(void) {
    > ticks ret;
    > __asm__ __volatile__("rdtsc": "=A" (ret));
    > return ret; }


    This, and most of this thread is OT here, but since you've already
    posted this, I don't want others to be bit by this in the future.

    Danger: The above is TOTALLY UNRELIABLE on SMP systems. Getting
    dependent upon this will bite you, as even notebook computers
    are now available with dual CPUs. There is no synchronization
    between the TSCs on the individual CPUs. So, you are as likely
    as not to get bogus time values if your process gets bounced
    between processors.

    --
    Randy Howard _o
    2reply remove FOOBAR \<,
    ______________________()/ ()______________________________________________
    SCO Spam-magnet:
    Randy Howard, Oct 9, 2003
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. aj
    Replies:
    3
    Views:
    4,158
    Jaakko Varteva
    Nov 24, 2005
  2. Paul Brown
    Replies:
    7
    Views:
    364
    Christian Bau
    Oct 10, 2003
  3. Paul Brown
    Replies:
    1
    Views:
    304
    Steven G. Johnson
    Oct 7, 2003
  4. Paul Brown
    Replies:
    2
    Views:
    295
    Keith Thompson
    Oct 9, 2003
  5. Replies:
    8
    Views:
    339
    infobahn
    Feb 5, 2005
Loading...

Share This Page