Counter Intuitive Results: Optimising an FFT routine for Speed

Discussion in 'C Programming' started by Paul Brown, Oct 8, 2003.

  1. Paul Brown

    Paul Brown Guest

    Thanks for various responses - my feeling is that the cost benefit
    ratio is now approaching the noise level asymptote, though there was
    at least one good point :

    > compressed plain-text copy of the whole thing at
    > <ftp://ftp.eskimo.com/u/s/scs/C-faq/faq.Z>
    > Apparently it's the most up-to-date version available.


    Well its plain text, and it is contiguous, which is a lot better than
    my 331 html files totalling 982kB, but up to date it is not :

    > Date: 8 Feb 1999 04:56:00 GMT
    > Approved:
    > Expires: 3 Mar 1999 00:00:00 GMT
    > X-Last-Modified: February 7, 1999


    Oh and by the way, since I have been cross-examined on cross-posting :
    > Newsgroups: comp.lang.c,comp.lang.c.moderated,
    > alt.comp.lang.learn.c-c++,comp.answers,alt.answers,news.answers


    is the general reader of
    alt.answers,news.answers
    going to be interested in the finer points of C?




    > Whoa, whoa, hold on, you want this to be fast on a DSP chip? But you're
    > planning to benchmark and optimize it on a general-purpose CPU?


    Absolutely right - I am doing high level system engineeting here, I
    will pass on the algorithm implementation and source code (probably
    FFTW derived, but may still be Numerical Recipes version enhanced on
    several counts) to the S/W engineer, he implements a version in Visual
    C++ for the client (PC) workstation, and the embedded version gets
    ported to whatever DSP is chosen for the new hardware platform. The
    DSP speed is the most critical, but speed on the PC client is also
    important. The lessons learned probably remain with me.


    > The OP should find a group or mailing list specific to the processor .....

    The reality of the situation is that this won't happen.


    > The x86 registers are extended precision (or at least double, depending
    > upon the mode), so your compiler doesn't have much choice in the matter.
    > Nor does this affect the speed.


    Surely the implication of this is that float variable arithmetic
    should be slower than double arithmetic, because of the extra casts?


    > Single precision is almost always faster than double precision, though,
    > at least for significant transform sizes, simply because of cache usage.


    Well, if 1024 x 2 doubles fit in the cache, the memory reads/writes
    should be the same speed as for float (there should be cast overheads
    though to slow the float version down).


    > You asked for something with more resolution than clock(), so I told you
    > the best resolution you can get; if you're happy with clock(), why did
    > you ask?
    > You also inexplicably grabbed the code from cycle.h labeled as gcc-specific.


    1. Thanks for the reference to CYCLE.h, it is a useful tool for the
    toolbox, when I have something apart from Turbo-C I am sure I will use
    it.

    2. What resolution is this, are we talking of about 1ns, 10ns or 100ns
    (and I do remember that you or someone else said it is for relative
    comparisons anyway, but there will be a precise answer for a fixed CPU
    and clock speed)?

    3. I am not happy with clock, I am distinctly unhappy with same. Not
    only is 18.2ms an age for most fast applications, it only actually
    ticks at 56.12ms. So I wrote a timer based on clock(), and with this I
    can get a resolution of
    666ns (900MHz Athlon, Win-98)
    and worst case
    1145µs (998MHz Celeron, Win-2k).

    4. GCC specific? - I grabbed the code for

    /*
    * Pentium cycle counter
    */
    #if (defined(__GNUC__) || defined(__ICC)) && defined(__i386__) ...

    __asm__ __volatile__("rdtsc": "=A" (ret));

    I have also tried

    /*
    * X86-64 cycle counter
    */
    #if defined(__GNUC__) && defined(__x86_64__) &&
    !defined(HAVE_TICK_COUNTER)

    asm volatile("rdtsc" : "=a" (a), "=d" (d));

    This at least generates the compiler error that "in-line assembly is
    not allowed", the __ASM__ keyword not being recognised at all in the
    Pentium example.

    Any chance of being able to get the assembled hex codes for these
    assembler statements and patching these into the executable. I
    imagine inserting a unique sequence of straight C statements that a
    search and replace executable could replace with the opcodes for
    "rdtsc": "=A" (ret)

    Seems bizarre I know, but pragmatism was never thus handicapped.

    ____________________________________________________________________
    Another great Dane has made free
    With a question of Be or Not be.
    Now might Schrödinger's puss,
    In descending by Schuss,
    Leave one track on each side of a tree?
    ____________________________________________________________________
    Quantum mechanics, hmmm. You put a cat in a box, along with a hammer
    and
    some poison and a radioactive isotope ... I forget exactly how this
    goes.
    Anyway, keep some bandages on hand, because I guarantee the cat won't
    be
    happy.
    ____________________________________________________________________
     
    Paul Brown, Oct 8, 2003
    #1
    1. Advertising

  2. Paul Brown wrote:
    >SGJ wrote:
    >>Whoa, whoa, hold on, you want this to be fast on a DSP chip? But you're
    >>planning to benchmark and optimize it on a general-purpose CPU?

    >
    > Absolutely right - I am doing high level system engineeting here


    The point is that optimizing for a DSP chip (with all sorts of
    special-purpose instructions etc.) is completely different from
    optimizing for a general-purpose CPU, and the same code, or even the
    same *style* of code, is unlikely to be fast on both kinds of architecture.

    > Surely the implication of this is that float variable arithmetic
    > should be slower than double arithmetic, because of the extra casts?


    There isn't a one-to-one mapping from C operators to hardware
    operations. The "cast" from float to double (or actually, to 80-bit
    extended precision) is done "costlessly" in hardware as part of the
    load/store.

    > 4. GCC specific? - I grabbed the code for
    > #if (defined(__GNUC__) || defined(__ICC)) && defined(__i386__) ...


    Nota bene: __GNUC__. (If you scroll down in the file, you'll see
    alternate code for Visual C++.) Anyway, I think that this is the least
    of your worries.

    Good luck to you. I hereby declare this thread dead.
     
    Steven G. Johnson, Oct 9, 2003
    #2
    1. Advertising

  3. (Paul Brown) writes:
    > Thanks for various responses - my feeling is that the cost benefit
    > ratio is now approaching the noise level asymptote, though there was
    > at least one good point :
    >
    > > compressed plain-text copy of the whole thing at
    > > <ftp://ftp.eskimo.com/u/s/scs/C-faq/faq.Z>
    > > Apparently it's the most up-to-date version available.

    >
    > Well its plain text, and it is contiguous, which is a lot better than
    > my 331 html files totalling 982kB, but up to date it is not :
    >
    > > Date: 8 Feb 1999 04:56:00 GMT
    > > Approved:
    > > Expires: 3 Mar 1999 00:00:00 GMT
    > > X-Last-Modified: February 7, 1999


    I said it seems to be the most up-to-date-version available; as far as
    I can tell, it is.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://www.sdsc.edu/~kst>
    Schroedinger does Shakespeare: "To be *and* not to be"
     
    Keith Thompson, Oct 9, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. aj
    Replies:
    3
    Views:
    4,243
    Jaakko Varteva
    Nov 24, 2005
  2. Paul Brown
    Replies:
    7
    Views:
    387
    Christian Bau
    Oct 10, 2003
  3. Paul Brown
    Replies:
    4
    Views:
    412
    Randy Howard
    Oct 9, 2003
  4. Paul Brown
    Replies:
    1
    Views:
    319
    Steven G. Johnson
    Oct 7, 2003
  5. Replies:
    8
    Views:
    356
    infobahn
    Feb 5, 2005
Loading...

Share This Page