Is this the optimal FIR filter on all platforms?

Discussion in 'C Programming' started by Johan Bergman, Oct 25, 2003.

  1. Hi,

    Maybe someone can help me to optimize this C/C++ implementation of a FIR
    filter (although I realize that I should probably consider an FFT approach
    instead.)

    The example below calculates the output signal for 10000 lags from an FIR
    filter with 10000 taps. The input signal and the filter coefficients is just
    rubbish in this example. For my intended usage of this filter, it is not
    necessary to store the state of the filter.

    My problem is that this implementation performs very well in Cygwin on my
    laptop, but less good in Linux and even worse in SunOS. I have tried to
    calculate the number of mega-cycles that the example program consumes on the
    different platforms (execution time in seconds multiplied by CPU clock
    frequency in MHz):

    Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum;)
    Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum;)
    SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum;)

    The performance within parentheses was obtained when a certain line was
    commented out (i.e. the line *tmp_output_ptr++ = sum;). As you can see there
    were very different behaviors on the different platforms!

    The SunOS program performs extremely well without the critial line. Note
    that the 103 Mcycles means that only one cycle was spent per iteration in
    the inner for loop! But it performs terribly when the critical line is
    included, about 28 times slower!

    As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
    clock cycles per iteration in the inner loop, without the critical line and
    only about twice as much with the critical line.

    Can someone help me to speed up the program on the Linux and SunOS
    platforms?

    Regards,
    Johan


    #include <malloc.h>
    int main()
    {
    const int nrof_lags = 10000;
    const int nrof_taps = 10000;
    const int * const coeff_ptr =
    (const int * const) malloc(nrof_taps*sizeof(int));
    const int * const input_ptr =
    (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
    int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
    const int * tmp_coeff_ptr;
    const int * tmp_input_ptr;
    int * tmp_output_ptr;
    int sum;
    int lag, tap;
    tmp_output_ptr = (int *) output_ptr;
    for (lag=0; lag<nrof_lags; lag++)
    {
    tmp_coeff_ptr = (const int *) coeff_ptr;
    tmp_input_ptr = (const int *) input_ptr + lag;
    sum = 0;
    for (tap=0; tap<nrof_taps; tap++)
    {
    sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    }
    *tmp_output_ptr++ = sum;
    }
    }
     
    Johan Bergman, Oct 25, 2003
    #1
    1. Advertising

  2. Johan Bergman

    Tim Prince Guest

    Johan Bergman wrote:

    > Hi,
    >
    > Maybe someone can help me to optimize this C/C++ implementation of a FIR
    > filter (although I realize that I should probably consider an FFT approach
    > instead.)
    >
    > The example below calculates the output signal for 10000 lags from an FIR
    > filter with 10000 taps. The input signal and the filter coefficients is
    > just rubbish in this example. For my intended usage of this filter, it is
    > not necessary to store the state of the filter.
    >
    > My problem is that this implementation performs very well in Cygwin on my
    > laptop, but less good in Linux and even worse in SunOS. I have tried to
    > calculate the number of mega-cycles that the example program consumes on
    > the different platforms (execution time in seconds multiplied by CPU clock
    > frequency in MHz):
    >
    > Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum;)
    > Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum;)
    > SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum;)
    >
    > The performance within parentheses was obtained when a certain line was
    > commented out (i.e. the line *tmp_output_ptr++ = sum;). As you can see
    > there were very different behaviors on the different platforms!
    >
    > The SunOS program performs extremely well without the critial line. Note
    > that the 103 Mcycles means that only one cycle was spent per iteration in
    > the inner for loop! But it performs terribly when the critical line is
    > included, about 28 times slower!
    >
    > As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
    > clock cycles per iteration in the inner loop, without the critical line
    > and only about twice as much with the critical line.
    >
    > Can someone help me to speed up the program on the Linux and SunOS
    > platforms?
    >
    > Regards,
    > Johan
    >
    >
    > #include <malloc.h>
    > int main()
    > {
    > const int nrof_lags = 10000;
    > const int nrof_taps = 10000;
    > const int * const coeff_ptr =
    > (const int * const) malloc(nrof_taps*sizeof(int));
    > const int * const input_ptr =
    > (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
    > int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
    > const int * tmp_coeff_ptr;
    > const int * tmp_input_ptr;
    > int * tmp_output_ptr;
    > int sum;
    > int lag, tap;
    > tmp_output_ptr = (int *) output_ptr;
    > for (lag=0; lag<nrof_lags; lag++)
    > {
    > tmp_coeff_ptr = (const int *) coeff_ptr;
    > tmp_input_ptr = (const int *) input_ptr + lag;
    > sum = 0;
    > for (tap=0; tap<nrof_taps; tap++)
    > {
    > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > }
    > *tmp_output_ptr++ = sum;
    > }
    > }

    Assuming that you are using basically the same compiler and the same
    optimization options in each case, you must be running on different
    hardware. You've given no reason to believe that the OS is responsible.
    If one or more of your platforms supports SSE/SSE2 instructions, and you
    can change the operands to float/double, you should be able to get much
    better performance. Split the sum into 4 or 8 partial sums, thus
    parallelizing the sum reduction. This becomes OT if you use a C compiler
    whichs supports parallel instructions only in asm.
    --
    Tim Prince
     
    Tim Prince, Oct 25, 2003
    #2
    1. Advertising

  3. [Warning: long style critique ahead]

    On Sat, 25 Oct 2003, Johan Bergman wrote:
    >
    > Maybe someone can help me to optimize this C/C++ implementation of a FIR
    > filter (although I realize that I should probably consider an FFT approach
    > instead.)


    What in the world is an FIR filter? (Finite impulse-response, yes,
    yes, I know, I looked it up. Point is, practically nobody in this
    newsgroup will know, either, and they'll *all* have to look it up
    if they really want to help you. You're making it harder for people
    to help you when you don't define your terms.)

    > My problem is that this implementation performs very well in Cygwin on my
    > laptop, but less good in Linux and even worse in SunOS. I have tried to
    > calculate the number of mega-cycles that the example program consumes on the
    > different platforms (execution time in seconds multiplied by CPU clock
    > frequency in MHz):
    >
    > Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum;)
    > Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum;)
    > SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum;)


    > Can someone help me to speed up the program on the Linux and SunOS
    > platforms?


    Well, first let's clean up your code and see what it looks like
    then. Clean code is always easier to operate on. :)


    > #include <malloc.h>


    Non-standard header. 'malloc' is declared in <stdlib.h>.

    #include <stdlib.h>

    > int main()


    [Some regulars strongly believe in 'int main(void)'. I
    don't have a strong preference either way, but it's wise
    to consider why you picked the style you did, even on
    little things like this.]

    > {
    > const int nrof_lags = 10000;
    > const int nrof_taps = 10000;
    > const int * const coeff_ptr =


    Okay, yeah, 'const' is great. But really, this is ridiculous.
    You don't think that all these consts are somehow making your
    program more efficient, do you? (They're not hurting, but
    over-constifying is Bad mostly because it's Ugly, and Ugly code
    is hard to make Right.)

    > (const int * const) malloc(nrof_taps*sizeof(int));


    malloc(nrof_taps * sizeof *coeff_ptr);

    No chance for error using this idiom.
    Also note that casting a malloc() result to 'const' anything
    is just silly, because free() takes a non-const pointer
    argument. How are you going to free() this memory, hmm?


    > const int * const input_ptr =
    > (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
    > int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
    > const int * tmp_coeff_ptr;
    > const int * tmp_input_ptr;
    > int * tmp_output_ptr;
    > int sum;
    > int lag, tap;


    Way too many similarly-named variables, in my opinion.
    Again, purely a stylistic issue, but again, clean code is
    nice code. Note that in the cleaned-up version, I've pulled
    in their scope to where they're actually used. This may
    actually help with the efficiency of the program, since some
    optimizers look for "windows" in which objects can be moved
    into registers and suchlike. The wider their scopes are, the
    harder it is for the optimizer to see them.

    > tmp_output_ptr = (int *) output_ptr;


    Casting away constness is Bad. Especially when you've just
    finished jumping through hoops to make it const in the first
    place. Geez.

    > for (lag=0; lag<nrof_lags; lag++)
    > {
    > tmp_coeff_ptr = (const int *) coeff_ptr;
    > tmp_input_ptr = (const int *) input_ptr + lag;


    STOP PERFORMING RANDOM CASTS! For Pete's sake!
    Get a hold of yourself here!

    > sum = 0;
    > for (tap=0; tap<nrof_taps; tap++)
    > {
    > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;


    This line (and the one below) are just begging to be
    simplified. Let's expand it so we can see the order
    of operations.

    > }
    > *tmp_output_ptr++ = sum;
    > }


    /* always */ return 0; /* from main */

    > }


    Okay, let's put it all together: no casts, clean names,
    clean mallocs, and see what we get!

    You see how all those tmp_*_ptr objects just melt away
    once we start to see the outline of the program clearly.
    In fact, we uncover a subtle bug in the memory allocation,
    and fix it ('input' was being malloc'ed one element short)!
    Finally, we strive for idiomatic identifiers: using 'i' and
    'j' for the nested loop controls, and losing the silly
    '_ptr' suffixes on the main data arrays.


    #include <stdlib.h>

    int main(void)
    {
    int nrof_lags = 10000; /* Could both of these be #defines? */
    int nrof_taps = 10000; /* That would help a little. */

    int *coeffs = malloc(nrof_taps * sizeof *coeffs);
    int *input = malloc((nrof_taps+nrof_lags) * sizeof *input);
    int *output = malloc(nrof_lags * sizeof *output);
    int i, j;

    for (i=0; i < nrof_lags; ++i)
    {
    int sum = 0;
    for (j=0; j < nrof_taps; ++j)
    sum += coeffs[j] * input[i+j];
    output = sum;
    }

    free(coeffs);
    free(input);
    free(output);
    return 0;
    }


    There -- isn't that a heck of a lot simpler and easier to
    read? I bet it performs comparably fast on all your machines.
    If not, you might try changing 'int' to 'unsigned int' (in
    fact, you might do that anyway, to guard against overflow
    conditions); reversing the order of either or both loops;
    if many of the 'coeffs[j]' are zero, try pre-processing
    'coeffs' a little bit; little tweaks like that.

    I think most of your performance differences are due to
    different cache layouts on the different machines, but I
    could easily be wrong.

    How's the code do now?

    HTH,
    -Arthur
     
    Arthur J. O'Dwyer, Oct 25, 2003
    #3
  4. Johan Bergman

    nobody Guest

    [OT] Re: Is this the optimal FIR filter on all platforms?

    "Johan Bergman" <> wrote in message
    news:_ylmb.34856$...
    > Hi,
    >
    > Maybe someone can help me to optimize this C/C++ implementation of a FIR
    > filter (although I realize that I should probably consider an FFT approach
    > instead.)
    >
    > The example below calculates the output signal for 10000 lags from an FIR
    > filter with 10000 taps. The input signal and the filter coefficients is

    just
    > rubbish in this example. For my intended usage of this filter, it is not
    > necessary to store the state of the filter.
    >
    > My problem is that this implementation performs very well in Cygwin on my
    > laptop, but less good in Linux and even worse in SunOS. I have tried to
    > calculate the number of mega-cycles that the example program consumes on

    the
    > different platforms (execution time in seconds multiplied by CPU clock
    > frequency in MHz):
    >
    > Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum;)
    > Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum;)
    > SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum;)
    >
    > The performance within parentheses was obtained when a certain line was
    > commented out (i.e. the line *tmp_output_ptr++ = sum;). As you can see

    there
    > were very different behaviors on the different platforms!
    >
    > The SunOS program performs extremely well without the critial line. Note
    > that the 103 Mcycles means that only one cycle was spent per iteration in
    > the inner for loop! But it performs terribly when the critical line is
    > included, about 28 times slower!
    >
    > As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
    > clock cycles per iteration in the inner loop, without the critical line

    and
    > only about twice as much with the critical line.
    >
    > Can someone help me to speed up the program on the Linux and SunOS
    > platforms?
    >
    > Regards,
    > Johan
    >
    >
    > #include <malloc.h>
    > int main()
    > {
    > const int nrof_lags = 10000;
    > const int nrof_taps = 10000;
    > const int * const coeff_ptr =
    > (const int * const) malloc(nrof_taps*sizeof(int));
    > const int * const input_ptr =
    > (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
    > int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
    > const int * tmp_coeff_ptr;
    > const int * tmp_input_ptr;
    > int * tmp_output_ptr;
    > int sum;
    > int lag, tap;
    > tmp_output_ptr = (int *) output_ptr;
    > for (lag=0; lag<nrof_lags; lag++)
    > {
    > tmp_coeff_ptr = (const int *) coeff_ptr;
    > tmp_input_ptr = (const int *) input_ptr + lag;
    > sum = 0;
    > for (tap=0; tap<nrof_taps; tap++)
    > {
    > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > }
    > *tmp_output_ptr++ = sum;
    > }
    > }
    >

    No idea what the problem is, but if you comment out
    *tmp_output_ptr++ = sum;
    compiler may optimize away calculation of sum in previous statement,
    which could explain performance differences. Try
    1. optimize code little bit, instead of
    sum = 0;
    for (tap=0; tap<nrof_taps; tap++)
    {
    sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    }
    *tmp_output_ptr++ = sum;
    do
    for (tap=0; tap<nrof_taps; tap++)
    *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    (sum is cleared in each outer iteration and therefore I assume it's
    not used somewhere later)
    2. use calloc() instead of malloc(). I'm not sure what are alignment
    implications (some additional instructions for acces to data via
    dereferenced
    ptr in malloc'd block?) in case of malloc() use on any particular platform.

    I would appreciate posting of proper explanation and solution,
    when you find it.
     
    nobody, Oct 25, 2003
    #4
  5. Johan Bergman

    CBFalconer Guest

    Johan Bergman wrote:
    >

    .... snip ...
    >
    > Can someone help me to speed up the program on the Linux and SunOS
    > platforms?
    >
    > #include <malloc.h>


    No such standard header. Use stdlib.h.

    > int main()
    > {
    > const int nrof_lags = 10000;
    > const int nrof_taps = 10000;
    > const int * const coeff_ptr =
    > (const int * const) malloc(nrof_taps*sizeof(int));


    Don't cast malloc output.

    > const int * const input_ptr =
    > (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));


    Same here.

    > int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));


    Same here.

    > const int * tmp_coeff_ptr;
    > const int * tmp_input_ptr;
    > int * tmp_output_ptr;
    > int sum;
    > int lag, tap;
    > tmp_output_ptr = (int *) output_ptr;
    > for (lag=0; lag<nrof_lags; lag++)
    > {
    > tmp_coeff_ptr = (const int *) coeff_ptr;


    Uninitialized data. Anything can happen.

    > tmp_input_ptr = (const int *) input_ptr + lag;
    > sum = 0;
    > for (tap=0; tap<nrof_taps; tap++)
    > {
    > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > }
    > *tmp_output_ptr++ = sum;
    > }
    > }


    I suggest you first create a legal program before measuring
    performance.

    --
    Chuck F () ()
    Available for consulting/temporary embedded and systems.
    <http://cbfalconer.home.att.net> USE worldnet address!
     
    CBFalconer, Oct 25, 2003
    #5
  6. Re: [OT] Re: Is this the optimal FIR filter on all platforms?

    On Sat, 25 Oct 2003, nobody wrote:
    >
    > "Johan Bergman" <> wrote...
    > >
    > > Maybe someone can help me to optimize this C/C++ implementation of a FIR
    > > filter (although I realize that I should probably consider an FFT approach
    > > instead.)


    > Try
    > 1. optimize code little bit, instead of
    > sum = 0;
    > for (tap=0; tap<nrof_taps; tap++)
    > {
    > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > }
    > *tmp_output_ptr++ = sum;
    > do
    > for (tap=0; tap<nrof_taps; tap++)
    > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > (sum is cleared in each outer iteration and therefore I assume it's
    > not used somewhere later)


    That will almost certainly slow down the code immensely; remember
    that 'tmp_output_ptr' points to a 100000-element array, which is highly
    unlikely to be readily accessible. 'sum', on the other hand, should
    have been optimized by the compiler to be stored very accessibly in
    a register or suchlike.

    > 2. use calloc() instead of malloc(). I'm not sure what are alignment
    > implications (some additional instructions for acces to data via
    > dereferenced
    > ptr in malloc'd block?) in case of malloc() use on any particular platform.


    On many modern systems, there won't even be a significant difference
    between the performance of calloc and malloc -- but since malloc doesn't
    need to write zeroes everywhere, it's faster in theory. OTOH, if the
    OP used calloc, he would have a program that actually exhibited defined,
    reproducible behavior. :)

    -Arthur
     
    Arthur J. O'Dwyer, Oct 25, 2003
    #6
  7. Johan Bergman

    nobody Guest

    Re: [OT] Re: Is this the optimal FIR filter on all platforms?

    "Arthur J. O'Dwyer" <> wrote in message
    news:p...
    >
    > On Sat, 25 Oct 2003, nobody wrote:
    > >
    > > "Johan Bergman" <> wrote...
    > > >
    > > > Maybe someone can help me to optimize this C/C++ implementation of a

    FIR
    > > > filter (although I realize that I should probably consider an FFT

    approach
    > > > instead.)

    >
    > > Try
    > > 1. optimize code little bit, instead of
    > > sum = 0;
    > > for (tap=0; tap<nrof_taps; tap++)
    > > {
    > > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > > }
    > > *tmp_output_ptr++ = sum;
    > > do
    > > for (tap=0; tap<nrof_taps; tap++)
    > > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > > (sum is cleared in each outer iteration and therefore I assume it's
    > > not used somewhere later)

    >
    > That will almost certainly slow down the code immensely; remember
    > that 'tmp_output_ptr' points to a 100000-element array, which is highly
    > unlikely to be readily accessible. 'sum', on the other hand, should
    > have been optimized by the compiler to be stored very accessibly in
    > a register or suchlike.
    >

    I was thinking of that too, but array has 10K elemnts, not 100K, so
    assuming 32 bit inegers, there is malloc'd something below 160K
    from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
    OP is running SunOS on hardware with more memory than that.
    But nonwithstanding, you may be right.

    > > 2. use calloc() instead of malloc(). I'm not sure what are alignment
    > > implications (some additional instructions for acces to data via
    > > dereferenced
    > > ptr in malloc'd block?) in case of malloc() use on any particular

    platform.
    >
    > On many modern systems, there won't even be a significant difference
    > between the performance of calloc and malloc -- but since malloc doesn't
    > need to write zeroes everywhere, it's faster in theory. OTOH, if the


    That is true, but allocation is done outside of loop so this difference
    would be IMHO practically inmesurable in given program (10K
    multiplications and additions in inner loop times 10K iterations
    and additions in outer loop. But I was really speculating, because
    I can't explain those huge time execution differences (assuming
    correct methodology was used to measure this time).

    > OP used calloc, he would have a program that actually exhibited defined,
    > reproducible behavior. :)
    >
    > -Arthur
    >
     
    nobody, Oct 25, 2003
    #7
  8. Re: [OT] Re: Is this the optimal FIR filter on all platforms?

    On Sat, 25 Oct 2003 18:01:57 +0000, nobody wrote:

    > "Arthur J. O'Dwyer" <> wrote in message
    > news:p...
    >>
    >> On Sat, 25 Oct 2003, nobody wrote:
    >> >
    >> > Try
    >> > 1. optimize code little bit, instead of
    >> > sum = 0;
    >> > for (tap=0; tap<nrof_taps; tap++)
    >> > {
    >> > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    >> > }
    >> > *tmp_output_ptr++ = sum;
    >> > do
    >> > for (tap=0; tap<nrof_taps; tap++)
    >> > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    >> > (sum is cleared in each outer iteration and therefore I assume it's
    >> > not used somewhere later)

    >>
    >> That will almost certainly slow down the code immensely; remember
    >> that 'tmp_output_ptr' points to a 100000-element array, which is highly
    >> unlikely to be readily accessible. 'sum', on the other hand, should
    >> have been optimized by the compiler to be stored very accessibly in
    >> a register or suchlike.
    >>

    > I was thinking of that too, but array has 10K elemnts, not 100K, so
    > assuming 32 bit inegers, there is malloc'd something below 160K
    > from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
    > OP is running SunOS on hardware with more memory than that.
    > But nonwithstanding, you may be right.


    For what it's worth, I compiled both versions on my machine (which
    has plenty of memory). The version with sum was faster:

    with sum:
    [sheldon@wsxyz]$ gcc -Wall -W -O3 -o test1 test1.c
    [sheldon@wsxyz]$ time ./test1

    real 0m0.627s
    user 0m0.560s
    sys 0m0.000s


    without sum:
    [sheldon@wsxyz]$ gcc -Wall -W -O3 -o test2 test2.c
    [sheldon@wsxyz]$ time ./test2

    real 0m0.710s
    user 0m0.630s
    sys 0m0.000s
     
    Sheldon Simms, Oct 25, 2003
    #8
  9. Johan Bergman

    nobody Guest

    Re: [OT] Re: Is this the optimal FIR filter on all platforms?

    "Sheldon Simms" <> wrote in message
    news:p...
    > On Sat, 25 Oct 2003 18:01:57 +0000, nobody wrote:
    >
    > > "Arthur J. O'Dwyer" <> wrote in message
    > > news:p...
    > >>
    > >> On Sat, 25 Oct 2003, nobody wrote:
    > >> >
    > >> > Try
    > >> > 1. optimize code little bit, instead of
    > >> > sum = 0;
    > >> > for (tap=0; tap<nrof_taps; tap++)
    > >> > {
    > >> > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > >> > }
    > >> > *tmp_output_ptr++ = sum;
    > >> > do
    > >> > for (tap=0; tap<nrof_taps; tap++)
    > >> > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > >> > (sum is cleared in each outer iteration and therefore I assume it's
    > >> > not used somewhere later)
    > >>
    > >> That will almost certainly slow down the code immensely; remember
    > >> that 'tmp_output_ptr' points to a 100000-element array, which is highly
    > >> unlikely to be readily accessible. 'sum', on the other hand, should
    > >> have been optimized by the compiler to be stored very accessibly in
    > >> a register or suchlike.
    > >>

    > > I was thinking of that too, but array has 10K elemnts, not 100K, so
    > > assuming 32 bit inegers, there is malloc'd something below 160K
    > > from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
    > > OP is running SunOS on hardware with more memory than that.
    > > But nonwithstanding, you may be right.

    >
    > For what it's worth, I compiled both versions on my machine (which
    > has plenty of memory). The version with sum was faster:
    >
    > with sum:
    > [sheldon@wsxyz]$ gcc -Wall -W -O3 -o test1 test1.c
    > [sheldon@wsxyz]$ time ./test1
    >
    > real 0m0.627s
    > user 0m0.560s
    > sys 0m0.000s
    >
    >
    > without sum:
    > [sheldon@wsxyz]$ gcc -Wall -W -O3 -o test2 test2.c
    > [sheldon@wsxyz]$ time ./test2
    >
    > real 0m0.710s
    > user 0m0.630s
    > sys 0m0.000s
    >

    Thanks. So I was wrong. My sincere apologies. Had I SunOS box
    at my disposal, I would test it before posting :)
     
    nobody, Oct 25, 2003
    #9
  10. Re: [OT] Re: Is this the optimal FIR filter on all platforms?

    On Sat, 25 Oct 2003, nobody wrote:
    >
    > "Arthur J. O'Dwyer" <> wrote...
    > > On Sat, 25 Oct 2003, nobody wrote:
    > > > optimize code little bit, instead of
    > > > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > > > do
    > > > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;

    > >
    > > That will almost certainly slow down the code immensely; remember
    > > that 'tmp_output_ptr' points to a 100000-element array, which is highly
    > > unlikely to be readily accessible. 'sum', on the other hand, should
    > > have been optimized by the compiler to be stored very accessibly in
    > > a register or suchlike.

    >
    > I was thinking of that too, but array has 10K elemnts, not 100K, so
    > assuming 32 bit inegers, there is malloc'd something below 160K
    > from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
    > OP is running SunOS on hardware with more memory than that.


    More main memory, yes. More cache memory, no. I was referring to
    the size of the cache in this case. You're right, though, that if
    the arrays get really ungodly huge, then the OP might need to look
    at the effects of whatever virtual memory facilities his OS is
    trying to use.

    This is way OT; I'll stop now.

    -Arthur
     
    Arthur J. O'Dwyer, Oct 25, 2003
    #10
  11. Johan Bergman

    Matteo Frigo Guest

    "Johan Bergman" <> writes:

    > const int * const coeff_ptr =
    > (const int * const) malloc(nrof_taps*sizeof(int));
    > const int * const input_ptr =
    > (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
    > int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));


    You are not initializing the input. Apart from the fact that the
    behavior is undefined, as pointed out by previous posters, you ought
    to be aware that floating-point arithmetic on infinities and NaN's on
    Intel processors is much slower than floating-point arithmetic on
    finite floating-point numbers. Your observed slowdown is likely
    caused by a floating-point NaN that happens to be in the uninitialized
    array.

    Independently of this problem, to answer the question in the title,
    your FIR filter is not optimal on *any* platform that I can think of,
    but the reason is way off-topic for this newsgroup. (It is also
    suboptimal from a roundoff error point of view.)

    Cheers,
    Matteo Frigo
     
    Matteo Frigo, Oct 26, 2003
    #11
  12. Hi Tim, thanks for your reply. Yes, I am running on different hardware:

    Cygwin on a 400-MHz x86
    Linux on a 1620-MHz x86
    SunOS on a 333-MHz sparc

    I compiled with gcc -O3. I get the similar behaviors with gcc versions
    2.95.3 and 3.2.2. I check the execution time with the 'time' command.

    Thanks for the tip to change from int to float/double! Here are the results:

    Linux: 1113 Mcycles for int, 522 Mcycles for float, 531 Mcycles for double
    SunOS: 2774 Mcycles for int, 982 Mcycles for float, 1282 Mcycles for double
    Cygwin: 452 Mcycles for int, 500 Mcycles for float, 568 Mcycles for double

    So changing from int to float/double brought down the execution time a lot
    in Linux and SunOS. Now I have similar performance on the two x86 platforms,
    and the sparc platform is now only two times worse in the float case. :)

    Regards,
    Johan

    > Assuming that you are using basically the same compiler and the same
    > optimization options in each case, you must be running on different
    > hardware. You've given no reason to believe that the OS is responsible.
    > If one or more of your platforms supports SSE/SSE2 instructions, and you
    > can change the operands to float/double, you should be able to get much
    > better performance. Split the sum into 4 or 8 partial sums, thus
    > parallelizing the sum reduction. This becomes OT if you use a C compiler
    > whichs supports parallel instructions only in asm.
    > --
    > Tim Prince
     
    Johan Bergman, Oct 26, 2003
    #12
  13. Hi Arthur, thanks for your reply!

    > What in the world is an FIR filter? (Finite impulse-response, yes,
    > yes, I know, I looked it up. Point is, practically nobody in this
    > newsgroup will know, either, and they'll *all* have to look it up
    > if they really want to help you. You're making it harder for people
    > to help you when you don't define your terms.)


    Sorry, maybe I should have described the background a bit better. An FIR
    filter is one of the basic filter types in digital signal processing, and my
    example shows one of the ways to realize it. For FIR filter this long (10000
    taps), it would actually make more sence to choose another realization based
    on FFTs (Fast Fourier Transforms). But first I would like to see how fast I
    can make the realization in the example.

    > Well, first let's clean up your code and see what it looks like
    > then. Clean code is always easier to operate on. :)


    OK then! :)

    > #include <stdlib.h>
    >
    > int main(void)
    > {
    > int nrof_lags = 10000; /* Could both of these be #defines? */
    > int nrof_taps = 10000; /* That would help a little. */
    >
    > int *coeffs = malloc(nrof_taps * sizeof *coeffs);
    > int *input = malloc((nrof_taps+nrof_lags) * sizeof *input);
    > int *output = malloc(nrof_lags * sizeof *output);
    > int i, j;
    >
    > for (i=0; i < nrof_lags; ++i)
    > {
    > int sum = 0;
    > for (j=0; j < nrof_taps; ++j)
    > sum += coeffs[j] * input[i+j];
    > output = sum;
    > }
    >
    > free(coeffs);
    > free(input);
    > free(output);
    > return 0;
    > }


    Thanks for the cleanup!

    > How's the code do now?


    Sorry, it is slightly slower than my (admittedly terrible) original code.

    BR,
    Johan
     
    Johan Bergman, Oct 26, 2003
    #13
  14. Re: [OT] Re: Is this the optimal FIR filter on all platforms?

    > No idea what the problem is, but if you comment out
    > *tmp_output_ptr++ = sum;
    > compiler may optimize away calculation of sum in previous statement,
    > which could explain performance differences.


    You are right, of course!

    > Try
    > 1. optimize code little bit, instead of
    > sum = 0;
    > for (tap=0; tap<nrof_taps; tap++)
    > {
    > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > }
    > *tmp_output_ptr++ = sum;
    > do
    > for (tap=0; tap<nrof_taps; tap++)
    > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > (sum is cleared in each outer iteration and therefore I assume it's
    > not used somewhere later)


    You probably meant something like this (since we don't want to step
    tmp_output_ptr inside the inner loop):

    for (tap=0; tap<nrof_taps; tap++)
    *tmp_output_ptr += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    *tmp_output_ptr++;

    I tried this, but unfortunately the result was about 40% longer execution
    time (at least on the Sun platform).

    > 2. use calloc() instead of malloc(). I'm not sure what are alignment
    > implications (some additional instructions for acces to data via
    > dereferenced
    > ptr in malloc'd block?) in case of malloc() use on any particular

    platform.

    Ok, I tried changing malloc() to calloc() but it didn't affect the execution
    time.

    > I would appreciate posting of proper explanation and solution,
    > when you find it.


    I will. See in my answer to Tim Prince about the results of changing int to
    float/double.

    BR,
    Johan
     
    Johan Bergman, Oct 26, 2003
    #14
  15. Re: [OT] Re: Is this the optimal FIR filter on all platforms?

    Hi Sheldon, what kind of processor do you have? What is the clock frequency?

    BR,
    Johan

    "Sheldon Simms" <> wrote in message
    news:p...
    > On Sat, 25 Oct 2003 18:01:57 +0000, nobody wrote:
    >
    > > "Arthur J. O'Dwyer" <> wrote in message
    > > news:p...
    > >>
    > >> On Sat, 25 Oct 2003, nobody wrote:
    > >> >
    > >> > Try
    > >> > 1. optimize code little bit, instead of
    > >> > sum = 0;
    > >> > for (tap=0; tap<nrof_taps; tap++)
    > >> > {
    > >> > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > >> > }
    > >> > *tmp_output_ptr++ = sum;
    > >> > do
    > >> > for (tap=0; tap<nrof_taps; tap++)
    > >> > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    > >> > (sum is cleared in each outer iteration and therefore I assume it's
    > >> > not used somewhere later)
    > >>
    > >> That will almost certainly slow down the code immensely; remember
    > >> that 'tmp_output_ptr' points to a 100000-element array, which is highly
    > >> unlikely to be readily accessible. 'sum', on the other hand, should
    > >> have been optimized by the compiler to be stored very accessibly in
    > >> a register or suchlike.
    > >>

    > > I was thinking of that too, but array has 10K elemnts, not 100K, so
    > > assuming 32 bit inegers, there is malloc'd something below 160K
    > > from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
    > > OP is running SunOS on hardware with more memory than that.
    > > But nonwithstanding, you may be right.

    >
    > For what it's worth, I compiled both versions on my machine (which
    > has plenty of memory). The version with sum was faster:
    >
    > with sum:
    > [sheldon@wsxyz]$ gcc -Wall -W -O3 -o test1 test1.c
    > [sheldon@wsxyz]$ time ./test1
    >
    > real 0m0.627s
    > user 0m0.560s
    > sys 0m0.000s
    >
    >
    > without sum:
    > [sheldon@wsxyz]$ gcc -Wall -W -O3 -o test2 test2.c
    > [sheldon@wsxyz]$ time ./test2
    >
    > real 0m0.710s
    > user 0m0.630s
    > sys 0m0.000s
    >
    >
     
    Johan Bergman, Oct 26, 2003
    #15
  16. Hi again,

    I said in my post to you that your program performs slightly worse (at least
    on the Sun platform). However, when I changed from int to float/double, as
    proposed by Tim Prince, there was no longer any difference in execution time
    between your code and mine. Funny!

    By the way, here are the results (with my code):

    Linux/x86: 1113 Mcycles for int, 522 Mcycles for float, 531 Mcycles for
    double
    SunOS/sparc: 2774 Mcycles for int, 982 Mcycles for float, 1282 Mcycles for
    double
    Cygwin/x86: 452 Mcycles for int, 500 Mcycles for float, 568 Mcycles for
    double

    So with float/double I get the same performance on the two x86 platforms.
    And in the float case, the sparc platform doesn't fall too much behind.

    Regards,
    Johan
     
    Johan Bergman, Oct 26, 2003
    #16
  17. Hi again, again! Now I have tried your code and mine on the Cygwin/x86
    platform as well, and this time your code was more than two times slower
    than mine.

    So it seems like this:
    sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
    is often more optimal than this:
    sum += coeffs[tap] * input[lag+tap];

    Regards,
    Johan
     
    Johan Bergman, Oct 26, 2003
    #17
  18. In article <bngson$o0j$>, Johan Bergman wrote:
    >
    > Cygwin on a 400-MHz x86
    > Linux on a 1620-MHz x86
    > SunOS on a 333-MHz sparc
    >
    > I compiled with gcc -O3. I get the similar behaviors with gcc versions
    > 2.95.3 and 3.2.2. I check the execution time with the 'time' command.
    >
    > Linux: 1113 Mcycles for int, 522 Mcycles for float, 531 Mcycles for double
    > SunOS: 2774 Mcycles for int, 982 Mcycles for float, 1282 Mcycles for double
    > Cygwin: 452 Mcycles for int, 500 Mcycles for float, 568 Mcycles for double


    Comparing clock cycles confuses the issue. Processor speeds, cache
    speeds/sizes, and main memory access/write times are not necessarily correlated.
    You have to get off your clock-cycle kick, and learn about memory subsystem
    performance. There's a lot to learn! And operating systems have small
    or zero effect on the result, as you could verify if you took the time
    to boot Linux on that 400-MHz x86, Windows on that 1620-MHz x86, or Linux
    on that 333-MHz sparc.

    - Larry
     
    Larry Doolittle, Oct 26, 2003
    #18
  19. > Comparing clock cycles confuses the issue. Processor speeds, cache
    > speeds/sizes, and main memory access/write times are not necessarily

    correlated.
    > You have to get off your clock-cycle kick, and learn about memory

    subsystem
    > performance. There's a lot to learn! And operating systems have small
    > or zero effect on the result, as you could verify if you took the time
    > to boot Linux on that 400-MHz x86, Windows on that 1620-MHz x86, or Linux
    > on that 333-MHz sparc.


    It's not like I am saying that clock frequency is all that matters... I am
    just saying that this program takes this many clock cycles to execute on a
    that platform.

    What I wanted was a program that runs fast on any platform, but it seemed to
    perform very differently on the platforms that I tried out. In order to
    convince myself of that, I had to at least take the different clock
    frequencies into account, right? And when I had convinced myself, I posted
    my question in this group, to see if someone could help me optimize my code.

    Regards,
    Johan
     
    Johan Bergman, Oct 26, 2003
    #19
  20. Hi Matteo, thanks for your reply!

    > You are not initializing the input. Apart from the fact that the
    > behavior is undefined, as pointed out by previous posters, you ought
    > to be aware that floating-point arithmetic on infinities and NaN's on
    > Intel processors is much slower than floating-point arithmetic on
    > finite floating-point numbers. Your observed slowdown is likely
    > caused by a floating-point NaN that happens to be in the uninitialized
    > array.


    But in my example, I only worked with integers! (By the way, float/double
    actually proved to be a lot fast than int on all tested platforms.)

    > Independently of this problem, to answer the question in the title,
    > your FIR filter is not optimal on *any* platform that I can think of,
    > but the reason is way off-topic for this newsgroup. (It is also
    > suboptimal from a roundoff error point of view.)


    Well, as I wrote, I realize that an FFT approach would be beneficial for
    such a long FIR filter. But apart from that, are there any other
    optimizations that you can think of? In that case I would be most interested
    in hearing them! The best would be to get a piece of code, of course! :)

    Regards,
    Johan
     
    Johan Bergman, Oct 26, 2003
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hari

    FIR Filter

    Hari, Jan 27, 2004, in forum: VHDL
    Replies:
    3
    Views:
    946
  2. Yttrium

    FIR filter design + COE file

    Yttrium, Feb 9, 2004, in forum: VHDL
    Replies:
    1
    Views:
    2,988
  3. dhaanya nair
    Replies:
    0
    Views:
    3,936
    dhaanya nair
    Feb 26, 2004
  4. Wiener, Norbert
    Replies:
    4
    Views:
    598
    Jerry Avins
    Nov 18, 2004
  5. Terry Reedy
    Replies:
    1
    Views:
    370
    cassiope
    Jun 2, 2010
Loading...

Share This Page