# Is this the optimal FIR filter on all platforms?

Discussion in 'C Programming' started by Johan Bergman, Oct 25, 2003.

1. ### Johan BergmanGuest

Hi,

Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach

The example below calculates the output signal for 10000 lags from an FIR
filter with 10000 taps. The input signal and the filter coefficients is just
rubbish in this example. For my intended usage of this filter, it is not
necessary to store the state of the filter.

My problem is that this implementation performs very well in Cygwin on my
laptop, but less good in Linux and even worse in SunOS. I have tried to
calculate the number of mega-cycles that the example program consumes on the
different platforms (execution time in seconds multiplied by CPU clock
frequency in MHz):

Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum
Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum
SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum

The performance within parentheses was obtained when a certain line was
commented out (i.e. the line *tmp_output_ptr++ = sum. As you can see there
were very different behaviors on the different platforms!

The SunOS program performs extremely well without the critial line. Note
that the 103 Mcycles means that only one cycle was spent per iteration in
the inner for loop! But it performs terribly when the critical line is

As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
clock cycles per iteration in the inner loop, without the critical line and
only about twice as much with the critical line.

Can someone help me to speed up the program on the Linux and SunOS
platforms?

Regards,
Johan

#include <malloc.h>
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const) malloc(nrof_taps*sizeof(int));
const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
tmp_output_ptr = (int *) output_ptr;
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
tmp_input_ptr = (const int *) input_ptr + lag;
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
}
}

Johan Bergman, Oct 25, 2003

2. ### Tim PrinceGuest

Johan Bergman wrote:

> Hi,
>
> Maybe someone can help me to optimize this C/C++ implementation of a FIR
> filter (although I realize that I should probably consider an FFT approach
>
> The example below calculates the output signal for 10000 lags from an FIR
> filter with 10000 taps. The input signal and the filter coefficients is
> just rubbish in this example. For my intended usage of this filter, it is
> not necessary to store the state of the filter.
>
> My problem is that this implementation performs very well in Cygwin on my
> laptop, but less good in Linux and even worse in SunOS. I have tried to
> calculate the number of mega-cycles that the example program consumes on
> the different platforms (execution time in seconds multiplied by CPU clock
> frequency in MHz):
>
> Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum
> Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum
> SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum
>
> The performance within parentheses was obtained when a certain line was
> commented out (i.e. the line *tmp_output_ptr++ = sum. As you can see
> there were very different behaviors on the different platforms!
>
> The SunOS program performs extremely well without the critial line. Note
> that the 103 Mcycles means that only one cycle was spent per iteration in
> the inner for loop! But it performs terribly when the critical line is
> included, about 28 times slower!
>
> As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
> clock cycles per iteration in the inner loop, without the critical line
> and only about twice as much with the critical line.
>
> Can someone help me to speed up the program on the Linux and SunOS
> platforms?
>
> Regards,
> Johan
>
>
> #include <malloc.h>
> int main()
> {
> const int nrof_lags = 10000;
> const int nrof_taps = 10000;
> const int * const coeff_ptr =
> (const int * const) malloc(nrof_taps*sizeof(int));
> const int * const input_ptr =
> (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
> int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
> const int * tmp_coeff_ptr;
> const int * tmp_input_ptr;
> int * tmp_output_ptr;
> int sum;
> int lag, tap;
> tmp_output_ptr = (int *) output_ptr;
> for (lag=0; lag<nrof_lags; lag++)
> {
> tmp_coeff_ptr = (const int *) coeff_ptr;
> tmp_input_ptr = (const int *) input_ptr + lag;
> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> }
> *tmp_output_ptr++ = sum;
> }
> }

Assuming that you are using basically the same compiler and the same
optimization options in each case, you must be running on different
hardware. You've given no reason to believe that the OS is responsible.
If one or more of your platforms supports SSE/SSE2 instructions, and you
can change the operands to float/double, you should be able to get much
better performance. Split the sum into 4 or 8 partial sums, thus
parallelizing the sum reduction. This becomes OT if you use a C compiler
whichs supports parallel instructions only in asm.
--
Tim Prince

Tim Prince, Oct 25, 2003

3. ### Arthur J. O'DwyerGuest

On Sat, 25 Oct 2003, Johan Bergman wrote:
>
> Maybe someone can help me to optimize this C/C++ implementation of a FIR
> filter (although I realize that I should probably consider an FFT approach

What in the world is an FIR filter? (Finite impulse-response, yes,
yes, I know, I looked it up. Point is, practically nobody in this
newsgroup will know, either, and they'll *all* have to look it up
if they really want to help you. You're making it harder for people

> My problem is that this implementation performs very well in Cygwin on my
> laptop, but less good in Linux and even worse in SunOS. I have tried to
> calculate the number of mega-cycles that the example program consumes on the
> different platforms (execution time in seconds multiplied by CPU clock
> frequency in MHz):
>
> Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum
> Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum
> SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum

> Can someone help me to speed up the program on the Linux and SunOS
> platforms?

Well, first let's clean up your code and see what it looks like
then. Clean code is always easier to operate on.

> #include <malloc.h>

Non-standard header. 'malloc' is declared in <stdlib.h>.

#include <stdlib.h>

> int main()

[Some regulars strongly believe in 'int main(void)'. I
don't have a strong preference either way, but it's wise
to consider why you picked the style you did, even on
little things like this.]

> {
> const int nrof_lags = 10000;
> const int nrof_taps = 10000;
> const int * const coeff_ptr =

Okay, yeah, 'const' is great. But really, this is ridiculous.
You don't think that all these consts are somehow making your
program more efficient, do you? (They're not hurting, but
over-constifying is Bad mostly because it's Ugly, and Ugly code
is hard to make Right.)

> (const int * const) malloc(nrof_taps*sizeof(int));

malloc(nrof_taps * sizeof *coeff_ptr);

No chance for error using this idiom.
Also note that casting a malloc() result to 'const' anything
is just silly, because free() takes a non-const pointer
argument. How are you going to free() this memory, hmm?

> const int * const input_ptr =
> (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
> int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
> const int * tmp_coeff_ptr;
> const int * tmp_input_ptr;
> int * tmp_output_ptr;
> int sum;
> int lag, tap;

Way too many similarly-named variables, in my opinion.
Again, purely a stylistic issue, but again, clean code is
nice code. Note that in the cleaned-up version, I've pulled
in their scope to where they're actually used. This may
actually help with the efficiency of the program, since some
optimizers look for "windows" in which objects can be moved
into registers and suchlike. The wider their scopes are, the
harder it is for the optimizer to see them.

> tmp_output_ptr = (int *) output_ptr;

Casting away constness is Bad. Especially when you've just
finished jumping through hoops to make it const in the first
place. Geez.

> for (lag=0; lag<nrof_lags; lag++)
> {
> tmp_coeff_ptr = (const int *) coeff_ptr;
> tmp_input_ptr = (const int *) input_ptr + lag;

STOP PERFORMING RANDOM CASTS! For Pete's sake!
Get a hold of yourself here!

> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;

This line (and the one below) are just begging to be
simplified. Let's expand it so we can see the order
of operations.

> }
> *tmp_output_ptr++ = sum;
> }

/* always */ return 0; /* from main */

> }

Okay, let's put it all together: no casts, clean names,
clean mallocs, and see what we get!

You see how all those tmp_*_ptr objects just melt away
once we start to see the outline of the program clearly.
In fact, we uncover a subtle bug in the memory allocation,
and fix it ('input' was being malloc'ed one element short)!
Finally, we strive for idiomatic identifiers: using 'i' and
'j' for the nested loop controls, and losing the silly
'_ptr' suffixes on the main data arrays.

#include <stdlib.h>

int main(void)
{
int nrof_lags = 10000; /* Could both of these be #defines? */
int nrof_taps = 10000; /* That would help a little. */

int *coeffs = malloc(nrof_taps * sizeof *coeffs);
int *input = malloc((nrof_taps+nrof_lags) * sizeof *input);
int *output = malloc(nrof_lags * sizeof *output);
int i, j;

for (i=0; i < nrof_lags; ++i)
{
int sum = 0;
for (j=0; j < nrof_taps; ++j)
sum += coeffs[j] * input[i+j];
output = sum;
}

free(coeffs);
free(input);
free(output);
return 0;
}

There -- isn't that a heck of a lot simpler and easier to
If not, you might try changing 'int' to 'unsigned int' (in
fact, you might do that anyway, to guard against overflow
conditions); reversing the order of either or both loops;
if many of the 'coeffs[j]' are zero, try pre-processing
'coeffs' a little bit; little tweaks like that.

I think most of your performance differences are due to
different cache layouts on the different machines, but I
could easily be wrong.

How's the code do now?

HTH,
-Arthur

Arthur J. O'Dwyer, Oct 25, 2003
4. ### nobodyGuest

[OT] Re: Is this the optimal FIR filter on all platforms?

"Johan Bergman" <> wrote in message
news:_ylmb.34856\$...
> Hi,
>
> Maybe someone can help me to optimize this C/C++ implementation of a FIR
> filter (although I realize that I should probably consider an FFT approach
>
> The example below calculates the output signal for 10000 lags from an FIR
> filter with 10000 taps. The input signal and the filter coefficients is

just
> rubbish in this example. For my intended usage of this filter, it is not
> necessary to store the state of the filter.
>
> My problem is that this implementation performs very well in Cygwin on my
> laptop, but less good in Linux and even worse in SunOS. I have tried to
> calculate the number of mega-cycles that the example program consumes on

the
> different platforms (execution time in seconds multiplied by CPU clock
> frequency in MHz):
>
> Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum
> Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum
> SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum
>
> The performance within parentheses was obtained when a certain line was
> commented out (i.e. the line *tmp_output_ptr++ = sum. As you can see

there
> were very different behaviors on the different platforms!
>
> The SunOS program performs extremely well without the critial line. Note
> that the 103 Mcycles means that only one cycle was spent per iteration in
> the inner for loop! But it performs terribly when the critical line is
> included, about 28 times slower!
>
> As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
> clock cycles per iteration in the inner loop, without the critical line

and
> only about twice as much with the critical line.
>
> Can someone help me to speed up the program on the Linux and SunOS
> platforms?
>
> Regards,
> Johan
>
>
> #include <malloc.h>
> int main()
> {
> const int nrof_lags = 10000;
> const int nrof_taps = 10000;
> const int * const coeff_ptr =
> (const int * const) malloc(nrof_taps*sizeof(int));
> const int * const input_ptr =
> (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
> int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
> const int * tmp_coeff_ptr;
> const int * tmp_input_ptr;
> int * tmp_output_ptr;
> int sum;
> int lag, tap;
> tmp_output_ptr = (int *) output_ptr;
> for (lag=0; lag<nrof_lags; lag++)
> {
> tmp_coeff_ptr = (const int *) coeff_ptr;
> tmp_input_ptr = (const int *) input_ptr + lag;
> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> }
> *tmp_output_ptr++ = sum;
> }
> }
>

No idea what the problem is, but if you comment out
*tmp_output_ptr++ = sum;
compiler may optimize away calculation of sum in previous statement,
which could explain performance differences. Try
1. optimize code little bit, instead of
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
do
for (tap=0; tap<nrof_taps; tap++)
*tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
(sum is cleared in each outer iteration and therefore I assume it's
not used somewhere later)
2. use calloc() instead of malloc(). I'm not sure what are alignment
implications (some additional instructions for acces to data via
dereferenced
ptr in malloc'd block?) in case of malloc() use on any particular platform.

I would appreciate posting of proper explanation and solution,
when you find it.

nobody, Oct 25, 2003
5. ### CBFalconerGuest

Johan Bergman wrote:
>

.... snip ...
>
> Can someone help me to speed up the program on the Linux and SunOS
> platforms?
>
> #include <malloc.h>

No such standard header. Use stdlib.h.

> int main()
> {
> const int nrof_lags = 10000;
> const int nrof_taps = 10000;
> const int * const coeff_ptr =
> (const int * const) malloc(nrof_taps*sizeof(int));

Don't cast malloc output.

> const int * const input_ptr =
> (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));

Same here.

> int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));

Same here.

> const int * tmp_coeff_ptr;
> const int * tmp_input_ptr;
> int * tmp_output_ptr;
> int sum;
> int lag, tap;
> tmp_output_ptr = (int *) output_ptr;
> for (lag=0; lag<nrof_lags; lag++)
> {
> tmp_coeff_ptr = (const int *) coeff_ptr;

Uninitialized data. Anything can happen.

> tmp_input_ptr = (const int *) input_ptr + lag;
> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> }
> *tmp_output_ptr++ = sum;
> }
> }

I suggest you first create a legal program before measuring
performance.

--
Chuck F () ()
Available for consulting/temporary embedded and systems.

CBFalconer, Oct 25, 2003
6. ### Arthur J. O'DwyerGuest

Re: [OT] Re: Is this the optimal FIR filter on all platforms?

On Sat, 25 Oct 2003, nobody wrote:
>
> "Johan Bergman" <> wrote...
> >
> > Maybe someone can help me to optimize this C/C++ implementation of a FIR
> > filter (although I realize that I should probably consider an FFT approach

> Try
> 1. optimize code little bit, instead of
> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> }
> *tmp_output_ptr++ = sum;
> do
> for (tap=0; tap<nrof_taps; tap++)
> *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> (sum is cleared in each outer iteration and therefore I assume it's
> not used somewhere later)

That will almost certainly slow down the code immensely; remember
that 'tmp_output_ptr' points to a 100000-element array, which is highly
unlikely to be readily accessible. 'sum', on the other hand, should
have been optimized by the compiler to be stored very accessibly in
a register or suchlike.

> 2. use calloc() instead of malloc(). I'm not sure what are alignment
> implications (some additional instructions for acces to data via
> dereferenced
> ptr in malloc'd block?) in case of malloc() use on any particular platform.

On many modern systems, there won't even be a significant difference
between the performance of calloc and malloc -- but since malloc doesn't
need to write zeroes everywhere, it's faster in theory. OTOH, if the
OP used calloc, he would have a program that actually exhibited defined,
reproducible behavior.

-Arthur

Arthur J. O'Dwyer, Oct 25, 2003
7. ### nobodyGuest

Re: [OT] Re: Is this the optimal FIR filter on all platforms?

"Arthur J. O'Dwyer" <> wrote in message
news...
>
> On Sat, 25 Oct 2003, nobody wrote:
> >
> > "Johan Bergman" <> wrote...
> > >
> > > Maybe someone can help me to optimize this C/C++ implementation of a

FIR
> > > filter (although I realize that I should probably consider an FFT

approach

>
> > Try
> > 1. optimize code little bit, instead of
> > sum = 0;
> > for (tap=0; tap<nrof_taps; tap++)
> > {
> > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> > }
> > *tmp_output_ptr++ = sum;
> > do
> > for (tap=0; tap<nrof_taps; tap++)
> > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> > (sum is cleared in each outer iteration and therefore I assume it's
> > not used somewhere later)

>
> That will almost certainly slow down the code immensely; remember
> that 'tmp_output_ptr' points to a 100000-element array, which is highly
> unlikely to be readily accessible. 'sum', on the other hand, should
> have been optimized by the compiler to be stored very accessibly in
> a register or suchlike.
>

I was thinking of that too, but array has 10K elemnts, not 100K, so
assuming 32 bit inegers, there is malloc'd something below 160K
from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
OP is running SunOS on hardware with more memory than that.
But nonwithstanding, you may be right.

> > 2. use calloc() instead of malloc(). I'm not sure what are alignment
> > implications (some additional instructions for acces to data via
> > dereferenced
> > ptr in malloc'd block?) in case of malloc() use on any particular

platform.
>
> On many modern systems, there won't even be a significant difference
> between the performance of calloc and malloc -- but since malloc doesn't
> need to write zeroes everywhere, it's faster in theory. OTOH, if the

That is true, but allocation is done outside of loop so this difference
would be IMHO practically inmesurable in given program (10K
multiplications and additions in inner loop times 10K iterations
and additions in outer loop. But I was really speculating, because
I can't explain those huge time execution differences (assuming
correct methodology was used to measure this time).

> OP used calloc, he would have a program that actually exhibited defined,
> reproducible behavior.
>
> -Arthur
>

nobody, Oct 25, 2003
8. ### Sheldon SimmsGuest

Re: [OT] Re: Is this the optimal FIR filter on all platforms?

On Sat, 25 Oct 2003 18:01:57 +0000, nobody wrote:

> "Arthur J. O'Dwyer" <> wrote in message
> news...
>>
>> On Sat, 25 Oct 2003, nobody wrote:
>> >
>> > Try
>> > 1. optimize code little bit, instead of
>> > sum = 0;
>> > for (tap=0; tap<nrof_taps; tap++)
>> > {
>> > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
>> > }
>> > *tmp_output_ptr++ = sum;
>> > do
>> > for (tap=0; tap<nrof_taps; tap++)
>> > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
>> > (sum is cleared in each outer iteration and therefore I assume it's
>> > not used somewhere later)

>>
>> That will almost certainly slow down the code immensely; remember
>> that 'tmp_output_ptr' points to a 100000-element array, which is highly
>> unlikely to be readily accessible. 'sum', on the other hand, should
>> have been optimized by the compiler to be stored very accessibly in
>> a register or suchlike.
>>

> I was thinking of that too, but array has 10K elemnts, not 100K, so
> assuming 32 bit inegers, there is malloc'd something below 160K
> from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
> OP is running SunOS on hardware with more memory than that.
> But nonwithstanding, you may be right.

For what it's worth, I compiled both versions on my machine (which
has plenty of memory). The version with sum was faster:

with sum:
[sheldon@wsxyz]\$ gcc -Wall -W -O3 -o test1 test1.c
[sheldon@wsxyz]\$ time ./test1

real 0m0.627s
user 0m0.560s
sys 0m0.000s

without sum:
[sheldon@wsxyz]\$ gcc -Wall -W -O3 -o test2 test2.c
[sheldon@wsxyz]\$ time ./test2

real 0m0.710s
user 0m0.630s
sys 0m0.000s

Sheldon Simms, Oct 25, 2003
9. ### nobodyGuest

Re: [OT] Re: Is this the optimal FIR filter on all platforms?

"Sheldon Simms" <> wrote in message
news...
> On Sat, 25 Oct 2003 18:01:57 +0000, nobody wrote:
>
> > "Arthur J. O'Dwyer" <> wrote in message
> > news...
> >>
> >> On Sat, 25 Oct 2003, nobody wrote:
> >> >
> >> > Try
> >> > 1. optimize code little bit, instead of
> >> > sum = 0;
> >> > for (tap=0; tap<nrof_taps; tap++)
> >> > {
> >> > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> >> > }
> >> > *tmp_output_ptr++ = sum;
> >> > do
> >> > for (tap=0; tap<nrof_taps; tap++)
> >> > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> >> > (sum is cleared in each outer iteration and therefore I assume it's
> >> > not used somewhere later)
> >>
> >> That will almost certainly slow down the code immensely; remember
> >> that 'tmp_output_ptr' points to a 100000-element array, which is highly
> >> unlikely to be readily accessible. 'sum', on the other hand, should
> >> have been optimized by the compiler to be stored very accessibly in
> >> a register or suchlike.
> >>

> > I was thinking of that too, but array has 10K elemnts, not 100K, so
> > assuming 32 bit inegers, there is malloc'd something below 160K
> > from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
> > OP is running SunOS on hardware with more memory than that.
> > But nonwithstanding, you may be right.

>
> For what it's worth, I compiled both versions on my machine (which
> has plenty of memory). The version with sum was faster:
>
> with sum:
> [sheldon@wsxyz]\$ gcc -Wall -W -O3 -o test1 test1.c
> [sheldon@wsxyz]\$ time ./test1
>
> real 0m0.627s
> user 0m0.560s
> sys 0m0.000s
>
>
> without sum:
> [sheldon@wsxyz]\$ gcc -Wall -W -O3 -o test2 test2.c
> [sheldon@wsxyz]\$ time ./test2
>
> real 0m0.710s
> user 0m0.630s
> sys 0m0.000s
>

Thanks. So I was wrong. My sincere apologies. Had I SunOS box
at my disposal, I would test it before posting

nobody, Oct 25, 2003
10. ### Arthur J. O'DwyerGuest

Re: [OT] Re: Is this the optimal FIR filter on all platforms?

On Sat, 25 Oct 2003, nobody wrote:
>
> "Arthur J. O'Dwyer" <> wrote...
> > On Sat, 25 Oct 2003, nobody wrote:
> > > optimize code little bit, instead of
> > > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> > > do
> > > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;

> >
> > That will almost certainly slow down the code immensely; remember
> > that 'tmp_output_ptr' points to a 100000-element array, which is highly
> > unlikely to be readily accessible. 'sum', on the other hand, should
> > have been optimized by the compiler to be stored very accessibly in
> > a register or suchlike.

>
> I was thinking of that too, but array has 10K elemnts, not 100K, so
> assuming 32 bit inegers, there is malloc'd something below 160K
> from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
> OP is running SunOS on hardware with more memory than that.

More main memory, yes. More cache memory, no. I was referring to
the size of the cache in this case. You're right, though, that if
the arrays get really ungodly huge, then the OP might need to look
at the effects of whatever virtual memory facilities his OS is
trying to use.

This is way OT; I'll stop now.

-Arthur

Arthur J. O'Dwyer, Oct 25, 2003
11. ### Matteo FrigoGuest

"Johan Bergman" <> writes:

> const int * const coeff_ptr =
> (const int * const) malloc(nrof_taps*sizeof(int));
> const int * const input_ptr =
> (const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
> int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));

You are not initializing the input. Apart from the fact that the
behavior is undefined, as pointed out by previous posters, you ought
to be aware that floating-point arithmetic on infinities and NaN's on
Intel processors is much slower than floating-point arithmetic on
finite floating-point numbers. Your observed slowdown is likely
caused by a floating-point NaN that happens to be in the uninitialized
array.

Independently of this problem, to answer the question in the title,
your FIR filter is not optimal on *any* platform that I can think of,
but the reason is way off-topic for this newsgroup. (It is also
suboptimal from a roundoff error point of view.)

Cheers,
Matteo Frigo

Matteo Frigo, Oct 26, 2003
12. ### Johan BergmanGuest

Hi Tim, thanks for your reply. Yes, I am running on different hardware:

Cygwin on a 400-MHz x86
Linux on a 1620-MHz x86
SunOS on a 333-MHz sparc

I compiled with gcc -O3. I get the similar behaviors with gcc versions
2.95.3 and 3.2.2. I check the execution time with the 'time' command.

Thanks for the tip to change from int to float/double! Here are the results:

Linux: 1113 Mcycles for int, 522 Mcycles for float, 531 Mcycles for double
SunOS: 2774 Mcycles for int, 982 Mcycles for float, 1282 Mcycles for double
Cygwin: 452 Mcycles for int, 500 Mcycles for float, 568 Mcycles for double

So changing from int to float/double brought down the execution time a lot
in Linux and SunOS. Now I have similar performance on the two x86 platforms,
and the sparc platform is now only two times worse in the float case.

Regards,
Johan

> Assuming that you are using basically the same compiler and the same
> optimization options in each case, you must be running on different
> hardware. You've given no reason to believe that the OS is responsible.
> If one or more of your platforms supports SSE/SSE2 instructions, and you
> can change the operands to float/double, you should be able to get much
> better performance. Split the sum into 4 or 8 partial sums, thus
> parallelizing the sum reduction. This becomes OT if you use a C compiler
> whichs supports parallel instructions only in asm.
> --
> Tim Prince

Johan Bergman, Oct 26, 2003
13. ### Johan BergmanGuest

> What in the world is an FIR filter? (Finite impulse-response, yes,
> yes, I know, I looked it up. Point is, practically nobody in this
> newsgroup will know, either, and they'll *all* have to look it up
> if they really want to help you. You're making it harder for people

Sorry, maybe I should have described the background a bit better. An FIR
filter is one of the basic filter types in digital signal processing, and my
example shows one of the ways to realize it. For FIR filter this long (10000
taps), it would actually make more sence to choose another realization based
on FFTs (Fast Fourier Transforms). But first I would like to see how fast I
can make the realization in the example.

> Well, first let's clean up your code and see what it looks like
> then. Clean code is always easier to operate on.

OK then!

> #include <stdlib.h>
>
> int main(void)
> {
> int nrof_lags = 10000; /* Could both of these be #defines? */
> int nrof_taps = 10000; /* That would help a little. */
>
> int *coeffs = malloc(nrof_taps * sizeof *coeffs);
> int *input = malloc((nrof_taps+nrof_lags) * sizeof *input);
> int *output = malloc(nrof_lags * sizeof *output);
> int i, j;
>
> for (i=0; i < nrof_lags; ++i)
> {
> int sum = 0;
> for (j=0; j < nrof_taps; ++j)
> sum += coeffs[j] * input[i+j];
> output = sum;
> }
>
> free(coeffs);
> free(input);
> free(output);
> return 0;
> }

Thanks for the cleanup!

> How's the code do now?

Sorry, it is slightly slower than my (admittedly terrible) original code.

BR,
Johan

Johan Bergman, Oct 26, 2003
14. ### Johan BergmanGuest

Re: [OT] Re: Is this the optimal FIR filter on all platforms?

> No idea what the problem is, but if you comment out
> *tmp_output_ptr++ = sum;
> compiler may optimize away calculation of sum in previous statement,
> which could explain performance differences.

You are right, of course!

> Try
> 1. optimize code little bit, instead of
> sum = 0;
> for (tap=0; tap<nrof_taps; tap++)
> {
> sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> }
> *tmp_output_ptr++ = sum;
> do
> for (tap=0; tap<nrof_taps; tap++)
> *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> (sum is cleared in each outer iteration and therefore I assume it's
> not used somewhere later)

You probably meant something like this (since we don't want to step
tmp_output_ptr inside the inner loop):

for (tap=0; tap<nrof_taps; tap++)
*tmp_output_ptr += *tmp_coeff_ptr++ * *tmp_input_ptr++;
*tmp_output_ptr++;

I tried this, but unfortunately the result was about 40% longer execution
time (at least on the Sun platform).

> 2. use calloc() instead of malloc(). I'm not sure what are alignment
> implications (some additional instructions for acces to data via
> dereferenced
> ptr in malloc'd block?) in case of malloc() use on any particular

platform.

Ok, I tried changing malloc() to calloc() but it didn't affect the execution
time.

> I would appreciate posting of proper explanation and solution,
> when you find it.

I will. See in my answer to Tim Prince about the results of changing int to
float/double.

BR,
Johan

Johan Bergman, Oct 26, 2003
15. ### Johan BergmanGuest

Re: [OT] Re: Is this the optimal FIR filter on all platforms?

Hi Sheldon, what kind of processor do you have? What is the clock frequency?

BR,
Johan

"Sheldon Simms" <> wrote in message
news...
> On Sat, 25 Oct 2003 18:01:57 +0000, nobody wrote:
>
> > "Arthur J. O'Dwyer" <> wrote in message
> > news...
> >>
> >> On Sat, 25 Oct 2003, nobody wrote:
> >> >
> >> > Try
> >> > 1. optimize code little bit, instead of
> >> > sum = 0;
> >> > for (tap=0; tap<nrof_taps; tap++)
> >> > {
> >> > sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> >> > }
> >> > *tmp_output_ptr++ = sum;
> >> > do
> >> > for (tap=0; tap<nrof_taps; tap++)
> >> > *tmp_output_ptr++ += *tmp_coeff_ptr++ * *tmp_input_ptr++;
> >> > (sum is cleared in each outer iteration and therefore I assume it's
> >> > not used somewhere later)
> >>
> >> That will almost certainly slow down the code immensely; remember
> >> that 'tmp_output_ptr' points to a 100000-element array, which is highly
> >> unlikely to be readily accessible. 'sum', on the other hand, should
> >> have been optimized by the compiler to be stored very accessibly in
> >> a register or suchlike.
> >>

> > I was thinking of that too, but array has 10K elemnts, not 100K, so
> > assuming 32 bit inegers, there is malloc'd something below 160K
> > from heap (2 arrays of 10K ints, 1 of 20K ints. I assumed that
> > OP is running SunOS on hardware with more memory than that.
> > But nonwithstanding, you may be right.

>
> For what it's worth, I compiled both versions on my machine (which
> has plenty of memory). The version with sum was faster:
>
> with sum:
> [sheldon@wsxyz]\$ gcc -Wall -W -O3 -o test1 test1.c
> [sheldon@wsxyz]\$ time ./test1
>
> real 0m0.627s
> user 0m0.560s
> sys 0m0.000s
>
>
> without sum:
> [sheldon@wsxyz]\$ gcc -Wall -W -O3 -o test2 test2.c
> [sheldon@wsxyz]\$ time ./test2
>
> real 0m0.710s
> user 0m0.630s
> sys 0m0.000s
>
>

Johan Bergman, Oct 26, 2003
16. ### Johan BergmanGuest

Hi again,

I said in my post to you that your program performs slightly worse (at least
on the Sun platform). However, when I changed from int to float/double, as
proposed by Tim Prince, there was no longer any difference in execution time
between your code and mine. Funny!

By the way, here are the results (with my code):

Linux/x86: 1113 Mcycles for int, 522 Mcycles for float, 531 Mcycles for
double
SunOS/sparc: 2774 Mcycles for int, 982 Mcycles for float, 1282 Mcycles for
double
Cygwin/x86: 452 Mcycles for int, 500 Mcycles for float, 568 Mcycles for
double

So with float/double I get the same performance on the two x86 platforms.
And in the float case, the sparc platform doesn't fall too much behind.

Regards,
Johan

Johan Bergman, Oct 26, 2003
17. ### Johan BergmanGuest

Hi again, again! Now I have tried your code and mine on the Cygwin/x86
platform as well, and this time your code was more than two times slower
than mine.

So it seems like this:
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
is often more optimal than this:
sum += coeffs[tap] * input[lag+tap];

Regards,
Johan

Johan Bergman, Oct 26, 2003
18. ### Larry DoolittleGuest

In article <bngson\$o0j\$>, Johan Bergman wrote:
>
> Cygwin on a 400-MHz x86
> Linux on a 1620-MHz x86
> SunOS on a 333-MHz sparc
>
> I compiled with gcc -O3. I get the similar behaviors with gcc versions
> 2.95.3 and 3.2.2. I check the execution time with the 'time' command.
>
> Linux: 1113 Mcycles for int, 522 Mcycles for float, 531 Mcycles for double
> SunOS: 2774 Mcycles for int, 982 Mcycles for float, 1282 Mcycles for double
> Cygwin: 452 Mcycles for int, 500 Mcycles for float, 568 Mcycles for double

Comparing clock cycles confuses the issue. Processor speeds, cache
speeds/sizes, and main memory access/write times are not necessarily correlated.
You have to get off your clock-cycle kick, and learn about memory subsystem
performance. There's a lot to learn! And operating systems have small
or zero effect on the result, as you could verify if you took the time
to boot Linux on that 400-MHz x86, Windows on that 1620-MHz x86, or Linux
on that 333-MHz sparc.

- Larry

Larry Doolittle, Oct 26, 2003
19. ### Johan BergmanGuest

> Comparing clock cycles confuses the issue. Processor speeds, cache
> speeds/sizes, and main memory access/write times are not necessarily

correlated.
> You have to get off your clock-cycle kick, and learn about memory

subsystem
> performance. There's a lot to learn! And operating systems have small
> or zero effect on the result, as you could verify if you took the time
> to boot Linux on that 400-MHz x86, Windows on that 1620-MHz x86, or Linux
> on that 333-MHz sparc.

It's not like I am saying that clock frequency is all that matters... I am
just saying that this program takes this many clock cycles to execute on a
that platform.

What I wanted was a program that runs fast on any platform, but it seemed to
perform very differently on the platforms that I tried out. In order to
convince myself of that, I had to at least take the different clock
frequencies into account, right? And when I had convinced myself, I posted
my question in this group, to see if someone could help me optimize my code.

Regards,
Johan

Johan Bergman, Oct 26, 2003
20. ### Johan BergmanGuest

> You are not initializing the input. Apart from the fact that the
> behavior is undefined, as pointed out by previous posters, you ought
> to be aware that floating-point arithmetic on infinities and NaN's on
> Intel processors is much slower than floating-point arithmetic on
> finite floating-point numbers. Your observed slowdown is likely
> caused by a floating-point NaN that happens to be in the uninitialized
> array.

But in my example, I only worked with integers! (By the way, float/double
actually proved to be a lot fast than int on all tested platforms.)

> Independently of this problem, to answer the question in the title,
> your FIR filter is not optimal on *any* platform that I can think of,
> but the reason is way off-topic for this newsgroup. (It is also
> suboptimal from a roundoff error point of view.)

Well, as I wrote, I realize that an FFT approach would be beneficial for
such a long FIR filter. But apart from that, are there any other
optimizations that you can think of? In that case I would be most interested
in hearing them! The best would be to get a piece of code, of course!

Regards,
Johan

Johan Bergman, Oct 26, 2003