J
Johan Bergman
Hi,
Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach
instead.)
The example below calculates the output signal for 10000 lags from an FIR
filter with 10000 taps. The input signal and the filter coefficients is just
rubbish in this example. For my intended usage of this filter, it is not
necessary to store the state of the filter.
My problem is that this implementation performs very well in Cygwin on my
laptop, but less good in Linux and even worse in SunOS. I have tried to
calculate the number of mega-cycles that the example program consumes on the
different platforms (execution time in seconds multiplied by CPU clock
frequency in MHz):
Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum
Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum
SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum
The performance within parentheses was obtained when a certain line was
commented out (i.e. the line *tmp_output_ptr++ = sum
. As you can see there
were very different behaviors on the different platforms!
The SunOS program performs extremely well without the critial line. Note
that the 103 Mcycles means that only one cycle was spent per iteration in
the inner for loop! But it performs terribly when the critical line is
included, about 28 times slower!
As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
clock cycles per iteration in the inner loop, without the critical line and
only about twice as much with the critical line.
Can someone help me to speed up the program on the Linux and SunOS
platforms?
Regards,
Johan
#include <malloc.h>
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const) malloc(nrof_taps*sizeof(int));
const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
tmp_output_ptr = (int *) output_ptr;
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
tmp_input_ptr = (const int *) input_ptr + lag;
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
}
}
Maybe someone can help me to optimize this C/C++ implementation of a FIR
filter (although I realize that I should probably consider an FFT approach
instead.)
The example below calculates the output signal for 10000 lags from an FIR
filter with 10000 taps. The input signal and the filter coefficients is just
rubbish in this example. For my intended usage of this filter, it is not
necessary to store the state of the filter.
My problem is that this implementation performs very well in Cygwin on my
laptop, but less good in Linux and even worse in SunOS. I have tried to
calculate the number of mega-cycles that the example program consumes on the
different platforms (execution time in seconds multiplied by CPU clock
frequency in MHz):
Cygwin: 448 Mcycles (216 Mcycles without *tmp_output_ptr++ = sum
Linux: 1090 Mcycles (151 Mcycles without *tmp_output_ptr++ = sum
SunOS: 2800 Mcycles (103 Mcycles without *tmp_output_ptr++ = sum
The performance within parentheses was obtained when a certain line was
commented out (i.e. the line *tmp_output_ptr++ = sum
were very different behaviors on the different platforms!
The SunOS program performs extremely well without the critial line. Note
that the 103 Mcycles means that only one cycle was spent per iteration in
the inner for loop! But it performs terribly when the critical line is
included, about 28 times slower!
As a comparison, the Cygwin program consumes 216 Mcycles, i.e. about two
clock cycles per iteration in the inner loop, without the critical line and
only about twice as much with the critical line.
Can someone help me to speed up the program on the Linux and SunOS
platforms?
Regards,
Johan
#include <malloc.h>
int main()
{
const int nrof_lags = 10000;
const int nrof_taps = 10000;
const int * const coeff_ptr =
(const int * const) malloc(nrof_taps*sizeof(int));
const int * const input_ptr =
(const int * const) malloc((nrof_taps-1+nrof_lags)*sizeof(int));
int const * output_ptr = (int const *) malloc(nrof_lags*sizeof(int));
const int * tmp_coeff_ptr;
const int * tmp_input_ptr;
int * tmp_output_ptr;
int sum;
int lag, tap;
tmp_output_ptr = (int *) output_ptr;
for (lag=0; lag<nrof_lags; lag++)
{
tmp_coeff_ptr = (const int *) coeff_ptr;
tmp_input_ptr = (const int *) input_ptr + lag;
sum = 0;
for (tap=0; tap<nrof_taps; tap++)
{
sum += *tmp_coeff_ptr++ * *tmp_input_ptr++;
}
*tmp_output_ptr++ = sum;
}
}