R
Well do so!
-- Richard
Well do so!
-- Richard
Can changing the function to macro definition make things faster? I
know that calling a function itself but without executing its body
takes tiny time. Suppose it would take 10e-5 secs per call. Calling
it 1,000,000 would take 10 secs.
10usec per call?
More like 10nsec, 1000 times faster. It would make your program 10msec
faster.
If possible, call your routine and comment out the product loop. Or
just call it 1000000 times with no content, to see how much the
overhead actually is.
When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.
double foo(double *a, double *b, int sz)
{
double sum = 0.0;
int i;
for (i = 0; i < sz; i++)
sum += a*b;
return sum;
}
Setting the "-O2 -funroll-loops" flags, I compiled my program
again . Using the optimization flags made my program 1 sec faster.
The program compiled without these optimization flags took 72 secs to
run. I used MinGW to compile my program.
Can changing the function to macro definition make things faster? I
know that calling a function itself but without executing its body
takes tiny time. Suppose it would take 10e-5 secs per call. Calling
it 1,000,000 would take 10 secs.
Paul Hsieh said:This is a very standard optimization problem; the dot product. Real
improvements to it require that you relax your accuracy constraints.
I.e., you have to allow for the products to be added in a different
order. The other suggestions made so far are humorous at best and
will do nothing to improve the performance. However this:
static double foo (const double *a, const double *b, int sz) {
double sum0 = 0, sum1= 0, sum2 = 0, sum3 = 0;
int i;
for (i=0; i < sz - 3; i+=4) {
sum0 += a*b;
sum1 += a[i+1]*b[i+1];
sum2 += a[i+2]*b[i+2];
sum3 += a[i+3]*b[i+3];
}
for (;i < sz; i++) {
sum0 += a*b;
}
return (sum1 + sum0) + (sum3 + sum2);
}
will lead to a dramatic improvement in performance.
Setting the "-O2 -funroll-loops" flags, I compiled my program
again . Using the optimization flags made my program 1 sec faster.
The program compiled without these optimization flags took 72 secs to
run. I used MinGW to compile my program.Can changing the function to macro definition make things faster? I
know that calling a function itself but without executing its body
takes tiny time. Suppose it would take 10e-5 secs per call. Calling
it 1,000,000 would take 10 secs.
Elsewhere you said that there are a million calls, and that the number
of elements is less than 5,000. 72 seconds sound like a long time for
that - about 10 billion floating point operations - on a modern
computer. I used the program below to test it on my Mac (which uses
gcc). It takes about 23 seconds wth no optimisation, and about 5.7
seconds with -O or -O2; -funroll-loops doesn't make much difference.
If you're using a current machine, I would reconsider your conclusion
that it's the floating point calculation loop that is taking all the
time.
#include <stdio.h>
#include <stdlib.h>
double foo(double *a, double *b, int sz)
{
double sum = 0.0;
int i;
for (i = 0; i < sz; i++)
sum += a*b;
return sum;
}
int main(int argc, char **argv)
{
int size = atoi(argv[1]);
int rep = atoi(argv[2]);
double *a = malloc(size * sizeof(*a));
double *b = malloc(size * sizeof(*b));
int i;
double res;
for(i=0; i<size; i++)
{
a = random();
b = random();
}
for(i=0; i<rep; i++)
res += foo(a, b, size);
printf("res=%g\n", res);
return 0;
}
-- Richard
[email protected] said:When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.
double foo(double *a, double *b, int sz)
{
double sum = 0.0;
int i;
for (i = 0; i < sz; i++)
sum += a*b;
return sum;
}
Suppose in my application I know that a*b can only take a
limited set of values. For examples, a can only be 0, 0.25, 0.5,
or 0.75; b can only be 0, 0.2, 0.4, 0.6, or 0.8.
[...]
Then a[] and b[] should probably be arrays of unsigned char,
encoding the a's and b's as four and five times their nominal
values, respectively. You compute the sum of products as before
(but using an integer sum), and then divide the total by 20.0
before returning it. This might produce speed gains on two
fronts: first, integer multiplications and additions are faster
than F-P operations on many machines, and second, fetching one-
eighth (probably) as much data from memory will save time on
many machines.
Still, I think your first step should be to turn up the
optimization level on your compiler; it seems you've been using
no optimization at all. If that's not enough (or maybe even
if it *is* enough), try some of the higher-level algorithmic
suggestions various people have mentioned: arrange to call
foo() less often, or on shorter vectors, or both.
(e-mail address removed) wrote:
) Thanks. The foo I posted previously was a simplified version of my
) real foo. The real foo function, which
) is computes a function's value and its first derivative given x, is
) below. The foo function is defined as
Ah, so you're asking how to optimize one function, when in fact
the function you want optimized is a *very different* one ?
Jeez.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
Setting the "-O2 -funroll-loops" flags, I compiled my program
again . Using the optimization flags made my program 1 sec faster.
The program compiled without these optimization flags took 72 secs to
run. I used MinGW to compile my program.Can changing the function to macro definition make things faster? I
know that calling a function itself but without executing its body
takes tiny time. Suppose it would take 10e-5 secs per call. Calling
it 1,000,000 would take 10 secs.
Elsewhere you said that there are a million calls, and that the number
of elements is less than 5,000. 72 seconds sound like a long time for
that - about 10 billion floating point operations - on a modern
computer. I used the program below to test it on my Mac (which uses
gcc). It takes about 23 seconds wth no optimisation, and about 5.7
seconds with -O or -O2; -funroll-loops doesn't make much difference.
If you're using a current machine, I would reconsider your conclusion
that it's the floating point calculation loop that is taking all the
time.
#include <stdio.h>
#include <stdlib.h>
double foo(double *a, double *b, int sz)
{
double sum = 0.0;
int i;
for (i = 0; i < sz; i++)
sum += a*b;
return sum;
}
int main(int argc, char **argv)
{
int size = atoi(argv[1]);
int rep = atoi(argv[2]);
double *a = malloc(size * sizeof(*a));
double *b = malloc(size * sizeof(*b));
int i;
double res;
for(i=0; i<size; i++)
{
a = random();
b = random();
}
for(i=0; i<rep; i++)
res += foo(a, b, size);
printf("res=%g\n", res);
return 0;
}
-- Richard
This might produce speed gains on two
fronts: first, integer multiplications and additions are faster
than F-P operations on many machines,
Setting the "-O2 -funroll-loops" flags, I compiled my program
again . Using the optimization flags made my program 1 sec faster.
The program compiled without these optimization flags took 72 secs to
run. I used MinGW to compile my program.Can changing the function to macro definition make things faster? I
know that calling a function itself but without executing its body
takes tiny time. Suppose it would take 10e-5 secs per call. Calling
it 1,000,000 would take 10 secs.
Elsewhere you said that there are a million calls, and that the number
of elements is less than 5,000. 72 seconds sound like a long time for
that - about 10 billion floating point operations - on a modern
computer. I used the program below to test it on my Mac (which uses
gcc). It takes about 23 seconds wth no optimisation, and about 5.7
seconds with -O or -O2; -funroll-loops doesn't make much difference.
If you're using a current machine, I would reconsider your conclusion
that it's the floating point calculation loop that is taking all the
time.
#include <stdio.h>
#include <stdlib.h>
double foo(double *a, double *b, int sz)
{
double sum = 0.0;
int i;
for (i = 0; i < sz; i++)
sum += a*b;
return sum;
}
int main(int argc, char **argv)
{
int size = atoi(argv[1]);
int rep = atoi(argv[2]);
double *a = malloc(size * sizeof(*a));
double *b = malloc(size * sizeof(*b));
int i;
double res;
for(i=0; i<size; i++)
{
a = random();
b = random();
}
for(i=0; i<rep; i++)
res += foo(a, b, size);
printf("res=%g\n", res);
return 0;
}
-- Richard
I tried Paul Hsieh's idea. But I got almost the same running time.
[email protected] said:[...]
sum += log(z1) - log(z2);
[...]
Looks like some low-hanging fruit could be plucked
here: Try sum += log(z1/z2) instead, and see what
happens.
I thought the general problem was the same. I did not want to mess
with too many details.
Eric said:(e-mail address removed) wrote:
[...]
sum += log(z1) - log(z2);
[...]
Looks like some low-hanging fruit could be plucked
here: Try sum += log(z1/z2) instead, and see what
happens.
This is a remarkable improvement. The running time is 51 secs now,
almost 30% faster.
Perhaps this may serve as an object lesson: When you
ask how to improve a piece of code, provide the actual
code you want improve and not a half-baked paraphrase.
#define exp0 2.71828
static double foo(Vector *x, Vector *df, double *fip0, double *fip1,
double *fip2,
double *firp0, double *firp1, double *firp2, unsigned long sz)
{
unsigned long i;
register double sum;
register double x0, x1, x2, y0, y1, y2;
register double z1, z2;
register double exp1, exp2;
fip0 = para->prob0;
fip1 = para->prob1;
fip2 = para->prob2;
firp0 = para->prob3;
firp1 = para->prob4;
firp2 = para->prob5;
sum = 0.0;
df->data[0] = 0.0;
df->data[1] = 0.0;
exp1 = exp(x->data[0]);
exp2 = exp(x->data[1]);
for (i = 0; i < sz; i++) {
x0 = exp0*firp0;
x1 = exp1*firp1;
x2 = exp2*firp2;
z1 = x0 + x1 + x2;
y0 = exp0*fip0;
y1 = exp1*fip1;
y2 = exp2*fip2;
z2 = y0 + y1 + y2;
sum += log(z1) - log(z2);
df->data[0] += x1/z1 - y1/z2;
df->data[1] += x2/z1 - y2/z2;
}
return sum;
}
There are some other oddities in the code you (finally)
provided, and cleaning them up might get you some further
improvements. For example, six of the nine arguments to the
function are not used at all, so whatever effort the caller
expended to pass those arguments to the function is wasted
and can be done away with.
There are far too many `register'
declarations; many compilers will ignore them, but on those
that do not it may turn out that you'd get better speed with
fewer register variables. And the loop accumulates three
sums, one in a local variable but the other two in memory
accessed via a pointer; using three local variables and a
pair of stores after the loop may well be faster.
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.