How to make it faster?

I

istillshine

Well do so!

-- Richard

Setting the "-O2 -funroll-loops" flags, I compiled my program
again . Using the optimization flags made my program 1 sec faster.
The program compiled without these optimization flags took 72 secs to
run. I used MinGW to compile my program.
 
I

istillshine

Well do so!

-- Richard

Setting the "-O2 -funroll-loops" flags, I compiled my program
again . Using the optimization flags made my program 1 sec faster.
The program compiled without these optimization flags took 72 secs to
run. I used MinGW to compile my program.

Can changing the function to macro definition make things faster? I
know that calling a function itself but without executing its body
takes tiny time. Suppose it would take 10e-5 secs per call. Calling
it 1,000,000 would take 10 secs.
 
B

Bart

Can changing the function to macro definition make things faster?  I
know that calling a function itself but without executing its body
takes tiny time.  Suppose it would take 10e-5 secs per call.  Calling
it 1,000,000 would take 10 secs.

10usec per call?

More like 10nsec, 1000 times faster. It would make your program 10msec
faster.

If possible, call your routine and comment out the product loop. Or
just call it 1000000 times with no content, to see how much the
overhead actually is.
 
P

Paul Hsieh

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;

}


This is a very standard optimization problem; the dot product. Real
improvements to it require that you relax your accuracy constraints.
I.e., you have to allow for the products to be added in a different
order. The other suggestions made so far are humorous at best and
will do nothing to improve the performance. However this:

static double foo (const double *a, const double *b, int sz) {
double sum0 = 0, sum1= 0, sum2 = 0, sum3 = 0;
int i;

for (i=0; i < sz - 3; i+=4) {
sum0 += a*b;
sum1 += a[i+1]*b[i+1];
sum2 += a[i+2]*b[i+2];
sum3 += a[i+3]*b[i+3];
}

for (;i < sz; i++) {
sum0 += a*b;
}

return (sum1 + sum0) + (sum3 + sum2);
}

will lead to a dramatic improvement in performance. The reason is
that you are bottlenecked by the "sum += " part. I.e., you are
constantly using sum as your input and output operand, and therefore
you cannot start each addition before the previous one is done. In
the above breakdown the 4 additions in the main loop will execute
totally in parallel if there is execution bandwidth to spare.

With the Intel compiler, its even possible that the loop is vectorized
using SSE-2, but its hard to be sure if it would kick into that mode.

But because the additions are done in a different order, you are
likely to get slightly different answers (all the round offs will be
different).

You will see techniques like this shown on my optimization page here:

http://www.pobox.com/~qed/optimize.html
 
R

Richard Tobin

Setting the "-O2 -funroll-loops" flags, I compiled my program
again . Using the optimization flags made my program 1 sec faster.
The program compiled without these optimization flags took 72 secs to
run. I used MinGW to compile my program.

Can changing the function to macro definition make things faster? I
know that calling a function itself but without executing its body
takes tiny time. Suppose it would take 10e-5 secs per call. Calling
it 1,000,000 would take 10 secs.

Elsewhere you said that there are a million calls, and that the number
of elements is less than 5,000. 72 seconds sound like a long time for
that - about 10 billion floating point operations - on a modern
computer. I used the program below to test it on my Mac (which uses
gcc). It takes about 23 seconds wth no optimisation, and about 5.7
seconds with -O or -O2; -funroll-loops doesn't make much difference.

If you're using a current machine, I would reconsider your conclusion
that it's the floating point calculation loop that is taking all the
time.

#include <stdio.h>
#include <stdlib.h>

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}

int main(int argc, char **argv)
{
int size = atoi(argv[1]);
int rep = atoi(argv[2]);
double *a = malloc(size * sizeof(*a));
double *b = malloc(size * sizeof(*b));
int i;
double res;

for(i=0; i<size; i++)
{
a = random();
b = random();
}

for(i=0; i<rep; i++)
res += foo(a, b, size);

printf("res=%g\n", res);

return 0;
}

-- Richard
 
R

Richard Tobin

Paul Hsieh said:
This is a very standard optimization problem; the dot product. Real
improvements to it require that you relax your accuracy constraints.
I.e., you have to allow for the products to be added in a different
order. The other suggestions made so far are humorous at best and
will do nothing to improve the performance. However this:

static double foo (const double *a, const double *b, int sz) {
double sum0 = 0, sum1= 0, sum2 = 0, sum3 = 0;
int i;

for (i=0; i < sz - 3; i+=4) {
sum0 += a*b;
sum1 += a[i+1]*b[i+1];
sum2 += a[i+2]*b[i+2];
sum3 += a[i+3]*b[i+3];
}

for (;i < sz; i++) {
sum0 += a*b;
}

return (sum1 + sum0) + (sum3 + sum2);
}

will lead to a dramatic improvement in performance.


I took the test program from the article I just posted and replaced
the original foo() with yours. The time taken varied quite oddly
depending on optimisation, but was between 5.0 and 6.4 seconds compared
with 5.7 seconds for the original.

-- Richard
 
I

istillshine

Setting the "-O2 -funroll-loops" flags, I compiled my program
again . Using the optimization flags made my program 1 sec faster.
The program compiled without these optimization flags took 72 secs to
run. I used MinGW to compile my program.
Can changing the function to macro definition make things faster? I
know that calling a function itself but without executing its body
takes tiny time. Suppose it would take 10e-5 secs per call. Calling
it 1,000,000 would take 10 secs.

Elsewhere you said that there are a million calls, and that the number
of elements is less than 5,000. 72 seconds sound like a long time for
that - about 10 billion floating point operations - on a modern
computer. I used the program below to test it on my Mac (which uses
gcc). It takes about 23 seconds wth no optimisation, and about 5.7
seconds with -O or -O2; -funroll-loops doesn't make much difference.

If you're using a current machine, I would reconsider your conclusion
that it's the floating point calculation loop that is taking all the
time.

#include <stdio.h>
#include <stdlib.h>

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;

}

int main(int argc, char **argv)
{
int size = atoi(argv[1]);
int rep = atoi(argv[2]);
double *a = malloc(size * sizeof(*a));
double *b = malloc(size * sizeof(*b));
int i;
double res;

for(i=0; i<size; i++)
{
a = random();
b = random();
}

for(i=0; i<rep; i++)
res += foo(a, b, size);

printf("res=%g\n", res);

return 0;

}

-- Richard


Thanks. The foo I posted previously was a simplified version of my
real foo. The real foo function, which
is computes a function's value and its first derivative given x, is
below. The foo function is defined as

\sum_{i=0}^{sz-1} [ log(exp0*f0+exp1*f1+exp2*f2) -
log(exp0*g0+exp1*g1+exp2*g2) ].


typdef struct vector_t {
double *data;
unsigned long *size;
} Vector;


/*
* Computes function value and its first derivative.
* x and df both have size 2.
*/
#define exp0 2.71828
static double foo(Vector *x, Vector *df, double *fip0, double *fip1,
double *fip2,
double *firp0, double *firp1, double *firp2, unsigned long sz)
{
unsigned long i;
register double sum;
register double x0, x1, x2, y0, y1, y2;
register double z1, z2;
register double exp1, exp2;


fip0 = para->prob0;
fip1 = para->prob1;
fip2 = para->prob2;
firp0 = para->prob3;
firp1 = para->prob4;
firp2 = para->prob5;

sum = 0.0;
df->data[0] = 0.0;
df->data[1] = 0.0;
exp1 = exp(x->data[0]);
exp2 = exp(x->data[1]);

for (i = 0; i < sz; i++) {

x0 = exp0*firp0;
x1 = exp1*firp1;
x2 = exp2*firp2;
z1 = x0 + x1 + x2;

y0 = exp0*fip0;
y1 = exp1*fip1;
y2 = exp2*fip2;
z2 = y0 + y1 + y2;

sum += log(z1) - log(z2);
df->data[0] += x1/z1 - y1/z2;
df->data[1] += x2/z1 - y2/z2;
}

return sum;
}

I tried Paul Hsieh's idea. But I got almost the same running time.
 
W

Willem

(e-mail address removed) wrote:
) Thanks. The foo I posted previously was a simplified version of my
) real foo. The real foo function, which
) is computes a function's value and its first derivative given x, is
) below. The foo function is defined as

Ah, so you're asking how to optimize one function, when in fact
the function you want optimized is a *very different* one ?

Jeez.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
I

istillshine

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.
double foo(double *a, double *b, int sz)
{
double sum = 0.0;
int i;
for (i = 0; i < sz; i++)
sum += a*b;
return sum;
}

Suppose in my application I know that a*b can only take a
limited set of values. For examples, a can only be 0, 0.25, 0.5,
or 0.75; b can only be 0, 0.2, 0.4, 0.6, or 0.8.
[...]


Then a[] and b[] should probably be arrays of unsigned char,
encoding the a's and b's as four and five times their nominal
values, respectively. You compute the sum of products as before
(but using an integer sum), and then divide the total by 20.0
before returning it. This might produce speed gains on two
fronts: first, integer multiplications and additions are faster
than F-P operations on many machines, and second, fetching one-
eighth (probably) as much data from memory will save time on
many machines.

Still, I think your first step should be to turn up the
optimization level on your compiler; it seems you've been using
no optimization at all. If that's not enough (or maybe even
if it *is* enough), try some of the higher-level algorithmic
suggestions various people have mentioned: arrange to call
foo() less often, or on shorter vectors, or both.


Thanks. The foo I posted previously was a simplified version of my
real foo. The real foo function, which computes a function's value
and its first derivative, given x, is shown below. The foo function
is defined as

\sum_{i=0}^{sz-1} [ log(exp0*f0+exp1*f1+exp2*f2) -
log(exp0*g0+exp1*g1+exp2*g2) ].

I tried Paul Hsieh's idea. But I got almost the same running time.

-----------------------------------------------------------------------
typdef struct vector_t {
double *data;
unsigned long *size;

} Vector;

/*
* Computes function value and its first derivative.
* x and df both have size 2.
*/
#define exp0 2.71828
static double foo(Vector *x, Vector *df,
double *f0, double *f1, double *f2,
double *g0, double *g1, double *g2,
unsigned long sz)
{
unsigned long i;
register double sum;
register double x0, x1, x2, y0, y1, y2;
register double z1, z2;
register double exp1, exp2;

f0 = para->prob0;
f1 = para->prob1;
f2 = para->prob2;
g0 = para->prob3;
g1 = para->prob4;
g2 = para->prob5;

sum = 0.0;
df->data[0] = 0.0;
df->data[1] = 0.0;
exp1 = exp(x->data[0]);
exp2 = exp(x->data[1]);

for (i = 0; i < sz; i++) {

x0 = exp0*g0;
x1 = exp1*g1;
x2 = exp2*g2;
z1 = x0 + x1 + x2;

y0 = exp0*f0;
y1 = exp1*f1;
y2 = exp2*f2;
z2 = y0 + y1 + y2;

sum += log(z1) - log(z2);
df->data[0] += x1/z1 - y1/z2;
df->data[1] += x2/z1 - y2/z2;
}

return sum;

}
 
I

istillshine

(e-mail address removed) wrote:

) Thanks. The foo I posted previously was a simplified version of my
) real foo. The real foo function, which
) is computes a function's value and its first derivative given x, is
) below. The foo function is defined as

Ah, so you're asking how to optimize one function, when in fact
the function you want optimized is a *very different* one ?

Jeez.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

I thought the general problem was the same. I did not want to mess
with too many details.
 
I

istillshine

Setting the "-O2 -funroll-loops" flags, I compiled my program
again . Using the optimization flags made my program 1 sec faster.
The program compiled without these optimization flags took 72 secs to
run. I used MinGW to compile my program.
Can changing the function to macro definition make things faster? I
know that calling a function itself but without executing its body
takes tiny time. Suppose it would take 10e-5 secs per call. Calling
it 1,000,000 would take 10 secs.

Elsewhere you said that there are a million calls, and that the number
of elements is less than 5,000. 72 seconds sound like a long time for
that - about 10 billion floating point operations - on a modern
computer. I used the program below to test it on my Mac (which uses
gcc). It takes about 23 seconds wth no optimisation, and about 5.7
seconds with -O or -O2; -funroll-loops doesn't make much difference.

If you're using a current machine, I would reconsider your conclusion
that it's the floating point calculation loop that is taking all the
time.

#include <stdio.h>
#include <stdlib.h>

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;

}

int main(int argc, char **argv)
{
int size = atoi(argv[1]);
int rep = atoi(argv[2]);
double *a = malloc(size * sizeof(*a));
double *b = malloc(size * sizeof(*b));
int i;
double res;

for(i=0; i<size; i++)
{
a = random();
b = random();
}

for(i=0; i<rep; i++)
res += foo(a, b, size);

printf("res=%g\n", res);

return 0;

}

-- Richard


Thanks. The foo I posted previously was a simplified version of my
real foo. The real foo function, which computes a function's value
and its first derivative, given x, is shown below. The foo function
is defined as

\sum_{i=0}^{sz-1} [ log(exp0*f0+exp1*f1+exp2*f2) -
log(exp0*g0+exp1*g1+exp2*g2) ].

I tried Paul Hsieh's idea. But I got almost the same running time.

-----------------------------------------------------------------------
typdef struct vector_t {
double *data;
unsigned long *size;

} Vector;

/*
* Computes function value and its first derivative.
* x and df both have size 2.
*/
#define exp0 2.71828
static double foo(Vector *x, Vector *df,
double *f0, double *f1, double *f2,
double *g0, double *g1, double *g2,
unsigned long sz)
{
unsigned long i;
register double sum;
register double x0, x1, x2, y0, y1, y2;
register double z1, z2;
register double exp1, exp2;

sum = 0.0;
df->data[0] = 0.0;
df->data[1] = 0.0;
exp1 = exp(x->data[0]);
exp2 = exp(x->data[1]);

for (i = 0; i < sz; i++) {

x0 = exp0*g0;
x1 = exp1*g1;
x2 = exp2*g2;
z1 = x0 + x1 + x2;

y0 = exp0*f0;
y1 = exp1*f1;
y2 = exp2*f2;
z2 = y0 + y1 + y2;

sum += log(z1) - log(z2);
df->data[0] += x1/z1 - y1/z2;
df->data[1] += x2/z1 - y2/z2;
}

return sum;

}
 
W

Walter Roberson

This might produce speed gains on two
fronts: first, integer multiplications and additions are faster
than F-P operations on many machines,

With exceptions. The MIPS R8000/R10000/R12000 floating point
operations were, if I recall correctly, faster than the
integer operations -- most noticably so for the R8000.
 
I

istillshine

Setting the "-O2 -funroll-loops" flags, I compiled my program
again . Using the optimization flags made my program 1 sec faster.
The program compiled without these optimization flags took 72 secs to
run. I used MinGW to compile my program.
Can changing the function to macro definition make things faster? I
know that calling a function itself but without executing its body
takes tiny time. Suppose it would take 10e-5 secs per call. Calling
it 1,000,000 would take 10 secs.

Elsewhere you said that there are a million calls, and that the number
of elements is less than 5,000. 72 seconds sound like a long time for
that - about 10 billion floating point operations - on a modern
computer. I used the program below to test it on my Mac (which uses
gcc). It takes about 23 seconds wth no optimisation, and about 5.7
seconds with -O or -O2; -funroll-loops doesn't make much difference.

If you're using a current machine, I would reconsider your conclusion
that it's the floating point calculation loop that is taking all the
time.

#include <stdio.h>
#include <stdlib.h>

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;

}

int main(int argc, char **argv)
{
int size = atoi(argv[1]);
int rep = atoi(argv[2]);
double *a = malloc(size * sizeof(*a));
double *b = malloc(size * sizeof(*b));
int i;
double res;

for(i=0; i<size; i++)
{
a = random();
b = random();
}

for(i=0; i<rep; i++)
res += foo(a, b, size);

printf("res=%g\n", res);

return 0;

}

-- Richard


Thanks. The foo I posted previously was a simplified version of my
real foo. The real foo function, which computes a function's value
and its first derivative, given x, is shown below. The foo function
is defined as

\sum_{i=0}^{sz-1} [ log(exp0*f0+exp1*f1+exp2*f2) -
log(exp0*g0+exp1*g1+exp2*g2) ].

I tried Paul Hsieh's idea. But I got almost the same running time.

-----------------------------------------------------------------------
typdef struct vector_t {
double *data;
unsigned long size;

} Vector;

/*
* Computes function value and its first derivative.
* x and df both have size 2.
*/
#define exp0 2.71828
static double foo(Vector *x, Vector *df,
double *f0, double *f1, double *f2,
double *g0, double *g1, double *g2,
unsigned long sz)
{
unsigned long i;
register double sum;
register double x0, x1, x2, y0, y1, y2;
register double z1, z2;
register double exp1, exp2;

sum = 0.0;
df->data[0] = 0.0;
df->data[1] = 0.0;
exp1 = exp(x->data[0]);
exp2 = exp(x->data[1]);

for (i = 0; i < sz; i++) {

x0 = exp0*g0;
x1 = exp1*g1;
x2 = exp2*g2;
z1 = x0 + x1 + x2;

y0 = exp0*f0;
y1 = exp1*f1;
y2 = exp2*f2;
z2 = y0 + y1 + y2;

sum += log(z1) - log(z2);
df->data[0] += x1/z1 - y1/z2;
df->data[1] += x2/z1 - y2/z2;
}

return sum;

}
 
J

jacob navia

I tried Paul Hsieh's idea. But I got almost the same running time.

Told you so.

C is no use here.

Why not try assembler?

:)

Obviously I am an interested party :)
 
F

Flash Gordon

Please don't quote peoples signatures, the bit typically after the "-- "
unless you are commenting on them.
I thought the general problem was the same. I did not want to mess
with too many details.

How on earth can the problem be the same when the two functions are
completely different? If you had the measles would you send your friend
with a mole on his nose to the Doctors to ask how to cure that and
expect the same method to work on your measles?

As evidence Eric easily spotted what could be a major piece of
optimisation in your (I hope) real code which simply did not exist in
your "sample". I'm not going to bother analysing your code further
because if this still is not your real code I would just be wasting my time.
 
J

jacob navia

Eric said:
(e-mail address removed) wrote:
[...]
sum += log(z1) - log(z2);
[...]
Looks like some low-hanging fruit could be plucked
here: Try sum += log(z1/z2) instead, and see what
happens.

This is a remarkable improvement. The running time is 51 secs now,
almost 30% faster.

Perhaps this may serve as an object lesson: When you
ask how to improve a piece of code, provide the actual
code you want improve and not a half-baked paraphrase.

Yes. It is amazing the time we would all have gained if
we would have had the code before...

The first proposition was completely impossible to
optimize.
 
B

Bart

On Apr 8, 1:14 pm, (e-mail address removed) wrote:

[istillsh's later post didn't make it to google, replying to earlier
post]
#define exp0 2.71828
static double foo(Vector *x, Vector *df, double *fip0, double *fip1,
double *fip2,
double *firp0, double *firp1, double *firp2, unsigned long sz)
{
unsigned long i;
register double sum;
register double x0, x1, x2, y0, y1, y2;
register double z1, z2;
register double exp1, exp2;


fip0 = para->prob0;
fip1 = para->prob1;
fip2 = para->prob2;
firp0 = para->prob3;
firp1 = para->prob4;
firp2 = para->prob5;

sum = 0.0;
df->data[0] = 0.0;
df->data[1] = 0.0;
exp1 = exp(x->data[0]);
exp2 = exp(x->data[1]);

for (i = 0; i < sz; i++) {

x0 = exp0*firp0;
x1 = exp1*firp1;
x2 = exp2*firp2;
z1 = x0 + x1 + x2;

y0 = exp0*fip0;
y1 = exp1*fip1;
y2 = exp2*fip2;
z2 = y0 + y1 + y2;

sum += log(z1) - log(z2);
df->data[0] += x1/z1 - y1/z2;
df->data[1] += x2/z1 - y2/z2;
}

return sum;
}


I like the way you think the above code has any points of similarity
to: sum+=a*b!

There's so much maths in there that function call and loop overheads
are an irrelevance.

And what was all that stuff about a = 0, 0.25, 0.5, 0.75 all about?

You might be able to squeeze some extra performance using tight
assembler, but the code will be unmaintainable, if it needs to do all
that.

Otherwise, you need to look more globally. Why call this thing 1000000
times and why have 5000 data points? Also look at the maths operations
involved, maybe they can be tweaked, but they mean nothing to me.

To get a speedup from 72 seconds to 36 seconds... 36 seconds is not
exactly interactive so it will still be pretty slow.
 
I

istillshine

I found your advice was really useful. Magic advice. Really
helpful. And sorry for giving a simplistic foo. But I did not mean
to fool you because in my opinion, the previous one shared some
similar patterns with the later one. You can disagree with me, but I
don't have to agree with you.

There are some other oddities in the code you (finally)
provided, and cleaning them up might get you some further
improvements. For example, six of the nine arguments to the
function are not used at all, so whatever effort the caller
expended to pass those arguments to the function is wasted
and can be done away with.

I don't understand this. All nine arguments were used in my foo
function body.
There are far too many `register'
declarations; many compilers will ignore them, but on those
that do not it may turn out that you'd get better speed with
fewer register variables. And the loop accumulates three
sums, one in a local variable but the other two in memory
accessed via a pointer; using three local variables and a
pair of stores after the loop may well be faster.

I keep only three register variables, as following, among which df0
and df1 store
df->data[0] and df->data[1], as you suggested. By doing this I gained
2 secs speedup.
2 secs' speedup is also useful in my application since my program
would be applied on
hundreds of different date sets. It would therefore save several
minutes.

register double sum;
register double df0, df1;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top