How to make it faster?

B

Ben Bacarisse

I found your advice was really useful. Magic advice. Really
helpful. And sorry for giving a simplistic foo. But I did not mean
to fool you because in my opinion, the previous one shared some
similar patterns with the later one. You can disagree with me, but I
don't have to agree with you.



I don't understand this. All nine arguments were used in my foo
function body.

Yes, but not as passed, so why pass them? Here is the code:

static double foo(Vector *x, Vector *df,
double *f0, double *f1, double *f2,
double *g0, double *g1, double *g2,
unsigned long sz)
{
unsigned long i;
register double sum;
register double x0, x1, x2, y0, y1, y2;
register double z1, z2;
register double exp1, exp2;

f0 = para->prob0;
f1 = para->prob1;
f2 = para->prob2;
g0 = para->prob3;
g1 = para->prob4;
g2 = para->prob5;

You immediately assign to these parameters. This has no effect in the
calling function (if you intended it to, you have more work to do) and
any value calculated and passed is lost, so why pass them?

sum = 0.0;
df->data[0] = 0.0;
df->data[1] = 0.0;
exp1 = exp(x->data[0]);
exp2 = exp(x->data[1]);

for (i = 0; i < sz; i++) {

x0 = exp0*g0;
x1 = exp1*g1;
x2 = exp2*g2;
z1 = x0 + x1 + x2;

You could try:

z1 = exp0*g0 + x1 + x2;

here since x0 is not used later. Many compilers will have eliminated
the store for you already so you won't gain, but it is worth a try.

y0 = exp0*f0;
y1 = exp1*f1;
y2 = exp2*f2;
z2 = y0 + y1 + y2;

Same here.

sum += log(z1) - log(z2);
df->data[0] += x1/z1 - y1/z2;
df->data[1] += x2/z1 - y2/z2;

You've already done Eric's other suggestion (using locals for these
sums) so I won't repeat it.

}
return sum;
}
 
I

istillshine

True, but a 30% speedup from a one-line change seems like
evidence that lends more support to my view than to yours ...

True. I have a sense of admiration to you.
 
I

istillshine

for (i = 0; i < sz; i++) {

x0 = exp0*g0;
x1 = exp1*g1;
x2 = exp2*g2;
z1 = x0 + x1 + x2;

You could try:

z1 = exp0*g0 + x1 + x2;

here since x0 is not used later. Many compilers will have eliminated
the store for you already so you won't gain, but it is worth a try.

y0 = exp0*f0;
y1 = exp1*f1;
y2 = exp2*f2;
z2 = y0 + y1 + y2;

Same here.

sum += log(z1) - log(z2);
df->data[0] += x1/z1 - y1/z2;
df->data[1] += x2/z1 - y2/z2;


1 sec faster, on average.
 
I

istillshine

And what was all that stuff about a = 0, 0.25, 0.5, 0.75 all about?


In the following function foo, the values of arguments g0, g1 and g2
are limited to 0, 0.25, 0.5, or 0.75. But I guess this property has
no use because exp1 and exp2 change in every call. However, exp0 dose
remain the same for all calls.


/*
* Computes function value and its first derivative.
* x and df both have size 2.
*/
#define exp0 2.71828
static double foo(Vector *x, Vector *df,
double *f0, double *f1, double *f2,
double *g0, double *g1, double *g2,
unsigned long sz)
{
unsigned long i;
register double sum;
register double x0, x1, x2, y0, y1, y2;
register double z1, z2;
register double exp1, exp2;

sum = 0.0;
df->data[0] = 0.0;
df->data[1] = 0.0;
exp1 = exp(x->data[0]);
exp2 = exp(x->data[1]);

for (i = 0; i < sz; i++) {

x0 = exp0*g0;
x1 = exp1*g1;
x2 = exp2*g2;
z1 = x0 + x1 + x2;

y0 = exp0*f0;
y1 = exp1*f1;
y2 = exp2*f2;
z2 = y0 + y1 + y2;

sum += log(z1) - log(z2);
df->data[0] += x1/z1 - y1/z2;
df->data[1] += x2/z1 - y2/z2;
}

return sum;
}
 
H

Hallvard B Furuseth

Eric said:
[...]
sum += log(z1) - log(z2);
[...]

Looks like some low-hanging fruit could be plucked
here: Try sum += log(z1/z2) instead, and see what
happens.

Or compute exp(sum) instead of sum, if that doesn't overflow:
double exp_sum = 1.0;
loop: ... exp_sum *= z1/z2; ...
return log(exp_sum);


Can also get rid of the multiplications with exp0:
Instead of exp1, exp2, x0, ... z2, keep variables
with values (current exp1)/exp0 ... (current z2)/exp0.

Both suggestions affect the accuracy of the computation (in which
direction I couldn't guess), but since the program says
#define exp0 2.71828
which I presume is e, I don't suppose the OP is too concerned with
accuracy anyway.
 
I

istillshine

Or compute exp(sum) instead of sum, if that doesn't overflow:
double exp_sum = 1.0;
loop: ... exp_sum *= z1/z2; ...
return log(exp_sum);

The running time reduced to 19 secs now. Compared with the original
71 secs, it is more than 3 times faster. Looks like log() is an
expensive function to call. The problem remaining here is how to
handle overflow. I think "long double exp_sum = 1.0" is safer than
"double exp_sum = 1.0", but still not very safe.
Can also get rid of the multiplications with exp0:
Instead of exp1, exp2, x0, ... z2, keep variables
with values (current exp1)/exp0 ... (current z2)/exp0.

I don't understand this
which I presume is e, I don't suppose the OP is too concerned with
accuracy anyway.

6 digits after "." is accurate enough for my application.
 
H

Hallvard B Furuseth

I said:
Can also get rid of the multiplications with exp0:
Instead of exp1, exp2, x0, ... z2, keep variables
with values (current exp1)/exp0 ... (current z2)/exp0.

I.e.
exp1 = exp(x->data[0] - 1.0);
exp2 = exp(x->data[1] - 1.0);
if exp0 is supposed to be e.

Yet another point: I assume the other pointer arguments will not point
into df->data? The compiler cannot know that. So when the loop updates
df->data[], the compiler cannot know that g0, g1, etc do not change.
That may prevent it from optimizing the accesses to these variables.

So compute the df->data values in local variables and copy those into
df->data afterwards. Then the compiler can treat the other variables
as read-only during the loop. (Assuming it knows log() has no side
effects, anyway.)

Though C99 has a way to tell that the compiler, at least sometimes:
Use the 'restrict' keywords on the arguments. "double *restrict f0"
means something like, during the execution of the function, the object
f0 points at will not be accessed through other pointers than f0.
The semantics is a bit hairy, I haven't looked too closely at it yet.
 
I

istillshine

The running time reduced to 19 secs now. Compared with the original
71 secs, it is more than 3 times faster. Looks like log() is an
expensive function to call. The problem remaining here is how to
handle overflow. I think "long double exp_sum = 1.0" is safer than
"double exp_sum = 1.0", but still not very safe.

I came up with a solution.

exp_sum *= (z1/z2)*HUGE_VAL;
return sz*log(HUGE_VAL) + log(exp_sum)

How large this HUGE_VAL should be?
 
H

Hallvard B Furuseth

I don't understand this

exp1 = exp(x->data[0] - 1.0);
exp2 = exp(x->data[1] - 1.0);
for (i = 0; i < sz; i++) {
x0 = g0;
y0 = f0;
(no other changes in the loop)
}

Variables <exp,x,y,z><0,1,2> now have values (original value / exp0).
sum and df->data do not change (except due to rounding), since
they get values (foo/exp0) / (bar/exp0) instead of foo/bar.
 
I

istillshine

The running time reduced to 19 secs now. Compared with the original
71 secs, it is more than 3 times faster. Looks like log() is an
expensive function to call. The problem remaining here is how to
handle overflow. I think "long double exp_sum = 1.0" is safer than
"double exp_sum = 1.0", but still not very safe.


I came up with a solution.

exp_sum *= (z1/z2)/HUGE_VAL;

return sz*log(HUGE_VAL) + log(exp_sum);

How large this HUGE_VAL should be?
 
H

Hallvard B Furuseth

I came up with a solution.
exp_sum *= (z1/z2)*HUGE_VAL;
return sz*log(HUGE_VAL) + log(exp_sum)

How large this HUGE_VAL should be?

Depends on your data set and the range of result values. Looks to me
like if z1/z2 are normally about the same size, HUGE_VAL should be 1...
I have little experience with numerical programming, but I suppose a
good value would be one which normally yields exp_sum == ca 1, i.e.
HUGE_VAL = exp(result/sz). But if you need this, maybe that means
the data set is too variable and you had better do log() in the loop.
 
H

Hallvard B Furuseth

exp_sum *= (z1/z2)*HUGE_VAL;
return sz*log(HUGE_VAL) + log(exp_sum)

Eh. I'm getting tired, but shouldn't that be exp_sum *= (z1/z2) /
HUGE_VAL or return log(exp_sum) - sz*log(HUGE_VAL)?

And a similar error carried over to my answer. (The idea was to
try to collect a normal value of the answer in HUGE_VAL so
so log(exp_sum) return approximately 0 for a normal data set.)
 
I

istillshine

I don't understand this

exp1 = exp(x->data[0] - 1.0);
exp2 = exp(x->data[1] - 1.0);
for (i = 0; i < sz; i++) {
x0 = g0;
y0 = f0;
(no other changes in the loop)
}

Variables <exp,x,y,z><0,1,2> now have values (original value / exp0).
sum and df->data do not change (except due to rounding), since
they get values (foo/exp0) / (bar/exp0) instead of foo/bar.


Got it. So smart.
 
P

Paul Hsieh

And what was all that stuff about a = 0, 0.25, 0.5, 0.75 all about?


In the following function foo, the values of arguments g0, g1 and g2
are limited to 0, 0.25, 0.5, or 0.75. But I guess this property has
no use because exp1 and exp2 change in every call. However, exp0 dose
remain the same for all calls.

/*
* Computes function value and its first derivative.
* x and df both have size 2.
*/
#define exp0 2.71828
static double foo(Vector *x, Vector *df,
double *f0, double *f1, double *f2,
double *g0, double *g1, double *g2,
unsigned long sz)
{
unsigned long i;
register double sum;
register double x0, x1, x2, y0, y1, y2;
register double z1, z2;
register double exp1, exp2;

sum = 0.0;
df->data[0] = 0.0;
df->data[1] = 0.0;
exp1 = exp(x->data[0]);
exp2 = exp(x->data[1]);

for (i = 0; i < sz; i++) {

x0 = exp0*g0;
x1 = exp1*g1;
x2 = exp2*g2;
z1 = x0 + x1 + x2;

y0 = exp0*f0;
y1 = exp1*f1;
y2 = exp2*f2;
z2 = y0 + y1 + y2;

sum += log(z1) - log(z2);


Oh for crying out loud!!!! Remove the above line with two calls to
the transcendental function log() and substitute this:

sumpn *= z1;
sumpd *= z2;

(After declaring sumpn and sumpd and initializing them to 1.0 in the
outer loops)
df->data[0] += x1/z1 - y1/z2;
df->data[1] += x2/z1 - y2/z2;

This will be computed faster as:

df->data[0] += (x1*z2 - y1*z1) / (z1 * z2);
df->data[1] += (x2*z2 - y2*z1) / (z1 * z2);

Then add this:

sum = log (sumpn/sumpd);
return sum;

}

If you get overflows, then in the inner loop use:

sump *= z1/z2;

and in the last just compute sum as log(sump).

The two calls to log in your inner loop are brutal (they are about 200
clocks each.) You need to move your calls to log to the outside of
your loop by any means necessary.

You want to *avoid* performing divides in your inner loop if you can
help it. Overflows may prevent you, so I have shown you an
alternative with a divide in the inner loop. However, one thing you
can do, is accumulate the products for a while, then "normalize them"
after, say, 100 iterations, by dividing both by sumpd (i.e., dividing
sumpn by sumpd, and setting sumpd to 1.)
 
I

istillshine

df->data[0] += x1/z1 - y1/z2;
df->data[1] += x2/z1 - y2/z2;

This will be computed faster as:

df->data[0] += (x1*z2 - y1*z1) / (z1 * z2);
df->data[1] += (x2*z2 - y2*z1) / (z1 * z2);

Works. Save 3 secs.

The two calls to log in your inner loop are brutal (they are about 200
clocks each.)

Good to know this.

You want to *avoid* performing divides in your inner loop if you can
help it. Overflows may prevent you, so I have shown you an
alternative with a divide in the inner loop. However, one thing you
can do, is accumulate the products for a while, then "normalize them"
after, say, 100 iterations, by dividing both by sumpd (i.e., dividing
sumpn by sumpd, and setting sumpd to 1.)

#define ITER 5


if (i % ITER == 0) {
sumpn /= sumpd;
sumpd = 1.0;
}

Combined with the previous 3 secs' saving, the running time has
reduced to 11 secs from 71 secs. Thank you. I really learned
something new and useful.
 
I

istillshine

Combined with the previous 3 secs' saving, the running time has
reduced to 11 secs from 71 secs. Thank you. I really learned
something new and useful.

After setting -O2 flag in compiling, it only took 7 secs. The running
time is satisfactory. Your ideas are indispensable.
 
I

istillshine

Combined with the previous 3 secs' saving, the running time has
reduced to 11 secs from 71 secs. Thank you. I really learned
something new and useful.

After setting -O2 flag in compiling, it only took 7 secs.
 
P

Paul Hsieh

df->data[0] += x1/z1 - y1/z2;
df->data[1] += x2/z1 - y2/z2;
This will be computed faster as:
df->data[0] += (x1*z2 - y1*z1) / (z1 * z2);
df->data[1] += (x2*z2 - y2*z1) / (z1 * z2);

Works. Save 3 secs.
The two calls to log in your inner loop are brutal (they are about 200
clocks each.)

Good to know this.
You want to *avoid* performing divides in your inner loop if you can
help it. Overflows may prevent you, so I have shown you an
alternative with a divide in the inner loop. However, one thing you
can do, is accumulate the products for a while, then "normalize them"
after, say, 100 iterations, by dividing both by sumpd (i.e., dividing
sumpn by sumpd, and setting sumpd to 1.)

#define ITER 5

Change this to:

#define ITER 64

Just trust me, and try it out.
if (i % ITER == 0) {
sumpn /= sumpd;
sumpd = 1.0;
}

Combined with the previous 3 secs' saving, the running time has
reduced to 11 secs from 71 secs. Thank you. I really learned
something new and useful.

Glad to hear it. Transcendental calls are the worst in terms of
performance. After that comes divides and modulos. After that is
branch mis-predictions. If you can optimize all that, you will at
least be within 2x-4x of optimal (barring algorithm and IO
improvements) pretty much all time.
 
K

Keith Thompson

After setting -O2 flag in compiling, it only took 7 secs. The running
time is satisfactory. Your ideas are indispensable.

Check your compiler's documentation. It's likely that "-O2" is not
the highest optimization level available. (gcc supports at least
"-O3", plus a number of other options for specific optimizations that
aren't necessarily included in "-ON" for whatever value of N.)

The details are implementation-specific, so we can't necessarily help
you decide which options(s) to use. Read your compiler's
documentation.
 
I

istillshine

Change this to:

#define ITER 64

Just trust me, and try it out.

I tried using "#define ITER 64". Overflow occurred in this setting.
I got "1.#J". I don't know what "1.#J" stands for? When I changed
the type of sump and sumd to long double, the problem disappeared.
The program saved 1 sec because of larger ITER.



I also tried another way:

#define ITER 64

register double sum = 0.0;

if (i % ITER == 0) {
sum += log(prodn/prodd); /* introducing log, but not so frequent */
prodn = 1.0;
prodd = 1.0;
}

In the way, the running time remained the same: 11 secs before setting
-O2, and 7 secs after setting -O2. However, the return value was a
little different.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top