How to make it faster?

istillshine · Apr 7, 2008

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}

istillshine · Apr 7, 2008

[email protected] said:
When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}

Twice faster would help a lot.

Willem · Apr 7, 2008

(e-mail address removed) wrote:
) When I used gprof to see which function consumed most running time, I
) identified the following one. sz was less than 5000 on average, but
) foo had been called about 1,000,000 times. I have tried using
) "register sum = 0.0" and saw some improvement. My question is how to
) improve foo further to make it faster.
)
) double foo(double *a, double *b, int sz)
) {
)
) double sum = 0.0;
) int i;
)
) for (i = 0; i < sz; i++)
) sum += a*b;
)
) return sum;
) }

This depends on sooo many factors...
CPU type ? compiler ? platform ? etc, etc...

A first step would be to look at the generated assembly.

One thing you can try are to calculate two sums in the same loop and add
them together at the end, which might improve pipelining on modern CPUs.

Another would be to count sz down to zero, but that may even hurt
performance because modern RAM might be optimized for sequential
incremental access.

As a third option, you could try writing it with some extended instructions,
such as 3DNow or SSE or whatever is the latest fad.

Of course, on different systems, different rules apply, so YMMV.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Walter Roberson · Apr 7, 2008

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}

Click to expand...

There might not be any effective way to make it faster -- possibly
with the compiler options you are using, it is already compiled to
as tight a code as you can feasibly get. But that's going to depend
on the intelligence of the optimizer, and on which (platform-
dependant) flags you passed it.

If you are using C99, then -possibly- using 'restrict' on the
pointer arguments might result in a speed-up.

Possibly "loop unrolling" might help. Something like, as long as
there are still at least 4 values left to do,
sum += a*b + a[i+1]*b[i+1] + a[i+2]*b[i+2] + a[i+3]*b[i+3];
i+=4;
and when there are fewer left, switch back to adding them individually.

The optimal unrolling could depend upon the number of floating point
registers available... which could even depend upon the compile options,
as well as the architecture.

A good optimizer would do loop unrolling.

When you are really pushing vectorized arithmetic, it can help
a lot to know that the elements you are working with do not share
cache lines. Cache line "aliasing" rules get pretty system specific:
the *exact same* CPU in two different systems might have different
cache line widths because they might have been physically populated
with different amounts of cache. And determining the alignment
of a particular address is not officially portable, because
officially addresses don't "mean anything"... though -usually- on
flat memory architectures, addresses can be interpreted as integral
and alignment unofficially checked that way. On segmented architecture,
it depends on the architecture: some segmented architectures only
allow segments to start on large aligned boundaries, and some
segmented architectures allow segments to start on arbitrary
boundaries (e.g., to prevent array overflow, a compiler could choose
to have a segment describing a char array start on what happened
to be an odd boundary [e.g., a substring] with the hardware-tagged
segment size limited to the size declared at compile time.)

For the kind of operations you are doing, instead of spending a lot
of time implementing your own optimized functions, it is often
a good idea to link against the NAG or BLAS or SCSL libraries:
when they are available for your systems, they are often
optimized to heck and back.

Another poster suggested switching to float instead of double
precision. It is true that on some systems, float is faster
than double; it is also true that on some systems, double
is faster than float. One thing about float is that if you
are on a system with SSE3 and you have a good fast graphics board,
then float (but not double) can be offloaded to the graphics board,
apparently with great improvements in speed. The mechanisms
for triggering that sort of optimization would definitely be
system dependant though!

Bartc · Apr 7, 2008

[email protected] said:
[email protected] said:

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}

Click to expand...

Twice faster would help a lot.

Suggestions have already been made for improving the loop.

Reducing the calls from 1000000 will help a lot more, for that the overall
strategy needs to be looked at.

Then there's the data: does sz need to be that high? Is there anything
special about the a and b data (like mostly zeros or ones)?

How accurate does the sum have to be? If a,b change slowly, perhaps sum only
every other pair (then double the result)?

Have you thought about using integers (fixed point)? This might need a
global change.

Flash Gordon · Apr 7, 2008

Malcolm McLean wrote, On 07/04/08 18:50:

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}

Click to expand...

Make a and b float arrays, and sum a float too. On some architectures
that will give about a 2 * speedup.

On other architectures it will slow it down.

Of course you'll have to use single
precision throughout the program.
The code is so simple that there is unlikely to be any sort of other
micro-optimisiation you can do. However it may be possible to call foo()
less often by rearranging the algorithm.

Click to expand...

Or using a different algorithm.

Malcolm is correct that the algorithm is the best place to start
looking. Before now I've easily had an order of magnitude improvement by
rearranging the algorithm (instead of taking a square root when
processing each pixel take the square of the number you would compare
against, implementing running windows by subtracting a number
corresponding to the old starting pixel and adding a number
corresponding to the new end instead of recalculating the entire window
etc).

If you need help with your algorithm then comp.programming or a group
specific to your type of algorithm would be the place to post (there are
groups dedicated to graphics for example).

jacob navia · Apr 7, 2008

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}

How to make it faster?

You pay me 1 day consulting and I make it
twice as fast in assembler.

x86, power-pc, sparc. (Each platform
one day consulting)

C can't go any further.

Fei Liu · Apr 7, 2008

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}

Nobody has mentioned "loop unrolling"? Intel compilers also support
embedded prefetch directives if your target platform is Intel based.

Fei

Walter Roberson · Apr 7, 2008

Nobody has mentioned "loop unrolling"?

Eric, Willem, and I all mentioned loop unrolling, with Eric and I
mentioning it directly under that name and Willem mentioning the
technique without giving it that name.

Walter Roberson · Apr 7, 2008

(e-mail address removed) wrote:

How to make it faster?

You pay me 1 day consulting and I make it
twice as fast in assembler.

x86, power-pc, sparc. (Each platform
one day consulting)

C can't go any further.

Yes C can, and I described several techniques that can help given
a bit of platform-specific information

istillshine · Apr 7, 2008

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;

}

Suppose in my application I know that a*b can only take a
limited set of values. For examples, a can only be 0, 0.25, 0.5,
or 0.75; b can only be 0, 0.2, 0.4, 0.6, or 0.8.
I thought I could pre-compute their products and fetch them given
certain a and b. For example,

if (a == 0.25 && b == 0.2)
sum += 0.05;
else if (a == 0.25 && b == 0.4)
sum += 0.1;
....

But this solution made running time longer. I speculated that
conditional statements were more expensive than multiplication.

Do you have any better solution?

istillshine · Apr 7, 2008

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;

}

Suppose in my application I know that a*b can only take a
limited set of values. For example, a can only be 0, 0.25, 0.5, or
0.75; b can only be 0, 0.2, 0.4, 0.6, or 0.8.
I thought I could pre-compute their products and fetch them given
certain a and b. For example,

if (a == 0.25 && b == 0.2)
sum += 0.05;
else if (a == 0.25 && b == 0.4)
sum += 0.1;
....

But the above solution made the running time considerably longer. I
speculated that conditional statements were more expensive than
multiplication.

Are better solutions possible?

Richard Tobin · Apr 7, 2008

Suppose in my application I know that a*b can only take a
limited set of values. For example, a can only be 0, 0.25, 0.5, or
0.75; b can only be 0, 0.2, 0.4, 0.6, or 0.8.
I thought I could pre-compute their products and fetch them given
certain a and b. For example,

if (a == 0.25 && b == 0.2)
sum += 0.05;
else if (a == 0.25 && b == 0.4)
sum += 0.1;
...

But the above solution made the running time considerably longer. I
speculated that conditional statements were more expensive than
multiplication.

Yes, that's quite likely. I would expect a modern compiler to generate
good code for a simple loop of multiplications.

If the values are from some small set, perhaps it would be better not
to store them as floating point numbers. Can you use integers instead
by multiplying by a constant?

Or you could store the values as small integers even if they are not
simple multiples, and have a 2-dimensional array in which you look up
the answers.

Or perhaps you could split your problem into multiple threads.

Of course, if your sequence of multiplications is really so huge that
it is important to speed it up, you should also reconsider your
algorithm.

-- Richard

Richard Tobin · Apr 7, 2008

Presumably you mean "register double sum = 0.0".

I'm surprised that register makes a difference. Are you telling
the compiler to optimise?

-- Richard

Wolfgang Draxinger · Apr 7, 2008

Bartc said:
How accurate does the sum have to be? If a,b change slowly,
perhaps sum only every other pair (then double the result)?

Bad idea if the OPs intention was implementing a inner product
(aka scalar product, aka dot product).

I know, it's a very dirty trick, but may I suggest a Duff's
Device? Tom Duff invented this for quite a similair situation.

Have a look at the Wikipedia page for how it's done:
http://en.wikipedia.org/w/index.php?title=Duff's_device&oldid=201047882

Wolfgang Draxinger

istillshine · Apr 7, 2008

Presumably you mean "register double sum = 0.0".
Yes.

I'm surprised that register makes a difference. Are you telling
the compiler to optimise?

-- Richard

I did not set -O2 flag.

Keith Thompson · Apr 8, 2008

Presumably you mean "register double sum = 0.0".

I'm surprised that register makes a difference. Are you telling
the compiler to optimise?

If he really did use "register sum = 0.0;", assuming a pre-C99
compiler, then he might have seen some speed-up because sum is
implicitly an int rather than a double. (Or it might slow it down,
depending on the relative speed of various operations.)

Keith Thompson · Apr 8, 2008

I did not set -O2 flag.

Or any other optimization flags?

You should use the maximum optimization level offered by your compiler
and measure the resulting performance before you even *think* about
optimizing the source code.

user923005 · Apr 8, 2008

[email protected] said:
[email protected] said:

Or any other optimization flags?

You should use the maximum optimization level offered by your compiler
and measure the resulting performance before you even *think* about
optimizing the source code.

Click to expand...

And you should run a profile after you decided that there is definite
necessity to speed things up.

If we do not know where the program is slow, we might speed up a
routine by a factor of 1000 and yet the speedup for the entire
application might not even be discernable with careful measurement.

Click to expand...

Bart · Apr 8, 2008

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

Click to expand...

double foo(double *a, double *b, int sz)
{

Click to expand...

double sum = 0.0;
int i;

Click to expand...

for (i = 0; i < sz; i++)
sum += a*b;

Click to expand...

return sum;

Click to expand...

}

Click to expand...

Suppose in my application I know that a*b can only take a
limited set of values. For example, a can only be 0, 0.25, 0.5, or
0.75; b can only be 0, 0.2, 0.4, 0.6, or 0.8.
I thought I could pre-compute their products and fetch them given
certain a and b. For example,

if (a == 0.25 && b == 0.2)
sum += 0.05;
else if (a == 0.25 && b == 0.4)
sum += 0.1;
...

But the above solution made the running time considerably longer. I
speculated that conditional statements were more expensive than
multiplication.

Messing about with logic and floating point compares is likely slower
than just blindly doing your original calculation.

How fast can you convert a from 0.0 0.25 0.5 0.75 to 0,1,2,3 (or
0,25,50,75)? And similarly b?

With indices of 0,1,2,3 and 0,1,2,3,4 you can just index into a 2-dim
array for the product.

This array can be double, or int (convert to double at end of loop).

Or represent a,b as fractions: 0/4 to 3/4, 0/5 to 4/5 (you only need
the top part). Plenty of scope if your data is really that limited.

But it's necessary to look at how the data is generated: best to
directly generate the fast form otherwise you will waste time
converting.

How to Make CSV Contact Files Work Seamlessly Across All Smartphones?	0	Sep 17, 2025
How can I calculate the last payment for Reprofiled Amount column with 2 decimal places to make the sum of all payments to be the same as RC amount?	2	Jul 13, 2023
I am trying to make an auto-play thing. How do I make it work?	5	Apr 5, 2022
How to use Densenet121 in monai	0	Feb 16, 2024
Sticky Header - How Do I Make It Hide/Show on Scroll?	0	Sep 27, 2021
Adding adressing of IPv6 to program	1	Feb 16, 2023
How to make an input box enable for only past dates and for the startdate if it falls in the current month?	5	Jun 12, 2023
Chatbot	0	Oct 8, 2024

How to make it faster?

istillshine

istillshine

Willem

Walter Roberson

Bartc

Flash Gordon

jacob navia

Fei Liu

Walter Roberson

Walter Roberson

istillshine

istillshine

Richard Tobin

Richard Tobin

Wolfgang Draxinger

istillshine

Keith Thompson

Keith Thompson

user923005

Bart

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads