How to make it faster?

I

istillshine

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}
 
I

istillshine

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}


Twice faster would help a lot.
 
W

Willem

(e-mail address removed) wrote:
) When I used gprof to see which function consumed most running time, I
) identified the following one. sz was less than 5000 on average, but
) foo had been called about 1,000,000 times. I have tried using
) "register sum = 0.0" and saw some improvement. My question is how to
) improve foo further to make it faster.
)
) double foo(double *a, double *b, int sz)
) {
)
) double sum = 0.0;
) int i;
)
) for (i = 0; i < sz; i++)
) sum += a*b;
)
) return sum;
) }

This depends on sooo many factors...
CPU type ? compiler ? platform ? etc, etc...

A first step would be to look at the generated assembly.

One thing you can try are to calculate two sums in the same loop and add
them together at the end, which might improve pipelining on modern CPUs.

Another would be to count sz down to zero, but that may even hurt
performance because modern RAM might be optimized for sequential
incremental access.

As a third option, you could try writing it with some extended instructions,
such as 3DNow or SSE or whatever is the latest fad.

Of course, on different systems, different rules apply, so YMMV.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
W

Walter Roberson

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.
double foo(double *a, double *b, int sz)
{
double sum = 0.0;
int i;
for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}

There might not be any effective way to make it faster -- possibly
with the compiler options you are using, it is already compiled to
as tight a code as you can feasibly get. But that's going to depend
on the intelligence of the optimizer, and on which (platform-
dependant) flags you passed it.

If you are using C99, then -possibly- using 'restrict' on the
pointer arguments might result in a speed-up.

Possibly "loop unrolling" might help. Something like, as long as
there are still at least 4 values left to do,
sum += a*b + a[i+1]*b[i+1] + a[i+2]*b[i+2] + a[i+3]*b[i+3];
i+=4;
and when there are fewer left, switch back to adding them individually.

The optimal unrolling could depend upon the number of floating point
registers available... which could even depend upon the compile options,
as well as the architecture.

A good optimizer would do loop unrolling.

When you are really pushing vectorized arithmetic, it can help
a lot to know that the elements you are working with do not share
cache lines. Cache line "aliasing" rules get pretty system specific:
the *exact same* CPU in two different systems might have different
cache line widths because they might have been physically populated
with different amounts of cache. And determining the alignment
of a particular address is not officially portable, because
officially addresses don't "mean anything"... though -usually- on
flat memory architectures, addresses can be interpreted as integral
and alignment unofficially checked that way. On segmented architecture,
it depends on the architecture: some segmented architectures only
allow segments to start on large aligned boundaries, and some
segmented architectures allow segments to start on arbitrary
boundaries (e.g., to prevent array overflow, a compiler could choose
to have a segment describing a char array start on what happened
to be an odd boundary [e.g., a substring] with the hardware-tagged
segment size limited to the size declared at compile time.)

For the kind of operations you are doing, instead of spending a lot
of time implementing your own optimized functions, it is often
a good idea to link against the NAG or BLAS or SCSL libraries:
when they are available for your systems, they are often
optimized to heck and back.


Another poster suggested switching to float instead of double
precision. It is true that on some systems, float is faster
than double; it is also true that on some systems, double
is faster than float. One thing about float is that if you
are on a system with SSE3 and you have a good fast graphics board,
then float (but not double) can be offloaded to the graphics board,
apparently with great improvements in speed. The mechanisms
for triggering that sort of optimization would definitely be
system dependant though!
 
B

Bartc

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}


Twice faster would help a lot.


Suggestions have already been made for improving the loop.

Reducing the calls from 1000000 will help a lot more, for that the overall
strategy needs to be looked at.

Then there's the data: does sz need to be that high? Is there anything
special about the a and b data (like mostly zeros or ones)?

How accurate does the sum have to be? If a,b change slowly, perhaps sum only
every other pair (then double the result)?

Have you thought about using integers (fixed point)? This might need a
global change.
 
F

Flash Gordon

Malcolm McLean wrote, On 07/04/08 18:50:
double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}


Make a and b float arrays, and sum a float too. On some architectures
that will give about a 2 * speedup.


On other architectures it will slow it down.
Of course you'll have to use single
precision throughout the program.
The code is so simple that there is unlikely to be any sort of other
micro-optimisiation you can do. However it may be possible to call foo()
less often by rearranging the algorithm.

Or using a different algorithm.

Malcolm is correct that the algorithm is the best place to start
looking. Before now I've easily had an order of magnitude improvement by
rearranging the algorithm (instead of taking a square root when
processing each pixel take the square of the number you would compare
against, implementing running windows by subtracting a number
corresponding to the old starting pixel and adding a number
corresponding to the new end instead of recalculating the entire window
etc).

If you need help with your algorithm then comp.programming or a group
specific to your type of algorithm would be the place to post (there are
groups dedicated to graphics for example).
 
J

jacob navia

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}


How to make it faster?

You pay me 1 day consulting and I make it
twice as fast in assembler.

x86, power-pc, sparc. (Each platform
one day consulting)

C can't go any further.
 
F

Fei Liu

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;
}


Nobody has mentioned "loop unrolling"? Intel compilers also support
embedded prefetch directives if your target platform is Intel based.

Fei
 
W

Walter Roberson

Nobody has mentioned "loop unrolling"?

Eric, Willem, and I all mentioned loop unrolling, with Eric and I
mentioning it directly under that name and Willem mentioning the
technique without giving it that name.
 
I

istillshine

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;

}


Suppose in my application I know that a*b can only take a
limited set of values. For examples, a can only be 0, 0.25, 0.5,
or 0.75; b can only be 0, 0.2, 0.4, 0.6, or 0.8.
I thought I could pre-compute their products and fetch them given
certain a and b. For example,

if (a == 0.25 && b == 0.2)
sum += 0.05;
else if (a == 0.25 && b == 0.4)
sum += 0.1;
....

But this solution made running time longer. I speculated that
conditional statements were more expensive than multiplication.

Do you have any better solution?
 
I

istillshine

When I used gprof to see which function consumed most running time, I
identified the following one. sz was less than 5000 on average, but
foo had been called about 1,000,000 times. I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.

double foo(double *a, double *b, int sz)
{

double sum = 0.0;
int i;

for (i = 0; i < sz; i++)
sum += a*b;

return sum;

}


Suppose in my application I know that a*b can only take a
limited set of values. For example, a can only be 0, 0.25, 0.5, or
0.75; b can only be 0, 0.2, 0.4, 0.6, or 0.8.
I thought I could pre-compute their products and fetch them given
certain a and b. For example,

if (a == 0.25 && b == 0.2)
sum += 0.05;
else if (a == 0.25 && b == 0.4)
sum += 0.1;
....

But the above solution made the running time considerably longer. I
speculated that conditional statements were more expensive than
multiplication.

Are better solutions possible?
 
R

Richard Tobin

Suppose in my application I know that a*b can only take a
limited set of values. For example, a can only be 0, 0.25, 0.5, or
0.75; b can only be 0, 0.2, 0.4, 0.6, or 0.8.
I thought I could pre-compute their products and fetch them given
certain a and b. For example,

if (a == 0.25 && b == 0.2)
sum += 0.05;
else if (a == 0.25 && b == 0.4)
sum += 0.1;
...

But the above solution made the running time considerably longer. I
speculated that conditional statements were more expensive than
multiplication.


Yes, that's quite likely. I would expect a modern compiler to generate
good code for a simple loop of multiplications.

If the values are from some small set, perhaps it would be better not
to store them as floating point numbers. Can you use integers instead
by multiplying by a constant?

Or you could store the values as small integers even if they are not
simple multiples, and have a 2-dimensional array in which you look up
the answers.

Or perhaps you could split your problem into multiple threads.

Of course, if your sequence of multiplications is really so huge that
it is important to speed it up, you should also reconsider your
algorithm.

-- Richard
 
R

Richard Tobin

Presumably you mean "register double sum = 0.0".

I'm surprised that register makes a difference. Are you telling
the compiler to optimise?

-- Richard
 
W

Wolfgang Draxinger

Bartc said:
How accurate does the sum have to be? If a,b change slowly,
perhaps sum only every other pair (then double the result)?

Bad idea if the OPs intention was implementing a inner product
(aka scalar product, aka dot product).

I know, it's a very dirty trick, but may I suggest a Duff's
Device? Tom Duff invented this for quite a similair situation.

Have a look at the Wikipedia page for how it's done:
http://en.wikipedia.org/w/index.php?title=Duff's_device&oldid=201047882

Wolfgang Draxinger
 
I

istillshine

Presumably you mean "register double sum = 0.0".
Yes.


I'm surprised that register makes a difference. Are you telling
the compiler to optimise?

-- Richard

I did not set -O2 flag.
 
K

Keith Thompson

Presumably you mean "register double sum = 0.0".

I'm surprised that register makes a difference. Are you telling
the compiler to optimise?

If he really did use "register sum = 0.0;", assuming a pre-C99
compiler, then he might have seen some speed-up because sum is
implicitly an int rather than a double. (Or it might slow it down,
depending on the relative speed of various operations.)
 
K

Keith Thompson

I did not set -O2 flag.

Or any other optimization flags?

You should use the maximum optimization level offered by your compiler
and measure the resulting performance before you even *think* about
optimizing the source code.
 
U

user923005

Or any other optimization flags?

You should use the maximum optimization level offered by your compiler
and measure the resulting performance before you even *think* about
optimizing the source code.

And you should run a profile after you decided that there is definite
necessity to speed things up.

If we do not know where the program is slow, we might speed up a
routine by a factor of 1000 and yet the speedup for the entire
application might not even be discernable with careful measurement.
 
B

Bart

When I used gprof to see which function consumed most running time, I
identified the following one.  sz was less than 5000 on average, but
foo had been called about 1,000,000 times.  I have tried using
"register sum = 0.0" and saw some improvement. My question is how to
improve foo further to make it faster.
double foo(double *a, double *b, int sz)
{
  double sum = 0.0;
  int i;
  for (i = 0; i < sz; i++)
    sum += a*b;

  return sum;

Suppose in my application I know that a*b can only take a
limited set of values.  For example, a can only be 0, 0.25, 0.5, or
0.75; b can only be 0, 0.2, 0.4, 0.6, or 0.8.
I thought I could pre-compute their products and fetch them given
certain a and b.  For example,

if (a == 0.25 && b == 0.2)
  sum += 0.05;
else if (a == 0.25 && b == 0.4)
  sum += 0.1;
...

But the above solution made the running time considerably longer.  I
speculated that conditional statements were more expensive than
multiplication.


Messing about with logic and floating point compares is likely slower
than just blindly doing your original calculation.

How fast can you convert a from 0.0 0.25 0.5 0.75 to 0,1,2,3 (or
0,25,50,75)? And similarly b?

With indices of 0,1,2,3 and 0,1,2,3,4 you can just index into a 2-dim
array for the product.

This array can be double, or int (convert to double at end of loop).

Or represent a,b as fractions: 0/4 to 3/4, 0/5 to 4/5 (you only need
the top part). Plenty of scope if your data is really that limited.

But it's necessary to look at how the data is generated: best to
directly generate the fast form otherwise you will waste time
converting.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top