inline power function replacement

C

chris.fairles

Just want an opinion. I have an algorithm that needs to run as fast as
possible, in fact. Its already too slow. I've done as much algorithmic
changes as I can to reduce the amount of code, so now I'm turning to
micro-optimizations.

One function that gets run a bazillion times is the pow() function from
math.h. However I've realized probably 90% of the time, the exponent
will be 0. 99.9% of the time, the exponent will lie between -3 and 3.
So I read that switch's can sometimes be optimized into jump tables if
the case expressions are nearly contigous integers. So I coded up this
little diddy:

static inline double
mult_power(double x, int y)
{
switch ((y))
{
case 0 : return 1.0;
case -3 : return 1.0 / ((x)*(x)*(x));
case -2 : return 1.0 / ((x)*(x));
case -1 : return 1.0 / (x);
case 1 : return (x);
case 2 : return (x)*(x);
case 3 : return (x)*(x)*(x);
default : return pow((x), (y));
}
}

I actually havent tested this vs sticking the 0 in the middle (which I
will be doing shortly) but is this a sound replacement? Or is there
perhaps a more efficient solution?
 
A

Ancient_Hacker

I actually havent tested this vs sticking the 0 in the middle (which I
will be doing shortly) but is this a sound replacement? Or is there
perhaps a more efficient solution?

Well it depends on how smart your compiler is. It's not unimaginable
that a smart compiler would already be checking for the simple cases
and generating in-line code for this. Only way to find out is to try
it.

If your compiler is somewhat dumb, it may help to make this a macro,
so no call gets generated.

and Oh, you'll get some flamitude from folks that consider any kind of
speculation of run-times has nothing to do with C.
 
W

websnarf

Just want an opinion. I have an algorithm that needs to run as fast as
possible, in fact. Its already too slow. I've done as much algorithmic
changes as I can to reduce the amount of code, so now I'm turning to
micro-optimizations.

I'm not sure this is the right newsgroup for such a question. Not a
lot of people in this group can fathom what exactly you are saying
here.
One function that gets run a bazillion times is the pow() function from
math.h. However I've realized probably 90% of the time, the exponent
will be 0. 99.9% of the time, the exponent will lie between -3 and 3.
So I read that switch's can sometimes be optimized into jump tables if
the case expressions are nearly contigous integers. So I coded up this
little diddy:

static inline double
mult_power(double x, int y)
{
switch ((y))
{
case 0 : return 1.0;
case -3 : return 1.0 / ((x)*(x)*(x));
case -2 : return 1.0 / ((x)*(x));
case -1 : return 1.0 / (x);
case 1 : return (x);
case 2 : return (x)*(x);
case 3 : return (x)*(x)*(x);
default : return pow((x), (y));
}
}

I actually havent tested this vs sticking the 0 in the middle (which I
will be doing shortly) but is this a sound replacement?

Well, there are numerical considerations (i.e., your results will not
be identical to just calling pow().) But your idea is very well
motivated. There really should be a powint(double x, int y) function
in C (that does even more.) The problem of doing this as quickly as
possible for all integers is actually an interesting one.

Oh yeah, and the order of the cases should not matter if the compiler
has transformed this to a jump table. It only matters in the cases
where it doesn't do that, in which case you should just put the most
common cases as early as possible.
Or is there perhaps a more efficient solution?

Indeed. If y is mostly zero, then you should do this:

#define fastPowInt(x,y) ((0 == (y)) ? 1.0 : mult_power ((x), (y)))

The reason is that the single conditional on the top can leverage a
modern microprocessor's branch prediction capabilities, and thus cost
the absolutely least overhead for the vast majority of cases. A
switch() operation indeed does use jmp tables in many compilers, but
unless you are almost always switching to the same case, this has
roughly the same overhead as an indirect function call (calling through
a function pointer.) So, the surrounding conditional as shown above
should improve performance further still.
 
F

Frederick Gotham

(e-mail address removed) posted:
case -3 : return 1.0 / ((x)*(x)*(x));


Why the copious use of parentheses? You're not writing a macro.

return 1.0 / (x*x*x);

Or maybe:

return 1.0 / x / x / x;

Or is there perhaps a more efficient solution?


Presumbly "pow" should be checking for integer exponents... ?
 
E

Eric Sosman

Just want an opinion. I have an algorithm that needs to run as fast as
possible, in fact. Its already too slow. I've done as much algorithmic
changes as I can to reduce the amount of code, so now I'm turning to
micro-optimizations.

One function that gets run a bazillion times is the pow() function from
math.h. However I've realized probably 90% of the time, the exponent
will be 0. 99.9% of the time, the exponent will lie between -3 and 3.

It sounds like you're approaching things correctly: You've
tried changing the algorithms, rearranging the calculations,
and other such structural changes (that's where the big wins
are), and found yourself still a little short of the required
performance. And you've made measurements that finger pow()
as one of the big time sinks, and you've counted the number
of times it's called, and you've even made an empirical study
of the exponent values.

(You HAVE, right? The "bazillion" and "90%" and "99.9%"
are actual measurements, right? Or at least they're rough
statements backed by measurements, right? If so, fine -- but
if not, go back and make the measurements before proceeding.)

Once you've got measurements that finger pow() as the
culprit and give some actual information about the exponent
distribution, your inline function seems a good way to exploit
the knowledge you've acquired about the program, knowledge
that pow() itself does not have.

Whether this will save "enough" time is something you
can only find out by measuring. You may be able to tell just
by considering your initial measurements, though: If you find
that pow() takes 20% of the running time, replacing pow() with
something that takes *zero* time can only speed up the program
by 25%. If you need a 10% speedup, it might be within reach --
but if you need to double the program's speed, this isn't
going to do the trick.
So I read that switch's can sometimes be optimized into jump tables if
the case expressions are nearly contigous integers. So I coded up this
little diddy:

static inline double
mult_power(double x, int y)
{
switch ((y))
{
case 0 : return 1.0;
case -3 : return 1.0 / ((x)*(x)*(x));
case -2 : return 1.0 / ((x)*(x));
case -1 : return 1.0 / (x);
case 1 : return (x);
case 2 : return (x)*(x);
case 3 : return (x)*(x)*(x);
default : return pow((x), (y));
}
}

I actually havent tested this vs sticking the 0 in the middle (which I
will be doing shortly) but is this a sound replacement? Or is there
perhaps a more efficient solution?

The C language itself says nothing about the speeds of
the constructs you write in it, so questions of efficiency
must always refer to the platform or platforms. Still, it
is quite unlikely that the speed will be affected noticeably
by the order in which the case labels appear in the source
code. ((Also, the speed probably won't be affected if you
get rid of some of those needless parentheses ...))

If your "zero 90% of the time" figure is literally true,
you might (*might*) gain a milliquiver or so by testing for
it before doing the switch:

if (y == 0)
return 1.0;
switch (y) {
...
}

Differences due to this sort of thing are usually "down in
the noise," but a bazillion milliquivers is a hectotwitch
(IIRC), which might be noticeable. Keep in mind, though,
that this "optimization" might actually slow you down --
measure, measure, measure!
 
C

chris.fairles

I'm not sure this is the right newsgroup for such a question. Not a
lot of people in this group can fathom what exactly you are saying
here.

I see your point. I suppose the heart of this question is "what machine
code does this C code get translated into" which obviously is done by
the compiler (gcc 4.1.1 in my case).

Actually, this is more of an archetecture specific question because
really the question is "what should I do to optimize this function for
amd64 processors" heh. But then again, you did give some good insight
depsite the wrong newsgroup! So thanks!

Chris
 
K

Keith Thompson

Ancient_Hacker said:
Well it depends on how smart your compiler is. It's not unimaginable
that a smart compiler would already be checking for the simple cases
and generating in-line code for this. Only way to find out is to try
it.

If your compiler is somewhat dumb, it may help to make this a macro,
so no call gets generated.

He declared the function as "static inline"; it's unlikely that making
it a macro will improve performance beyond that (assuming the compiler
really does inline it).

If this truly is a bottleneck, both measuring the code's actual
performance and looking at assembly listings could be helpful.
 
E

Eric Sosman

[...]
> There really should be a powint(double x, int y) function
in C (that does even more. [...]

Wimp! Puling almsbeggar! Settle-for-half pantywaist!

C should have an exponentiation *operator*! Since the
obvious ^ has already been co-opted for another purpose and
since the traditional ** would be ambiguous due to other
overloadings, I propose the ^^ operator for exponentiation.
(Or, if folks would prefer it, the **^ operator -- best of
both worlds, don'tcha know.)

NEVER SURRENDER!
-- Dyed-in-the-wool FORTRAN II programmer (always
speaks in capital letters)
 
N

Nick Vatamaniuc

Chris,
Your problem is architectures & compiler specific. You have to become
"good friends" with your compiler and architecture to optimize this.
It seems like a good thing to do is to hint the branch predictor that
most of the time the value will be 0. So check out the
__builtin_expect() for GCC (there are a bunch of other __builtins for
constant folding and such...)

If that is too much of a headache, try using the -fbranch-arcs compile
option, run the code, then use the -fbranch-probabilities to recompile
after that. On some architectures you will anywhere from 0 to some
improvement. For example between latest AMD and Intel I would expect
a better improvement from Intel as it has longer pipelines ( the longer
the pipeline, the more penalty during wrong branch prediction --
therefore the better the prediction the better chance of a speed
improvement).

Hope this helps,
Nick Vatamaniuc
 
K

Keith Thompson

Eric Sosman said:
[...]
There really should be a powint(double x, int y) function
in C (that does even more. [...]

Wimp! Puling almsbeggar! Settle-for-half pantywaist!

C should have an exponentiation *operator*! Since the
obvious ^ has already been co-opted for another purpose and
since the traditional ** would be ambiguous due to other
overloadings, I propose the ^^ operator for exponentiation.
(Or, if folks would prefer it, the **^ operator -- best of
both worlds, don'tcha know.)

If you use ^^ for exponentiation, what am I going to use for
short-circuit xor?
 
M

Michael Mair

Keith said:
Eric Sosman said:
There really should be a powint(double x, int y) function
in C (that does even more. [...]

Wimp! Puling almsbeggar! Settle-for-half pantywaist!

C should have an exponentiation *operator*! Since the
obvious ^ has already been co-opted for another purpose and
since the traditional ** would be ambiguous due to other
overloadings, I propose the ^^ operator for exponentiation.
(Or, if folks would prefer it, the **^ operator -- best of
both worlds, don'tcha know.)

If you use ^^ for exponentiation, what am I going to use for
short-circuit xor?

Maybe we could abuse static once more...
^static
should do nicely.

Cheers
Michael
 
F

Flash Gordon

Keith said:
Eric Sosman said:
[...]
There really should be a powint(double x, int y) function
in C (that does even more. [...]
Wimp! Puling almsbeggar! Settle-for-half pantywaist!

C should have an exponentiation *operator*! Since the
obvious ^ has already been co-opted for another purpose and
since the traditional ** would be ambiguous due to other
overloadings, I propose the ^^ operator for exponentiation.
(Or, if folks would prefer it, the **^ operator -- best of
both worlds, don'tcha know.)

If you use ^^ for exponentiation, what am I going to use for
short-circuit xor?

You could use ^|^
 
U

Ulrich Eckhardt

One function that gets run a bazillion times is the pow() function from
math.h. [...too slow ...] So I coded up this little diddy:

static inline double
mult_power(double x, int y)
{
switch ((y))
{
case 0 : return 1.0;
case -3 : return 1.0 / ((x)*(x)*(x));
case -2 : return 1.0 / ((x)*(x));
case -1 : return 1.0 / (x);
case 1 : return (x);
case 2 : return (x)*(x);
case 3 : return (x)*(x)*(x);
default : return pow((x), (y));
}
}

Someone already noted that the (x) can be just x because you are preperly
using an inline function and not a macro. Bad habits die hard, eh?
I actually havent tested this vs sticking the 0 in the middle (which I
will be doing shortly) but is this a sound replacement?

I don't see a case where this would not be sound, but I'd indeed put the
zero in the middle because it just belongs there and might make it easier
for the compiler to figure out things. Anyhow, whatever you do will
probably depend on the compiler and target platform, so if you target
multiple platforms leave yourself some space to switch implementations in
the back.
Or is there perhaps a more efficient solution?

Hard to tell. However, there is one thing I would at least try and that is
to use single-precision math. It's not so much that these are necessarily
faster (I have often seen powf() just cast and forward to pow()) but that
they put less load on the memory bus, i.e. you can fit more values into
the CPU's cache than with double. You then need to use float, the foof()
functions and the f-suffix on literals to make the whole consistent.

HTH

Uli
 
R

Richard Heathfield

Flash Gordon said:

You could use ^|^

....which would not break any strictly conforming code.

Or maybe you could abandon artificial associations with existing
symbol/semantics relationships and go for :) which, as far as I can tell,
would also not break any existing strictly conforming code, and which might
even cheer up the compiler.
 
E

Eric Sosman

Flash said:
Keith said:
Eric Sosman said:
(e-mail address removed) wrote:

[...]

There really should be a powint(double x, int y) function

in C (that does even more. [...]

Wimp! Puling almsbeggar! Settle-for-half pantywaist!

C should have an exponentiation *operator*! Since the
obvious ^ has already been co-opted for another purpose and
since the traditional ** would be ambiguous due to other
overloadings, I propose the ^^ operator for exponentiation.
(Or, if folks would prefer it, the **^ operator -- best of
both worlds, don'tcha know.)


If you use ^^ for exponentiation, what am I going to use for
short-circuit xor?


You could use ^|^

.... or even ;-)
 
C

chris.fairles

Ulrich said:
One function that gets run a bazillion times is the pow() function from
math.h. [...too slow ...] So I coded up this little diddy:

static inline double
mult_power(double x, int y)
{
switch ((y))
{
case 0 : return 1.0;
case -3 : return 1.0 / ((x)*(x)*(x));
case -2 : return 1.0 / ((x)*(x));
case -1 : return 1.0 / (x);
case 1 : return (x);
case 2 : return (x)*(x);
case 3 : return (x)*(x)*(x);
default : return pow((x), (y));
}
}

Someone already noted that the (x) can be just x because you are preperly
using an inline function and not a macro. Bad habits die hard, eh?
I actually havent tested this vs sticking the 0 in the middle (which I
will be doing shortly) but is this a sound replacement?

I don't see a case where this would not be sound, but I'd indeed put the
zero in the middle because it just belongs there and might make it easier
for the compiler to figure out things. Anyhow, whatever you do will
probably depend on the compiler and target platform, so if you target
multiple platforms leave yourself some space to switch implementations in
the back.
Or is there perhaps a more efficient solution?

Hard to tell. However, there is one thing I would at least try and that is
to use single-precision math. It's not so much that these are necessarily
faster (I have often seen powf() just cast and forward to pow()) but that
they put less load on the memory bus, i.e. you can fit more values into
the CPU's cache than with double. You then need to use float, the foof()
functions and the f-suffix on literals to make the whole consistent.

I actually have it in my task list to determine when and where using
float is appropriate. The problem with this power function in
particular is that its used to evaluate binomal distributions so the
"double x" is a probability < 1 and I'm multiplying a whole bunch (10
or so) together so I worry about there error propagating to the point
where i have only a couple sig. digits. I have yet to do a formal
analysis on the error diff when multiplying/dividing a series of floats
together vs. doubles.

And as for all the (x)'s, I had it as a macro before ;) but sometimes
I'm dividing by the result and sometimes multiplying depending on some
condition, so i didnt want to introduce another test condition within
the macro to determine whether to use *= or /= .

i.e.
if( k > 0)
pdf *= mult_power(probability, some_int);
else
pdf /= mult_power(probability, some_int);
 
C

chris.fairles

Nick said:
Chris,
Your problem is architectures & compiler specific. You have to become
"good friends" with your compiler and architecture to optimize this.
It seems like a good thing to do is to hint the branch predictor that
most of the time the value will be 0. So check out the
__builtin_expect() for GCC (there are a bunch of other __builtins for
constant folding and such...)

Wow, never got that far in the GCC manual but very interesting. I think
the __builtin_constant() function should be useful too because I have a
bunch of constants that are read from a config file during an init
function called before processing. I'm not sure if the compiler can
determine that once read in, they are never altered therefor treat as
constants but worth investigating! Thanks.

Chris
 
E

Eric Sosman

[...]
I actually have it in my task list to determine when and where using
float is appropriate. The problem with this power function in
particular is that its used to evaluate binomal distributions so the
"double x" is a probability < 1 and I'm multiplying a whole bunch (10
or so) together so I worry about there error propagating to the point
where i have only a couple sig. digits. I have yet to do a formal
analysis on the error diff when multiplying/dividing a series of floats
together vs. doubles.

<topicality level=marginal>

Roughly speaking, floating-point multiplication and
division don't increase the relative error very much. It's
addition and subtraction that are usually the troublemakers.

</topicality>

You say you're evaluating binomial distributions: What
exactly do you mean by that? There may be better ways to
proceed than by adding up the terms individually.
 
W

Walter Roberson

I have yet to do a formal
analysis on the error diff when multiplying/dividing a series of floats
together vs. doubles.

float and double present a great many bands in which the error is
absolute, but the absolute error for any one band is approximately
relative. Makes for a messy calculation.

Taking the sightly simpler case where the error is always relative,
f +/- f/c, (e.g., the error is 1/10000 times the original value)
and simplifying even further to f-f/c > 0 then
max(f-f/c,f+f/c) = f+f/c
min(f-f/c,f+f/c) = f-f/c
and then the maximum product is product(f+f/c,i=1..n)
and the minimum product is product(f-f/c,i=1..n)

Subtracting the first from the second and simplifying, the result is

product(f,i=1..n) * ( (c+1)^n- (c-1)^n ) / c^n

If we say that c = 10^7 and take 20 terms, the relative error is
about 4e-4 times the real product; for c = 10^15 and 20 terms, the
relative error is about 4e-12 times the real product.
 
T

Tak-Shing Chan

[optimizations snipped]

The problem with this power function in
particular is that its used to evaluate binomal distributions so the

Have you investigated the normal approximation ``algorithm''
before attempting your micro-optimizations? I am still
unconvinced about all the micro-optimization stuffs posted in
this thread thus far.

Tak-Shing
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top