inline power function replacement

chris.fairles · Jul 27, 2006

Just want an opinion. I have an algorithm that needs to run as fast as
possible, in fact. Its already too slow. I've done as much algorithmic
changes as I can to reduce the amount of code, so now I'm turning to
micro-optimizations.

One function that gets run a bazillion times is the pow() function from
math.h. However I've realized probably 90% of the time, the exponent
will be 0. 99.9% of the time, the exponent will lie between -3 and 3.
So I read that switch's can sometimes be optimized into jump tables if
the case expressions are nearly contigous integers. So I coded up this
little diddy:

static inline double
mult_power(double x, int y)
{
switch ((y))
{
case 0 : return 1.0;
case -3 : return 1.0 / ((x)*(x)*(x));
case -2 : return 1.0 / ((x)*(x));
case -1 : return 1.0 / (x);
case 1 : return (x);
case 2 : return (x)*(x);
case 3 : return (x)*(x)*(x);
default : return pow((x), (y));
}
}

I actually havent tested this vs sticking the 0 in the middle (which I
will be doing shortly) but is this a sound replacement? Or is there
perhaps a more efficient solution?

Ancient_Hacker · Jul 27, 2006

I actually havent tested this vs sticking the 0 in the middle (which I
will be doing shortly) but is this a sound replacement? Or is there
perhaps a more efficient solution?

Well it depends on how smart your compiler is. It's not unimaginable
that a smart compiler would already be checking for the simple cases
and generating in-line code for this. Only way to find out is to try
it.

If your compiler is somewhat dumb, it may help to make this a macro,
so no call gets generated.

and Oh, you'll get some flamitude from folks that consider any kind of
speculation of run-times has nothing to do with C.

websnarf · Jul 27, 2006

Just want an opinion. I have an algorithm that needs to run as fast as
possible, in fact. Its already too slow. I've done as much algorithmic
changes as I can to reduce the amount of code, so now I'm turning to
micro-optimizations.

I'm not sure this is the right newsgroup for such a question. Not a
lot of people in this group can fathom what exactly you are saying
here.

One function that gets run a bazillion times is the pow() function from
math.h. However I've realized probably 90% of the time, the exponent
will be 0. 99.9% of the time, the exponent will lie between -3 and 3.
So I read that switch's can sometimes be optimized into jump tables if
the case expressions are nearly contigous integers. So I coded up this
little diddy:

static inline double
mult_power(double x, int y)
{
switch ((y))
{
case 0 : return 1.0;
case -3 : return 1.0 / ((x)*(x)*(x));
case -2 : return 1.0 / ((x)*(x));
case -1 : return 1.0 / (x);
case 1 : return (x);
case 2 : return (x)*(x);
case 3 : return (x)*(x)*(x);
default : return pow((x), (y));
}
}

I actually havent tested this vs sticking the 0 in the middle (which I
will be doing shortly) but is this a sound replacement?

Well, there are numerical considerations (i.e., your results will not
be identical to just calling pow().) But your idea is very well
motivated. There really should be a powint(double x, int y) function
in C (that does even more.) The problem of doing this as quickly as
possible for all integers is actually an interesting one.

Oh yeah, and the order of the cases should not matter if the compiler
has transformed this to a jump table. It only matters in the cases
where it doesn't do that, in which case you should just put the most
common cases as early as possible.

Or is there perhaps a more efficient solution?

Indeed. If y is mostly zero, then you should do this:

#define fastPowInt(x,y) ((0 == (y)) ? 1.0 : mult_power ((x), (y)))

The reason is that the single conditional on the top can leverage a
modern microprocessor's branch prediction capabilities, and thus cost
the absolutely least overhead for the vast majority of cases. A
switch() operation indeed does use jmp tables in many compilers, but
unless you are almost always switching to the same case, this has
roughly the same overhead as an indirect function call (calling through
a function pointer.) So, the surrounding conditional as shown above
should improve performance further still.

Frederick Gotham · Jul 27, 2006

(e-mail address removed) posted:

case -3 : return 1.0 / ((x)*(x)*(x));

Why the copious use of parentheses? You're not writing a macro.

return 1.0 / (x*x*x);

Or maybe:

return 1.0 / x / x / x;

Or is there perhaps a more efficient solution?

Presumbly "pow" should be checking for integer exponents... ?

Eric Sosman · Jul 27, 2006

Just want an opinion. I have an algorithm that needs to run as fast as
possible, in fact. Its already too slow. I've done as much algorithmic
changes as I can to reduce the amount of code, so now I'm turning to
micro-optimizations.

One function that gets run a bazillion times is the pow() function from
math.h. However I've realized probably 90% of the time, the exponent
will be 0. 99.9% of the time, the exponent will lie between -3 and 3.

It sounds like you're approaching things correctly: You've
tried changing the algorithms, rearranging the calculations,
and other such structural changes (that's where the big wins
are), and found yourself still a little short of the required
performance. And you've made measurements that finger pow()
as one of the big time sinks, and you've counted the number
of times it's called, and you've even made an empirical study
of the exponent values.

(You HAVE, right? The "bazillion" and "90%" and "99.9%"
are actual measurements, right? Or at least they're rough
statements backed by measurements, right? If so, fine -- but
if not, go back and make the measurements before proceeding.)

Once you've got measurements that finger pow() as the
culprit and give some actual information about the exponent
distribution, your inline function seems a good way to exploit
the knowledge you've acquired about the program, knowledge
that pow() itself does not have.

Whether this will save "enough" time is something you
can only find out by measuring. You may be able to tell just
by considering your initial measurements, though: If you find
that pow() takes 20% of the running time, replacing pow() with
something that takes *zero* time can only speed up the program
by 25%. If you need a 10% speedup, it might be within reach --
but if you need to double the program's speed, this isn't
going to do the trick.

So I read that switch's can sometimes be optimized into jump tables if
the case expressions are nearly contigous integers. So I coded up this
little diddy:

static inline double
mult_power(double x, int y)
{
switch ((y))
{
case 0 : return 1.0;
case -3 : return 1.0 / ((x)*(x)*(x));
case -2 : return 1.0 / ((x)*(x));
case -1 : return 1.0 / (x);
case 1 : return (x);
case 2 : return (x)*(x);
case 3 : return (x)*(x)*(x);
default : return pow((x), (y));
}
}

I actually havent tested this vs sticking the 0 in the middle (which I
will be doing shortly) but is this a sound replacement? Or is there
perhaps a more efficient solution?

The C language itself says nothing about the speeds of
the constructs you write in it, so questions of efficiency
must always refer to the platform or platforms. Still, it
is quite unlikely that the speed will be affected noticeably
by the order in which the case labels appear in the source
code. ((Also, the speed probably won't be affected if you
get rid of some of those needless parentheses ...))

If your "zero 90% of the time" figure is literally true,
you might (*might*) gain a milliquiver or so by testing for
it before doing the switch:

if (y == 0)
return 1.0;
switch (y) {
...
}

Differences due to this sort of thing are usually "down in
the noise," but a bazillion milliquivers is a hectotwitch
(IIRC), which might be noticeable. Keep in mind, though,
that this "optimization" might actually slow you down --
measure, measure, measure!

chris.fairles · Jul 27, 2006

I'm not sure this is the right newsgroup for such a question. Not a
lot of people in this group can fathom what exactly you are saying
here.

I see your point. I suppose the heart of this question is "what machine
code does this C code get translated into" which obviously is done by
the compiler (gcc 4.1.1 in my case).

Actually, this is more of an archetecture specific question because
really the question is "what should I do to optimize this function for
amd64 processors" heh. But then again, you did give some good insight
depsite the wrong newsgroup! So thanks!

Chris

Keith Thompson · Jul 27, 2006

Ancient_Hacker said:
Well it depends on how smart your compiler is. It's not unimaginable
that a smart compiler would already be checking for the simple cases
and generating in-line code for this. Only way to find out is to try
it.

If your compiler is somewhat dumb, it may help to make this a macro,
so no call gets generated.

He declared the function as "static inline"; it's unlikely that making
it a macro will improve performance beyond that (assuming the compiler
really does inline it).

If this truly is a bottleneck, both measuring the code's actual
performance and looking at assembly listings could be helpful.

Eric Sosman · Jul 28, 2006

[...]
> There really should be a powint(double x, int y) function
in C (that does even more. [...]

Wimp! Puling almsbeggar! Settle-for-half pantywaist!

C should have an exponentiation *operator*! Since the
obvious ^ has already been co-opted for another purpose and
since the traditional ** would be ambiguous due to other
overloadings, I propose the ^^ operator for exponentiation.
(Or, if folks would prefer it, the **^ operator -- best of
both worlds, don'tcha know.)

NEVER SURRENDER!
-- Dyed-in-the-wool FORTRAN II programmer (always
speaks in capital letters)

Nick Vatamaniuc · Jul 28, 2006

Chris,
Your problem is architectures & compiler specific. You have to become
"good friends" with your compiler and architecture to optimize this.
It seems like a good thing to do is to hint the branch predictor that
most of the time the value will be 0. So check out the
__builtin_expect() for GCC (there are a bunch of other __builtins for
constant folding and such...)

If that is too much of a headache, try using the -fbranch-arcs compile
option, run the code, then use the -fbranch-probabilities to recompile
after that. On some architectures you will anywhere from 0 to some
improvement. For example between latest AMD and Intel I would expect
a better improvement from Intel as it has longer pipelines ( the longer
the pipeline, the more penalty during wrong branch prediction --
therefore the better the prediction the better chance of a speed
improvement).

Hope this helps,
Nick Vatamaniuc

Keith Thompson · Jul 28, 2006

Eric Sosman said:
[...]
There really should be a powint(double x, int y) function
in C (that does even more. [...]

Click to expand...

Wimp! Puling almsbeggar! Settle-for-half pantywaist!

C should have an exponentiation *operator*! Since the
obvious ^ has already been co-opted for another purpose and
since the traditional ** would be ambiguous due to other
overloadings, I propose the ^^ operator for exponentiation.
(Or, if folks would prefer it, the **^ operator -- best of
both worlds, don'tcha know.)

If you use ^^ for exponentiation, what am I going to use for
short-circuit xor?

Michael Mair · Jul 28, 2006

Keith said:
Eric Sosman said:

[email protected] said:

[...]

Click to expand...

There really should be a powint(double x, int y) function

Click to expand...

in C (that does even more. [...]

Click to expand...

Wimp! Puling almsbeggar! Settle-for-half pantywaist!

C should have an exponentiation *operator*! Since the
obvious ^ has already been co-opted for another purpose and
since the traditional ** would be ambiguous due to other
overloadings, I propose the ^^ operator for exponentiation.
(Or, if folks would prefer it, the **^ operator -- best of
both worlds, don'tcha know.)

Click to expand...

If you use ^^ for exponentiation, what am I going to use for
short-circuit xor?

Maybe we could abuse static once more...
^static
should do nicely.

Cheers
Michael

Flash Gordon · Jul 28, 2006

Keith said:
Eric Sosman said:

[...]
There really should be a powint(double x, int y) function
in C (that does even more. [...]

Click to expand...

Wimp! Puling almsbeggar! Settle-for-half pantywaist!

C should have an exponentiation *operator*! Since the
obvious ^ has already been co-opted for another purpose and
since the traditional ** would be ambiguous due to other
overloadings, I propose the ^^ operator for exponentiation.
(Or, if folks would prefer it, the **^ operator -- best of
both worlds, don'tcha know.)

Click to expand...

If you use ^^ for exponentiation, what am I going to use for
short-circuit xor?

You could use ^|^

Ulrich Eckhardt · Jul 28, 2006

One function that gets run a bazillion times is the pow() function from
math.h. [...too slow ...] So I coded up this little diddy:

static inline double
mult_power(double x, int y)
{
switch ((y))
{
case 0 : return 1.0;
case -3 : return 1.0 / ((x)*(x)*(x));
case -2 : return 1.0 / ((x)*(x));
case -1 : return 1.0 / (x);
case 1 : return (x);
case 2 : return (x)*(x);
case 3 : return (x)*(x)*(x);
default : return pow((x), (y));
}
}

Someone already noted that the (x) can be just x because you are preperly
using an inline function and not a macro. Bad habits die hard, eh?

I actually havent tested this vs sticking the 0 in the middle (which I
will be doing shortly) but is this a sound replacement?

I don't see a case where this would not be sound, but I'd indeed put the
zero in the middle because it just belongs there and might make it easier
for the compiler to figure out things. Anyhow, whatever you do will
probably depend on the compiler and target platform, so if you target
multiple platforms leave yourself some space to switch implementations in
the back.

Or is there perhaps a more efficient solution?

Hard to tell. However, there is one thing I would at least try and that is
to use single-precision math. It's not so much that these are necessarily
faster (I have often seen powf() just cast and forward to pow()) but that
they put less load on the memory bus, i.e. you can fit more values into
the CPU's cache than with double. You then need to use float, the foof()
functions and the f-suffix on literals to make the whole consistent.

HTH

Uli

Richard Heathfield · Jul 28, 2006

Flash Gordon said:

You could use ^|^

....which would not break any strictly conforming code.

Or maybe you could abandon artificial associations with existing
symbol/semantics relationships and go for

which, as far as I can tell,
would also not break any existing strictly conforming code, and which might
even cheer up the compiler.

Eric Sosman · Jul 28, 2006

Flash said:
Keith said:

Eric Sosman said:

(e-mail address removed) wrote:

[...]

There really should be a powint(double x, int y) function

in C (that does even more. [...]

Wimp! Puling almsbeggar! Settle-for-half pantywaist!

C should have an exponentiation *operator*! Since the
obvious ^ has already been co-opted for another purpose and
since the traditional ** would be ambiguous due to other
overloadings, I propose the ^^ operator for exponentiation.
(Or, if folks would prefer it, the **^ operator -- best of
both worlds, don'tcha know.)

Click to expand...

If you use ^^ for exponentiation, what am I going to use for
short-circuit xor?

Click to expand...

You could use ^|^

.... or even ;-)

chris.fairles · Jul 28, 2006

Ulrich said:
One function that gets run a bazillion times is the pow() function from
math.h. [...too slow ...] So I coded up this little diddy:

static inline double
mult_power(double x, int y)
{
switch ((y))
{
case 0 : return 1.0;
case -3 : return 1.0 / ((x)*(x)*(x));
case -2 : return 1.0 / ((x)*(x));
case -1 : return 1.0 / (x);
case 1 : return (x);
case 2 : return (x)*(x);
case 3 : return (x)*(x)*(x);
default : return pow((x), (y));
}
}

Click to expand...

Someone already noted that the (x) can be just x because you are preperly
using an inline function and not a macro. Bad habits die hard, eh?

I actually havent tested this vs sticking the 0 in the middle (which I
will be doing shortly) but is this a sound replacement?

Click to expand...

I don't see a case where this would not be sound, but I'd indeed put the
zero in the middle because it just belongs there and might make it easier
for the compiler to figure out things. Anyhow, whatever you do will
probably depend on the compiler and target platform, so if you target
multiple platforms leave yourself some space to switch implementations in
the back.

Or is there perhaps a more efficient solution?

Click to expand...

Hard to tell. However, there is one thing I would at least try and that is
to use single-precision math. It's not so much that these are necessarily
faster (I have often seen powf() just cast and forward to pow()) but that
they put less load on the memory bus, i.e. you can fit more values into
the CPU's cache than with double. You then need to use float, the foof()
functions and the f-suffix on literals to make the whole consistent.

I actually have it in my task list to determine when and where using
float is appropriate. The problem with this power function in
particular is that its used to evaluate binomal distributions so the
"double x" is a probability < 1 and I'm multiplying a whole bunch (10
or so) together so I worry about there error propagating to the point
where i have only a couple sig. digits. I have yet to do a formal
analysis on the error diff when multiplying/dividing a series of floats
together vs. doubles.

And as for all the (x)'s, I had it as a macro before

but sometimes
I'm dividing by the result and sometimes multiplying depending on some
condition, so i didnt want to introduce another test condition within
the macro to determine whether to use *= or /= .

i.e.
if( k > 0)
pdf *= mult_power(probability, some_int);
else
pdf /= mult_power(probability, some_int);

chris.fairles · Jul 28, 2006

Nick said:
Chris,
Your problem is architectures & compiler specific. You have to become
"good friends" with your compiler and architecture to optimize this.
It seems like a good thing to do is to hint the branch predictor that
most of the time the value will be 0. So check out the
__builtin_expect() for GCC (there are a bunch of other __builtins for
constant folding and such...)

Wow, never got that far in the GCC manual but very interesting. I think
the __builtin_constant() function should be useful too because I have a
bunch of constants that are read from a config file during an init
function called before processing. I'm not sure if the compiler can
determine that once read in, they are never altered therefor treat as
constants but worth investigating! Thanks.

Chris

Eric Sosman · Jul 28, 2006

[...]
I actually have it in my task list to determine when and where using
float is appropriate. The problem with this power function in
particular is that its used to evaluate binomal distributions so the
"double x" is a probability < 1 and I'm multiplying a whole bunch (10
or so) together so I worry about there error propagating to the point
where i have only a couple sig. digits. I have yet to do a formal
analysis on the error diff when multiplying/dividing a series of floats
together vs. doubles.

<topicality level=marginal>

Roughly speaking, floating-point multiplication and
division don't increase the relative error very much. It's
addition and subtraction that are usually the troublemakers.

</topicality>

You say you're evaluating binomial distributions: What
exactly do you mean by that? There may be better ways to
proceed than by adding up the terms individually.

Walter Roberson · Jul 28, 2006

I have yet to do a formal
analysis on the error diff when multiplying/dividing a series of floats
together vs. doubles.

float and double present a great many bands in which the error is
absolute, but the absolute error for any one band is approximately
relative. Makes for a messy calculation.

Taking the sightly simpler case where the error is always relative,
f +/- f/c, (e.g., the error is 1/10000 times the original value)
and simplifying even further to f-f/c > 0 then
max(f-f/c,f+f/c) = f+f/c
min(f-f/c,f+f/c) = f-f/c
and then the maximum product is product(f+f/c,i=1..n)
and the minimum product is product(f-f/c,i=1..n)

Subtracting the first from the second and simplifying, the result is

product(f,i=1..n) * ( (c+1)^n- (c-1)^n ) / c^n

If we say that c = 10^7 and take 20 terms, the relative error is
about 4e-4 times the real product; for c = 10^15 and 20 terms, the
relative error is about 4e-12 times the real product.

Tak-Shing Chan · Jul 28, 2006

[optimizations snipped]

The problem with this power function in
particular is that its used to evaluate binomal distributions so the

Have you investigated the normal approximation ``algorithm''
before attempting your micro-optimizations? I am still
unconvinced about all the micro-optimization stuffs posted in
this thread thus far.

Tak-Shing

Where is my mistake? Why is s equal to minus infinity at some loop iterations?	0	Oct 9, 2022
Function is not worked in C	2	Jun 27, 2023
[C language] Issue in the Lotka-Volterra model.	0	Jun 28, 2023
SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
I need help making a zooming function	11	Dec 14, 2021
Rock, Paper, Scissor game. Im getting TypeError, unsupported operand type(s) for -=: 'NoneType' and 'int'	2	Aug 29, 2023

inline power function replacement

chris.fairles

Ancient_Hacker

websnarf

Frederick Gotham

Eric Sosman

chris.fairles

Keith Thompson

Eric Sosman

Nick Vatamaniuc

Keith Thompson

Michael Mair

Flash Gordon

Ulrich Eckhardt

Richard Heathfield

Eric Sosman

chris.fairles

chris.fairles

Eric Sosman

Walter Roberson

Tak-Shing Chan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads