fast power function

C

Chris Forone

hello group,

is there some reference implementation of a fast power function? how
fast is the power function in <cmath>?

thanks & hand, chris
 
T

Tomás Ó hÉilidhe

Chris Forone:
hello group,

is there some reference implementation of a fast power function? how
fast is the power function in <cmath>?

thanks & hand, chris


Very probably the fastest method of calculating powers for that platform.
 
J

Juha Nieminen

Tomás Ó hÉilidhe said:
Very probably the fastest method of calculating powers for that platform.

Not necessarily in all cases.

For example in intel architectures pow() is rather slow because
there's no such FPU opcode in 387. We are probably talking about many
hundreds, if not even over a thousand clock cycles even on a Pentium.
While compilers may try to optimize the pow() call away if they can,
they often can't.

In some cases it may be faster to "open up" a calculation than
calling pow(). For example, in many cases it may be slower to perform a
"pow(x, 1.5)" than a "x*sqrt(x)" (many compilers are unable to optimize
the former into the latter).

But of course this is more related to architectures and compilers than
to C++, and thus slightly off-topic.
 
T

Tomás Ó hÉilidhe

Would it not make sense for the standard library to have a pow function
that deals only with integer exponents?

unsigned pow_int(unsigned const x,unsigned exp)
{
unsigned retval = 1;

while ( ; exp; --exp) retval *= x;

return retval;
}
 
P

Pete Becker

Would it not make sense for the standard library to have a pow function
that deals only with integer exponents?

It has three: pow(float, int), pow(double, int), and pow(long double, int).
 
J

Juha Nieminen

Tomás Ó hÉilidhe said:
Would it not make sense for the standard library to have a pow function
that deals only with integer exponents?

Compilers do have specialized pow() functions when the exponent is
integer.
unsigned pow_int(unsigned const x,unsigned exp)
{
unsigned retval = 1;

while ( ; exp; --exp) retval *= x;

return retval;
}

That's *not* the fastest way of calculating pow(x, integer).
 
J

Jerry Coffin

[ ... ]
For example in intel architectures pow() is rather slow because
there's no such FPU opcode in 387. We are probably talking about many
hundreds, if not even over a thousand clock cycles even on a Pentium.
While compilers may try to optimize the pow() call away if they can,
they often can't.

I don't know of any processor that would let you implement pow in a
single instruction -- but Intel floating point includes instructions for
logarithm and inverse logarithm, which make pow pretty easy to
implement.

As far as speed goes, you should be looking at about 150-190 CPU cycles
on a reasonably modern CPU (depending somewhat on data). I haven't found
any data for which it takes 200 cycles on my machine, and it's a few
years old -- I believe current CPUs are typically at least 20-30% faster
at the same clock speed.
 
J

Jerry Coffin

hello group,

is there some reference implementation of a fast power function? how
fast is the power function in <cmath>?

The implementation in the library is likely to be oriented more toward
being general purpose than the fastest for a given situation. You can
probably do substantially better if (and only if) you know a fair amount
about the data you're working with, so you can write something more
specialized.
 
J

Juha Nieminen

Jerry said:
I don't know of any processor that would let you implement pow in a
single instruction -- but Intel floating point includes instructions for
logarithm and inverse logarithm, which make pow pretty easy to
implement.

Actually it's not "pretty easy", as there's a twist.

pow(x, y) = pow(2, y*log2(x)), and the 387 happens to have both an
opcode for calculating y*log2(x), fyl2x, and one for calculating
pow(2, x)-1, f2xm1. The twist is, however, that with the latter x must
be inside the range from -1.0 to +1.0. This means that the x in the
original pow(x, y) cannot be used as it is with f2xm1, but it must be
split using the property 2^(a+b) = 2^a*2^b.

The complete asm code for calculating pow(x, y), assuming x and y have
already been loaded into the FPU, is something like (with reported clock
cycles for the pentium):

fyl2x ; 22-111
fld st ; 1
frndint ; 9-20
fsub st(1), st ; 3
fxch st(2) ; 0-1
f2xm1 ; 13-57
fld1 ; 2
fadd ; 3
fscale ; 20-31
fstp st(1) ; 1

I suppose the absolute best case scenario is a total of 74 clock
cycles, and the worst is 230 clock cycles, with possible stalls.

I suppose I overestimated the required clock cycles.
 
J

Jerry Coffin

Actually it's not "pretty easy", as there's a twist.

Actually, I'm well aware of the "twists". At least by my figuring, it
still qualifies as "pretty easy".
pow(x, y) = pow(2, y*log2(x)), and the 387 happens to have both an
opcode for calculating y*log2(x), fyl2x, and one for calculating
pow(2, x)-1, f2xm1. The twist is, however, that with the latter x must
be inside the range from -1.0 to +1.0. This means that the x in the
original pow(x, y) cannot be used as it is with f2xm1, but it must be
split using the property 2^(a+b) = 2^a*2^b.

Yes -- and that doesn't stop it from being pretty easy.
The complete asm code for calculating pow(x, y), assuming x and y have
already been loaded into the FPU, is something like (with reported clock
cycles for the pentium):

fyl2x ; 22-111
fld st ; 1
frndint ; 9-20
fsub st(1), st ; 3
fxch st(2) ; 0-1
f2xm1 ; 13-57
fld1 ; 2
fadd ; 3
fscale ; 20-31
fstp st(1) ; 1

That's certainly similar to the code I've always used:

fyl2x
fld st(0)
frndint
fld st(1)
fsub st(0),st(1)
f2xm1
fld1
faddp st(1),st(0)
fxch st(1)
fld1
fscale
fxch st(1)
fstp st(0)
fmulp st(1),st(0)
I suppose the absolute best case scenario is a total of 74 clock
cycles, and the worst is 230 clock cycles, with possible stalls.

I suppose I overestimated the required clock cycles.

Especially since you seem to be basing your clock counts on the Pentium,
which is an _ancient_ processor. The Pentium was discontinued more than
a decade ago. Even if you obtain your computers from other people's
trash, you can something a lot newer by now.

To put things in perspective, using the code above, it took about half
again longer to load X and Y from memory into the FPU than it did to
compute X^Y after they were loaded.
 
J

Jerry Coffin

Would it not make sense for the standard library to have a pow function
that deals only with integer exponents?

It already has such overloads, though there's no guarantee that they
work any differently than for floating point exponents.
unsigned pow_int(unsigned const x,unsigned exp)
{
unsigned retval = 1;

while ( ; exp; --exp) retval *= x;

return retval;
}

This isn't a particularly good algorithm unless exp is _quite_ small
(low double digits at most). For an N-bit exponent, this takes a maximum
of 2^N-1 multiplications. Assuming values of exp were randomly
distributed, it would take about 2^(N-1) multiplications on average.

You can do a lot better than that, taking roughly 2N multiplications in
the worst case, and something like 1.5N multiplications in the typical
case (actually, it should be somewhat less than that, but I haven't
bothered to figure up the exact number -- in any case, it's linear on
the number of bits instead of exponential.

Here's an implementation with some test code and instrumentation:

#include <vector>
#include <numeric>

unsigned mults;

template <class T>
T mult(T a, T b) {
mults++;
return a * b;
}

#define bits(object) (sizeof(object)*CHAR_BIT)

template <class T>
T my_pow(T X, unsigned Y) {
std::vector<T> sq;

sq.reserve(bits(Y));

T current_square(X);
mults = 0;

while (Y) {
if (Y&1)
sq.push_back(current_square);
Y>>=1;
current_square *= current_square;
mults++;
}
T result = std::accumulate(sq.begin(), sq.end(), T(1), mult<T>);

return result;
}

#ifdef TEST

#include <iostream>
#include <cmath>

int main() {
unsigned max_mults = 0;
unsigned total_mults = 0;
unsigned numbers = 0;

for (double i=0; i<10; i++)
for (int j=0; j<300; j++) {
std::cout << i << "^" << j << "=" << my_pow(i, j);
std::cout << "\t" << mults << " multiplications\n";
if (std::pow(i,j) != my_pow(i,j))
std::cout << "^Error!\n";
if (mults > max_mults)
max_mults = mults;
numbers ++;
total_mults += mults;
}
std::cout << "Maximum Multiplications: " << max_mults;
std::cout << "\nAverage Multipliations: " << (double)total_mults /
numbers;
return 0;
}

#endif

When I run this, I get a worst-case of 16 and an average of 11.23
multiplications per call. In every case, the result matches the standard
library exactly -- which gives a strong indication that the standard
library is implementing the code the same way. I'd expect to see at
least a few minor differences if the standard library were doing the job
by taking the logarithm and multiplying.
 
J

James Kanze

It may or may not be...

It's not, by far.
Surely it's guaranteed 100% accurate* - it's integer arithmetic.

Oops. I missed that; I read the line "Compilers do have
specialized pow() functions when the exponent is integer", and
didn't look at the actual function parameters.

Using this function if the base is a floating point type, of
course, will result in very poor accuracy.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,190
Latest member
Martindap

Latest Threads

Top