Strange performance problem in C++ program

M

mike3

Hi.

(posted to comp.lang.c++ as well in case some C++-specific questions
get raised here)
I've got this routine from a bignum package I'm writing in C++, and
I'm having trouble with the performance. It's supposed to add two big
floating point numbers together nice and fast. But, by commenting out
everything sans the actual math loops to see the speed of the
components, I noticed that the non-math-loop stuff accounts for like
70%+ of the time! Many of those functions called are trivial in what
they do, they're not super-complicated. Even just the calls to
"GetExp()" which just _return a value_ of a data item (yes, something
THAT silly and trivial) take *25%* of the time, which I determined by
simply commenting them out and substituting the values they should
return in their place. That's just *Crazy*. I'm utterly, completely
mystified as to why this is. Is it because they're member functions of
objects? What's wrong with this picture?

---

/* Add two big floating point numbers, unsigned. */
/* These unsigned MP routines treat both operands as positive
* and the sign of the result is given accordingly.
*/
/* Special high-performance addition routine. This is much
* faster than the one in fladdu.cpp and is designed for use
* in the time-critical fractal-generating code. All operands --
* inputs and output, must be of the same length.
* No arithmetic from rawint.cpp is used here for the actual
* addition loop. Very fast -- 8 seconds for 100 million operations
* using Microsoft C++ compiler with all optimizations turned on.
*/
FG3DError BigFloat::FastAddUnsigned(const BigFloat *a, const BigFloat
*b)
{
/* Choke if the lengths are not equal */
if((length != a->length) || (a->length != b->length))
return(FG3DError(FG3D_MP_LENGTHS_UNEQUAL));

/* Handle exceptional/"error" values (infinity, NaNs) */
if(IsErr() || a->IsErr() || b->IsErr())
{
bool aIsInf(a->IsInfinity()), bIsInf(b->IsInfinity());
bool aIsQNaN(a->IsQNaN()), bIsQNaN(b->IsQNaN());
bool aIsSNaN(a->IsSNaN()), bIsSNaN(b->IsSNaN());

/* Anything + NaN = NaN */
if((aIsQNaN || bIsQNaN) && !(aIsSNaN || bIsSNaN))
{
MakeQNaN(SIGN_POSITIVE);
return(FG3DError(FG3D_SUCCESS));
} else if(aIsSNaN || bIsSNaN) {
MakeSNaN(SIGN_POSITIVE);
return(FG3DError(FG3D_BAD_FLOAT, aIsSNaN ?
reinterpret_cast<DWORD>(a)
:
reinterpret_cast<DWORD>(b)));
}

/* Infinity + Anything = Infinity */
if(aIsInf || bIsInf)
{
MakeInfinity(SIGN_POSITIVE);
return(FG3DError(FG3D_SUCCESS));
}
}

/* Next, fetch direct pointers to the digit data. We are
* going to get down and dirty with real live arrays and
* pointers here. Kids, don't try this at home! We're
* trying to squeeze every last ounce of speed there
* is!
*/
Digit *rPtr(&digits[0]);
const Digit *aPtr(&a->digits[0]), *bPtr(&b->digits[0]);

/* Now get the exponents and make sure b's is the smaller. */
s32 aExp(a->signexp.GetExp()), bExp(b->signexp.GetExp());
if(aExp < bExp)
{
std::swap(aPtr, bPtr);
std::swap(aExp, bExp);
}

/* Compute the difference of the exponents and the shifts. */
s32 expDiff(aExp - bExp);
s32 wsh(expDiff/DIGIT_SIZE); // word shift
s32 bsh(expDiff%DIGIT_SIZE); // bit shift
if(wsh > length) wsh = length; // safety

/* Apply the word shift. */
bPtr += wsh;

/* Now work through the digits, shifting b and adding it
* to our result as we go.
*/
Digit sumCarry(0);
for(int i(0);i<length-wsh-1;i++)
{
/* Do combined add-and-shift */
Digit aVal(*aPtr), bVal(*bPtr);
Digit rshCarry(bsh ? (*(bPtr+1) << (DIGIT_SIZE-bsh)) : 0);
Digit bRsh((bVal >> bsh) | rshCarry);
Digit sum(aVal + bRsh + sumCarry);
sumCarry = sumCarry ? (sum <= aVal) : (sum < aVal);
*rPtr = sum;

/* Increment pointers */
rPtr++;
aPtr++;
bPtr++;
}

/* Next, take care of the last digit of b, which has no shift
* carry.
*/
if(wsh < length)
{
Digit aVal(*aPtr), bRsh(*bPtr >> bsh);
Digit sum(aVal + bRsh + sumCarry);
sumCarry = sumCarry ? (sum <= aVal) : (sum < aVal);
*rPtr = sum;
rPtr++;
aPtr++;
bPtr++;
}

/* Finally, ripple the addition carry through the remaining
digits. */
for(int i(length-wsh);i<length;i++)
{
Digit sum(*aPtr + sumCarry);
sumCarry = sumCarry ? (sum == 0) : 0;
*rPtr = sum;
rPtr++;
aPtr++;
}

/* Set exponent */
FG3DError err;
ECHK_RET(err, signexp.SetExp(aExp));

/* Shift in any remaining carries */
if(sumCarry)
{
ECHK_RET(err, signexp.Inc());

Digit rshCarry(DIGIT_MSB_MASK);
rPtr = (&digits[0]+(length-1));
for(int i(length-1);i>=0;i--)
{
Digit rVal(*rPtr);
Digit tmp((rVal >> 1) | rshCarry);
rshCarry = rVal << 31;
*rPtr = tmp;
rPtr--;
}
}

/* Set sign */
signexp.SetSign(SIGN_POSITIVE);

/* Return success. */
return(FG3DError(FG3D_SUCCESS));
}
---
 
M

Marco Manfredini

mike3 wrote:
[...]I noticed that the non-math-loop stuff accounts for like
70%+ of the time! Many of those functions called are trivial in what
they do, they're not super-complicated. Even just the calls to
"GetExp()" which just _return a value_ of a data item (yes, something
THAT silly and trivial) take *25%* of the time, which I determined by
simply commenting them out and substituting the values they should
return in their place. That's just *Crazy*.

It's not crazy, substituting the actual values means that the compiler
can perform constant folding on the rest of your code. Apparently you
derive all kinds of shift values from what GetExp() returns, which also
become constants for the compiler. This line, for example:
> Digit rshCarry(bsh ? (*(bPtr+1) << (DIGIT_SIZE-bsh)) : 0);

Contains a branch+runtime shift, but becomes a simple constant shift if
bsh is known in advance. Btw. the conditional is a non-optimization,
because the shift path is taken anyway in most of the cases. You have an
expensive branch to avoid the unavoidable. Also, I wonder why you need
to shift your digits anyway. Are storing the exponents for a different
base than the digits?

I'm utterly, completely
mystified as to why this is. Is it because they're member functions of
objects? What's wrong with this picture?

The routine is overly complicated for it's task.
 
M

mike3

mike3 wrote:

[...]I noticed that the non-math-loop stuff accounts for like
70%+ of the time! Many of those functions called are trivial in what
they do, they're not super-complicated. Even just the calls to
"GetExp()" which just _return a value_ of a data item (yes, something
THAT silly and trivial) take *25%* of the time, which I determined by
simply commenting them out and substituting the values they should
return in their place. That's just *Crazy*.

It's not crazy, substituting the actual values means that the compiler
can perform constant folding on the rest of your code. Apparently you
derive all kinds of shift values from what GetExp() returns, which also
become constants for the compiler. This line, for example:
Digit rshCarry(bsh ? (*(bPtr+1) << (DIGIT_SIZE-bsh)) : 0);

Contains a branch+runtime shift, but becomes a simple constant shift if
bsh is known in advance. Btw. the conditional is a non-optimization,
because the shift path is taken anyway in most of the cases. You have an
expensive branch to avoid the unavoidable. Also, I wonder why you need
to shift your digits anyway. Are storing the exponents for a different
base than the digits?

Yes, the digits must be shifted because the floating
point numbers are stored with an exponent to the base
two (binary). This avoids the problem of "wobbling
precision" that plagues using the digit size as the
exponent base. Furthermore I _know_ this can be made
fast (see the end of this message for why).

The reason for the conditional is because shifting
say a 32-bit "unsigned long" type left by 32 bits
(if "Digit" is defined as "unsigned long" like it is
in my program and we're on a 32-bit machine, DIGIT_SIZE
is 32 bits -- the "digits" take up and enitre machine word),
does _not_ produce zero like you'd expect. Try this:

---
#include <iostream>

using namespace std;

int main()
{
unsigned int a(1245); // should be 32 bits

cout << "a = " << a << endl;
a <<= 32;
cout << "a shifted left by 32 bits = " << a << endl;

return 0;
}
---

When I run it, I don't get 0 like would be needed
in the adder, I get 1245.
I'm utterly, completely


The routine is overly complicated for it's task.

The mystifying thing was that when I had a more "C-like"
routine -- namely, the "BigFloat" object used pure
arrays, instead of std::vector, *and* the exponent/sign
combination were not encapsulated, plus the routine
was not a member function -- it performed much better.
(The drawback though was then boundschecking for exponents
and arrays had to be all over the place (in all sorts
of different routines), dramatically killing code
readability) And there is virtually no difference between
the algorithms used, no more or less shifts.
 
A

aku ankka

mike3 wrote:
[...]I noticed that the non-math-loop stuff accounts for like
70%+ of the time! Many of those functions called are trivial in what
they do, they're not super-complicated. Even just the calls to
"GetExp()" which just _return a value_ of a data item (yes, something
THAT silly and trivial) take *25%* of the time, which I determined by
simply commenting them out and substituting the values they should
return in their place. That's just *Crazy*.
It's not crazy, substituting the actual values means that the compiler
can perform constant folding on the rest of your code. Apparently you
derive all kinds of shift values from what GetExp() returns, which also
become constants for the compiler. This line, for example:
Contains a branch+runtime shift, but becomes a simple constant shift if
bsh is known in advance. Btw. the conditional is a non-optimization,
because the shift path is taken anyway in most of the cases. You have an
expensive branch to avoid the unavoidable. Also, I wonder why you need
to shift your digits anyway. Are storing the exponents for a different
base than the digits?

Yes, the digits must be shifted because the floating
point numbers are stored with an exponent to the base
two (binary). This avoids the problem of "wobbling
precision" that plagues using the digit size as the
exponent base. Furthermore I _know_ this can be made
fast (see the end of this message for why).

The reason for the conditional is because shifting
say a 32-bit "unsigned long" type left by 32 bits
(if "Digit" is defined as "unsigned long" like it is
in my program and we're on a 32-bit machine, DIGIT_SIZE
is 32 bits -- the "digits" take up and enitre machine word),
does _not_ produce zero like you'd expect. Try this:

---
#include <iostream>

using namespace std;

int main()
{
unsigned int a(1245); // should be 32 bits

cout << "a = " << a << endl;
a <<= 32;
cout << "a shifted left by 32 bits = " << a << endl;

return 0;}

For example, on the x86 the shift value is masked to five significant
bits in the 32 bit mode.

32 & 31 = 0, that equals nop so the compiler doesn't have to emit ALU
instruction at all to implement that statement if it can see the (a <<
0) at compilation time.

The standard document has more information why this behavior is
legal. :D
 
M

Marco Manfredini

mike3 said:
Yes, the digits must be shifted because the floating
point numbers are stored with an exponent to the base
two (binary). This avoids the problem of "wobbling
precision" that plagues using the digit size as the
exponent base. Furthermore I _know_ this can be made
fast (see the end of this message for why).

IIRC, the wobbling precision issue is about one digit in your
representation which stores less information in higher bases. That was a
big deal in the 60's but I hardly see why this is important if a 4000
bit floating value effectively has only 3980 bits of real precision? In
particular if you always have to bit-shift one operand for the extra 20
bits?
The reason for the conditional is because shifting
say a 32-bit "unsigned long" type left by 32 bits
(if "Digit" is defined as "unsigned long" like it is
in my program and we're on a 32-bit machine, DIGIT_SIZE
is 32 bits -- the "digits" take up and enitre machine word),
does _not_ produce zero like you'd expect. Try this:

So, why the heck do you do this in the inner loop? What about
if (bsh==0)
{
}
else
{
}

The mystifying thing was that when I had a more "C-like"
routine -- namely, the "BigFloat" object used pure
arrays, instead of std::vector, *and* the exponent/sign
combination were not encapsulated, plus the routine
was not a member function -- it performed much better.
(The drawback though was then boundschecking for exponents
and arrays had to be all over the place (in all sorts
of different routines), dramatically killing code
readability) And there is virtually no difference between
the algorithms used, no more or less shifts.

If "digits" is a std::vector, then you should not take the address of
the first element, but an iterator of course. The encapsulation
shouldn't introduce an overhead, provided that the compiler inlines the
calls. For the transformation into a member function: Your problem might
rather be, that you are simply running out of registers. Your core
addition loop for example works with (lenght-wsh-1), sumCarry, aPtr,
bPtr, rPtr which is much for a x86. It literally carries sumCarry. Now
compare with the following approach

while(true)
{
do
{
if (i==end) goto out;
carry=add_without_carry(a,b,res);
i++;
}
while (!carry);

do
{
if (i==end) goto out;
i++;
carry=add_with_carry(a,b,res);
}
while(carry);
}
out:

The compiler can easily put carry in a scratch register now and the
add_* routines can be coded with ahead knowledge of carry.

Also consider a storage layout, where all the digits are *within* the
(variable sized) bignum, because you save the extra pointers for the
digits in all your computations, plus the caching aspects of this.

And learn to read the assembler output by your compiler. For bignum
addition, the limiting factor usually is the memory transfer capability
of your system.
 
M

mike3

IIRC, the wobbling precision issue is about one digit in your
representation which stores less information in higher bases. That was a
big deal in the 60's but I hardly see why this is important if a 4000
bit floating value effectively has only 3980 bits of real precision? In
particular if you always have to bit-shift one operand for the extra 20
bits?

But at 4000 bits one will be waiting days, weeks even for the fractals
to render -- that corresponds to a magnification factor of some
10^1200 or so. Bignum is required at around 10^14 magnification or so.
A 64-bit value effectively having only 32 bits of precision, or a 96-
bit value effectively having only 64 bits, now THAT's something. Plus
my original routine which also used binary exponents did not have the
performance trouble that I am experiencing now.
So, why the heck do you do this in the inner loop? What about
if (bsh==0)
{}

else
{

}

Like I said, though, it seems that commenting out the inner loop just
to see how it affects the time required shows that the majority of the
time is still going into that stuff _outside_ the loop, such as
fetching the exponent using GetExp().
If "digits" is a std::vector, then you should not take the address of
the first element, but an iterator of course. The encapsulation
shouldn't introduce an overhead, provided that the compiler inlines the
calls.

Can you count on that inlining happening provided you have
any reasonably decent compiler, as I don't know if I may
change compilers in the future.
For the transformation into a member function: Your problem might
rather be, that you are simply running out of registers. Your core
addition loop for example works with (lenght-wsh-1), sumCarry, aPtr,
bPtr, rPtr which is much for a x86. It literally carries sumCarry. Now
compare with the following approach

The original routine used in my previous BigFloat attempt
that I mentioned used about the same thing, although it
indexed the arrays directly (a, for example) instead of
incrementing 3 pointers. In fact the 3 pointers being added
was actually an attempt to make it go faster, since it
_still_ had the same performance problem even with the arrays
done that way.
while(true)
{
do
{
if (i==end) goto out;
carry=add_without_carry(a,b,res);
i++;}

while (!carry);

do
{
if (i==end) goto out;
i++;
carry=add_with_carry(a,b,res);}
while(carry);
}

out:

The compiler can easily put carry in a scratch register now and the
add_* routines can be coded with ahead knowledge of carry.

Also consider a storage layout, where all the digits are *within* the
(variable sized) bignum, because you save the extra pointers for the
digits in all your computations, plus the caching aspects of this.

And learn to read the assembler output by your compiler. For bignum
addition, the limiting factor usually is the memory transfer capability
of your system.
 
M

Marco Manfredini

mike3 said:
Like I said, though, it seems that commenting out the inner loop just
to see how it affects the time required shows that the majority of the
time is still going into that stuff _outside_ the loop, such as
fetching the exponent using GetExp().

I told you before, that replacing GetExp() with the "values they should
return" makes the compiler create completely different code because it
sees constant values now. You can't conclude that GetExp() eats 20% just
by replacing it with a constant, this not how benchmarking in the
presence of an optimizer works.
Should you instead just have replaced GetExp with a direct member
access, then look at the assembler code if there's a call to GetExp().
If yes, find out why it's not inline. Is it virtual? Are your
optimization settings OK (ie. inline control)? Are you using a Visual
Studio *Express* Edition (These do not optimize, IIRC) and so on. These
are my suggestions that can be made from the code fragment you've posted.
Can you count on that inlining happening provided you have
any reasonably decent compiler, as I don't know if I may
change compilers in the future.

A reasonable compiler will inline, when he realizess that inlining
produces faster code and you want him to create faster code.
 
M

mike3

I told you before, that replacing GetExp() with the "values they should
return" makes the compiler create completely different code because it
sees constant values now. You can't conclude that GetExp() eats 20% just
by replacing it with a constant, this not how benchmarking in the
presence of an optimizer works.

Then how can you test to see where the problem is? Any good,
_free_ profiler? (I don't have a lot of money.)
Should you instead just have replaced GetExp with a direct member
access, then look at the assembler code if there's a call to GetExp().
If yes, find out why it's not inline.

Because it's not declared inline. I just tried inlining but
it still doesn't seem to help any.
Is it virtual? Are your
optimization settings OK (ie. inline control)? Are you using a Visual
Studio *Express* Edition (These do not optimize, IIRC) and so on.

Yes, I was using the "express" edition (I cannot afford the other
editions), but it does seem to have an optimizer though. (Maybe the
last time you used it it didn't? Maybe that's something new?) Also,
the problem persists across compilers.
These
are my suggestions that can be made from the code fragment you've posted.


A reasonable compiler will inline, when he realizess that inlining
produces faster code and you want him to create faster code.

Who's "he", anyway?
 
P

Phlip

mike3 said:
Then how can you test to see where the problem is? Any good,
_free_ profiler? (I don't have a lot of money.)

Write unit tests (for C++ with UnitTest++), and then time-test the target
methods, with simple loop statements.

This will also reduce your time spent debugging, and improve your design.
Your objects must decouple enough to permit testing.
Because it's not declared inline. I just tried inlining but
it still doesn't seem to help any.

The 'inline' keyword, and the default inline inside classes, are no
guarantees of performance. A modern compiler will inline many methods that
you declared out-of-line, if it thinks this will improve things.

The first place to start with optimization is your general algorithm; never
the specific details of its implementation.

Here's a good thread on this subject:

http://www.beyond3d.com/content/articles/8/

It disects this horror:

float InvSqrt (float x){
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
Yes, I was using the "express" edition (I cannot afford the other
editions),

Get cygwin and g++. The best software on Earth is free.
Who's "he", anyway?

The internet is international, and many newcomers to English transliterate
their native idioms. Some languages have no gender-neutral pronouns, so a
compiler is "he".
 
M

Marco Manfredini

Phlip said:
The internet is international, and many newcomers to English transliterate
their native idioms. Some languages have no gender-neutral pronouns, so a
compiler is "he".
A German peculiarity (where 'compiler', 'computer' etc. are all male
bastards,..er nouns) which leads to an anthropomorphic perception and a
certain intentional unwillingness to correct the usage of pronouns in
other languages. Sorry.
 
M

mike3

A German peculiarity (where 'compiler', 'computer' etc. are all male
bastards,..er nouns) which leads to an anthropomorphic perception and a
certain intentional unwillingness to correct the usage of pronouns in
other languages. Sorry.

I understand. Thanks for the clarification. I just got
confused, that's all.

What is this "intentional" unwillingness, anyway?
 
M

mike3

Write unit tests (for C++ with UnitTest++), and then time-test the target
methods, with simple loop statements.

Well, I'll see what I get.
This will also reduce your time spent debugging, and improve your design.
Your objects must decouple enough to permit testing.


The 'inline' keyword, and the default inline inside classes, are no
guarantees of performance. A modern compiler will inline many methods that
you declared out-of-line, if it thinks this will improve things.

The first place to start with optimization is your general algorithm; never
the specific details of its implementation.

There isn't really any better algorithm than that -- it's
just the addition like you do on pen and paper only with a
computer and using base two instead of base ten (although
working in chunks of digits so you could also think of it
as working in base 2^32 or similar.). And like I said, a
previous implementation in a more "C" style was quite fast
(about as fast as an integer addition! Now that's fast.).
The algorithm used there was exactly the same, just the
implementation differed. So what gives? And it didn't differ
much -- the principal difference was hands-on manipulations
of the exponents and not being a member function itself.
Here's a good thread on this subject:

http://www.beyond3d.com/content/articles/8/

It disects this horror:

float InvSqrt (float x){
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;

}

Eww! That's awful! Talk about optimization and
clear code being in opposition! Holy cow.
Get cygwin and g++. The best software on Earth is free.



The internet is international, and many newcomers to English transliterate
their native idioms. Some languages have no gender-neutral pronouns, so a
compiler is "he".

Ah. I see.
 
E

Erik Wikström

Yes, I was using the "express" edition (I cannot afford the other
editions), but it does seem to have an optimizer though. (Maybe the
last time you used it it didn't? Maybe that's something new?) Also,
the problem persists across compilers.

The compiler (and optimiser) is (and always have been) the same as in
the non-free versions, what you miss out on is tools, such as the
profiler, and libraries.
 
M

mike3

The compiler (and optimiser) is (and always have been) the same as in
the non-free versions, what you miss out on is tools, such as the
profiler, and libraries.

That's right. That's the actual difference. If you want the tools, you
must pay a small fortune ($700).
 
P

Phlip

mike3 said:
Eww! That's awful! Talk about optimization and
clear code being in opposition! Holy cow.

That is the Alpha and Omega of Postmature Optimization.

(Look up "premature optimization" to see what I mean.)

The method is inside a function, InvSqrt, so you don't need to type all that
just to collect inverse square roots. That is exemplary, not awful.

But note that modern architectures will probably run 'double' faster than
'float', when the former matches the FPU's bus.
 
M

mike3

Well, just wanted to inform you that I fixed my problem, by removing
calls to the helper functions and instead working on the sign/exponent
data directly. Although this made the code messier, unfortunately(*),
it gave me a 2.5x performance gain -- the routine now runs at around
50% of the speed of integer addition. I doubt it is possible to make
it
any faster than that due to the bit shift, unless perhaps one programs
it in assembler, but that's a whole new story.

(*) Fortunately, though, there are only 5 of these "messy" fast
routines,
for the 5 basic numerical operations (addition, subtraction,
multiplication,
division, and comparison).
 
P

Philip Potter

Phlip said:
The method is inside a function, InvSqrt, so you don't need to type all that
just to collect inverse square roots. That is exemplary, not awful.

It's a terrible name, too. To me, an "inverse square root" is a square.
The function calculates reciprocal square roots.
 
P

Phlip

It's a terrible name, too. To me, an "inverse square root" is a square.
The function calculates reciprocal square roots.

Point. I _wish_ I could have written the innards, but I certainly
would have found a better name!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top