Strange performance problem in C++ program

Discussion in 'C++' started by mike3, Nov 18, 2007.

  1. mike3

    mike3 Guest

    Hi.

    (posted to comp.lang.c++ as well in case some C++-specific questions
    get raised here)
    I've got this routine from a bignum package I'm writing in C++, and
    I'm having trouble with the performance. It's supposed to add two big
    floating point numbers together nice and fast. But, by commenting out
    everything sans the actual math loops to see the speed of the
    components, I noticed that the non-math-loop stuff accounts for like
    70%+ of the time! Many of those functions called are trivial in what
    they do, they're not super-complicated. Even just the calls to
    "GetExp()" which just _return a value_ of a data item (yes, something
    THAT silly and trivial) take *25%* of the time, which I determined by
    simply commenting them out and substituting the values they should
    return in their place. That's just *Crazy*. I'm utterly, completely
    mystified as to why this is. Is it because they're member functions of
    objects? What's wrong with this picture?

    ---

    /* Add two big floating point numbers, unsigned. */
    /* These unsigned MP routines treat both operands as positive
    * and the sign of the result is given accordingly.
    */
    /* Special high-performance addition routine. This is much
    * faster than the one in fladdu.cpp and is designed for use
    * in the time-critical fractal-generating code. All operands --
    * inputs and output, must be of the same length.
    * No arithmetic from rawint.cpp is used here for the actual
    * addition loop. Very fast -- 8 seconds for 100 million operations
    * using Microsoft C++ compiler with all optimizations turned on.
    */
    FG3DError BigFloat::FastAddUnsigned(const BigFloat *a, const BigFloat
    *b)
    {
    /* Choke if the lengths are not equal */
    if((length != a->length) || (a->length != b->length))
    return(FG3DError(FG3D_MP_LENGTHS_UNEQUAL));

    /* Handle exceptional/"error" values (infinity, NaNs) */
    if(IsErr() || a->IsErr() || b->IsErr())
    {
    bool aIsInf(a->IsInfinity()), bIsInf(b->IsInfinity());
    bool aIsQNaN(a->IsQNaN()), bIsQNaN(b->IsQNaN());
    bool aIsSNaN(a->IsSNaN()), bIsSNaN(b->IsSNaN());

    /* Anything + NaN = NaN */
    if((aIsQNaN || bIsQNaN) && !(aIsSNaN || bIsSNaN))
    {
    MakeQNaN(SIGN_POSITIVE);
    return(FG3DError(FG3D_SUCCESS));
    } else if(aIsSNaN || bIsSNaN) {
    MakeSNaN(SIGN_POSITIVE);
    return(FG3DError(FG3D_BAD_FLOAT, aIsSNaN ?
    reinterpret_cast<DWORD>(a)
    :
    reinterpret_cast<DWORD>(b)));
    }

    /* Infinity + Anything = Infinity */
    if(aIsInf || bIsInf)
    {
    MakeInfinity(SIGN_POSITIVE);
    return(FG3DError(FG3D_SUCCESS));
    }
    }

    /* Next, fetch direct pointers to the digit data. We are
    * going to get down and dirty with real live arrays and
    * pointers here. Kids, don't try this at home! We're
    * trying to squeeze every last ounce of speed there
    * is!
    */
    Digit *rPtr(&digits[0]);
    const Digit *aPtr(&a->digits[0]), *bPtr(&b->digits[0]);

    /* Now get the exponents and make sure b's is the smaller. */
    s32 aExp(a->signexp.GetExp()), bExp(b->signexp.GetExp());
    if(aExp < bExp)
    {
    std::swap(aPtr, bPtr);
    std::swap(aExp, bExp);
    }

    /* Compute the difference of the exponents and the shifts. */
    s32 expDiff(aExp - bExp);
    s32 wsh(expDiff/DIGIT_SIZE); // word shift
    s32 bsh(expDiff%DIGIT_SIZE); // bit shift
    if(wsh > length) wsh = length; // safety

    /* Apply the word shift. */
    bPtr += wsh;

    /* Now work through the digits, shifting b and adding it
    * to our result as we go.
    */
    Digit sumCarry(0);
    for(int i(0);i<length-wsh-1;i++)
    {
    /* Do combined add-and-shift */
    Digit aVal(*aPtr), bVal(*bPtr);
    Digit rshCarry(bsh ? (*(bPtr+1) << (DIGIT_SIZE-bsh)) : 0);
    Digit bRsh((bVal >> bsh) | rshCarry);
    Digit sum(aVal + bRsh + sumCarry);
    sumCarry = sumCarry ? (sum <= aVal) : (sum < aVal);
    *rPtr = sum;

    /* Increment pointers */
    rPtr++;
    aPtr++;
    bPtr++;
    }

    /* Next, take care of the last digit of b, which has no shift
    * carry.
    */
    if(wsh < length)
    {
    Digit aVal(*aPtr), bRsh(*bPtr >> bsh);
    Digit sum(aVal + bRsh + sumCarry);
    sumCarry = sumCarry ? (sum <= aVal) : (sum < aVal);
    *rPtr = sum;
    rPtr++;
    aPtr++;
    bPtr++;
    }

    /* Finally, ripple the addition carry through the remaining
    digits. */
    for(int i(length-wsh);i<length;i++)
    {
    Digit sum(*aPtr + sumCarry);
    sumCarry = sumCarry ? (sum == 0) : 0;
    *rPtr = sum;
    rPtr++;
    aPtr++;
    }

    /* Set exponent */
    FG3DError err;
    ECHK_RET(err, signexp.SetExp(aExp));

    /* Shift in any remaining carries */
    if(sumCarry)
    {
    ECHK_RET(err, signexp.Inc());

    Digit rshCarry(DIGIT_MSB_MASK);
    rPtr = (&digits[0]+(length-1));
    for(int i(length-1);i>=0;i--)
    {
    Digit rVal(*rPtr);
    Digit tmp((rVal >> 1) | rshCarry);
    rshCarry = rVal << 31;
    *rPtr = tmp;
    rPtr--;
    }
    }

    /* Set sign */
    signexp.SetSign(SIGN_POSITIVE);

    /* Return success. */
    return(FG3DError(FG3D_SUCCESS));
    }
    ---
     
    mike3, Nov 18, 2007
    #1
    1. Advertising

  2. mike3 wrote:
    [...]I noticed that the non-math-loop stuff accounts for like
    > 70%+ of the time! Many of those functions called are trivial in what
    > they do, they're not super-complicated. Even just the calls to
    > "GetExp()" which just _return a value_ of a data item (yes, something
    > THAT silly and trivial) take *25%* of the time, which I determined by
    > simply commenting them out and substituting the values they should
    > return in their place. That's just *Crazy*.


    It's not crazy, substituting the actual values means that the compiler
    can perform constant folding on the rest of your code. Apparently you
    derive all kinds of shift values from what GetExp() returns, which also
    become constants for the compiler. This line, for example:

    > Digit rshCarry(bsh ? (*(bPtr+1) << (DIGIT_SIZE-bsh)) : 0);


    Contains a branch+runtime shift, but becomes a simple constant shift if
    bsh is known in advance. Btw. the conditional is a non-optimization,
    because the shift path is taken anyway in most of the cases. You have an
    expensive branch to avoid the unavoidable. Also, I wonder why you need
    to shift your digits anyway. Are storing the exponents for a different
    base than the digits?

    I'm utterly, completely
    > mystified as to why this is. Is it because they're member functions of
    > objects? What's wrong with this picture?


    The routine is overly complicated for it's task.
     
    Marco Manfredini, Nov 18, 2007
    #2
    1. Advertising

  3. mike3

    mike3 Guest

    On Nov 17, 9:12 pm, Marco Manfredini <> wrote:
    > mike3 wrote:
    >
    > [...]I noticed that the non-math-loop stuff accounts for like
    >
    > > 70%+ of the time! Many of those functions called are trivial in what
    > > they do, they're not super-complicated. Even just the calls to
    > > "GetExp()" which just _return a value_ of a data item (yes, something
    > > THAT silly and trivial) take *25%* of the time, which I determined by
    > > simply commenting them out and substituting the values they should
    > > return in their place. That's just *Crazy*.

    >
    > It's not crazy, substituting the actual values means that the compiler
    > can perform constant folding on the rest of your code. Apparently you
    > derive all kinds of shift values from what GetExp() returns, which also
    > become constants for the compiler. This line, for example:
    >
    > > Digit rshCarry(bsh ? (*(bPtr+1) << (DIGIT_SIZE-bsh)) : 0);

    >
    > Contains a branch+runtime shift, but becomes a simple constant shift if
    > bsh is known in advance. Btw. the conditional is a non-optimization,
    > because the shift path is taken anyway in most of the cases. You have an
    > expensive branch to avoid the unavoidable. Also, I wonder why you need
    > to shift your digits anyway. Are storing the exponents for a different
    > base than the digits?
    >


    Yes, the digits must be shifted because the floating
    point numbers are stored with an exponent to the base
    two (binary). This avoids the problem of "wobbling
    precision" that plagues using the digit size as the
    exponent base. Furthermore I _know_ this can be made
    fast (see the end of this message for why).

    The reason for the conditional is because shifting
    say a 32-bit "unsigned long" type left by 32 bits
    (if "Digit" is defined as "unsigned long" like it is
    in my program and we're on a 32-bit machine, DIGIT_SIZE
    is 32 bits -- the "digits" take up and enitre machine word),
    does _not_ produce zero like you'd expect. Try this:

    ---
    #include <iostream>

    using namespace std;

    int main()
    {
    unsigned int a(1245); // should be 32 bits

    cout << "a = " << a << endl;
    a <<= 32;
    cout << "a shifted left by 32 bits = " << a << endl;

    return 0;
    }
    ---

    When I run it, I don't get 0 like would be needed
    in the adder, I get 1245.

    > I'm utterly, completely
    >
    > > mystified as to why this is. Is it because they're member functions of
    > > objects? What's wrong with this picture?

    >
    > The routine is overly complicated for it's task.


    The mystifying thing was that when I had a more "C-like"
    routine -- namely, the "BigFloat" object used pure
    arrays, instead of std::vector, *and* the exponent/sign
    combination were not encapsulated, plus the routine
    was not a member function -- it performed much better.
    (The drawback though was then boundschecking for exponents
    and arrays had to be all over the place (in all sorts
    of different routines), dramatically killing code
    readability) And there is virtually no difference between
    the algorithms used, no more or less shifts.
     
    mike3, Nov 18, 2007
    #3
  4. mike3

    aku ankka Guest

    On Nov 18, 7:46 am, mike3 <> wrote:
    > On Nov 17, 9:12 pm, Marco Manfredini <> wrote:
    >
    >
    >
    > > mike3 wrote:

    >
    > > [...]I noticed that the non-math-loop stuff accounts for like

    >
    > > > 70%+ of the time! Many of those functions called are trivial in what
    > > > they do, they're not super-complicated. Even just the calls to
    > > > "GetExp()" which just _return a value_ of a data item (yes, something
    > > > THAT silly and trivial) take *25%* of the time, which I determined by
    > > > simply commenting them out and substituting the values they should
    > > > return in their place. That's just *Crazy*.

    >
    > > It's not crazy, substituting the actual values means that the compiler
    > > can perform constant folding on the rest of your code. Apparently you
    > > derive all kinds of shift values from what GetExp() returns, which also
    > > become constants for the compiler. This line, for example:

    >
    > > > Digit rshCarry(bsh ? (*(bPtr+1) << (DIGIT_SIZE-bsh)) : 0);

    >
    > > Contains a branch+runtime shift, but becomes a simple constant shift if
    > > bsh is known in advance. Btw. the conditional is a non-optimization,
    > > because the shift path is taken anyway in most of the cases. You have an
    > > expensive branch to avoid the unavoidable. Also, I wonder why you need
    > > to shift your digits anyway. Are storing the exponents for a different
    > > base than the digits?

    >
    > Yes, the digits must be shifted because the floating
    > point numbers are stored with an exponent to the base
    > two (binary). This avoids the problem of "wobbling
    > precision" that plagues using the digit size as the
    > exponent base. Furthermore I _know_ this can be made
    > fast (see the end of this message for why).
    >
    > The reason for the conditional is because shifting
    > say a 32-bit "unsigned long" type left by 32 bits
    > (if "Digit" is defined as "unsigned long" like it is
    > in my program and we're on a 32-bit machine, DIGIT_SIZE
    > is 32 bits -- the "digits" take up and enitre machine word),
    > does _not_ produce zero like you'd expect. Try this:
    >
    > ---
    > #include <iostream>
    >
    > using namespace std;
    >
    > int main()
    > {
    > unsigned int a(1245); // should be 32 bits
    >
    > cout << "a = " << a << endl;
    > a <<= 32;
    > cout << "a shifted left by 32 bits = " << a << endl;
    >
    > return 0;}
    >
    > ---
    >
    > When I run it, I don't get 0 like would be needed
    > in the adder, I get 1245.


    For example, on the x86 the shift value is masked to five significant
    bits in the 32 bit mode.

    32 & 31 = 0, that equals nop so the compiler doesn't have to emit ALU
    instruction at all to implement that statement if it can see the (a <<
    0) at compilation time.

    The standard document has more information why this behavior is
    legal. :D
     
    aku ankka, Nov 18, 2007
    #4
  5. mike3 wrote:

    >
    > Yes, the digits must be shifted because the floating
    > point numbers are stored with an exponent to the base
    > two (binary). This avoids the problem of "wobbling
    > precision" that plagues using the digit size as the
    > exponent base. Furthermore I _know_ this can be made
    > fast (see the end of this message for why).


    IIRC, the wobbling precision issue is about one digit in your
    representation which stores less information in higher bases. That was a
    big deal in the 60's but I hardly see why this is important if a 4000
    bit floating value effectively has only 3980 bits of real precision? In
    particular if you always have to bit-shift one operand for the extra 20
    bits?

    >
    > The reason for the conditional is because shifting
    > say a 32-bit "unsigned long" type left by 32 bits
    > (if "Digit" is defined as "unsigned long" like it is
    > in my program and we're on a 32-bit machine, DIGIT_SIZE
    > is 32 bits -- the "digits" take up and enitre machine word),
    > does _not_ produce zero like you'd expect. Try this:


    So, why the heck do you do this in the inner loop? What about
    if (bsh==0)
    {
    }
    else
    {
    }


    > The mystifying thing was that when I had a more "C-like"
    > routine -- namely, the "BigFloat" object used pure
    > arrays, instead of std::vector, *and* the exponent/sign
    > combination were not encapsulated, plus the routine
    > was not a member function -- it performed much better.
    > (The drawback though was then boundschecking for exponents
    > and arrays had to be all over the place (in all sorts
    > of different routines), dramatically killing code
    > readability) And there is virtually no difference between
    > the algorithms used, no more or less shifts.
    >


    If "digits" is a std::vector, then you should not take the address of
    the first element, but an iterator of course. The encapsulation
    shouldn't introduce an overhead, provided that the compiler inlines the
    calls. For the transformation into a member function: Your problem might
    rather be, that you are simply running out of registers. Your core
    addition loop for example works with (lenght-wsh-1), sumCarry, aPtr,
    bPtr, rPtr which is much for a x86. It literally carries sumCarry. Now
    compare with the following approach

    while(true)
    {
    do
    {
    if (i==end) goto out;
    carry=add_without_carry(a,b,res);
    i++;
    }
    while (!carry);

    do
    {
    if (i==end) goto out;
    i++;
    carry=add_with_carry(a,b,res);
    }
    while(carry);
    }
    out:

    The compiler can easily put carry in a scratch register now and the
    add_* routines can be coded with ahead knowledge of carry.

    Also consider a storage layout, where all the digits are *within* the
    (variable sized) bignum, because you save the extra pointers for the
    digits in all your computations, plus the caching aspects of this.

    And learn to read the assembler output by your compiler. For bignum
    addition, the limiting factor usually is the memory transfer capability
    of your system.
     
    Marco Manfredini, Nov 18, 2007
    #5
  6. mike3

    mike3 Guest

    On Nov 18, 9:27 am, Marco Manfredini <> wrote:
    > mike3 wrote:
    >
    > > Yes, the digits must be shifted because the floating
    > > point numbers are stored with an exponent to the base
    > > two (binary). This avoids the problem of "wobbling
    > > precision" that plagues using the digit size as the
    > > exponent base. Furthermore I _know_ this can be made
    > > fast (see the end of this message for why).

    >
    > IIRC, the wobbling precision issue is about one digit in your
    > representation which stores less information in higher bases. That was a
    > big deal in the 60's but I hardly see why this is important if a 4000
    > bit floating value effectively has only 3980 bits of real precision? In
    > particular if you always have to bit-shift one operand for the extra 20
    > bits?
    >


    But at 4000 bits one will be waiting days, weeks even for the fractals
    to render -- that corresponds to a magnification factor of some
    10^1200 or so. Bignum is required at around 10^14 magnification or so.
    A 64-bit value effectively having only 32 bits of precision, or a 96-
    bit value effectively having only 64 bits, now THAT's something. Plus
    my original routine which also used binary exponents did not have the
    performance trouble that I am experiencing now.

    >
    >
    > > The reason for the conditional is because shifting
    > > say a 32-bit "unsigned long" type left by 32 bits
    > > (if "Digit" is defined as "unsigned long" like it is
    > > in my program and we're on a 32-bit machine, DIGIT_SIZE
    > > is 32 bits -- the "digits" take up and enitre machine word),
    > > does _not_ produce zero like you'd expect. Try this:

    >
    > So, why the heck do you do this in the inner loop? What about
    > if (bsh==0)
    > {}
    >
    > else
    > {
    >
    > }


    Like I said, though, it seems that commenting out the inner loop just
    to see how it affects the time required shows that the majority of the
    time is still going into that stuff _outside_ the loop, such as
    fetching the exponent using GetExp().

    > > The mystifying thing was that when I had a more "C-like"
    > > routine -- namely, the "BigFloat" object used pure
    > > arrays, instead of std::vector, *and* the exponent/sign
    > > combination were not encapsulated, plus the routine
    > > was not a member function -- it performed much better.
    > > (The drawback though was then boundschecking for exponents
    > > and arrays had to be all over the place (in all sorts
    > > of different routines), dramatically killing code
    > > readability) And there is virtually no difference between
    > > the algorithms used, no more or less shifts.

    >
    > If "digits" is a std::vector, then you should not take the address of
    > the first element, but an iterator of course. The encapsulation
    > shouldn't introduce an overhead, provided that the compiler inlines the
    > calls.


    Can you count on that inlining happening provided you have
    any reasonably decent compiler, as I don't know if I may
    change compilers in the future.

    > For the transformation into a member function: Your problem might
    > rather be, that you are simply running out of registers. Your core
    > addition loop for example works with (lenght-wsh-1), sumCarry, aPtr,
    > bPtr, rPtr which is much for a x86. It literally carries sumCarry. Now
    > compare with the following approach
    >


    The original routine used in my previous BigFloat attempt
    that I mentioned used about the same thing, although it
    indexed the arrays directly (a, for example) instead of
    incrementing 3 pointers. In fact the 3 pointers being added
    was actually an attempt to make it go faster, since it
    _still_ had the same performance problem even with the arrays
    done that way.

    > while(true)
    > {
    > do
    > {
    > if (i==end) goto out;
    > carry=add_without_carry(a,b,res);
    > i++;}
    >
    > while (!carry);
    >
    > do
    > {
    > if (i==end) goto out;
    > i++;
    > carry=add_with_carry(a,b,res);}
    > while(carry);
    > }
    >
    > out:
    >
    > The compiler can easily put carry in a scratch register now and the
    > add_* routines can be coded with ahead knowledge of carry.
    >
    > Also consider a storage layout, where all the digits are *within* the
    > (variable sized) bignum, because you save the extra pointers for the
    > digits in all your computations, plus the caching aspects of this.
    >
    > And learn to read the assembler output by your compiler. For bignum
    > addition, the limiting factor usually is the memory transfer capability
    > of your system.
     
    mike3, Nov 18, 2007
    #6
  7. mike3 wrote:

    > Like I said, though, it seems that commenting out the inner loop just
    > to see how it affects the time required shows that the majority of the
    > time is still going into that stuff _outside_ the loop, such as
    > fetching the exponent using GetExp().


    I told you before, that replacing GetExp() with the "values they should
    return" makes the compiler create completely different code because it
    sees constant values now. You can't conclude that GetExp() eats 20% just
    by replacing it with a constant, this not how benchmarking in the
    presence of an optimizer works.
    Should you instead just have replaced GetExp with a direct member
    access, then look at the assembler code if there's a call to GetExp().
    If yes, find out why it's not inline. Is it virtual? Are your
    optimization settings OK (ie. inline control)? Are you using a Visual
    Studio *Express* Edition (These do not optimize, IIRC) and so on. These
    are my suggestions that can be made from the code fragment you've posted.

    > Can you count on that inlining happening provided you have
    > any reasonably decent compiler, as I don't know if I may
    > change compilers in the future.


    A reasonable compiler will inline, when he realizess that inlining
    produces faster code and you want him to create faster code.
     
    Marco Manfredini, Nov 18, 2007
    #7
  8. mike3

    mike3 Guest

    On Nov 18, 2:33 pm, Marco Manfredini <> wrote:
    > mike3 wrote:
    > > Like I said, though, it seems that commenting out the inner loop just
    > > to see how it affects the time required shows that the majority of the
    > > time is still going into that stuff _outside_ the loop, such as
    > > fetching the exponent using GetExp().

    >
    > I told you before, that replacing GetExp() with the "values they should
    > return" makes the compiler create completely different code because it
    > sees constant values now. You can't conclude that GetExp() eats 20% just
    > by replacing it with a constant, this not how benchmarking in the
    > presence of an optimizer works.


    Then how can you test to see where the problem is? Any good,
    _free_ profiler? (I don't have a lot of money.)

    > Should you instead just have replaced GetExp with a direct member
    > access, then look at the assembler code if there's a call to GetExp().
    > If yes, find out why it's not inline.


    Because it's not declared inline. I just tried inlining but
    it still doesn't seem to help any.

    > Is it virtual? Are your
    > optimization settings OK (ie. inline control)? Are you using a Visual
    > Studio *Express* Edition (These do not optimize, IIRC) and so on.


    Yes, I was using the "express" edition (I cannot afford the other
    editions), but it does seem to have an optimizer though. (Maybe the
    last time you used it it didn't? Maybe that's something new?) Also,
    the problem persists across compilers.

    > These
    > are my suggestions that can be made from the code fragment you've posted.
    >
    > > Can you count on that inlining happening provided you have
    > > any reasonably decent compiler, as I don't know if I may
    > > change compilers in the future.

    >
    > A reasonable compiler will inline, when he realizess that inlining
    > produces faster code and you want him to create faster code.


    Who's "he", anyway?
     
    mike3, Nov 19, 2007
    #8
  9. mike3

    Phlip Guest

    mike3 wrote:

    > Then how can you test to see where the problem is? Any good,
    > _free_ profiler? (I don't have a lot of money.)


    Write unit tests (for C++ with UnitTest++), and then time-test the target
    methods, with simple loop statements.

    This will also reduce your time spent debugging, and improve your design.
    Your objects must decouple enough to permit testing.

    > Because it's not declared inline. I just tried inlining but
    > it still doesn't seem to help any.


    The 'inline' keyword, and the default inline inside classes, are no
    guarantees of performance. A modern compiler will inline many methods that
    you declared out-of-line, if it thinks this will improve things.

    The first place to start with optimization is your general algorithm; never
    the specific details of its implementation.

    Here's a good thread on this subject:

    http://www.beyond3d.com/content/articles/8/

    It disects this horror:

    float InvSqrt (float x){
    float xhalf = 0.5f*x;
    int i = *(int*)&x;
    i = 0x5f3759df - (i>>1);
    x = *(float*)&i;
    x = x*(1.5f - xhalf*x*x);
    return x;
    }

    > Yes, I was using the "express" edition (I cannot afford the other
    > editions),


    Get cygwin and g++. The best software on Earth is free.

    >> A reasonable compiler will inline, when he realizess that inlining
    >> produces faster code and you want him to create faster code.

    >
    > Who's "he", anyway?


    The internet is international, and many newcomers to English transliterate
    their native idioms. Some languages have no gender-neutral pronouns, so a
    compiler is "he".

    --
    Phlip
     
    Phlip, Nov 19, 2007
    #9
  10. Phlip wrote:
    > mike3 wrote:
    >>> A reasonable compiler will inline, when he realizess that inlining
    >>> produces faster code and you want him to create faster code.

    >> Who's "he", anyway?

    >
    > The internet is international, and many newcomers to English transliterate
    > their native idioms. Some languages have no gender-neutral pronouns, so a
    > compiler is "he".
    >

    A German peculiarity (where 'compiler', 'computer' etc. are all male
    bastards,..er nouns) which leads to an anthropomorphic perception and a
    certain intentional unwillingness to correct the usage of pronouns in
    other languages. Sorry.
     
    Marco Manfredini, Nov 19, 2007
    #10
  11. mike3

    mike3 Guest

    On Nov 19, 12:23 am, Marco Manfredini <> wrote:
    > Phlip wrote:
    > > mike3 wrote:
    > >>> A reasonable compiler will inline, when he realizess that inlining
    > >>> produces faster code and you want him to create faster code.
    > >> Who's "he", anyway?

    >
    > > The internet is international, and many newcomers to English transliterate
    > > their native idioms. Some languages have no gender-neutral pronouns, so a
    > > compiler is "he".

    >
    > A German peculiarity (where 'compiler', 'computer' etc. are all male
    > bastards,..er nouns) which leads to an anthropomorphic perception and a
    > certain intentional unwillingness to correct the usage of pronouns in
    > other languages. Sorry.


    I understand. Thanks for the clarification. I just got
    confused, that's all.

    What is this "intentional" unwillingness, anyway?
     
    mike3, Nov 19, 2007
    #11
  12. mike3

    mike3 Guest

    On Nov 18, 11:15 pm, "Phlip" <> wrote:
    > mike3 wrote:
    > > Then how can you test to see where the problem is? Any good,
    > > _free_ profiler? (I don't have a lot of money.)

    >
    > Write unit tests (for C++ with UnitTest++), and then time-test the target
    > methods, with simple loop statements.
    >


    Well, I'll see what I get.

    > This will also reduce your time spent debugging, and improve your design.
    > Your objects must decouple enough to permit testing.
    >
    > > Because it's not declared inline. I just tried inlining but
    > > it still doesn't seem to help any.

    >
    > The 'inline' keyword, and the default inline inside classes, are no
    > guarantees of performance. A modern compiler will inline many methods that
    > you declared out-of-line, if it thinks this will improve things.
    >
    > The first place to start with optimization is your general algorithm; never
    > the specific details of its implementation.
    >


    There isn't really any better algorithm than that -- it's
    just the addition like you do on pen and paper only with a
    computer and using base two instead of base ten (although
    working in chunks of digits so you could also think of it
    as working in base 2^32 or similar.). And like I said, a
    previous implementation in a more "C" style was quite fast
    (about as fast as an integer addition! Now that's fast.).
    The algorithm used there was exactly the same, just the
    implementation differed. So what gives? And it didn't differ
    much -- the principal difference was hands-on manipulations
    of the exponents and not being a member function itself.

    > Here's a good thread on this subject:
    >
    > http://www.beyond3d.com/content/articles/8/
    >
    > It disects this horror:
    >
    > float InvSqrt (float x){
    > float xhalf = 0.5f*x;
    > int i = *(int*)&x;
    > i = 0x5f3759df - (i>>1);
    > x = *(float*)&i;
    > x = x*(1.5f - xhalf*x*x);
    > return x;
    >
    > }


    Eww! That's awful! Talk about optimization and
    clear code being in opposition! Holy cow.

    > > Yes, I was using the "express" edition (I cannot afford the other
    > > editions),

    >
    > Get cygwin and g++. The best software on Earth is free.
    >
    > >> A reasonable compiler will inline, when he realizess that inlining
    > >> produces faster code and you want him to create faster code.

    >
    > > Who's "he", anyway?

    >
    > The internet is international, and many newcomers to English transliterate
    > their native idioms. Some languages have no gender-neutral pronouns, so a
    > compiler is "he".
    >


    Ah. I see.

    > --
    > Phlip
     
    mike3, Nov 19, 2007
    #12
  13. On 2007-11-19 03:41, mike3 wrote:
    > On Nov 18, 2:33 pm, Marco Manfredini <> wrote:
    >> mike3 wrote:
    >> Is it virtual? Are your
    >> optimization settings OK (ie. inline control)? Are you using a Visual
    >> Studio *Express* Edition (These do not optimize, IIRC) and so on.

    >
    > Yes, I was using the "express" edition (I cannot afford the other
    > editions), but it does seem to have an optimizer though. (Maybe the
    > last time you used it it didn't? Maybe that's something new?) Also,
    > the problem persists across compilers.


    The compiler (and optimiser) is (and always have been) the same as in
    the non-free versions, what you miss out on is tools, such as the
    profiler, and libraries.

    --
    Erik Wikström
     
    Erik Wikström, Nov 19, 2007
    #13
  14. mike3

    mike3 Guest

    On Nov 19, 10:04 am, Erik Wikström <> wrote:
    > On 2007-11-19 03:41, mike3 wrote:
    >
    > > On Nov 18, 2:33 pm, Marco Manfredini <> wrote:
    > >> mike3 wrote:
    > >> Is it virtual? Are your
    > >> optimization settings OK (ie. inline control)? Are you using a Visual
    > >> Studio *Express* Edition (These do not optimize, IIRC) and so on.

    >
    > > Yes, I was using the "express" edition (I cannot afford the other
    > > editions), but it does seem to have an optimizer though. (Maybe the
    > > last time you used it it didn't? Maybe that's something new?) Also,
    > > the problem persists across compilers.

    >
    > The compiler (and optimiser) is (and always have been) the same as in
    > the non-free versions, what you miss out on is tools, such as the
    > profiler, and libraries.
    >


    That's right. That's the actual difference. If you want the tools, you
    must pay a small fortune ($700).

    > --
    > Erik Wikström
     
    mike3, Nov 19, 2007
    #14
  15. mike3

    Phlip Guest

    mike3 wrote:

    >> float InvSqrt (float x){
    >> float xhalf = 0.5f*x;

    ....
    >> }

    >
    > Eww! That's awful! Talk about optimization and
    > clear code being in opposition! Holy cow.


    That is the Alpha and Omega of Postmature Optimization.

    (Look up "premature optimization" to see what I mean.)

    The method is inside a function, InvSqrt, so you don't need to type all that
    just to collect inverse square roots. That is exemplary, not awful.

    But note that modern architectures will probably run 'double' faster than
    'float', when the former matches the FPU's bus.

    --
    Phlip
     
    Phlip, Nov 20, 2007
    #15
  16. mike3

    mike3 Guest

    Problem fixed (was Re: Strange performance problem in C++ program)

    Well, just wanted to inform you that I fixed my problem, by removing
    calls to the helper functions and instead working on the sign/exponent
    data directly. Although this made the code messier, unfortunately(*),
    it gave me a 2.5x performance gain -- the routine now runs at around
    50% of the speed of integer addition. I doubt it is possible to make
    it
    any faster than that due to the bit shift, unless perhaps one programs
    it in assembler, but that's a whole new story.

    (*) Fortunately, though, there are only 5 of these "messy" fast
    routines,
    for the 5 basic numerical operations (addition, subtraction,
    multiplication,
    division, and comparison).
     
    mike3, Nov 20, 2007
    #16
  17. Phlip wrote:
    > mike3 wrote:
    >
    >>> float InvSqrt (float x){
    >>> float xhalf = 0.5f*x;

    > ...
    >>> }

    > The method is inside a function, InvSqrt, so you don't need to type all that
    > just to collect inverse square roots. That is exemplary, not awful.


    It's a terrible name, too. To me, an "inverse square root" is a square.
    The function calculates reciprocal square roots.
     
    Philip Potter, Nov 20, 2007
    #17
  18. mike3

    Phlip Guest

    Philip Potter wrote:

    > >>> float InvSqrt (float x){


    > It's a terrible name, too. To me, an "inverse square root" is a square.
    > The function calculates reciprocal square roots.


    Point. I _wish_ I could have written the innards, but I certainly
    would have found a better name!

    --
    Phlip
     
    Phlip, Nov 21, 2007
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. jm
    Replies:
    1
    Views:
    540
    alien2_51
    Dec 12, 2003
  2. z. f.
    Replies:
    0
    Views:
    719
    z. f.
    Feb 3, 2005
  3. Sojwal
    Replies:
    2
    Views:
    326
    Chris Uppal
    Jan 27, 2004
  4. DM
    Replies:
    7
    Views:
    281
    Andrew Thompson
    Nov 29, 2006
  5. Software Engineer
    Replies:
    0
    Views:
    368
    Software Engineer
    Jun 10, 2011
Loading...

Share This Page