Re: Code optimisation

Discussion in 'C++' started by Rob Williscroft, Aug 27, 2003.

  1. Pete wrote in news:bii1hq$9hg$:

    > Hello, hope the following in on-topic
    >
    > I'm trying to optimise some code for speed.
    > Essentially I have a big loop with a kernel of the form
    >
    > double *out, *in;
    > int *test;
    >
    > if(test)
    > out = 4*in
    > else
    > out = 0.0;
    >
    > On my machine I find that about the same performance is obtained with
    >
    > out = test*(4*in);


    On meny, but not all, modern CPU's executing branches is quite
    expensive.

    Also it may be that your FPU having loaded a 0 (from test) will
    simply ignore the rest of the expresion.

    Even so both of the above could be changed to the other if your
    compilers optimizer is smart enough.

    Fortunatly (*) the day's where you could look at two pieces of
    code and say which is faster are gone. Today you need to test
    and before you do that you may need to profile your code to
    find the bottle neck, there's no point optimizing code that's
    hardy ever run's :).

    (*) I say Fortunate because (a) now we don't have to do it and
    (b) because its a consiquence of how fast/clever mordern CPU's
    and compiers are.

    >
    > Is this to be expected with all compilers and h/w and is there a
    > faster way of doing this?
    >


    <£0.02>
    The only optimization a programmer should be doing is
    reducing the compexity of an algorithm, reducing the
    total number of operation's your algorithm does.

    Leave making individual operation's faster to your
    compilers optimizer.
    </£0.02>

    Rob.
    --
    http://www.victim-prime.dsl.pipex.com/
    Rob Williscroft, Aug 27, 2003
    #1
    1. Advertising

  2. Rob Williscroft <> wrote in message news:<Xns93E493852EF72ukcoREMOVEfreenetrtw@195.129.110.201>...
    > Pete wrote in news:bii1hq$9hg$:
    >
    > > Hello, hope the following in on-topic
    > >
    > > I'm trying to optimise some code for speed.
    > > Essentially I have a big loop with a kernel of the form
    > >
    > > double *out, *in;
    > > int *test;
    > >
    > > if(test)
    > > out = 4*in
    > > else
    > > out = 0.0;
    > >
    > > On my machine I find that about the same performance is obtained with
    > >
    > > out = test*(4*in);

    >
    > On meny, but not all, modern CPU's executing branches is quite
    > expensive.


    I wonder how much this is still true. The advent of pipelining
    made branches expensive, but most modern CPUs offset this using
    branch prediction and/or by executing both branches simultaneously.

    > Also it may be that your FPU having loaded a 0 (from test) will
    > simply ignore the rest of the expresion.
    >
    > Even so both of the above could be changed to the other if your
    > compilers optimizer is smart enough.


    Hmm. In this case the compiler would have to be know that test
    can only contain zeros and ones. Maybe if test were a bool* the
    compiler could do this.

    > <£0.02>
    > The only optimization a programmer should be doing is
    > reducing the compexity of an algorithm, reducing the
    > total number of operation's your algorithm does.
    >
    > Leave making individual operation's faster to your
    > compilers optimizer.
    > </£0.02>


    I agree with your two pence, but just to fully experience the
    futility of this sort of low-level optimisation I decided to
    play around with it a little bit.

    First I changed test to an array of bool. This reduced run time
    very slightly, probably by making better use of the processor
    cache.

    Next I changed the loop to use pointer arithmetic instead of
    an integer subscript. Somewhat to my surprise, this also made
    for a (very slight) performance improvement. Here's the code
    I used:

    void Test(
    double const* in,
    bool const* test,
    int size,
    double* out)
    {
    double const* const end = in + size;
    for (; in != end; ++in, ++test, ++out)
    {
    *out = *test ? 4 * (*in) : 0.0L;
    }
    }

    Next I tried unwinding the loop four times, then eight times,
    but this made NO difference! (Apparently branching isn't very
    costly on my hardware, a Pentium 4.)

    Finally, I did a version in which test is a bit array. I
    unwound the loop eight times so as to process one byte of
    test with each iteration. Here's the very ugly result (BYTE
    is a typedef for unsigned char):

    void Test2(
    double const* in,
    BYTE const* test,
    int size,
    double* out)
    {
    for (; size >= 8;
    in += 8, ++test, out += 8, size -= 8)
    {
    out[0] = *test & 0x01 ? 4 * in[0] : 0.0L;
    out[1] = *test & 0x02 ? 4 * in[1] : 0.0L;
    out[2] = *test & 0x04 ? 4 * in[2] : 0.0L;
    out[3] = *test & 0x08 ? 4 * in[3] : 0.0L;
    out[4] = *test & 0x10 ? 4 * in[4] : 0.0L;
    out[5] = *test & 0x20 ? 4 * in[5] : 0.0L;
    out[6] = *test & 0x40 ? 4 * in[6] : 0.0L;
    out[7] = *test & 0x80 ? 4 * in[7] : 0.0L;
    }

    switch (size)
    {
    case 7: out[6] = *test & 0x40 ? 4 * in[6] : 0.0L;
    case 6: out[5] = *test & 0x20 ? 4 * in[5] : 0.0L;
    case 5: out[4] = *test & 0x10 ? 4 * in[4] : 0.0L;
    case 4: out[3] = *test & 0x08 ? 4 * in[3] : 0.0L;
    case 3: out[2] = *test & 0x04 ? 4 * in[2] : 0.0L;
    case 2: out[1] = *test & 0x02 ? 4 * in[1] : 0.0L;
    case 1: out[0] = *test & 0x01 ? 4 * in[0] : 0.0L;
    }
    }

    What did all this extra complexity buy me? About a 5% reduction
    in run time, probably due to better use of the processor cache
    (since we already found that loop unwinding made no difference).
    To put this in perspective, we're talking about a difference of
    1.6 milliseconds for 1,000,000 elements!

    These specific results are of course VERY platform dependent
    (I'm using a Pentium 4 and MSVC7.1). The more general lesson is
    what Pete already said: it's better to optimise "in the large"
    by choosing good algorithms, and let the compiler and hardware
    worry about optimising "in the small".

    Another lesson might be that one should minimize memory usage,
    and choose data structures with good memory locality.

    --Nick
    Niklas Borson, Aug 27, 2003
    #2
    1. Advertising

  3. > > > I'm trying to optimise some code for speed.
    > > > Essentially I have a big loop with a kernel of the form
    > > >
    > > > double *out, *in;
    > > > int *test;
    > > >
    > > > if(test)
    > > > out = 4*in
    > > > else
    > > > out = 0.0;
    > > >
    > > > On my machine I find that about the same performance is obtained

    with
    > > >
    > > > out = test*(4*in);

    > >
    > > On meny, but not all, modern CPU's executing branches is quite
    > > expensive.

    >
    > I wonder how much this is still true. The advent of pipelining
    > made branches expensive, but most modern CPUs offset this using
    > branch prediction and/or by executing both branches simultaneously.


    Unfortunately branch prediction is not perfect. If the contents of the
    test[] array are rather random the branch predictor is quite likely to
    bet on the wrong code path. OTOH if the vast majority of the elements in
    the test[] array are for example 0, then the branch predictor will be
    correct in most cases. Even if both branches are executed
    simultaneously, it means that execution resources are taken away for
    more useful tasks. It all depends, but the OP does not provide
    sufficient information to make a reasonable guess what the bottleneck
    is.

    <snip>
    > Next I tried unwinding the loop four times, then eight times,
    > but this made NO difference! (Apparently branching isn't very
    > costly on my hardware, a Pentium 4.)


    That is because the branch for the loop is easilly predictable. Only in
    the last iteration of the loop you will have a miss predict penaltly.

    > Another lesson might be that one should minimize memory usage,
    > and choose data structures with good memory locality.


    That might also be the problem, unfortunately the OP didn't provide
    information about the size of the dataset.

    --
    Peter van Merkerk
    peter.van.merkerk(at)dse.nl
    Peter van Merkerk, Aug 28, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Agent Mulder

    Re: Code optimisation

    Agent Mulder, Aug 27, 2003, in forum: C++
    Replies:
    1
    Views:
    301
    Peter van Merkerk
    Aug 27, 2003
  2. Rob Williscroft

    Re: Code optimisation

    Rob Williscroft, Aug 27, 2003, in forum: C++
    Replies:
    1
    Views:
    338
    Peter van Merkerk
    Aug 27, 2003
  3. Peter van Merkerk

    Re: Code optimisation

    Peter van Merkerk, Aug 27, 2003, in forum: C++
    Replies:
    1
    Views:
    343
    Alan Sung
    Aug 27, 2003
  4. mjm

    Re: Code optimisation

    mjm, Aug 29, 2003, in forum: C++
    Replies:
    2
    Views:
    332
    Peter van Merkerk
    Aug 29, 2003
  5. Farraige
    Replies:
    4
    Views:
    275
    Farraige
    Nov 8, 2006
Loading...

Share This Page