code optimisation?

Discussion in 'C Programming' started by Jeff, Jan 10, 2007.

  1. Jeff

    Jeff Guest

    Hello

    Sorry if this isn't entirely a C language question - perhaps someone could
    suggest a more appropriate group?

    I'm running the appended code on a Mips R12000 processor and am getting very
    confused about why the use of the temporary array (temp_C) can give such a
    large speed-up (50% for size<128).

    The alternative to using tempC is to write the result of the inner dot
    product directly into C at the end of each inner loop. As far as I can see
    this is not a caching issue. My hunch is that this is related to the fact
    that the size of tempC is well within a single base+offset load whereas the
    size of C itself is much larger than this.

    Is anyone familiar with this issue?

    Many Thanks
    Jeff


    for(i=0; i<size; i++)
    {
    for(j=0; j<size; j++)
    {
    rowBPosition = size*j;
    x=0;
    for(k=0; k<size; k++)
    {
    x+=_pA[k] * pB[rowBPosition+k];
    }
    tempC[j]=x;
    }

    // write tempC into a row of C
    while(_tempC<tempCEnd)
    *_pC++=*_tempC++;

    _pA+=size;
    _tempC=tempC;
    }
     
    Jeff, Jan 10, 2007
    #1
    1. Advertising

  2. In article <>,
    Jeff <> wrote:

    >Sorry if this isn't entirely a C language question - perhaps someone could
    >suggest a more appropriate group?


    comp.unix.programming ? comp.sys.sgi.misc ?


    >I'm running the appended code on a Mips R12000 processor and am getting very
    >confused about why the use of the temporary array (temp_C) can give such a
    >large speed-up (50% for size<128).


    >The alternative to using tempC is to write the result of the inner dot
    >product directly into C at the end of each inner loop. As far as I can see
    >this is not a caching issue. My hunch is that this is related to the fact
    >that the size of tempC is well within a single base+offset load whereas the
    >size of C itself is much larger than this.


    >Is anyone familiar with this issue?


    You do not happen to mention the platform. If it were SGI IRIX MipsPro
    then I could answer about what -does- happen; as it is, I can only
    offer what -might- happen.

    When you use fixed arrays then the SGI MipsPro compiler optimizes
    the starting location of the array to reduce cache-line thrashing: it
    deliberately arranges the array and the operations to maximize
    pipelining, especially if you have turned the MipsPro
    LNO (Loop Nest Optimizer) options way up.

    Your alternate code write via a pointer instead, and the compiler
    cannot know what the alignment of that destination memory is. It
    thus cannot assume that no cache-line thrashing is taking place
    and so cannot generate code that is as tight; and if cache-line
    thrashing -does- take place, then you could get substantial speed
    reductions.

    With a dot-product size of < 128, you are likely fitting the
    entire math operation into the primary cache, to the point
    where your time constraint might become the write of the
    result to somewhere off-cache. As such, even a single
    conservative main-memory access left in place could come close
    to halving the net speed of your calculation.


    If you do happen to be using the IRIX MipsPro compilers, then push up
    the logging level on LNO: it can be surprising how tiny little
    program tweaks can improve the optimization.

    --
    "No one has the right to destroy another person's belief by
    demanding empirical evidence." -- Ann Landers
     
    Walter Roberson, Jan 10, 2007
    #2
    1. Advertising

  3. Jeff

    Barry Guest

    "Jeff" <> wrote in message
    news:...
    > Hello
    >
    > Sorry if this isn't entirely a C language question - perhaps someone could
    > suggest a more appropriate group?
    >
    > I'm running the appended code on a Mips R12000 processor and am getting

    very
    > confused about why the use of the temporary array (temp_C) can give such a
    > large speed-up (50% for size<128).
    >
    > The alternative to using tempC is to write the result of the inner dot
    > product directly into C at the end of each inner loop. As far as I can see
    > this is not a caching issue. My hunch is that this is related to the fact
    > that the size of tempC is well within a single base+offset load whereas

    the
    > size of C itself is much larger than this.
    >
    > Is anyone familiar with this issue?
    >
    > Many Thanks
    > Jeff
    >
    >
    > for(i=0; i<size; i++)
    > {
    > for(j=0; j<size; j++)
    > {
    > rowBPosition = size*j;
    > x=0;
    > for(k=0; k<size; k++)
    > {
    > x+=_pA[k] * pB[rowBPosition+k];
    > }
    > tempC[j]=x;
    > }
    >
    > // write tempC into a row of C
    > while(_tempC<tempCEnd)
    > *_pC++=*_tempC++;
    >
    > _pA+=size;
    > _tempC=tempC;
    > }
    >
    >
    >
    >
    >


    I believe most readers of this group are quite familiar
    with off-topic posts.

    And if someone doesn't reply "42", I will quit reading
    or anwsering.

    <<Off Topic
    Ask the folks who wrote the compiler. Look at
    the assembly blah blah....
    Off Topic>>
     
    Barry, Jan 10, 2007
    #3
  4. Jeff

    Jeff Guest

    "Walter Roberson" <-cnrc.gc.ca> wrote in message
    news:eo3lfu$fmf$...
    > In article <>,
    > Jeff <> wrote:
    >
    >>Sorry if this isn't entirely a C language question - perhaps someone could
    >>suggest a more appropriate group?

    >
    > comp.unix.programming ? comp.sys.sgi.misc ?
    >


    Thanks, the latter group is probably the most appropriate.

    And many thanks for your detailed response.
    Yes, I'm using the SGI MipsPro compiler and have all optimisations turned up
    fully.

    I'm running a utility called Perfex which gives me a count of instructions
    and cache misses. Cache misses seem very similar whether I use an
    intermediate array when calculating a row of C or whether I write each
    element directly into C . Instructions counts however seem to be halved for
    smaller matrices by using the temporary row-array - even though the two
    methods are semantically almost identical. It would seem that this must be
    due to some sort of dynamic scheduling/compiling rather than assembly code
    generation since the compiler can't know how big things will be.

    Anyway, maybe I should move over to comp.sys.sgi.misc ...

    Many thanks once again

    Jeff
     
    Jeff, Jan 10, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Agent Mulder

    Re: Code optimisation

    Agent Mulder, Aug 27, 2003, in forum: C++
    Replies:
    1
    Views:
    320
    Peter van Merkerk
    Aug 27, 2003
  2. Rob Williscroft

    Re: Code optimisation

    Rob Williscroft, Aug 27, 2003, in forum: C++
    Replies:
    2
    Views:
    376
    Peter van Merkerk
    Aug 28, 2003
  3. Rob Williscroft

    Re: Code optimisation

    Rob Williscroft, Aug 27, 2003, in forum: C++
    Replies:
    1
    Views:
    361
    Peter van Merkerk
    Aug 27, 2003
  4. Peter van Merkerk

    Re: Code optimisation

    Peter van Merkerk, Aug 27, 2003, in forum: C++
    Replies:
    1
    Views:
    372
    Alan Sung
    Aug 27, 2003
  5. mjm

    Re: Code optimisation

    mjm, Aug 29, 2003, in forum: C++
    Replies:
    2
    Views:
    356
    Peter van Merkerk
    Aug 29, 2003
Loading...

Share This Page