code optimisation?

J

Jeff

Hello

Sorry if this isn't entirely a C language question - perhaps someone could
suggest a more appropriate group?

I'm running the appended code on a Mips R12000 processor and am getting very
confused about why the use of the temporary array (temp_C) can give such a
large speed-up (50% for size<128).

The alternative to using tempC is to write the result of the inner dot
product directly into C at the end of each inner loop. As far as I can see
this is not a caching issue. My hunch is that this is related to the fact
that the size of tempC is well within a single base+offset load whereas the
size of C itself is much larger than this.

Is anyone familiar with this issue?

Many Thanks
Jeff


for(i=0; i<size; i++)
{
for(j=0; j<size; j++)
{
rowBPosition = size*j;
x=0;
for(k=0; k<size; k++)
{
x+=_pA[k] * pB[rowBPosition+k];
}
tempC[j]=x;
}

// write tempC into a row of C
while(_tempC<tempCEnd)
*_pC++=*_tempC++;

_pA+=size;
_tempC=tempC;
}
 
W

Walter Roberson

Jeff said:
Sorry if this isn't entirely a C language question - perhaps someone could
suggest a more appropriate group?

comp.unix.programming ? comp.sys.sgi.misc ?

I'm running the appended code on a Mips R12000 processor and am getting very
confused about why the use of the temporary array (temp_C) can give such a
large speed-up (50% for size<128).
The alternative to using tempC is to write the result of the inner dot
product directly into C at the end of each inner loop. As far as I can see
this is not a caching issue. My hunch is that this is related to the fact
that the size of tempC is well within a single base+offset load whereas the
size of C itself is much larger than this.
Is anyone familiar with this issue?

You do not happen to mention the platform. If it were SGI IRIX MipsPro
then I could answer about what -does- happen; as it is, I can only
offer what -might- happen.

When you use fixed arrays then the SGI MipsPro compiler optimizes
the starting location of the array to reduce cache-line thrashing: it
deliberately arranges the array and the operations to maximize
pipelining, especially if you have turned the MipsPro
LNO (Loop Nest Optimizer) options way up.

Your alternate code write via a pointer instead, and the compiler
cannot know what the alignment of that destination memory is. It
thus cannot assume that no cache-line thrashing is taking place
and so cannot generate code that is as tight; and if cache-line
thrashing -does- take place, then you could get substantial speed
reductions.

With a dot-product size of < 128, you are likely fitting the
entire math operation into the primary cache, to the point
where your time constraint might become the write of the
result to somewhere off-cache. As such, even a single
conservative main-memory access left in place could come close
to halving the net speed of your calculation.


If you do happen to be using the IRIX MipsPro compilers, then push up
the logging level on LNO: it can be surprising how tiny little
program tweaks can improve the optimization.
 
B

Barry

Jeff said:
Hello

Sorry if this isn't entirely a C language question - perhaps someone could
suggest a more appropriate group?

I'm running the appended code on a Mips R12000 processor and am getting very
confused about why the use of the temporary array (temp_C) can give such a
large speed-up (50% for size<128).

The alternative to using tempC is to write the result of the inner dot
product directly into C at the end of each inner loop. As far as I can see
this is not a caching issue. My hunch is that this is related to the fact
that the size of tempC is well within a single base+offset load whereas the
size of C itself is much larger than this.

Is anyone familiar with this issue?

Many Thanks
Jeff


for(i=0; i<size; i++)
{
for(j=0; j<size; j++)
{
rowBPosition = size*j;
x=0;
for(k=0; k<size; k++)
{
x+=_pA[k] * pB[rowBPosition+k];
}
tempC[j]=x;
}

// write tempC into a row of C
while(_tempC<tempCEnd)
*_pC++=*_tempC++;

_pA+=size;
_tempC=tempC;
}

I believe most readers of this group are quite familiar
with off-topic posts.

And if someone doesn't reply "42", I will quit reading
or anwsering.

<<Off Topic
Ask the folks who wrote the compiler. Look at
the assembly blah blah....
Off Topic>>
 
J

Jeff

Walter Roberson said:
comp.unix.programming ? comp.sys.sgi.misc ?

Thanks, the latter group is probably the most appropriate.

And many thanks for your detailed response.
Yes, I'm using the SGI MipsPro compiler and have all optimisations turned up
fully.

I'm running a utility called Perfex which gives me a count of instructions
and cache misses. Cache misses seem very similar whether I use an
intermediate array when calculating a row of C or whether I write each
element directly into C . Instructions counts however seem to be halved for
smaller matrices by using the temporary row-array - even though the two
methods are semantically almost identical. It would seem that this must be
due to some sort of dynamic scheduling/compiling rather than assembly code
generation since the compiler can't know how big things will be.

Anyway, maybe I should move over to comp.sys.sgi.misc ...

Many thanks once again

Jeff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top