code optimisation?

Jeff · Jan 10, 2007

Hello

Sorry if this isn't entirely a C language question - perhaps someone could
suggest a more appropriate group?

I'm running the appended code on a Mips R12000 processor and am getting very
confused about why the use of the temporary array (temp_C) can give such a
large speed-up (50% for size<128).

The alternative to using tempC is to write the result of the inner dot
product directly into C at the end of each inner loop. As far as I can see
this is not a caching issue. My hunch is that this is related to the fact
that the size of tempC is well within a single base+offset load whereas the
size of C itself is much larger than this.

Is anyone familiar with this issue?

Many Thanks
Jeff

for(i=0; i<size; i++)
{
for(j=0; j<size; j++)
{
rowBPosition = size*j;
x=0;
for(k=0; k<size; k++)
{
x+=_pA[k] * pB[rowBPosition+k];
}
tempC[j]=x;
}

// write tempC into a row of C
while(_tempC<tempCEnd)
*_pC++=*_tempC++;

_pA+=size;
_tempC=tempC;
}

Walter Roberson · Jan 10, 2007

Jeff said:
Sorry if this isn't entirely a C language question - perhaps someone could
suggest a more appropriate group?

comp.unix.programming ? comp.sys.sgi.misc ?

I'm running the appended code on a Mips R12000 processor and am getting very
confused about why the use of the temporary array (temp_C) can give such a
large speed-up (50% for size<128).

The alternative to using tempC is to write the result of the inner dot
product directly into C at the end of each inner loop. As far as I can see
this is not a caching issue. My hunch is that this is related to the fact
that the size of tempC is well within a single base+offset load whereas the
size of C itself is much larger than this.

Is anyone familiar with this issue?

You do not happen to mention the platform. If it were SGI IRIX MipsPro
then I could answer about what -does- happen; as it is, I can only
offer what -might- happen.

When you use fixed arrays then the SGI MipsPro compiler optimizes
the starting location of the array to reduce cache-line thrashing: it
deliberately arranges the array and the operations to maximize
pipelining, especially if you have turned the MipsPro
LNO (Loop Nest Optimizer) options way up.

Your alternate code write via a pointer instead, and the compiler
cannot know what the alignment of that destination memory is. It
thus cannot assume that no cache-line thrashing is taking place
and so cannot generate code that is as tight; and if cache-line
thrashing -does- take place, then you could get substantial speed
reductions.

With a dot-product size of < 128, you are likely fitting the
entire math operation into the primary cache, to the point
where your time constraint might become the write of the
result to somewhere off-cache. As such, even a single
conservative main-memory access left in place could come close
to halving the net speed of your calculation.

If you do happen to be using the IRIX MipsPro compilers, then push up
the logging level on LNO: it can be surprising how tiny little
program tweaks can improve the optimization.

Barry · Jan 10, 2007

Jeff said:
Hello

Sorry if this isn't entirely a C language question - perhaps someone could
suggest a more appropriate group?

I'm running the appended code on a Mips R12000 processor and am getting very
confused about why the use of the temporary array (temp_C) can give such a
large speed-up (50% for size<128).

The alternative to using tempC is to write the result of the inner dot
product directly into C at the end of each inner loop. As far as I can see
this is not a caching issue. My hunch is that this is related to the fact
that the size of tempC is well within a single base+offset load whereas the
size of C itself is much larger than this.

Is anyone familiar with this issue?

Many Thanks
Jeff

for(i=0; i<size; i++)
{
for(j=0; j<size; j++)
{
rowBPosition = size*j;
x=0;
for(k=0; k<size; k++)
{
x+=_pA[k] * pB[rowBPosition+k];
}
tempC[j]=x;
}

// write tempC into a row of C
while(_tempC<tempCEnd)
*_pC++=*_tempC++;

_pA+=size;
_tempC=tempC;
}

I believe most readers of this group are quite familiar
with off-topic posts.

And if someone doesn't reply "42", I will quit reading
or anwsering.

<<Off Topic
Ask the folks who wrote the compiler. Look at
the assembly blah blah....
Off Topic>>

Jeff · Jan 10, 2007

Walter Roberson said:
comp.unix.programming ? comp.sys.sgi.misc ?

Thanks, the latter group is probably the most appropriate.

And many thanks for your detailed response.
Yes, I'm using the SGI MipsPro compiler and have all optimisations turned up
fully.

I'm running a utility called Perfex which gives me a count of instructions
and cache misses. Cache misses seem very similar whether I use an
intermediate array when calculating a row of C or whether I write each
element directly into C . Instructions counts however seem to be halved for
smaller matrices by using the temporary row-array - even though the two
methods are semantically almost identical. It would seem that this must be
due to some sort of dynamic scheduling/compiling rather than assembly code
generation since the compiler can't know how big things will be.

Anyway, maybe I should move over to comp.sys.sgi.misc ...

Many thanks once again

Jeff

Could you fix my code please? (i get no output after inputing the number)	1	Oct 16, 2023
Need help with this Python code.	2	Jun 13, 2023
Optimisation questions	33	Feb 4, 2012
How to speed this code	3	Nov 16, 2022
please help me with this pc of code	2	Jul 31, 2007
I Need Fix In Code	1	Apr 12, 2023
Difference between using "let" in a "for" loop	0	Jul 3, 2022
Filter sober in c++ don't pass test	0	Dec 2, 2023

code optimisation?

Jeff

Walter Roberson

Barry

Jeff

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads