Cache

T

Thomas Kowalski

Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.

Thanks in advance,
Thomas Kowalski
 
V

Victor Bazarov

Thomas said:
I would like to know which high level techniques can improve the
number of cache hits. I am not trying to rewrite some code in
assembler, just want to know which things might be considered to help
a good compiler "understanding" my code better. Some examples or
links would be very helpful.

Last time I came across the recommendations of this sort it was mostly
to place objects in such a way that when you follow the references (not
C++ references, but generally using pointers, indices, etc) from one to
another, the objects found (resolved) would be close-by.

Beware, those are always platform-specific, and so is the advice you'll
receive, I am fairly certain. Try a good book, like "Efficient C++".

V
 
A

Alan Woodland

Thomas said:
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.
My reply is perhaps slightly OT here: Have you tried profiling it with
valgrind's cachegrind, or the frontend to it, kcachegrind? I can't give
many hard and fast rules (except for the case of linear algebra), but I
imagine kcachegrind could give you a somewhat better view of what's
going on.

Alan
 
J

Jacek Dziedzic

Thomas said:
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.

I imagine this would be hard on the source level, but one
advice springs to mind (similar to what Victor had pointed out)
-- it is usually beneficial to access memory addresses that
are close together, in an increasing order. For instance

int arr[1000][1000];
long int sum=0;

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j];

would probably run significantly faster than

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j];

because in the first case the memory occupied by arr is
traversed linearly. Therefore it is sometimes beneficial
to have a transposed copy of an array that you need to
access in the latter fashion.

<OT>
Other than that, I remember that aligning objects to
some boundaries may speed things up. Try looking up
"alignment" or __declspec(align(n)) in your compiler's
manual. For example on the Itanium 2 it is beneficial
to align variables at 16-byte boundaries. The cache
also usually displays this property that certain
alignments are ill advised, because they interfere
with the cache's internal hashing algorithm. For
instance on the Itanium 2 accessing variables in
succession that all lie at addresses divisible by 2048
causes very poor performance.
</OT>

As a last note -- definitely use a profiler to
check where the bottleneck lies. It quite often
lies far from where you'd expect it to be by
inspecting the source code.

HTH,
- J.
 
B

Bart

Thomas said:
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.

As others have pointed out, it's obviously very platform-specific.
Ideally, the part of your data structure that you are accessing should
fit into cache. There are various tricks already mentioned by others
like alignment, but again, these are just too specific to the
particular processor you're using.

Regards,
Bart.
 
B

Bart

Jacek said:
I imagine this would be hard on the source level, but one
advice springs to mind (similar to what Victor had pointed out)
-- it is usually beneficial to access memory addresses that
are close together, in an increasing order. For instance

int arr[1000][1000];
long int sum=0;

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j];

would probably run significantly faster than

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j];

because in the first case the memory occupied by arr is
traversed linearly.


That's entirely dependent on the situation. If the whole structure fits
into cache then that wouldn't be the case. Okay, arguably not too many
computers have 1MB of cache, but the point is that you have to analyze
each case separately. General advice on this particular topic is rarely
useful and gets obsolete very quickly.

Regards,
Bart.
 
B

Bart

Bart said:
Jacek said:
I imagine this would be hard on the source level, but one
advice springs to mind (similar to what Victor had pointed out)
-- it is usually beneficial to access memory addresses that
are close together, in an increasing order. For instance

int arr[1000][1000];
long int sum=0;

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j];

would probably run significantly faster than

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j];

because in the first case the memory occupied by arr is
traversed linearly.


That's entirely dependent on the situation. If the whole structure fits
into cache then that wouldn't be the case. Okay, arguably not too many
computers have 1MB of cache,


For some reason, I'm somewhat careless today. I meant 1 million ints
but wrote 1MB instead.

Sorry for that,

Regards,
Bart.
 
G

Gianni Mariani

Thomas said:
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.

One of the most effective techniques is "blocking" or taking chunks of
your data and operating on the chunks.

e.g. If I was doing this:

for ( i = 0; i < 1000; ++ i ) {
for ( j = 0; j < 1000; ++ j ) {
magic....
}
}

it can be turned into :

for ( i = 0; i < 1000; i += 10 ) {
for ( j = 0; j < 1000; j += 10) {
type temp_data[10];
for ( ii = i; ii < i + 10; ++i ) {
for ( jj = j; jj < j + 10; ++j ) {
tmp_data[ii-i] += magic ....
}
}
}
}

This is very simplistic, but, basically the idea is to keep as much data
in cache. I have seen 100x speed increases using this technique on
large data.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top