Cache

Thomas Kowalski · Sep 25, 2006

Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.

Thanks in advance,
Thomas Kowalski

Victor Bazarov · Sep 25, 2006

Thomas said:
I would like to know which high level techniques can improve the
number of cache hits. I am not trying to rewrite some code in
assembler, just want to know which things might be considered to help
a good compiler "understanding" my code better. Some examples or
links would be very helpful.

Last time I came across the recommendations of this sort it was mostly
to place objects in such a way that when you follow the references (not
C++ references, but generally using pointers, indices, etc) from one to
another, the objects found (resolved) would be close-by.

Beware, those are always platform-specific, and so is the advice you'll
receive, I am fairly certain. Try a good book, like "Efficient C++".

V

Alan Woodland · Sep 25, 2006

Thomas said:
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.

My reply is perhaps slightly OT here: Have you tried profiling it with
valgrind's cachegrind, or the frontend to it, kcachegrind? I can't give
many hard and fast rules (except for the case of linear algebra), but I
imagine kcachegrind could give you a somewhat better view of what's
going on.

Alan

Jacek Dziedzic · Sep 25, 2006

Thomas said:
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.

I imagine this would be hard on the source level, but one
advice springs to mind (similar to what Victor had pointed out)
-- it is usually beneficial to access memory addresses that
are close together, in an increasing order. For instance

int arr[1000][1000];
long int sum=0;

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j];

would probably run significantly faster than

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j];

because in the first case the memory occupied by arr is
traversed linearly. Therefore it is sometimes beneficial
to have a transposed copy of an array that you need to
access in the latter fashion.

<OT>
Other than that, I remember that aligning objects to
some boundaries may speed things up. Try looking up
"alignment" or __declspec(align(n)) in your compiler's
manual. For example on the Itanium 2 it is beneficial
to align variables at 16-byte boundaries. The cache
also usually displays this property that certain
alignments are ill advised, because they interfere
with the cache's internal hashing algorithm. For
instance on the Itanium 2 accessing variables in
succession that all lie at addresses divisible by 2048
causes very poor performance.
</OT>

As a last note -- definitely use a profiler to
check where the bottleneck lies. It quite often
lies far from where you'd expect it to be by
inspecting the source code.

HTH,
- J.

Bart · Sep 25, 2006

Thomas said:
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.

As others have pointed out, it's obviously very platform-specific.
Ideally, the part of your data structure that you are accessing should
fit into cache. There are various tricks already mentioned by others
like alignment, but again, these are just too specific to the
particular processor you're using.

Regards,
Bart.

Bart · Sep 25, 2006

Jacek said:
I imagine this would be hard on the source level, but one
advice springs to mind (similar to what Victor had pointed out)
-- it is usually beneficial to access memory addresses that
are close together, in an increasing order. For instance

int arr[1000][1000];
long int sum=0;

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j];

would probably run significantly faster than

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j];

because in the first case the memory occupied by arr is
traversed linearly.

That's entirely dependent on the situation. If the whole structure fits
into cache then that wouldn't be the case. Okay, arguably not too many
computers have 1MB of cache, but the point is that you have to analyze
each case separately. General advice on this particular topic is rarely
useful and gets obsolete very quickly.

Regards,
Bart.

Bart · Sep 25, 2006

Bart said:
Jacek said:

I imagine this would be hard on the source level, but one
advice springs to mind (similar to what Victor had pointed out)
-- it is usually beneficial to access memory addresses that
are close together, in an increasing order. For instance

int arr[1000][1000];
long int sum=0;

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j];

would probably run significantly faster than

for(int i=0; i<1000; ++i)
for(int j=0; j<1000; ++j)
sum+=arr[j];

because in the first case the memory occupied by arr is
traversed linearly.

Click to expand...

That's entirely dependent on the situation. If the whole structure fits
into cache then that wouldn't be the case. Okay, arguably not too many
computers have 1MB of cache,

For some reason, I'm somewhat careless today. I meant 1 million ints
but wrote 1MB instead.

Sorry for that,

Regards,
Bart.

Gianni Mariani · Sep 25, 2006

Thomas said:
Hi,
I would like to know which high level techniques can improve the number
of cache hits. I am not trying to rewrite some code in assembler, just
want to know which things might be considered to help a good compiler
"understanding" my code better. Some examples or links would be very
helpful.

One of the most effective techniques is "blocking" or taking chunks of
your data and operating on the chunks.

e.g. If I was doing this:

for ( i = 0; i < 1000; ++ i ) {
for ( j = 0; j < 1000; ++ j ) {
magic....
}
}

it can be turned into :

for ( i = 0; i < 1000; i += 10 ) {
for ( j = 0; j < 1000; j += 10) {
type temp_data[10];
for ( ii = i; ii < i + 10; ++i ) {
for ( jj = j; jj < j + 10; ++j ) {
tmp_data[ii-i] += magic ....
}
}
}
}

This is very simplistic, but, basically the idea is to keep as much data
in cache. I have seen 100x speed increases using this technique on
large data.

[ANN] rack-cache 0.3.0	0	Dec 29, 2008
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
Parse string to double really fast	17	Jun 6, 2007
Looking for feedback on this markup language I developed and my website idea?	0	Jun 17, 2023
New to python NEED SIMPLE MATH CODE plz	3	Jan 11, 2023
Bare metal.	0	Sep 14, 2022
Tasks	1	Nov 29, 2022

Cache

Thomas Kowalski

Victor Bazarov

Alan Woodland

Jacek Dziedzic

Bart

Bart

Bart

Gianni Mariani

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads