loop optimization

Jayden Shui · Feb 2, 2012

Dear All,

I have a high dimensional array and want to assign one block of data
in the array to another block in the array. The two blocks are not
overlapped. The loops are very slow. Code is attached. Any ways to
optimize it? Thanks a lot.

#include <time.h>
#include <iostream>
using namespace std;

int main()
{
float a[100][100][8];

int ib = 1, ie = 99;
int jb = 1, je = 99;
int kb0 = 0, ke0 = 1;
int kb1 = 3, ke1 = 2;

clock_t t0 = clock();

// Loop to count CPU time.
for (int m = 0; m < 100000; ++m) {

// a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]
slow
for (int i = ib; i <= ie; ++i)
{
for (int j = jb; j <= je; ++j)
{
for (int k0 = kb0, k1 = kb1; k0 <= ke0; ++k0, --k1)
{
a[j][k0] = a[j][k1];
}
}
}

}

clock_t t1 = clock();
cout << "CPU Time = " << double(t1 - t0) / CLOCKS_PER_SEC << "s
\n";

return 0;
}

Jayden Shui · Feb 2, 2012

Dear All,

Click to expand...

I have a high dimensional array and want to assign one block of data
in the array to another block in the array. The two blocks are not
overlapped. The loops are very slow.

Click to expand...

Compared to what? It appears that you had to repeat this 100000 times
just in order to get some meaningful timings - does not look slow to me.

Other than that, try to increase your compiler optimization level. The
inner loop is 2 iterations and should be rolled out, some compilers may
require special flags for doing that (-funroll-loops for gcc, for
example).

hth
Paavo

Code is attached. Any ways to
optimize it? Thanks a lot.

Click to expand...

#include <time.h>
#include <iostream>
using namespace std;

Click to expand...

int main()
{
float a[100][100][8];

Click to expand...

int ib = 1, ie = 99;
int jb = 1, je = 99;
int kb0 = 0, ke0 = 1;
int kb1 = 3, ke1 = 2;

Click to expand...

clock_t t0 = clock();

Click to expand...

// Loop to count CPU time.
for (int m = 0; m < 100000; ++m) {

Click to expand...

// a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]
slow
for (int i = ib; i <= ie; ++i)
{
for (int j = jb; j <= je; ++j)
{
for (int k0 = kb0, k1 = kb1; k0 <= ke0; ++k0, --k1)
{
a[j][k0] = a[j][k1];
}
}
}

Click to expand...

}

Click to expand...

clock_t t1 = clock();
cout << "CPU Time = " << double(t1 - t0) / CLOCKS_PER_SEC << "s
\n";

Click to expand...

return 0;
}

Click to expand...

This is testing code for CPU time of the operations. My code has lots
of such similar operations. I hope using some loop optimization
techniques in http://en.wikipedia.org/wiki/Loop_optimization to speed
it up. I think the testing code should be run in less than 0.5 s in i7
after loop optimization (-O2 or O3 is really not enough) according to
my experience.

Victor Bazarov · Feb 2, 2012

I have a high dimensional array and want to assign one block of data
in the array to another block in the array. The two blocks are not
overlapped. The loops are very slow. Code is attached. Any ways to
optimize it? Thanks a lot.

#include<time.h>
#include<iostream>
using namespace std;

int main()
{
float a[100][100][8];

int ib = 1, ie = 99;
int jb = 1, je = 99;
int kb0 = 0, ke0 = 1;
int kb1 = 3, ke1 = 2;

clock_t t0 = clock();

// Loop to count CPU time.
for (int m = 0; m< 100000; ++m) {

// a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]
slow
for (int i = ib; i<= ie; ++i)
{
for (int j = jb; j<= je; ++j)
{
for (int k0 = kb0, k1 = kb1; k0<= ke0; ++k0, --k1)
{
a[j][k0] = a[j][k1];
}
}
}

}

clock_t t1 = clock();
cout<< "CPU Time = "<< double(t1 - t0) / CLOCKS_PER_SEC<< "s
\n";

return 0;
}

Some compilers I've seen (don't recall which, though) could generate
better code if you helped them reduce the indexing by extracting the
"common subexpression" outside of the inner loop. Something like

for (int i = ...
{
float (&ai)[100][8] = a;
for (int j = ...
{
float (&aij)[8] = ai[j];
for (int k0 ...
{
aij[k0] = aij[k1];
}
}
}

Better compilers do that on their own. Try it and see if you get any
improvement. If not, you could (a) parallelize the outer loop (since
there is no overlap) but keep in mind it's only worth if the overhead of
starting worker threads is much smaller than the actual work each has to
perform, or (b) change your algorithm and the data structure.

Barring those choices, you're pretty much stuck with what you have.

V

copx · Feb 2, 2012

Dear All,

I have a high dimensional array and want to assign one block of data
in the array to another block in the array. The two blocks are not
overlapped. The loops are very slow. Code is attached. Any ways to
optimize it? Thanks a lot.

[snip code]

I compiled this and ran it on my desktop computer which is not very
fast:

Unoptimized build ("g++ loop.cpp"): 15.762s
Optimized build ("g++ -O2 loop.cpp"): 0s

Yes, that's a zero. It runs so fast you cannot even measure it using
your method. Did you forget to enable optimization?

Jayden Shui · Feb 2, 2012

On 02.02.2012 15:55, Jayden Shui wrote:> Dear All,

I have a high dimensional array and want to assign one block of data
in the array to another block in the array. The two blocks are not
overlapped. The loops are very slow. Code is attached. Any ways to
optimize it? Thanks a lot.

Click to expand...

[snip code]

I compiled this and ran it on my desktop computer which is not very
fast:

Unoptimized build ("g++ loop.cpp"): 15.762s
Optimized build ("g++ -O2 loop.cpp"): 0s

Yes, that's a zero. It runs so fast you cannot even measure it using
your method. Did you forget to enable optimization?

This is due to there is no output and the code in fact is not run in -
O2. Please add

int i, j, k;
cin >> i >> j >> k;
cin << a[j][k];

before return 0;

Victor Bazarov · Feb 2, 2012

Am 02.02.12 15:55, schrieb Jayden Shui:

Dear All,

// a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]

Click to expand...

I don't really understand what kind of copy this performs. But if you
can reformulate this operation to copy (large) contiguous blocks of
memory, you could use memcpy, which is probably optimized better. As
others have suggested, this problem is probably memory bandwidth
limited. This means you can gain something by vectorizing as large
blocks as possible at once.

The problem is that the order of elements in the destination block is
inverted (see the "-1"?) That prevents memcpy (or memmove) from being used.

V

Juha Nieminen · Feb 2, 2012

Christian Gollwitzer said:
I don't really understand what kind of copy this performs. But if you
can reformulate this operation to copy (large) contiguous blocks of
memory, you could use memcpy, which is probably optimized better. As
others have suggested, this problem is probably memory bandwidth
limited. This means you can gain something by vectorizing as large
blocks as possible at once.

When dealing with multidimensional arrays, if you need to handle
(read, copy, modify...) large portions of it at adjacent indices,
it's always much more efficient to perform the operations on elements
that are in contiguous memory locations rather than jumping around.
This makes it important which index of a multidimensional array is
modified most often.

The reason for this is, obviously, caching. When you access an element
in the array, the CPU will load adjacent memory data to the cache. If
you then access an adjacent element of the array, it will most probably
be already loaded into the cache (closest to the CPU), which will make
it really fast. If, however, you jumped around the RAM by incrementing
the wrong index, you will get lots of cache misses.

For that reason eg. this:

for(unsigned i1 = 0; i1 < size1; ++i1)
for(unsigned i2 = 0; i2 < size2; ++i2)
largeArray[i1][i2] = anotherLargeArray[i1][i2];

will usually be much faster than:

for(unsigned i2 = 0; i2 < size2; ++i2)
for(unsigned i1 = 0; i1 < size1; ++i1)
largeArray[i1][i2] = anotherLargeArray[i1][i2];

memcpy() is often very fast not only because it has been optimized to
death during the last 30 years, but also because it accesses consecutive
memory locations, which is cache-friendly.

Asger Joergensen · Feb 2, 2012

Hi Jayden

Jayden said:
int ib = 1, ie = 99;
int jb = 1, je = 99;
int kb0 = 0, ke0 = 1;
int kb1 = 3, ke1 = 2; .....
for (int j = jb; j <= je; ++j)
{
for (int k0 = kb0, k1 = kb1; k0 <= ke0; ++k0, --k1)

putting in the numbers:
for (int k0 = 0, k1 = 3; k0 <= ke0; ++k0, --k1)

so You are essentially only swapping the first four floats, right ?

well to my solution:

no need for all these [][][] they are just to make things slower
and in some cases simpler to understand.

// first set some sizes
const int lineSize = 8;
const int squareSize = 800;
const int cubeSize = 80000;

// get one chuck of memory of the right size
float cube[ cubeSize ];

float *square = cube;
float* squaresEnd = cube + cubeSize;

while( square < squaresEnd )
{
float* line = square;
float* linesEnd = square + squareSize;

while( line < linesEnd )
{
float* lineBegin = line;
float* lineEnd = line + 3; // lineSize ****

for(; lineBegin < lineEnd; ++lineBegin, --lineEnd )//**
*lineBegin = *lineEnd;

line += lineSize;
}

square += squareSize;
}

this is close to double speed on my machine, but I only have an old
Borland compiler, so it might not be the same for You.

****
I don't know if the 3 was what You intended, if not replace with
lineSize.
**
I removed the = from <= if they are pointing to the same float there
is really no need to swap. ;-)

Best regards
Asger-P

"Forward/backward loop" - problem	3	Feb 19, 2025
Fibonacci	0	May 13, 2023
Help with Loop	0	Mar 30, 2023
Code help please	4	May 19, 2023
SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
RSA implementation issues in public key pem loader function	0	May 21, 2025
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
Universal BMP Steganography Tool (AES-128-CTR + SP800-90A CSPRNG) Full Encoder/Decoder with 3LSB Payload, PasswordDerived Key & External Key File	4	Mar 26, 2026

loop optimization

Jayden Shui

Jayden Shui

Victor Bazarov

copx

Jayden Shui

Victor Bazarov

Juha Nieminen

Asger Joergensen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads