time factor for loops and arrays

P

Patrick

Hello,

I noticed that there is time difference for the following problem:
Example: Array with size 32*100000

Now I iterate through the array with two for-loops:

for(i=0; i<32;i++){
for(j=0; j<100000; j++){
//store a value in the array
}
}

is 3 times slower than:

for(i=0; i<100000;i++){
for(j=0; j<32; j++){
//store a value in the array
}
}

but why?
the command within the loops will be executed 32*100000 in both cases.
I'm working with JBuilder10

Greetings and TIA
 
J

John B. Matthews

Hello,

I noticed that there is time difference for the following problem:
Example: Array with size 32*100000

Now I iterate through the array with two for-loops:

for(i=0; i<32;i++){
for(j=0; j<100000; j++){
//store a value in the array
}
}

is 3 times slower than:

for(i=0; i<100000;i++){
for(j=0; j<32; j++){
//store a value in the array
}
}

but why?
the command within the loops will be executed 32*100000 in both cases.
I'm working with JBuilder10

Greetings and TIA

In the latter, it's feasible to unroll the inner loop: only 32
repetitions versus 100,000. In addition, the JVM may be able to
optimize bus bandwidth: initializing 32 bytes may require eight
32-bit store instructions on one processor and four 64-bit
instructions on another. Numerous other optimizations are possible
depending on the architecture.
 
T

Tom McGlynn

Patrick said:
Hello,

I noticed that there is time difference for the following problem:
Example: Array with size 32*100000

Now I iterate through the array with two for-loops:

for(i=0; i<32;i++){
for(j=0; j<100000; j++){
//store a value in the array
}
}

is 3 times slower than:

for(i=0; i<100000;i++){
for(j=0; j<32; j++){
//store a value in the array
}
}

but why?
the command within the loops will be executed 32*100000 in both cases.
I'm working with JBuilder10

Greetings and TIA


If may well depend upon what the command you've represented as a comment
does. E.g.,

double array[100000][32];

for (i=0; i<100000; i += 1) {
for (j=0; j<32; j += 1) {
array[j] = i+j;
}
}

will likely have much better memory behavior than
when the loops are in the other direction. As written the inner
loop deals with each 32 element array once and for all while putting the
loops in the other order makes it stride through memory 32 times and
likely requires many more transfers between cache and main memory.

If the limits were something like 4096x4096, you might begin to get
in a situation where you would start thrashing the memory and have
to page when going the inefficient route. Then you'd get differences
that can be factors of 1000's not just two or three.

All of this depends upon the underlying architecture of the
system but generally if you have multidimensional arrays you want to
loop over the last index first.

Regards,
Tom McGlynn
 
Y

Yamin

Tom McGlynn said:
Greetings and TIA


If may well depend upon what the command you've represented as a comment
does. E.g.,

double array[100000][32];

for (i=0; i<100000; i += 1) {
for (j=0; j<32; j += 1) {
array[j] = i+j;
}
}

will likely have much better memory behavior than
when the loops are in the other direction. As written the inner
loop deals with each 32 element array once and for all while putting the
loops in the other order makes it stride through memory 32 times and
likely requires many more transfers between cache and main memory.


Yeah, that's what I was thinking. I'll detail it a bit for curiousity
sake. It might be page thrashing. Memory is only linear, so consider
the compiler lays out the array of doubles as follows:
a[x][y]
a[0][0] a[0][1] a[0][2]...a[0][31]a[1][0]a[1][1]...a[99999][0]...a[99999][31]
so as we see in this layout, all the y values for a particular x are
laid out sequentially.

Lets take an imaginary processor cache page size of 128 bytes. I'm
being nice for this example assuming sizeof(double) == 4; So each
cache page can hold exactly 32 doubles :) And lets assume there is
only 1 page in the cache.

Doing it this way, when the processor loads the value a[0][0], the
values a[0][0] all the way to a[0][31] are cached. So as you go
through the inner loop, you are very nicely accessing all these
values...which are cached. When you move to the next 'x' index, you
throw out these cached results and load in a[1][0] to a[1][31]. Its
very nice for locality. Therefore, you only change the cache page x
time...which in your case is 100000;

Now for the sake of argument, lets say the array is laid out exactly
the same as above, but you go through the loop the other direction.
1. you access a[0][0]
2. a[0][0] to a[0][31] are all loaded in
3. you access a[1][0]
4. a[1][0] is not in the cache, so it throws that out and loads in
a[1][0] to a[1][31]
5. you access a[2][0]
6. a[2][0] is not in the cache, so it throws that out and loads in
a[2][0] to a[2][31]

Notice how at every access, you're throwing out what's in the cache
and loading a new cache back in. How wasteful :) Of course a real
processor has a much larger cache and the cache page size is not going
to work out exactly on even bounds with your array...but that's the
just of the idea. Here, you are swapping the cache page 100000*32
times. That's very bad :)

Yamin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top