unrolling nested for-loop

V

V

Hello:

Consider the following nested for loop:

uint64 TABLE[8][256];

for (i=0; i<=7; i++)
for (j=1; j<=7; j++)
for (k=1; k<=(1<<j)-1; k++)
TABLE [((1<<j)+k)] = (TABLE[(1<<j)]) ^ (TABLE[k]);

I understand that unrolling just the inner most for-loop would give me
best performance and this is required for my project. But I'm unable
to figure out how to unroll just the innermost for-loop.

Can someone please help!

Thanks!
 
I

Ian Collins

V said:
Hello:

Consider the following nested for loop:

uint64 TABLE[8][256];
Better to use standard types, uint64_t.

All caps for variable names isn't a good idea, all caps is usually used
for macros.
for (i=0; i<=7; i++)
for (j=1; j<=7; j++)
for (k=1; k<=(1<<j)-1; k++)
TABLE [((1<<j)+k)] = (TABLE[(1<<j)]) ^ (TABLE[k]);

I understand that unrolling just the inner most for-loop would give me
best performance and this is required for my project. But I'm unable
to figure out how to unroll just the innermost for-loop.

Have you profiled the code?
Can someone please help!
Your compiler probably can. Loop unrolling is a common optimisation.
Check the assembly output for your code with various optimisation settings.
 
V

V

I have tried to use compile time option to unroll this and checked its
assembly code, but it doesn't seem to have unrolled the loop.

Any suggestions?

Thanks.
 
I

Ian Collins

V said:
I have tried to use compile time option to unroll this and checked its
assembly code, but it doesn't seem to have unrolled the loop.

Any suggestions?
A better compiler?

The most important thing to do is to profile your application and see if
there really is a bottleneck in this code. At the moment, manual
unrolling of this loop looks like premature micro-optimisation.

If there is an issue and your compiler isn't up the job, just repeat the
inner assignment N times and change the inner loop to increment its
counter by N. This assumes the limit is a multiple of N. Something
like (untested)

for (k=1; k<=(1<<j)-1; k += 2)
{
TABLE [((1<<j)+k)] = (TABLE[(1<<j)]) ^ (TABLE[k]);
TABLE [((1<<j)+k+1)] = (TABLE[(1<<j)]) ^ (TABLE[k+1]);
}
 
A

Antoninus Twink

This assumes the limit is a multiple of N. Something like (untested)

for (k=1; k<=(1<<j)-1; k += 2)
{
TABLE [((1<<j)+k)] = (TABLE[(1<<j)]) ^ (TABLE[k]);
TABLE [((1<<j)+k+1)] = (TABLE[(1<<j)]) ^ (TABLE[k+1]);
}


This fails when j=1.

The problem is that the number of iterations is variable, and ranges
between 1 and 127. Finding common divisors of (powers of 2 minus 1) as
an unrolling factor is problematic... (Since the number of iterations is
growing exponentially, even just unrolling the last loop is attractive,
but 127 is prime, so there'll be some fiddling needed.

Just unpacking what on earth the table looks like is an effort. The
outer index i is irrelevant, as there's no interaction between these
outer "layers". The inner 256-long arrays are arranged like this:

j
-1 | 0 |
0 | 1 |
1 | 2 | 3 |
2 | 4 | 5 | 6 | 7 |
3 | 8 | 9 |10 |11 |12 |13 |14 |15 |
4 ...
5 ...
6 ...
7 |128|129| ... |255|

The left-hand column is an input column. To fill in a given row, look at
the value in the left-hand column and xor it in turn with the elements
earlier in the array, in the order 0,1,2,3,...

Given the way these dependencies work, it is also possible to fill in
columns in turn from the left, rather than working row-wise. This
offers some potential for unrolling, but would still be complicated and
the resulting code would be large.
 
B

Bart

I have tried to use compile time option to unroll this and checked its
assembly code, but it doesn't seem to have unrolled the loop.

Any suggestions?

Thanks.

I'm not surprised. The inner loop is very complicated. Is there any
way of simplifying it?

Perhaps a rethink of what you're trying to achieve might help.

What factor of speedup do you need?
 
I

Ian Collins

Bart said:
I'm not surprised. The inner loop is very complicated. Is there any
way of simplifying it?
The first compiler I tried was happy to unroll the loop, it's not that
hard (for a machine!).
 
I

Ian Collins

Antoninus said:
This assumes the limit is a multiple of N. Something like (untested)

for (k=1; k<=(1<<j)-1; k += 2)
{
TABLE [((1<<j)+k)] = (TABLE[(1<<j)]) ^ (TABLE[k]);
TABLE [((1<<j)+k+1)] = (TABLE[(1<<j)]) ^ (TABLE[k+1]);
}


This fails when j=1.

I know, that why I said "This assumes the limit is a multiple of N", N
== 2 in this case.
 
B

Bart

The first compiler I tried was happy to unroll the loop, it's not that
hard (for a machine!).

Looking at the code again, it just initialies a 2048-element table (of
64-bit values).

If that's all it ever does, then it only needs to be done once, and
speed should not be an issue.

I think more context is needed.
 
W

Willem

Bart wrote:
) Looking at the code again, it just initialies a 2048-element table (of
) 64-bit values).

Not quite. If you look closer you will see that it depends on the first
elements of each of the 8 arrays.

However, unrolling the outermost loop might improve efficiency
because it is a constant (8-fold) unrolling and could give rise to better
pipelining.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
W

Willem

V wrote:
) Hello:
)
) Consider the following nested for loop:
)
) uint64 TABLE[8][256];
)
) for (i=0; i<=7; i++)
) for (j=1; j<=7; j++)
) for (k=1; k<=(1<<j)-1; k++)
) TABLE [((1<<j)+k)] = (TABLE[(1<<j)]) ^ (TABLE[k]);
)
) I understand that unrolling just the inner most for-loop would give me
) best performance and this is required for my project. But I'm unable
) to figure out how to unroll just the innermost for-loop.

In this case, it may be that unrolling the outermost loop would be best,
because the inner to loops don't depend on the outer.

(I took the liberty of adding another peephole optimization in how
j is used, even though the compiler should recognize it by itself)

for (j=2; j<=128; j<<=1) {
for (k=1; k<=j-1; k++) {
TABLE[0][j+k] = TABLE[0][j] ^ TABLE[0][k]
TABLE[1][j+k] = TABLE[1][j] ^ TABLE[1][k]
/* ... */
TABLE[7][j+k] = TABLE[7][j] ^ TABLE[7][k]
}
}

This has the added advantage of removing a level of indirection in
the table accesses, because TABLE[0] through TABLE[7] now are constant.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top