unrolling nested for-loop

Discussion in 'C Programming' started by V, May 10, 2008.

  1. V

    V Guest

    Hello:

    Consider the following nested for loop:

    uint64 TABLE[8][256];

    for (i=0; i<=7; i++)
    for (j=1; j<=7; j++)
    for (k=1; k<=(1<<j)-1; k++)
    TABLE [((1<<j)+k)] = (TABLE[(1<<j)]) ^ (TABLE[k]);

    I understand that unrolling just the inner most for-loop would give me
    best performance and this is required for my project. But I'm unable
    to figure out how to unroll just the innermost for-loop.

    Can someone please help!

    Thanks!
    V, May 10, 2008
    #1
    1. Advertising

  2. V

    Ian Collins Guest

    V wrote:
    > Hello:
    >
    > Consider the following nested for loop:
    >
    > uint64 TABLE[8][256];
    >

    Better to use standard types, uint64_t.

    All caps for variable names isn't a good idea, all caps is usually used
    for macros.

    > for (i=0; i<=7; i++)
    > for (j=1; j<=7; j++)
    > for (k=1; k<=(1<<j)-1; k++)
    > TABLE [((1<<j)+k)] = (TABLE[(1<<j)]) ^ (TABLE[k]);
    >
    > I understand that unrolling just the inner most for-loop would give me
    > best performance and this is required for my project. But I'm unable
    > to figure out how to unroll just the innermost for-loop.
    >

    Have you profiled the code?

    > Can someone please help!
    >

    Your compiler probably can. Loop unrolling is a common optimisation.
    Check the assembly output for your code with various optimisation settings.

    --
    Ian Collins.
    Ian Collins, May 10, 2008
    #2
    1. Advertising

  3. V

    V Guest

    I have tried to use compile time option to unroll this and checked its
    assembly code, but it doesn't seem to have unrolled the loop.

    Any suggestions?

    Thanks.
    V, May 10, 2008
    #3
  4. V

    Ian Collins Guest

    V wrote:
    > I have tried to use compile time option to unroll this and checked its
    > assembly code, but it doesn't seem to have unrolled the loop.
    >
    > Any suggestions?
    >

    A better compiler?

    The most important thing to do is to profile your application and see if
    there really is a bottleneck in this code. At the moment, manual
    unrolling of this loop looks like premature micro-optimisation.

    If there is an issue and your compiler isn't up the job, just repeat the
    inner assignment N times and change the inner loop to increment its
    counter by N. This assumes the limit is a multiple of N. Something
    like (untested)

    for (k=1; k<=(1<<j)-1; k += 2)
    {
    TABLE [((1<<j)+k)] = (TABLE[(1<<j)]) ^ (TABLE[k]);
    TABLE [((1<<j)+k+1)] = (TABLE[(1<<j)]) ^ (TABLE[k+1]);
    }

    --
    Ian Collins.
    Ian Collins, May 10, 2008
    #4
  5. On 10 May 2008 at 8:02, Ian Collins wrote:
    > This assumes the limit is a multiple of N. Something like (untested)
    >
    > for (k=1; k<=(1<<j)-1; k += 2)
    > {
    > TABLE [((1<<j)+k)] = (TABLE[(1<<j)]) ^ (TABLE[k]);
    > TABLE [((1<<j)+k+1)] = (TABLE[(1<<j)]) ^ (TABLE[k+1]);
    > }


    This fails when j=1.

    The problem is that the number of iterations is variable, and ranges
    between 1 and 127. Finding common divisors of (powers of 2 minus 1) as
    an unrolling factor is problematic... (Since the number of iterations is
    growing exponentially, even just unrolling the last loop is attractive,
    but 127 is prime, so there'll be some fiddling needed.

    Just unpacking what on earth the table looks like is an effort. The
    outer index i is irrelevant, as there's no interaction between these
    outer "layers". The inner 256-long arrays are arranged like this:

    j
    -1 | 0 |
    0 | 1 |
    1 | 2 | 3 |
    2 | 4 | 5 | 6 | 7 |
    3 | 8 | 9 |10 |11 |12 |13 |14 |15 |
    4 ...
    5 ...
    6 ...
    7 |128|129| ... |255|

    The left-hand column is an input column. To fill in a given row, look at
    the value in the left-hand column and xor it in turn with the elements
    earlier in the array, in the order 0,1,2,3,...

    Given the way these dependencies work, it is also possible to fill in
    columns in turn from the left, rather than working row-wise. This
    offers some potential for unrolling, but would still be complicated and
    the resulting code would be large.
    Antoninus Twink, May 10, 2008
    #5
  6. V

    Bart Guest

    On May 10, 7:44 am, V <> wrote:
    > I have tried to use compile time option to unroll this and checked its
    > assembly code, but it doesn't seem to have unrolled the loop.
    >
    > Any suggestions?
    >
    > Thanks.


    I'm not surprised. The inner loop is very complicated. Is there any
    way of simplifying it?

    Perhaps a rethink of what you're trying to achieve might help.

    What factor of speedup do you need?

    --
    Bartc
    Bart, May 10, 2008
    #6
  7. V

    Ian Collins Guest

    Bart wrote:
    > On May 10, 7:44 am, V <> wrote:
    >> I have tried to use compile time option to unroll this and checked its
    >> assembly code, but it doesn't seem to have unrolled the loop.
    >>
    >> Any suggestions?
    >>
    >> Thanks.

    >
    > I'm not surprised. The inner loop is very complicated. Is there any
    > way of simplifying it?
    >

    The first compiler I tried was happy to unroll the loop, it's not that
    hard (for a machine!).

    --
    Ian Collins.
    Ian Collins, May 10, 2008
    #7
  8. V

    Ian Collins Guest

    Antoninus Twink wrote:
    > On 10 May 2008 at 8:02, Ian Collins wrote:
    >> This assumes the limit is a multiple of N. Something like (untested)
    >>
    >> for (k=1; k<=(1<<j)-1; k += 2)
    >> {
    >> TABLE [((1<<j)+k)] = (TABLE[(1<<j)]) ^ (TABLE[k]);
    >> TABLE [((1<<j)+k+1)] = (TABLE[(1<<j)]) ^ (TABLE[k+1]);
    >> }

    >
    > This fails when j=1.
    >

    I know, that why I said "This assumes the limit is a multiple of N", N
    == 2 in this case.

    --
    Ian Collins.
    Ian Collins, May 10, 2008
    #8
  9. V

    Bart Guest

    On May 10, 12:35 pm, Ian Collins <> wrote:
    > Bart wrote:
    > > On May 10, 7:44 am, V <> wrote:
    > >> I have tried to use compile time option to unroll this and checked its
    > >> assembly code, but it doesn't seem to have unrolled the loop.

    >
    > >> Any suggestions?

    >
    > >> Thanks.

    >
    > > I'm not surprised. The inner loop is very complicated. Is there any
    > > way of simplifying it?

    >
    > The first compiler I tried was happy to unroll the loop, it's not that
    > hard (for a machine!).


    Looking at the code again, it just initialies a 2048-element table (of
    64-bit values).

    If that's all it ever does, then it only needs to be done once, and
    speed should not be an issue.

    I think more context is needed.

    --
    Bartc
    Bart, May 10, 2008
    #9
  10. V

    Willem Guest

    Bart wrote:
    ) Looking at the code again, it just initialies a 2048-element table (of
    ) 64-bit values).

    Not quite. If you look closer you will see that it depends on the first
    elements of each of the 8 arrays.

    However, unrolling the outermost loop might improve efficiency
    because it is a constant (8-fold) unrolling and could give rise to better
    pipelining.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
    Willem, May 10, 2008
    #10
  11. V

    Willem Guest

    V wrote:
    ) Hello:
    )
    ) Consider the following nested for loop:
    )
    ) uint64 TABLE[8][256];
    )
    ) for (i=0; i<=7; i++)
    ) for (j=1; j<=7; j++)
    ) for (k=1; k<=(1<<j)-1; k++)
    ) TABLE [((1<<j)+k)] = (TABLE[(1<<j)]) ^ (TABLE[k]);
    )
    ) I understand that unrolling just the inner most for-loop would give me
    ) best performance and this is required for my project. But I'm unable
    ) to figure out how to unroll just the innermost for-loop.

    In this case, it may be that unrolling the outermost loop would be best,
    because the inner to loops don't depend on the outer.

    (I took the liberty of adding another peephole optimization in how
    j is used, even though the compiler should recognize it by itself)

    for (j=2; j<=128; j<<=1) {
    for (k=1; k<=j-1; k++) {
    TABLE[0][j+k] = TABLE[0][j] ^ TABLE[0][k]
    TABLE[1][j+k] = TABLE[1][j] ^ TABLE[1][k]
    /* ... */
    TABLE[7][j+k] = TABLE[7][j] ^ TABLE[7][k]
    }
    }

    This has the added advantage of removing a level of indirection in
    the table accesses, because TABLE[0] through TABLE[7] now are constant.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
    Willem, May 10, 2008
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John Edwards
    Replies:
    5
    Views:
    371
    Thomas Matthews
    Jul 7, 2003
  2. =?ISO-8859-1?Q?Per_Nordl=F6w?=

    g++ loop unrolling performance

    =?ISO-8859-1?Q?Per_Nordl=F6w?=, Aug 31, 2004, in forum: C++
    Replies:
    1
    Views:
    1,001
    Jack Klein
    Sep 1, 2004
  3. Richard Cavell

    Unrolling a loop

    Richard Cavell, Feb 23, 2005, in forum: C++
    Replies:
    3
    Views:
    2,224
    Phillip Jordan
    Feb 23, 2005
  4. mark

    ultra-fast loop unrolling with g++ -O3

    mark, Jun 12, 2008, in forum: C Programming
    Replies:
    2
    Views:
    740
    santosh
    Jun 12, 2008
  5. Isaac Won
    Replies:
    9
    Views:
    364
    Ulrich Eckhardt
    Mar 4, 2013
Loading...

Share This Page