Loop Un Rolling

root · Nov 18, 2009

Hi I have a loop in C/C++ that I want to be un rolled for efficiency
purposes.

But the code has to target a variety of plat forms.

So I can't use compiler specific directives to force an un rolling.

My question is what is a good way to HINT to a compiler by the very
structure of the code that a loop should be un rolled in such a way that
many compilers will understand this hint.

Thanks.

/root

Ben Pfaff · Nov 18, 2009

root said:
Hi I have a loop in C/C++ that I want to be un rolled for efficiency
purposes.

But the code has to target a variety of plat forms.

So I can't use compiler specific directives to force an un rolling.

My question is what is a good way to HINT to a compiler by the very
structure of the code that a loop should be un rolled in such a way that
many compilers will understand this hint.

Why don't you just unroll the loop yourself, by e.g. repeating
code or using a macro?

root · Nov 18, 2009

Ben

This would not be acceptable to me as it would make the code too messy to
be readable.

The loop may have thousands of iterations.

/root

Ben Pfaff · Nov 18, 2009

root said:
This would not be acceptable to me as it would make the code too messy to
be readable.

The loop may have thousands of iterations.

I really doubt that unrolling a loop thousands of times will have
performance benefits.

Daniel Pitts · Nov 18, 2009

root said:
Hi I have a loop in C/C++ that I want to be un rolled for efficiency
purposes.

But the code has to target a variety of plat forms.

So I can't use compiler specific directives to force an un rolling.

My question is what is a good way to HINT to a compiler by the very
structure of the code that a loop should be un rolled in such a way that
many compilers will understand this hint.

Forcing unrolling (or any "optimization") is likely to cause inefficient
binaries. Smaller caches, and other situations, can negate any
performance gained from unrolled loops. It is better to let the
compiler decide whether to unroll or not.

Alf P. Steinbach · Nov 18, 2009

* root:

Hi I have a loop in C/C++

There is no such hybrid language (wrong).

So from the outset, you're saying that you don't know what you're talking about.

This is also indicated by cross-posting to [comp.lang.c] and [comp.lang.c++]
(wrong), and by not posting any of the relevant code (wrong).

that I want to be un rolled for efficiency
purposes.

But the code has to target a variety of plat forms.

So I can't use compiler specific directives to force an un rolling.

My question is what is a good way to HINT to a compiler

Telepathy may be even better. Just think very very hard what you want compiler
to do. Sometimes appears to work!

However, in C++ you can use the Boost preprocessor macros to help repeat code.
Check out Duff's device for the code you'd want the macros to generate. Since
the preprocessor is largely the same in C and C++ there is a possibility but no
guarantee that you can also use those macros in C.

However, unrolling a loop at the source code level may have the opposite effect
of what you want.

by the very
structure of the code that a loop should be un rolled in such a way that
many compilers will understand this hint.

Most inefficiency is of the algorithmic kind.

That's where you get most return on your optimization effort.

Cheers & hth.,

- Alf

Moi · Nov 19, 2009

Ben

This would not be acceptable to me as it would make the code too messy
to be readable.

The loop may have thousands of iterations.

Yes. if it had only one iteration, it would not be a loop.
For examples of macroization, you could look at GNU's libc
string function implementation.
You can easily conclude from that that it will take a lot
of time (probably 10-100 hours per LOC)

IMHO the rules of thumb for (loop-)optimization:
1) avoid "fat" loops. Don't do too much in one loop.
2) avoid (conditional) jumps
3) try to deal with register- or cache- line aligned chunks
of data as much as possible.

BTW: show us the loop.

HTH,
AvK

bartc · Nov 19, 2009

root said:
Ben

This would not be acceptable to me as it would make the code too messy to
be readable.

The loop may have thousands of iterations.

You don't need to unroll the loop completely. If the loop executes 1000
times, then unrolling one iteration means the loop only needs to execute 500
times:

for (i=0; i<1000; ++i) {
<yourcode>
}

for (i=0; i<500; ++i) {
<yourcode>
<yourcode>
}

Or are you saying the body of the loop is complex? In that case loop
overheads may not be significant.

White Wolf · Nov 19, 2009

Paavo said:
Duff's device?

Maybe. Most probably not. He did not say *anything* about what the
loop does. Duff's device is not loop unrolling, and Duff's devices is a
very specific idiom applicable in a very specific case: copying a lot of
bytes to hardware that can only efficiently copy words, while using a
compiler does doesn't do Duff's device on its own as an optimization
(gcc does, AFAIK).

To the OP:

1.) Follow Michael Jackson's rules of optimization

2.) Chances are high that the most stupid compiler optimizer is still
better at loop unrolling than you will ever be. Not because you are
stupid, but because those guys writing the compilers have 25 years of
head start on you.

3.) So to reword rule #1 and #2 from the Jackson rules: Don't sweat it.

BR, WW

Hamiral · Nov 19, 2009

root said:
But the code has to target a variety of plat forms.

It's nearly impossible to unroll a loop efficiently for a variety of
platforms, because they can have a lot of differences regarding cache etc.
The best optimization IMHO would be to trust your compiler and to have
an efficient algorithm.

Ham

root · Nov 19, 2009

Ben

Perhaps you are confused about what loop un rolling does.

The efficiency is because of avoided jumps. It is the size of the loop
body not the number of iter ations that determine if loop un rolling is
beneficial.

In this case the loop body is very short.

Also if the number of iterations is a compile time constant but defined
in a complicated way then your method will not work.

/root

Keith Thompson · Nov 19, 2009

White Wolf said:
2.) Chances are high that the most stupid compiler optimizer is still
better at loop unrolling than you will ever be. Not because you are
stupid, but because those guys writing the compilers have 25 years of
head start on you.

[...]

And because the compiler can do a tremendous amount of analysis
*every time* you compile your code, and can redo all the same
analysis from scratch when you change a single line and recompile.

Willem · Nov 19, 2009

root wrote:
) Perhaps you are confused about what loop un rolling does.

There is no reason to try to be patronizing.

) The efficiency is because of avoided jumps. It is the size of the loop
) body not the number of iter ations that determine if loop un rolling is
) beneficial.

The *in*efficiency is because of caching issues. Jumps are almost free on
any reasonably modern CPU, as opposed to having to read new instructions
from memory all the time.

What platform are you targeting where you believe loop unrolling will be
beneficial ?

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Ben Pfaff · Nov 19, 2009

root said:
The efficiency is because of avoided jumps. It is the size of the loop
body not the number of iter ations that determine if loop un rolling is
beneficial.

In this case the loop body is very short.

Unrolling the loop thousands of times, regardless of its length,
is unlikely to be beneficial. Even if it is only a single
instruction, thousands of copies of that instruction will
probably not fit in your CPU cache or, if they do, then they will
evict other instructions from it, which will cause a high cost
for other code and possibly cause it to be a net loss.

Also if the number of iterations is a compile time constant but defined
in a complicated way then your method will not work.

Then you need a switch statement. See Duff's Device (in an
obsolete form of C, from Wikipedia):

send(to, from, count)
register short *to, *from;
register count;
{
register n=(count+7)/8;
switch(count%8){
case 0: do{ *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
}while(--n>0);
}
}

White Wolf · Nov 19, 2009

Keith said:
White Wolf said:

2.) Chances are high that the most stupid compiler optimizer is still
better at loop unrolling than you will ever be. Not because you are
stupid, but because those guys writing the compilers have 25 years of
head start on you.

Click to expand...

[...]

And because the compiler can do a tremendous amount of analysis
*every time* you compile your code, and can redo all the same
analysis from scratch when you change a single line and recompile.

Yep.

The good old assembler versus C debate.

BR, WW

Walter Banks · Nov 19, 2009

Paavo said:
Duff's device?

There might be a general case for Duff's device that
could eliminate a lot of conditional checking.

The power of Duff's device is to do a a single
check in such a way that periodic checks are needed
less frequently.

The core of Duff's algorithm might even be able
to be implemented in macro's with some reasonable
assumptions and produce significant execution
improvements.

Regards,

Walter Banks

root · Nov 19, 2009

Richard

That's good advice, thanks.

I always try to give the compiler hints for optimization ie by using
register variables. It can make all the difference in a tight inner loop.

/root

--
Learn more or lose your rights!
http://www.guncontrolkills.com/
http://www.gunbanobama.com/
Learn more or lose your rights!

Keith Thompson · Nov 19, 2009

root said:
I always try to give the compiler hints for optimization ie by using
register variables. It can make all the difference in a tight inner loop.

[...]

Please don't top-post. See:
http://www.caliburn.nl/topposting.html
http://www.cpax.org.uk/prg/writings/topposting.php

The common wisdom these days is that the compiler can usually do
a better job of register allocation than you can, and that the
"register" keyword isn't likely to help much; in fact, it may
hinder optimization.

If you're finding that the "register" keyword actually improves
performance measurably, perhaps the common wisdom is sometimes wrong.

(You're telling your compiler to use its maximum optimization level,
right?)

Noob · Nov 20, 2009

Richard said:
Remember that the only thing you save by unrolling the loop is the
time spent on the loop logic.

This statement is not universally true.

Interleaving several iterations may "hide" some latency.

Consider e.g a naive memcpy:

for (i=0; i < N; ++i)
{
temp = src;
dst = temp; /* STALL WAITING FOR src */
}

Unroll the loop 4 times.

for (i=0; i < N; i += 4)
{
temp0 = src[i+0];
temp1 = src[i+1];
temp2 = src[i+2];
temp3 = src[i+3];
dst[i+0] = temp0;
dst[i+1] = temp1;
dst[i+2] = temp2;
dst[i+3] = temp3;
}

Depending on the size of the Icache, the latency of memory loads, etc,
the unrolled version may perform much better.

At this point, comp.arch would be more appropriate.

To the original poster, Wikipedia is a good reference.
http://en.wikipedia.org/wiki/Loop_unwinding
http://en.wikipedia.org/wiki/Software_pipelining

While loop unclear, can someone help?	4	Dec 6, 2023
Help with Loop	0	Mar 30, 2023
Rolling files in C++	3	Mar 20, 2007
Partially completed coding of loop script, need help finishing.	0	Oct 7, 2022
Rolling String Buffer needed	5	Oct 2, 2009
Regexp.escape with un-escapes	4	Dec 6, 2009
Neel Help	3	May 20, 2014
How do I use Find and Loop in VBA for Excel to identify, delete, and insert blank row for values greater than 6?	0	Feb 28, 2022

Loop Un Rolling

root

Ben Pfaff

root

Ben Pfaff

Daniel Pitts

Alf P. Steinbach

Moi

bartc

White Wolf

Hamiral

root

Keith Thompson

Willem

Ben Pfaff

White Wolf

Walter Banks

root

Keith Thompson

Noob

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads