# Faster for() loops?

Discussion in 'C Programming' started by Neo, Sep 26, 2005.

1. ### NeoGuest

Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
states:for( i=0; i<10; i++){ ... }i loops through the values
0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster. This
works because it is quicker to process "i--" as the test condition, which
says "is i non-zero? If so, decrement it and continue.". For the original
code, the processor has to calculate "subtract i from 10. Is the result
non-zero? if so, increment i and continue.". In tight loops, this make a
considerable difference.
How far it holds true.. in the light of modern optimizing compilers? and
will it make a significant difference in case of embedded systems???

Thanks,
-Neo
"Do U really think, what U think real is really real?"

Neo, Sep 26, 2005

2. ### Al BorowskiGuest

Hi,

Neo wrote:
[...] In tight loops, this make a
> considerable difference.
> How far it holds true.. in the light of modern optimizing compilers? and
> will it make a significant difference in case of embedded systems???

There is nothing like an experiment to test a theory. I just tried with
AVRGCC

void countDown(void){
int i;
for(i=10; i!=0; i--) doSomething();
}
void countUp(void){
int i;
for(i=0;i<10;i++) doSomething();
}

The generated code is

000000ce <countDown>:
}

void countDown(void){
ce: cf 93 push r28
d0: df 93 push r29
int i;
for(i=10; i!=0; i--) doSomething();
d2: ca e0 ldi r28, 0x0A ; 10
d4: d0 e0 ldi r29, 0x00 ; 0
d6: 0e 94 5d 00 call 0xba
da: 21 97 sbiw r28, 0x01 ; 1
dc: e1 f7 brne .-8 ; 0xd6
de: df 91 pop r29
e0: cf 91 pop r28
e2: 08 95 ret

000000e4 <countUp>:
}
void countUp(void){
e4: cf 93 push r28
e6: df 93 push r29
e8: c9 e0 ldi r28, 0x09 ; 9
ea: d0 e0 ldi r29, 0x00 ; 0
int i;
for(i=0;i<10;i++) doSomething();
ec: 0e 94 5d 00 call 0xba
f0: 21 97 sbiw r28, 0x01 ; 1
f2: d7 ff sbrs r29, 7
f4: fb cf rjmp .-10 ; 0xec
f6: df 91 pop r29
f8: cf 91 pop r28
fa: 08 95 ret

Counting down instead of up saves one whole instruction. It could make a
difference I suppose.

However, the compiler cannot optimise as well if anything in the loop
depends on the value of 'i'.

void countDown(void){
int i;
for(i=10; i!=0; i--) doSomething(i);
}
void countUp(void){
int i;
for(i=0;i<10;i++) doSomething(i);
}

Becomes

void countDown(void){
ce: cf 93 push r28
d0: df 93 push r29
int i;
for(i=10; i!=0; i--) doSomething(i);
d2: ca e0 ldi r28, 0x0A ; 10
d4: d0 e0 ldi r29, 0x00 ; 0
d6: ce 01 movw r24, r28
d8: 0e 94 5d 00 call 0xba
dc: 21 97 sbiw r28, 0x01 ; 1
de: d9 f7 brne .-10 ; 0xd6
e0: df 91 pop r29
e2: cf 91 pop r28
e4: 08 95 ret

000000e6 <countUp>:
}
void countUp(void){
e6: cf 93 push r28
e8: df 93 push r29
int i;
for(i=0;i<10;i++) doSomething(i);
ea: c0 e0 ldi r28, 0x00 ; 0
ec: d0 e0 ldi r29, 0x00 ; 0
ee: ce 01 movw r24, r28
f0: 0e 94 5d 00 call 0xba
f4: 21 96 adiw r28, 0x01 ; 1
f6: ca 30 cpi r28, 0x0A ; 10
f8: d1 05 cpc r29, r1
fa: cc f3 brlt .-14 ; 0xee
fc: df 91 pop r29
fe: cf 91 pop r28
100: 08 95 ret

This time there are a whole 2 extra instructions. I don't think this is
such a big deal. Unrolling the loop would give a better result.

cheers,

Al

Al Borowski, Sep 26, 2005

3. ### Ian BellGuest

Neo wrote:

> Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
> states:for( i=0; i<10; i++){ ... }i loops through the values
> 0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
> you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
> through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster.
> This works because it is quicker to process "i--" as the test condition,
> which says "is i non-zero? If so, decrement it and continue.". For the
> original code, the processor has to calculate "subtract i from 10. Is the
> result non-zero? if so, increment i and continue.". In tight loops, this
> make a considerable difference.
> How far it holds true.. in the light of modern optimizing compilers? and
> will it make a significant difference in case of embedded systems???
>

Many micros have a decrement jmp if zero (or non zero) machine instruction
so a decent optimising compiler should know this and use it in count down
to zero loops. Counting up often needs a compare followed by a jmp zero (or
non zero) which will be a tad slower.

Ian

Ian Bell, Sep 26, 2005
4. ### Peter BushellGuest

"Neo" <> wrote in message
news:...
> Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
> states:for( i=0; i<10; i++){ ... }i loops through the values
> 0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
> you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
> through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster.
> This works because it is quicker to process "i--" as the test condition,
> which says "is i non-zero? If so, decrement it and continue.". For the
> original code, the processor has to calculate "subtract i from 10. Is the
> result non-zero? if so, increment i and continue.". In tight loops, this
> make a considerable difference.
> How far it holds true.. in the light of modern optimizing compilers? and
> will it make a significant difference in case of embedded systems???
>
> Thanks,
> -Neo
> "Do U really think, what U think real is really real?"
>

A major advantage of writing in C is that you can, if you choose, write
understandable, maintainable code. This kind of hand-optimisation has the
opposite effect. If you really need to care about exactly how many
instruction cycle a loop takes, code it in assembly language. Otherwise, for
leave the compiler to do the optimisation. These days, most compilers can
optimise almost as well as you can, for most "normal" operations.

Regards,
--
Peter Bushell
http://www.software-integrity.com/

Peter Bushell, Sep 26, 2005
5. ### SkarmanderGuest

Neo wrote:
> Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
> states:for( i=0; i<10; i++){ ... }i loops through the values
> 0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
> you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
> through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster. This
> works because it is quicker to process "i--" as the test condition, which
> says "is i non-zero? If so, decrement it and continue.". For the original
> code, the processor has to calculate "subtract i from 10. Is the result
> non-zero? if so, increment i and continue.". In tight loops, this make a
> considerable difference.
> How far it holds true.. in the light of modern optimizing compilers? and
> will it make a significant difference in case of embedded systems???
>

Regardless of the performance issue, I'd like to point out that after
for( i=10; i--; ) finishes, i will have the value -1, since the
decrement is performed even if i is zero. This is counterintuitive, so
it's worth noting. It also means the following is not equivalent:

for (i = 10; i != 0; --i)

Since here one less decrement is performed. Incidentally, my
compiler/platform generates better code with this version -- it compares
i to -1 in the other, which is no better than comparing it to 10! If you
want to count down, I suggest writing what you mean and separating the
test and decrement parts -- it has the added bonus of making things more
readable. The rest is best left to the compiler.

S.

Skarmander, Sep 26, 2005
6. ### Scott MooreGuest

Neo wrote On 09/25/05 23:41,:
> Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
> states:for( i=0; i<10; i++){ ... }i loops through the values
> 0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
> you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
> through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster. This
> works because it is quicker to process "i--" as the test condition, which
> says "is i non-zero? If so, decrement it and continue.". For the original
> code, the processor has to calculate "subtract i from 10. Is the result
> non-zero? if so, increment i and continue.". In tight loops, this make a
> considerable difference.
> How far it holds true.. in the light of modern optimizing compilers? and
> will it make a significant difference in case of embedded systems???
>
> Thanks,
> -Neo
> "Do U really think, what U think real is really real?"
>
>

Unroll it completely.

Scott Moore, Sep 26, 2005
7. ### Roberto WaltmanGuest

"Neo" wrote:

>Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
>states:for( i=0; i<10; i++){ ... }i loops through the values
>0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
>you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
>through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster.
>....
>How far it holds true.. in the light of modern optimizing compilers? and
>will it make a significant difference in case of embedded systems???

It may or not save a couple of assembly language instructions, (of
course depending on the compiler and processor used,) but I doubt this
"noptimization" will make any noticeable change in the performance of
a program, unless your code consist mainly of empty for() loops.

What impact can a minuscule reduction in the time required to decide
if the loop has ended or not have, if the body of the loop, for
example, call functions that format a CAN message, deliver it, wait
for a response, retry if there were errors or timeouts, decode the
response, store the values in a serial EEPROM, and based on them start
a few motors, open pneumatic valves, optionally sending an email
message to Katmandu.

That is not an optimization, but a total waste of time. Read the first
example in "Elements of programming style" and learn...

Roberto Waltman

[ return address is invalid. ]

Roberto Waltman, Sep 26, 2005
8. ### Joe ButlerGuest

>
> That is not an optimization, but a total waste of time. Read the first
> example in "Elements of programming style" and learn...

What if the difference is between fitting into memory and not?

Joe Butler, Sep 26, 2005
9. ### Mark McIntyreGuest

On Mon, 26 Sep 2005 12:11:23 +0530, in comp.lang.c , "Neo"
<> wrote:

>Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
>states

(that reversing loop order is faster)

The page is talking rot. It *may* be faster. It *may* be slower. The
only way to know is to benchmark your particular implementation in the
specific case you're examining.

>How far it holds true.. in the light of modern optimizing compilers? and
>will it make a significant difference in case of embedded systems???

Benchmark.
--
Mark McIntyre
CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----

Mark McIntyre, Sep 26, 2005
10. ### Kevin D. QuittGuest

Depends what you're doing. If you're accessing a large chunk of memory on a system with
cache, you want to go through incrementing addresses to maximize the use of cache.
Decrementing through memory is generally pessimal.

--
#include <standard.disclaimer>
_
Kevin D Quitt USA 91387-4454 96.37% of all statistics are made up

Kevin D. Quitt, Sep 26, 2005
11. ### Flash GordonGuest

Joe Butler wrote:
>>That is not an optimization, but a total waste of time. Read the first
>>example in "Elements of programming style" and learn...

>
> What if the difference is between fitting into memory and not?

If you are hunting for that small an amount to get the program to fit
then you are in trouble anyway. Something will need changing making it
no longer fit!
--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.

Flash Gordon, Sep 26, 2005
12. ### Christian BauGuest

In article <dh8bgd\$g5o\$>,
Ian Bell <> wrote:

> Many micros have a decrement jmp if zero (or non zero) machine instruction
> so a decent optimising compiler should know this and use it in count down
> to zero loops. Counting up often needs a compare followed by a jmp zero (or
> non zero) which will be a tad slower.

The Pentium processors have a loop instruction. Every decent compiler
knows it and avoids it like hell because it runs slower than a subtract
+ compare + conditional branch

Christian Bau, Sep 26, 2005
13. ### Christian BauGuest

In article <>,
"Peter Bushell" <> wrote:

> These days, most compilers can
> optimise almost as well as you can, for most "normal" operations.

Question: How can I optimise code better than the compiler?

Christian Bau, Sep 26, 2005
14. ### Joe ButlerGuest

I think I disagree.

If you can fit something into a cheaper processor model because you save a
couple of bytes by changing 1 or two loops, then you are not in trouble
anymore.

"Flash Gordon" <> wrote in message
news:-gordon.me.uk...
> Joe Butler wrote:
> >>That is not an optimization, but a total waste of time. Read the first
> >>example in "Elements of programming style" and learn...

> >
> > What if the difference is between fitting into memory and not?

>
> If you are hunting for that small an amount to get the program to fit
> then you are in trouble anyway. Something will need changing making it
> no longer fit!
> --
> Flash Gordon
> Living in interesting times.
> Although my email address says spam, it is real and I read it.

Joe Butler, Sep 26, 2005
15. ### Flash GordonGuest

Joe Butler wrote:

Don't top post. Replies belong after the text you are replying to.

> "Flash Gordon" <> wrote in message
> news:-gordon.me.uk...
>
>>Joe Butler wrote:
>>
>>>>That is not an optimization, but a total waste of time. Read the first
>>>>example in "Elements of programming style" and learn...
>>>
>>>What if the difference is between fitting into memory and not?

>>
>>If you are hunting for that small an amount to get the program to fit
>>then you are in trouble anyway. Something will need changing making it
>>no longer fit!
>>--
>>Flash Gordon
>>Living in interesting times.
>>Although my email address says spam, it is real and I read it.

Don't include peoples signatures unless you are commenting on them.

> I think I disagree.
>
> If you can fit something into a cheaper processor model because you
> save a couple of bytes by changing 1 or two loops, then you are not in
> trouble anymore.

I'll be more explicit then. EVERY SINGLE TIME I have come across a
system where people have tried to squeeze the code in believing it will
just about fit (either size or speed) one of the following has happened:

1) Customer required a subsequent change which proved to be impossible
unless the card was redesigned because there was no space for the new
code.
2) A bug fix requires some additional code and oops, there is no more
space.
3) By the time all the required stuff was added that the person who
thought it would only just fit had forgotten it did NOT fit by a mile
so it did not even come close to meeting the customers requirements
4) It turned out there were massive savings to be had else where because
of higher level problems allowing me to save far more space/time than
you could possibly save by such micro-optimisations.

Only with the third of those possibilities was it possible to meet the
requirements using the existing hardware, and meeting the requirements
involved fixing the algorithms or doing large scale changes where the
coding was just plain atrocious.

So my experience is that it is never worth bothering with such
micro-optimisations.
--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.

Flash Gordon, Sep 27, 2005
16. ### Joe ButlerGuest

OK, point taken. Although, when working with very small memories, it can
make all the difference if a byte can be saved here and there. Afterall, 50
such 'optimisations' could amount to 10% of the total memory available. I'm
not necessarily suggesting this should be done from day 1, but have found it
useful just to get a feel for what the compiler works best with.

"Flash Gordon" <> wrote in message
news:-gordon.me.uk...
> Joe Butler wrote:
>
> Don't top post. Replies belong after the text you are replying to.
>
> > "Flash Gordon" <> wrote in message
> > news:-gordon.me.uk...
> >
> >>Joe Butler wrote:
> >>
> >>>>That is not an optimization, but a total waste of time. Read the first
> >>>>example in "Elements of programming style" and learn...
> >>>
> >>>What if the difference is between fitting into memory and not?
> >>
> >>If you are hunting for that small an amount to get the program to fit
> >>then you are in trouble anyway. Something will need changing making it
> >>no longer fit!
> >>--
> >>Flash Gordon
> >>Living in interesting times.
> >>Although my email address says spam, it is real and I read it.

>
> Don't include peoples signatures unless you are commenting on them.
>
> > I think I disagree.
> >
> > If you can fit something into a cheaper processor model because you
> > save a couple of bytes by changing 1 or two loops, then you are not in
> > trouble anymore.

>
> I'll be more explicit then. EVERY SINGLE TIME I have come across a
> system where people have tried to squeeze the code in believing it will
> just about fit (either size or speed) one of the following has happened:
>
> 1) Customer required a subsequent change which proved to be impossible
> unless the card was redesigned because there was no space for the new
> code.
> 2) A bug fix requires some additional code and oops, there is no more
> space.
> 3) By the time all the required stuff was added that the person who
> thought it would only just fit had forgotten it did NOT fit by a mile
> so it did not even come close to meeting the customers requirements
> 4) It turned out there were massive savings to be had else where because
> of higher level problems allowing me to save far more space/time than
> you could possibly save by such micro-optimisations.
>
> Only with the third of those possibilities was it possible to meet the
> requirements using the existing hardware, and meeting the requirements
> involved fixing the algorithms or doing large scale changes where the
> coding was just plain atrocious.
>
> So my experience is that it is never worth bothering with such
> micro-optimisations.
> --
> Flash Gordon
> Living in interesting times.
> Although my email address says spam, it is real and I read it.

Joe Butler, Sep 27, 2005
17. ### Mark VandeWetteringGuest

On 2005-09-26, Neo <> wrote:
> Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
> states:for( i=0; i<10; i++){ ... }i loops through the values
> 0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
> you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
> through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster. This
> works because it is quicker to process "i--" as the test condition, which
> says "is i non-zero? If so, decrement it and continue.". For the original
> code, the processor has to calculate "subtract i from 10. Is the result
> non-zero? if so, increment i and continue.". In tight loops, this make a
> considerable difference.
> How far it holds true.. in the light of modern optimizing compilers? and
> will it make a significant difference in case of embedded systems???

You could just test it.

I think it's a mistake to obfuscate your loops just for the sake of
(what is probably) executing one more instruction which in all likelihood
isn't on the critical path of your application _anyway_. If, as you say,
you don't use the loop index, you could indeed do without the one
extra compare instruction, but you'd probably benefit from loop unrolling
too.

Premature optimization is a hindrance to software development.

Mark

> Thanks,
> -Neo
> "Do U really think, what U think real is really real?"

Mark VandeWettering, Sep 27, 2005
18. ### David BrownGuest

Mark McIntyre wrote:
> On Mon, 26 Sep 2005 12:11:23 +0530, in comp.lang.c , "Neo"
> <> wrote:
>
>
>>Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
>>states

>
>
> (that reversing loop order is faster)
>
> The page is talking rot. It *may* be faster. It *may* be slower. The
> only way to know is to benchmark your particular implementation in the
> specific case you're examining.
>

Actually, the page is talking rubbish about a great deal more than just
this case. It's full of generalisations that depend highly on the
compiler and target in question (the post is cross-posted to
comp.arch.embedded, so we are looking at a wide range of targets). "Use
switch instead of if...else..." (varies widely according to
target/compiler and the size of the switch), "Avoid ++, -- in while ()
expressions" (good compilers work well with such expressions), "Use
word-size variables instead of chars" (great for PPC, indifferent for
msp430, terrible for AVR), "Addition is faster than multiplication - use
'val + val + val' instead of 'val * 3' " (wrong for most compiler/target
combinations).

It's a nice idea to try to list such tips, but the page is badly out of
date, and makes all sorts of unwarranted assumptions.

So, as Mark says, benchmark your implementation. Also examine the
generated assembly code (you do understand the generated assembly? If
not, forget about such minor "optimisations".) And remember Knuth's
rules regarding such code-level optimisations:

1. Don't do it.
2. (For experts only) Don't do it yet.

David Brown, Sep 27, 2005
19. ### Anonymous 7843Guest

In article <>,
Neo <> wrote:
> for( i=10; i--; )
> { ... }

I tend to avoid this kind of loop because it's a bit less
intuitive to use with unsigned loop counters. After the
loop is done, an unsigned i would be set to some very high
implementation-defined number.

There is not much to be gained on loops that only count to
10... that extra instruction 10 times through the loop
would only add an extra 10 nanoseconds. This is likely to
pale in significance to any useful work done in the body of
the loop.

Loops that range over memory should never count backwards,
at least not when speed is important. For better or worse,
operating systems and memory caches only prefetch when

Anonymous 7843, Sep 27, 2005
20. ### Dave HansenGuest

On Tue, 27 Sep 2005 18:21:59 GMT, (Anonymous
7843) wrote:

>In article <>,
>Neo <> wrote:
>> for( i=10; i--; )
>> { ... }

>
>I tend to avoid this kind of loop because it's a bit less
>intuitive to use with unsigned loop counters. After the
>loop is done, an unsigned i would be set to some very high
>implementation-defined number.

FWIW, my bit-bang SPI output function looks something like

bit_ctr = 8;
do
{
Set_IO(SPI_DATA, (data&0x80) != 0);
Set_IO(SPI_CLOCK, 1);
data <<= 1;
Set_IO(SPI_CLOCK, 0);

} while (--bit_ctr);

which seems intuitive for the function at hand, and generates nearly
optimal assembly on all the platforms it's used on.

Regards,

-=Dave
--
Change is inevitable, progress is not.

Dave Hansen, Sep 27, 2005