Faster for() loops?

N

Neo

Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
states:for( i=0; i<10; i++){ ... }i loops through the values
0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster. This
works because it is quicker to process "i--" as the test condition, which
says "is i non-zero? If so, decrement it and continue.". For the original
code, the processor has to calculate "subtract i from 10. Is the result
non-zero? if so, increment i and continue.". In tight loops, this make a
considerable difference.
How far it holds true.. in the light of modern optimizing compilers? and
will it make a significant difference in case of embedded systems???

Thanks,
-Neo
"Do U really think, what U think real is really real?"
 
A

Al Borowski

Hi,

Neo wrote:
[...] In tight loops, this make a
considerable difference.
How far it holds true.. in the light of modern optimizing compilers? and
will it make a significant difference in case of embedded systems???

There is nothing like an experiment to test a theory. I just tried with
AVRGCC

void countDown(void){
int i;
for(i=10; i!=0; i--) doSomething();
}
void countUp(void){
int i;
for(i=0;i<10;i++) doSomething();
}

The generated code is

000000ce <countDown>:
}

void countDown(void){
ce: cf 93 push r28
d0: df 93 push r29
int i;
for(i=10; i!=0; i--) doSomething();
d2: ca e0 ldi r28, 0x0A ; 10
d4: d0 e0 ldi r29, 0x00 ; 0
d6: 0e 94 5d 00 call 0xba
da: 21 97 sbiw r28, 0x01 ; 1
dc: e1 f7 brne .-8 ; 0xd6
de: df 91 pop r29
e0: cf 91 pop r28
e2: 08 95 ret

000000e4 <countUp>:
}
void countUp(void){
e4: cf 93 push r28
e6: df 93 push r29
e8: c9 e0 ldi r28, 0x09 ; 9
ea: d0 e0 ldi r29, 0x00 ; 0
int i;
for(i=0;i<10;i++) doSomething();
ec: 0e 94 5d 00 call 0xba
f0: 21 97 sbiw r28, 0x01 ; 1
f2: d7 ff sbrs r29, 7
f4: fb cf rjmp .-10 ; 0xec
f6: df 91 pop r29
f8: cf 91 pop r28
fa: 08 95 ret

Counting down instead of up saves one whole instruction. It could make a
difference I suppose.

However, the compiler cannot optimise as well if anything in the loop
depends on the value of 'i'.


void countDown(void){
int i;
for(i=10; i!=0; i--) doSomething(i);
}
void countUp(void){
int i;
for(i=0;i<10;i++) doSomething(i);
}

Becomes

void countDown(void){
ce: cf 93 push r28
d0: df 93 push r29
int i;
for(i=10; i!=0; i--) doSomething(i);
d2: ca e0 ldi r28, 0x0A ; 10
d4: d0 e0 ldi r29, 0x00 ; 0
d6: ce 01 movw r24, r28
d8: 0e 94 5d 00 call 0xba
dc: 21 97 sbiw r28, 0x01 ; 1
de: d9 f7 brne .-10 ; 0xd6
e0: df 91 pop r29
e2: cf 91 pop r28
e4: 08 95 ret

000000e6 <countUp>:
}
void countUp(void){
e6: cf 93 push r28
e8: df 93 push r29
int i;
for(i=0;i<10;i++) doSomething(i);
ea: c0 e0 ldi r28, 0x00 ; 0
ec: d0 e0 ldi r29, 0x00 ; 0
ee: ce 01 movw r24, r28
f0: 0e 94 5d 00 call 0xba
f4: 21 96 adiw r28, 0x01 ; 1
f6: ca 30 cpi r28, 0x0A ; 10
f8: d1 05 cpc r29, r1
fa: cc f3 brlt .-14 ; 0xee
fc: df 91 pop r29
fe: cf 91 pop r28
100: 08 95 ret

This time there are a whole 2 extra instructions. I don't think this is
such a big deal. Unrolling the loop would give a better result.

cheers,

Al
 
I

Ian Bell

Neo said:
Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
states:for( i=0; i<10; i++){ ... }i loops through the values
0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster.
This works because it is quicker to process "i--" as the test condition,
which says "is i non-zero? If so, decrement it and continue.". For the
original code, the processor has to calculate "subtract i from 10. Is the
result non-zero? if so, increment i and continue.". In tight loops, this
make a considerable difference.
How far it holds true.. in the light of modern optimizing compilers? and
will it make a significant difference in case of embedded systems???

Many micros have a decrement jmp if zero (or non zero) machine instruction
so a decent optimising compiler should know this and use it in count down
to zero loops. Counting up often needs a compare followed by a jmp zero (or
non zero) which will be a tad slower.

Ian
 
P

Peter Bushell

Neo said:
Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
states:for( i=0; i<10; i++){ ... }i loops through the values
0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster.
This works because it is quicker to process "i--" as the test condition,
which says "is i non-zero? If so, decrement it and continue.". For the
original code, the processor has to calculate "subtract i from 10. Is the
result non-zero? if so, increment i and continue.". In tight loops, this
make a considerable difference.
How far it holds true.. in the light of modern optimizing compilers? and
will it make a significant difference in case of embedded systems???

Thanks,
-Neo
"Do U really think, what U think real is really real?"

The answer is "implementation-dependent".

A major advantage of writing in C is that you can, if you choose, write
understandable, maintainable code. This kind of hand-optimisation has the
opposite effect. If you really need to care about exactly how many
instruction cycle a loop takes, code it in assembly language. Otherwise, for
the sake of those that come after you, please write your C readably and
leave the compiler to do the optimisation. These days, most compilers can
optimise almost as well as you can, for most "normal" operations.

Regards,
 
S

Skarmander

Neo said:
Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
states:for( i=0; i<10; i++){ ... }i loops through the values
0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster. This
works because it is quicker to process "i--" as the test condition, which
says "is i non-zero? If so, decrement it and continue.". For the original
code, the processor has to calculate "subtract i from 10. Is the result
non-zero? if so, increment i and continue.". In tight loops, this make a
considerable difference.
How far it holds true.. in the light of modern optimizing compilers? and
will it make a significant difference in case of embedded systems???
Regardless of the performance issue, I'd like to point out that after
for( i=10; i--; ) finishes, i will have the value -1, since the
decrement is performed even if i is zero. This is counterintuitive, so
it's worth noting. It also means the following is not equivalent:

for (i = 10; i != 0; --i)

Since here one less decrement is performed. Incidentally, my
compiler/platform generates better code with this version -- it compares
i to -1 in the other, which is no better than comparing it to 10! If you
want to count down, I suggest writing what you mean and separating the
test and decrement parts -- it has the added bonus of making things more
readable. The rest is best left to the compiler.

S.
 
S

Scott Moore

Neo wrote On 09/25/05 23:41,:
Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
states:for( i=0; i<10; i++){ ... }i loops through the values
0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster. This
works because it is quicker to process "i--" as the test condition, which
says "is i non-zero? If so, decrement it and continue.". For the original
code, the processor has to calculate "subtract i from 10. Is the result
non-zero? if so, increment i and continue.". In tight loops, this make a
considerable difference.
How far it holds true.. in the light of modern optimizing compilers? and
will it make a significant difference in case of embedded systems???

Thanks,
-Neo
"Do U really think, what U think real is really real?"

Unroll it completely.
 
R

Roberto Waltman

Neo said:
Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
states:for( i=0; i<10; i++){ ... }i loops through the values
0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster.
....
How far it holds true.. in the light of modern optimizing compilers? and
will it make a significant difference in case of embedded systems???

It may or not save a couple of assembly language instructions, (of
course depending on the compiler and processor used,) but I doubt this
"noptimization" will make any noticeable change in the performance of
a program, unless your code consist mainly of empty for() loops.

What impact can a minuscule reduction in the time required to decide
if the loop has ended or not have, if the body of the loop, for
example, call functions that format a CAN message, deliver it, wait
for a response, retry if there were errors or timeouts, decode the
response, store the values in a serial EEPROM, and based on them start
a few motors, open pneumatic valves, optionally sending an email
message to Katmandu.

That is not an optimization, but a total waste of time. Read the first
example in "Elements of programming style" and learn...

Roberto Waltman

[ Please reply to the group, ]
[ return address is invalid. ]
 
J

Joe Butler

That is not an optimization, but a total waste of time. Read the first
example in "Elements of programming style" and learn...

What if the difference is between fitting into memory and not?
 
M

Mark McIntyre


(that reversing loop order is faster)

The page is talking rot. It *may* be faster. It *may* be slower. The
only way to know is to benchmark your particular implementation in the
specific case you're examining.
How far it holds true.. in the light of modern optimizing compilers? and
will it make a significant difference in case of embedded systems???

Benchmark.
 
K

Kevin D. Quitt

Depends what you're doing. If you're accessing a large chunk of memory on a system with
cache, you want to go through incrementing addresses to maximize the use of cache.
Decrementing through memory is generally pessimal.
 
F

Flash Gordon

Joe said:
What if the difference is between fitting into memory and not?

If you are hunting for that small an amount to get the program to fit
then you are in trouble anyway. Something will need changing making it
no longer fit!
 
C

Christian Bau

Ian Bell said:
Many micros have a decrement jmp if zero (or non zero) machine instruction
so a decent optimising compiler should know this and use it in count down
to zero loops. Counting up often needs a compare followed by a jmp zero (or
non zero) which will be a tad slower.

The Pentium processors have a loop instruction. Every decent compiler
knows it and avoids it like hell because it runs slower than a subtract
+ compare + conditional branch :)
 
C

Christian Bau

"Peter Bushell said:
These days, most compilers can
optimise almost as well as you can, for most "normal" operations.

Question: How can I optimise code better than the compiler?
Answer: If you ask, then you can't.
 
J

Joe Butler

I think I disagree.

If you can fit something into a cheaper processor model because you save a
couple of bytes by changing 1 or two loops, then you are not in trouble
anymore.
 
F

Flash Gordon

Joe Butler wrote:

Don't top post. Replies belong after the text you are replying to.

Don't include peoples signatures unless you are commenting on them.
> I think I disagree.
>
> If you can fit something into a cheaper processor model because you
> save a couple of bytes by changing 1 or two loops, then you are not in
> trouble anymore.

I'll be more explicit then. EVERY SINGLE TIME I have come across a
system where people have tried to squeeze the code in believing it will
just about fit (either size or speed) one of the following has happened:

1) Customer required a subsequent change which proved to be impossible
unless the card was redesigned because there was no space for the new
code.
2) A bug fix requires some additional code and oops, there is no more
space.
3) By the time all the required stuff was added that the person who
thought it would only just fit had forgotten it did NOT fit by a mile
so it did not even come close to meeting the customers requirements
4) It turned out there were massive savings to be had else where because
of higher level problems allowing me to save far more space/time than
you could possibly save by such micro-optimisations.

Only with the third of those possibilities was it possible to meet the
requirements using the existing hardware, and meeting the requirements
involved fixing the algorithms or doing large scale changes where the
coding was just plain atrocious.

So my experience is that it is never worth bothering with such
micro-optimisations.
 
J

Joe Butler

OK, point taken. Although, when working with very small memories, it can
make all the difference if a byte can be saved here and there. Afterall, 50
such 'optimisations' could amount to 10% of the total memory available. I'm
not necessarily suggesting this should be done from day 1, but have found it
useful just to get a feel for what the compiler works best with.
 
M

Mark VandeWettering

["Followup-To:" header set to comp.arch.embedded.]
Hi Folks,http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR Page
states:for( i=0; i<10; i++){ ... }i loops through the values
0,1,2,3,4,5,6,7,8,9 If you don't care about the order of the loop counter,
you can do this instead: for( i=10; i--; ) { ... }Using this code, i loops
through the values 9,8,7,6,5,4,3,2,1,0, and the loop should be faster. This
works because it is quicker to process "i--" as the test condition, which
says "is i non-zero? If so, decrement it and continue.". For the original
code, the processor has to calculate "subtract i from 10. Is the result
non-zero? if so, increment i and continue.". In tight loops, this make a
considerable difference.
How far it holds true.. in the light of modern optimizing compilers? and
will it make a significant difference in case of embedded systems???

You could just test it.

I think it's a mistake to obfuscate your loops just for the sake of
(what is probably) executing one more instruction which in all likelihood
isn't on the critical path of your application _anyway_. If, as you say,
you don't use the loop index, you could indeed do without the one
extra compare instruction, but you'd probably benefit from loop unrolling
too.

Premature optimization is a hindrance to software development.

Mark
 
D

David Brown

Mark said:
(that reversing loop order is faster)

The page is talking rot. It *may* be faster. It *may* be slower. The
only way to know is to benchmark your particular implementation in the
specific case you're examining.

Actually, the page is talking rubbish about a great deal more than just
this case. It's full of generalisations that depend highly on the
compiler and target in question (the post is cross-posted to
comp.arch.embedded, so we are looking at a wide range of targets). "Use
switch instead of if...else..." (varies widely according to
target/compiler and the size of the switch), "Avoid ++, -- in while ()
expressions" (good compilers work well with such expressions), "Use
word-size variables instead of chars" (great for PPC, indifferent for
msp430, terrible for AVR), "Addition is faster than multiplication - use
'val + val + val' instead of 'val * 3' " (wrong for most compiler/target
combinations).

It's a nice idea to try to list such tips, but the page is badly out of
date, and makes all sorts of unwarranted assumptions.

So, as Mark says, benchmark your implementation. Also examine the
generated assembly code (you do understand the generated assembly? If
not, forget about such minor "optimisations".) And remember Knuth's
rules regarding such code-level optimisations:

1. Don't do it.
2. (For experts only) Don't do it yet.
 
A

Anonymous 7843

for( i=10; i--; )
{ ... }

I tend to avoid this kind of loop because it's a bit less
intuitive to use with unsigned loop counters. After the
loop is done, an unsigned i would be set to some very high
implementation-defined number.

There is not much to be gained on loops that only count to
10... that extra instruction 10 times through the loop
would only add an extra 10 nanoseconds. This is likely to
pale in significance to any useful work done in the body of
the loop.

Loops that range over memory should never count backwards,
at least not when speed is important. For better or worse,
operating systems and memory caches only prefetch when
reading ascending addresses.
 
D

Dave Hansen

I tend to avoid this kind of loop because it's a bit less
intuitive to use with unsigned loop counters. After the
loop is done, an unsigned i would be set to some very high
implementation-defined number.

FWIW, my bit-bang SPI output function looks something like

bit_ctr = 8;
do
{
Set_IO(SPI_DATA, (data&0x80) != 0);
Set_IO(SPI_CLOCK, 1);
data <<= 1;
Set_IO(SPI_CLOCK, 0);

} while (--bit_ctr);

which seems intuitive for the function at hand, and generates nearly
optimal assembly on all the platforms it's used on.

Regards,

-=Dave
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top