If seen the former variant quite often in JDK source code (eg BitSet
implementation of and), but don't understand why this should be faster.
Both do the same number of instructions, with the difference of adding
and substraction. Can you explain why the first one is better to
optimize, or just give me a web-page explaining this?
The short answer is computers have shortcuts to compare with 0 that
are faster than comparing with N. When you go down in a for loop,
under the hood, you determine loop termination by comparing the index
with 0, and with N when going up.
Up loops are particularly painful if N gets recomputed each time round
the loop, e.g. ArrayList.size(). Happily 0, the termination value,
when you work down never get recomputed.
To explain a little more deeply, here is a quick course in Intel
assembler.
[add-register] is a single instruction that adds two 32-bit registers
a and b and leaves the result in a. As a side effect it sets a magic
register called the condition code that records whether the result was
positive, negative, zero, overflowed etc. You get this information
for free as a side effect of an arithmetic instruction.
So if you said something like this in Java,
boolean biggerThanZero = ( a - b ) > 0;
you can in a sense, you do the whole thing in one machine instruction,
[sub-register] that takes only one clock cycle.
if you wrote
if ( ( a - b ) > 0 ) // or almost equivalently if ( a > b )
{
dothis();
}
else
{
dothat();
}
the if part compiles to two machine instructions like this:
[subtract] a, b
[jump if result <= 0] to else code
Typically the conditional jump is much faster when it falls through to
the if code than when it jumps to jump to the else clause. It barrels
along where it was planning to go anyway, rather than taking a detour,
throwing a monkey wrench into its plans for the future. The
conditional jump just interrogates the pre-computed magic condition
code register which we got free as a side effect of the subtract.
Now lets have a look at the code in an i++ style loop.
I am explaining the principle, not the Intel architecture, so I feel
free to lie slightly for the sake of simplicity.
i++ is done with an [increment] instruction that adds 1 in a single
cycle.
Then you do a compare instruction with N, which sets the condition
code. A compare is like a subtract, except you discard the result and
it works even when there is overflow, and it also provides both signed
and unsigned comparison results in the condition code register. Then
you do a conditional jump out of the loop. Otherwise you fall though
and execute the loop body one more time.
So, for looping UP a total 3 instructions, with a nice fall-through on
the conditional jump in the more common case.
Now lets say you count down. You subtract 1. As a side effect you get
the condition code telling you if the result is +ve, 0 or -ve.
You do a conditional jump out of the loop if it has gone negative.
So, for looping DOWN, a total of 2 instructions with a nice
fall-through on the usual case conditional jump.
People familiar with Intel assembler will chime that the [decrement]
instruction fails to set the condition code. You can subtract one, in
one cycle too, not using the [decrement] instruction, using [subtract]
with a register containing a convenient one, or subtract literal 1,
which DOES set the condition code. (Much of my dicey coding in 8086
assembler for the BBL Forth engine revolved around always having a
convenient 1 sitting in a register somewhere to use for a
condition-code setting increment or decrement.)
This one cycle overhead penalty for upward counted loops is only going
to matter on a very short loop body. Further it could be offset by
preemptive RAM caching hardware that presumes you generally work up.
Other factors in loop optimisation: If the loop body is small enough,
it may be cached totally on the CPU chip greatly increasing its speed.
So oddly, it sometimes pays to break a loop up in to two shorter
passes, running over all the elements twice. This is all complicated
immensely by how ram for operands and code is cached and preemptively
cached. Nowadays all you can do is code it several ways and see which
works faster. Jet, an AOT compiler, does versioning, creating multiple
loop bodies, each with a different set of presumptions. It decides at
the top of the loop which body to use, so it does not have to test
conditions repeatedly as it goes through the loop body.
E.g. Jet would convert code like this:
for ( i=0; i<n; i++ )
{
if ( i mod 2 != 0 )
{
dothing1();
}
else
{
dothing2();
}
sum += i;
if ( i mod 2 != 0 )
{
dothing3();
}
else
{
dothing4();
}
}
to code like this:
for ( i=0; i<n; i++ )
{
if ( i & 1 != 0)
{
dothing1();
sum += i;
do1thing3();
}
else
{
dothing2();
sum += i;
dothing4();
}
}
or maybe even code like this if side effects can be proved the same.
for ( i=1; i<n; i+=2; )
{
dothing1();
sum += i;
do1thing3();
}
for ( i=0; i<n; i+=2 )
{
dothing2();
sum += i;
dothing4();
}
}