even old x86 machines support some 'exg' opcode that swaps two
registers in just one instruction ,so a good compiler can
handle triple assignment much better than you mentioned in
case both ints have register storage.The xor trick has less
chance of such optimization, but it is a joyfull solution to
a programming delima(I like it more than add/subtract
solution).
I seem to recall some 8086 compilers actually recognizing the
classical swap idiom (with the temporary) and generating the
exch instruction to implement it. Modern x86 compilers don't,
however, probably because on more recent x86 processors, xchg
has an implied lock prefix, which acts as a memory fence (which
in turn means that the instruction is considerably slower than
it would be otherwise).
I just ran some quick benchmarks on an Intel based Linux machine
here (using g++ 4.1.0, -O3), using the following "swappers":
struct SwapperClassic
{
void operator()( int& a, int& b )
{
int tmp = a ;
a = b ;
b = tmp ;
}
} ;
struct SwapperXor
{
void operator()( int& a, int& b )
{
a ^= b ;
b ^= a ;
a ^= b ;
}
} ;
struct SwapperAsm
{
void operator()( int& a, int& b )
{
asm ( "movl %[a], %%eax\n xchgl %%eax,%
\n movl %%eax,%
[a]"
: [a] "+m" (a), "+m" (b) : : "%eax" ) ;
}
} ;
On this particular machine (not sure of its actual spec's, which
processor or the clock frequency), I got:
ns per machine memory
iter. instr. accesses
SwapperClassic: 1.7 5 5
SwapperXor: 2.5 9 7
SwapperAsm: 84.4 4 4
Tests run on 500 million iterations. In this case, the actual
function was invoked within a virtual member function, and
swapped two member variables. And the last two columns do not
include the standard function prefix or postfix.
A quick glance at the generated assembler showed that none of
the three versions used any local variables.
And as you can see, the cost of the memory fence due to the
implicit lock prefix on the xchgl instruction is very, very
high. The generated code uses one less instruction, and has one
less memory access, but requires roughly 50 times more time to
run.