This is related to the division speed, shift speed and branch
misprediction speed. On the obsolete G3 or G4 architecture, both of
which are comparable to the AMD K6 architecture, branch misprediction
speed is only slightly slower than the shift speed and a lot faster
than the division speed. So the speed of the inner loop of the binary
method, on average, makes up for the increased number of iterations --
in fact on small numbers the binary method should be faster than
Euclid's method (this is what I saw on the K6 and P5 architectures).
More modern CPUs all have high branch mispredict penalties -- this
will push back the advantage to Euclid's method. The AMD Athlon
processor has an unusually well compressed pipeline though it pays an
extra clock penalty for decoding the variable length x86 instructions,
but even its modest (for an otherwise fully pipelined processor)
mispredict penalties tilt the algorithm choice in favor of Euclid's
algorithm. I'm sure on the Pentium 4 (with the additional penalty of
having slow shifters) the advantage of using Euclid's algorithm is
probably fairly large. The G5 processor also has a high branch miss
penalty -- couple this with the extra 1-clock dependency penalty it
has, and I would be surprised if the Euclid algorithm wasn't a lot
faster (but I don't know how fast the G5's integer divider is, so I
can't say for sure.)
All this being said, the x86 architecture has bit scan instructions (I
think the PPC architecture does as well) and conditional shuffle
instructions. Look at the last code sequence in example of my webpage
here:
http://www.pobox.com/~qed/asmexample.html
This implements the binary GCD algorithm which removes the branch
mispredictions. On the Athlon, this easily swings the advantage back
to the binary GCD algorithm. Note that this is notwithstanding the
fact that the Athlon's bitscan instructions are not particularly fast
(they are "vector path" instructions.) Its very likely that on other
architectures that have bit scan instructions and long pipelines, that
this kind of a loop will be the fastest solution.
And the difference is the largest for numbers like the ones above (the
quotient is always 1). There are other pairs of numbers where Euclid's (!)
algorithm will outperporm the binary algorithm, especially if one number
is a multiple of the other. Take for instance some odd number and the
product of that and 127. One step for Euclid, seven steps for binary.
For every special case you find that favors Euclid, there are many
others that favor binary GCD. On average binary GCD does fairly well
by this kind of measure.
It is different when division is in software routines (like with big
numbers). But even there you may do well with a form of Euclid's
algorithm where you have only a 16-bit estimate of the quotient.
The binary GCD algorithm primary advantage over Euclid's algorithm is
that it approximates the divide as a 1-bit accurate divide, and
assumes it can do so much faster than an actual full divide. For
bignum libraries, the full divide routine is going to be very slow,
but the 1-bit approximation will have a lot of really horrible bad
cases. The point is that a 1 *digit* divide in a bignum system can be
performed at roughly the speed of a single integer divide. So a GCD
algorithm that performed "approximate divides" would probably do very
well for bignum libraries.
Of course, a*b = gcd(a,b)*lcm(a,b). So lcm(a,b) = (a/gcd(a,b))*b (you
need to do it like this to avoid a potential a*b overflow), and so in
theory you only need gcd().