I would suggest that there is basically no processor anywhere where
that isn't true. From Cray vector processors to digital watches -- its
the same deal.
[...] I agree the former takes more silicon, but it seems
like most chips have gone ahead and devoted the silicon to it
to make it fast.
- Logan
most CPU's nowdays have at least two integer and fp divide units,
Excuse me? The Alpha processor had a single multi-staged integer
divide unit (i.e., it could compute about 4 bits of integer result per
clock), because those guys were macho. I'm not sure if there are any
others that you could classify as a high performance CPU.
In general CPUs reuse many of their other functional units
(specifically their multipliers and adders) to simulate a divide over
multiple clocks. In fact, most CPUs only provide a floating point
division state machine and use it to generate the integer divide. Its
just a matter of making the most use out of your transisters. The
moment you try to make a divider in your CPU microarchitecture, it
becomes a better idea to turn those transistors into extra multipliers
and adders (on average there will just be a bigger performance impact
for doing so). I.e., it never has been, and never will be a
particularly good idea to make a custom divider, and as a consequence
it will never come down to a single clock (or 2); which I don't think
is possible anyways.
In the Itanium, for example, rather than making a division unit, they
decided to make two parallel multiply and add units instead. Its
clearly better to do things that way.
[...] and
can do a divide in two clock cycles, and overlapped with other
operations as well.
CPU manufacturers have, at various times, done clever things with their
divide mechanisms -- but escaping their inevitable slowness is not one
of them. About the fastest dividers in existence are the SIMD
reciprocal dividers in the latest x86 processors that can compute 4
parallel 32-bit floating point approximate results in about 12 clocks.
But you can't shrink the 12 clocks, you can't get fully IEEE accurate
results, and you can't get fewer than 4 of them done at a time (except
by ignoring the extra ones you compute.) It certainly isn't going to
help you compute an isolated integer modulo.
The AMD Athlon did a neat thing where they would allow other FP
instructions to use the few dead slots in the FPU while it executed the
microcode for its division. But this does nothing for serial
calculations such as computing the offset into a hash (the software
just has to wait for the result before it can proceed.)
[....] So divide and mod are not the 48-cycle monsters
they were just a few silicon generations ago.
Well good ones are about 20 clocks (the Grahm-Smidt formula is an
ingeneous way of reducing a divide to a critical path of about 5 serial
floating point operations). But its *really* hard (and not worth the
effort) to get them to go any faster.