Bit shifting can't replace division unless the divisor is a power
of two, not just even.
"Simply check" is likely to take extra time. Also, it may be a
low-percentage optimization that just doesn't get used that much,
in the opinion of the CPU designer, probably backed up by a bunch
of statistics on code in real programs.
Even worse, it will screw up the pipeline on modern/large CPUs. At
least for multipliers, these tend to be very fixed - IOW, it's five
cycles through the multiplier, no matter the operands. Making that
variable length requires a whole bunch of other stuff to be added, and
may make it much more difficult to pipeline several sequential
multiplications (IOW, the multiplier may take five cycles, but it can
start a new multiplication every cycle).
Now older/smaller CPUs often implement mulch smaller multipliers
(often the hardware equivalent of shift-and-add), and on those, early
out is often implemented (IOW, when you run out of one bits in the
multiplier, you can stop), and the shift cycle is sometimes faster for
zero bits in the multiplier). But those take many cycles in any
event, and making them take a variable number is often not a big deal.
General purpose dividers are a huge pain, and while some variability
is not uncommon in full hardware divider implementations, for the many
CPUs where there is no hardware divide (just approximate plus Newton-
Raphson), I doubt there's a useful way you could implement such a
thing without requiring a conditional branch in the divide routine
(which would be horrible). Again, iterative dividers on small
implementations often do have some sorts of optimizations, and some do
fast shift past low zero bit in the divisor (basically the division
equivalent of early-out).
But in short, at least for multipliers on large implementations, this
would likely slow general multiplications down far more than you'd
gain from a few special cases going faster. And dividers are going to
be an even bigger mess.
Probably a better optimization is to provide short multiplier/divisor
operations. Say up to eight bits. Those would cover a lot of actual
operations, and a 32x8 multiplier could be made to run rather faster
than a full 32x32 one.