Kaz Kylheku said:
So what is exactly that you are trying to benchmark? Your CPU's
instructions for computing a quotient and remainder?
In this case, whether a solution involving a mod operation and a table
lookup is, in general, faster or slower than one using a bunch of shifts and
masks.
I want to find out the efficiency of my C compiler in generating code for
mods, lookups, shifts and masks, not how clever it is in eliminating those
operations!
Since a simple benchmark often involves repeating a simple operation
millions of times, gcc especially tries to avoid those millions of
iterations, often resulting in zero runtimes.
Sure I can turn off optimisation completely, but that gives misleading
results too.
(I work on my own language projects and I spend a lot of time comparing
runtimes against C. If I have a recursive benchmark, I want to see how my
call/return code stacks up the call/return code that a good C compiler will
generate. Having the C compiler remove call/return operations is not really
very helpful. So it does tail recursion elimination; I could do that too,
but that's not what I'm testing...)
If that's what you are after, then C is the wrong interface,
don't you think. Write the exact instructions you want and
benchmark that.
On my cpu, there's an instruction BSF that does exactly this task, return
the bit index of the only (or lowest) set bit.
A good test is how well these C routines compare against that.
Using a benchmark which finds the set bit for 1,2,4,8,,,,42949672296 in
turn, executing the inner loop body ten times to reduce the effect of loop
overheads, and doing the whole thing 10 million times, took 670ms using
inline assembler.
Using C compilers other than gcc, took various times from 3500ms to 6600ms
(using the mask/shift code).
gcc 3.4.5 with -O0 took 7100ms.
With -O1, -O2 and -O3 it took just 62ms. That's incredible! gcc even
with -O1 outperforming tight inline assembler by ten to one. And an
optimiser which speeds things up by over 100 times!
That's why care has to be taken especially with gcc which can be too clever
for it's own good.