Jorgen Grahn said:
"Investigating the cost of an operation in cycles within a real
world, i.e., no peak, performance measurement of C#, C++, Java,
Fortran and JavaScript"
It's on topic, meaningless, and will be discussed here ad nauseum,
just like all other benchmarks have in the past.
I'll start... The reason to examine computation overhead is an overriding
concern about system performance. I care much more about total throughput
and latency rather than cycle counts of individual arithmetic operations.
The benchmark is simplistic, uninteresting, and unenlightening by failing to
exercise and account for memory bandwidth, cache misses, and instruction
order optimizations available on modern processors.
Real world computation intensive applications will run in multiple threads,
not all involved in computation. Hyperthreading might or might not prove
meaningful in the analysis. A truly good optimizer can leverage the
consecutive locations of parallel arrays of values, to minimize stalling on
cache misses. Throughput will almost certainly differ with more than the 3
values involved, and again if the related values are stored as cache line
aligned structs rather than separate arrays.
More to the point, and not at all incidentally, the benchmark test is
trivially and ideally suited for implementation in CUDA or other gpGPU. It
falls into the category of being embarrassingly parallel. Even a clumsy
implementation effectively measures memory bandwidth rather than computation
cost.
With this as context, the better metric of efficiency is percentage
occupancy of available memory bandwidth, not the cycle cost of individual
arithmetic operations. A benchmark that would interest me is one that
relates memory bandwith to throughput of various implementations in CUDA,
arrays, packed structs, and CPU (not GPU) threading schemes.