Lionel said:
That all sounds very impressive... could you possibly explain what it
means, roughly, to a non-assembler/microprocessor architecture expert?
Also, what about on x86_64?
Shouldn't sound very impressive imho. The central part of said
benchmark is the following loop:
21:
for (int k = 1; k <= n; ++k, pot = -pot) {
kd = double(k);
kd2 = kd * kd;
kd3 = kd * kd2;
sink = std::sin(kd);
cosk = std::cos(kd);
res1 += std:
ow(dt, kd);
res2 += 1.0 / std::sqrt(kd);
res3 += 1.0 / (kd2 + kd);
res4 += 1.0 / (kd3 * sink * sink);
res5 += 1.0 / (kd3 * cosk * cosk);
res6 += 1.0 / kd;
res7 += 1.0 / kd2;
res8 += pot / kd;
res9 += pot / (2.0 * kd - 1.0);
}
39:
What one may see is a bunch of operands that are
used all along the computation of the 9 different
terms (kd, kd2 etc). For me, it looks like the
Intel compiler counts the occurrences of these
operands and puts the "best" five into the upper
four or five fpu registers (x86) (ST[3] ... ST[7])
and does the increments if the res[1-9] terms
entirely out of these fpu registers.
Example:
;;; res4 += 1.0 / (kd3 * sink * sink);
;;; res5 += 1.0 / (kd3 * cosk * cosk);
;;; res6 += 1.0 / kd;
;;; res7 += 1.0 / kd2;
gives:
fdiv %st, %st(2) #36.27
fdiv %st, %st(1) #32.34
fxch %st(1) #32.13
faddp %st, %st(6) #32.13
fldl 112(%esp) #33.34
fxch %st(6) #32.13
fstpl 80(%esp) #32.13
fld %st(4) #33.34
fmul %st(6), %st #33.34
fmulp %st, %st(6) #33.41
fdiv %st, %st(5) #33.41
fldl 96(%esp) #33.13
faddp %st, %st(6) #33.13
fxch %st(5) #33.13
fstpl 96(%esp) #33.13
fldl 104(%esp) #34.34
fmul %st, %st(4) #34.34
fmulp %st, %st(4) #34.41
fxch %st(3) #34.41
fdivr %st(4), %st #34.41
[snipped]
One can immediately see that the operations use and store stuff
across the (almost) full fpu register set %st(0) .. %st(6).
Even the last register, %st(7) is used (elsewhere). A lot of
'fxch' operations are used too, which is '(fpu-) register renaming'
and costs 0 cycles on newer x86. This is necessary to throw out
operands no longer used, they are 'renamed' from %st(7) to %st
(which is the 'top of stack'). To the application, the x86 fpu
is a stack and can only used like a stack - except for 'renaming'.
x86_64 doesn't make a difference here. Only SSE would, which
isn't involved.
regards
M.