Intel vs Gnu compiler output quality

L

Lionel B

My experience has generally been that, for CPU-intensive tasks, the
Intel compiler produces code that is about as fast as that produced by
the Gnu compiler.

However, on this simple Shootout entry, Intel seems to be 4.5 times
faster:

http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=icpp&id=3

Any idea why?

Have you profiled the code? My guess would be that the bulk of the CPU
time is spent in the trig functions.

There are a host of possible explanations... maybe the Intel trig
functions are faster (but do they compute the same level of accuracy?)
Are the optimization levels really comparable? Does it make a difference
whether IEEE floating-point compliance is enforced (e.g. the GCC -ffast-
math flag can make quite a difference)?. Short of analysing the generated
assembler it's probably impossible to say.

In the past I've noticed that ICC seemed to be more aggressive at
vectorization, although recent versions of GCC do a better job (the
benchmarks don't specify which version of GCC was used) - and I'm not
sure if this is relevant here (you can test this: I think both compilers
will tell you what they vectorise if you ask them nicely).

In any case, this sort of benchmark is highly artificial and probably
quite irrelevant to real-life program performance. FWIW I too have found
very little to choose between ICC and GCC over a fair variety of real-
world numerically-intensive tasks (although I've also found later
versions of ICC on Linux to be unusably buggy).

Regards,
 
M

Mirco Wahab

My experience has generally been that, for CPU-intensive tasks, the
Intel compiler produces code that is about as fast as that produced by
the Gnu compiler.

However, on this simple Shootout entry, Intel seems to be 4.5 times
faster:

http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=icpp&id=3

Any idea why?

because the intel icc/icpc does magical
optimizations on this code and loads the
fpu stack (on x86) from ST(0) up to ST(6)
in the process, whereas the g++ (4.3)
doesn't have the vigor to go further up
than ST(2).

Out of this follows, the gcc code has to
to much more fldl/fildl and fst/fstp
to the L1, which isn't bad but not even
close to FPU register fiddling.

Thats it, basically.

Regards

M.
 
L

Lionel B

because the intel icc/icpc does magical optimizations on this code and
loads the fpu stack (on x86) from ST(0) up to ST(6) in the process,
whereas the g++ (4.3) doesn't have the vigor to go further up than
ST(2).

Out of this follows, the gcc code has to to much more fldl/fildl and
fst/fstp
to the L1, which isn't bad but not even close to FPU register fiddling.

That all sounds very impressive... could you possibly explain what it
means, roughly, to a non-assembler/microprocessor architecture expert?
Also, what about on x86_64?
Thats it, basically.

I'll quibble that "basically" ;-)
 
M

Mirco Wahab

Lionel said:
That all sounds very impressive... could you possibly explain what it
means, roughly, to a non-assembler/microprocessor architecture expert?
Also, what about on x86_64?

Shouldn't sound very impressive imho. The central part of said
benchmark is the following loop:
21:
for (int k = 1; k <= n; ++k, pot = -pot) {
kd = double(k);
kd2 = kd * kd;
kd3 = kd * kd2;

sink = std::sin(kd);
cosk = std::cos(kd);

res1 += std::pow(dt, kd);
res2 += 1.0 / std::sqrt(kd);
res3 += 1.0 / (kd2 + kd);
res4 += 1.0 / (kd3 * sink * sink);
res5 += 1.0 / (kd3 * cosk * cosk);
res6 += 1.0 / kd;
res7 += 1.0 / kd2;
res8 += pot / kd;
res9 += pot / (2.0 * kd - 1.0);
}
39:

What one may see is a bunch of operands that are
used all along the computation of the 9 different
terms (kd, kd2 etc). For me, it looks like the
Intel compiler counts the occurrences of these
operands and puts the "best" five into the upper
four or five fpu registers (x86) (ST[3] ... ST[7])
and does the increments if the res[1-9] terms
entirely out of these fpu registers.

Example:
;;; res4 += 1.0 / (kd3 * sink * sink);
;;; res5 += 1.0 / (kd3 * cosk * cosk);
;;; res6 += 1.0 / kd;
;;; res7 += 1.0 / kd2;
gives:
fdiv %st, %st(2) #36.27
fdiv %st, %st(1) #32.34
fxch %st(1) #32.13
faddp %st, %st(6) #32.13
fldl 112(%esp) #33.34
fxch %st(6) #32.13
fstpl 80(%esp) #32.13
fld %st(4) #33.34
fmul %st(6), %st #33.34
fmulp %st, %st(6) #33.41
fdiv %st, %st(5) #33.41
fldl 96(%esp) #33.13
faddp %st, %st(6) #33.13
fxch %st(5) #33.13
fstpl 96(%esp) #33.13
fldl 104(%esp) #34.34
fmul %st, %st(4) #34.34
fmulp %st, %st(4) #34.41
fxch %st(3) #34.41
fdivr %st(4), %st #34.41
[snipped]

One can immediately see that the operations use and store stuff
across the (almost) full fpu register set %st(0) .. %st(6).
Even the last register, %st(7) is used (elsewhere). A lot of
'fxch' operations are used too, which is '(fpu-) register renaming'
and costs 0 cycles on newer x86. This is necessary to throw out
operands no longer used, they are 'renamed' from %st(7) to %st
(which is the 'top of stack'). To the application, the x86 fpu
is a stack and can only used like a stack - except for 'renaming'.

x86_64 doesn't make a difference here. Only SSE would, which
isn't involved.

regards

M.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,023
Latest member
websitedesig25

Latest Threads

Top