Intel vs Gnu compiler output quality

jhc0033 · Jun 23, 2008

My experience has generally been that, for CPU-intensive tasks, the
Intel compiler produces code that is about as fast as that produced by
the Gnu compiler.

However, on this simple Shootout entry, Intel seems to be 4.5 times
faster:

http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=icpp&id=3

Any idea why?

Lionel B · Jun 23, 2008

My experience has generally been that, for CPU-intensive tasks, the
Intel compiler produces code that is about as fast as that produced by
the Gnu compiler.

However, on this simple Shootout entry, Intel seems to be 4.5 times
faster:

http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=icpp&id=3

Any idea why?

Have you profiled the code? My guess would be that the bulk of the CPU
time is spent in the trig functions.

There are a host of possible explanations... maybe the Intel trig
functions are faster (but do they compute the same level of accuracy?)
Are the optimization levels really comparable? Does it make a difference
whether IEEE floating-point compliance is enforced (e.g. the GCC -ffast-
math flag can make quite a difference)?. Short of analysing the generated
assembler it's probably impossible to say.

In the past I've noticed that ICC seemed to be more aggressive at
vectorization, although recent versions of GCC do a better job (the
benchmarks don't specify which version of GCC was used) - and I'm not
sure if this is relevant here (you can test this: I think both compilers
will tell you what they vectorise if you ask them nicely).

In any case, this sort of benchmark is highly artificial and probably
quite irrelevant to real-life program performance. FWIW I too have found
very little to choose between ICC and GCC over a fair variety of real-
world numerically-intensive tasks (although I've also found later
versions of ICC on Linux to be unusably buggy).

Regards,

Mirco Wahab · Jun 23, 2008

My experience has generally been that, for CPU-intensive tasks, the
Intel compiler produces code that is about as fast as that produced by
the Gnu compiler.

However, on this simple Shootout entry, Intel seems to be 4.5 times
faster:

http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=icpp&id=3

Any idea why?

because the intel icc/icpc does magical
optimizations on this code and loads the
fpu stack (on x86) from ST(0) up to ST(6)
in the process, whereas the g++ (4.3)
doesn't have the vigor to go further up
than ST(2).

Out of this follows, the gcc code has to
to much more fldl/fildl and fst/fstp
to the L1, which isn't bad but not even
close to FPU register fiddling.

Thats it, basically.

Regards

M.

Lionel B · Jun 23, 2008

because the intel icc/icpc does magical optimizations on this code and
loads the fpu stack (on x86) from ST(0) up to ST(6) in the process,
whereas the g++ (4.3) doesn't have the vigor to go further up than
ST(2).

Out of this follows, the gcc code has to to much more fldl/fildl and
fst/fstp
to the L1, which isn't bad but not even close to FPU register fiddling.

That all sounds very impressive... could you possibly explain what it
means, roughly, to a non-assembler/microprocessor architecture expert?
Also, what about on x86_64?

Thats it, basically.

I'll quibble that "basically" ;-)

Mirco Wahab · Jun 23, 2008

Lionel said:
That all sounds very impressive... could you possibly explain what it
means, roughly, to a non-assembler/microprocessor architecture expert?
Also, what about on x86_64?

Shouldn't sound very impressive imho. The central part of said
benchmark is the following loop:
21:
for (int k = 1; k <= n; ++k, pot = -pot) {
kd = double(k);
kd2 = kd * kd;
kd3 = kd * kd2;

sink = std::sin(kd);
cosk = std::cos(kd);

res1 += std:

ow(dt, kd);
res2 += 1.0 / std::sqrt(kd);
res3 += 1.0 / (kd2 + kd);
res4 += 1.0 / (kd3 * sink * sink);
res5 += 1.0 / (kd3 * cosk * cosk);
res6 += 1.0 / kd;
res7 += 1.0 / kd2;
res8 += pot / kd;
res9 += pot / (2.0 * kd - 1.0);
}
39:

What one may see is a bunch of operands that are
used all along the computation of the 9 different
terms (kd, kd2 etc). For me, it looks like the
Intel compiler counts the occurrences of these
operands and puts the "best" five into the upper
four or five fpu registers (x86) (ST[3] ... ST[7])
and does the increments if the res[1-9] terms
entirely out of these fpu registers.

Example:
;;; res4 += 1.0 / (kd3 * sink * sink);
;;; res5 += 1.0 / (kd3 * cosk * cosk);
;;; res6 += 1.0 / kd;
;;; res7 += 1.0 / kd2;
gives:
fdiv %st, %st(2) #36.27
fdiv %st, %st(1) #32.34
fxch %st(1) #32.13
faddp %st, %st(6) #32.13
fldl 112(%esp) #33.34
fxch %st(6) #32.13
fstpl 80(%esp) #32.13
fld %st(4) #33.34
fmul %st(6), %st #33.34
fmulp %st, %st(6) #33.41
fdiv %st, %st(5) #33.41
fldl 96(%esp) #33.13
faddp %st, %st(6) #33.13
fxch %st(5) #33.13
fstpl 96(%esp) #33.13
fldl 104(%esp) #34.34
fmul %st, %st(4) #34.34
fmulp %st, %st(4) #34.41
fxch %st(3) #34.41
fdivr %st(4), %st #34.41
[snipped]

One can immediately see that the operations use and store stuff
across the (almost) full fpu register set %st(0) .. %st(6).
Even the last register, %st(7) is used (elsewhere). A lot of
'fxch' operations are used too, which is '(fpu-) register renaming'
and costs 0 cycles on newer x86. This is necessary to throw out
operands no longer used, they are 'renamed' from %st(7) to %st
(which is the 'top of stack'). To the application, the x86 fpu
is a stack and can only used like a stack - except for 'renaming'.

x86_64 doesn't make a difference here. Only SSE would, which
isn't involved.

regards

M.

Lionel B · Jun 24, 2008

Shouldn't sound very impressive imho. The central part of said benchmark
is the following loop:

[...]

Thanks,

Intel 8.0 compiler optimization switches	1	Jan 30, 2004
GNU C vs BCC - huge output file	2	Dec 1, 2005
compiling under cygwin with intel windows compiler icl 8.0: is it a good idea?	1	Apr 16, 2004
slow complex<double>'s	9	Mar 5, 2006
Monoliths vs. "Microlibraries"	5	May 14, 2011
gcc 64bit compiler does not offer any speed advantage	19	Apr 18, 2007
std::istream slowness vs. std::fgetc	6	May 9, 2005
OFF-TOPIC:: Why Lisp is not my favorite programming language	26	Mar 3, 2004

Intel vs Gnu compiler output quality

jhc0033

Lionel B

Mirco Wahab

Lionel B

Mirco Wahab

Lionel B

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads