Intel vs Gnu compiler output quality

Discussion in 'C++' started by jhc0033@gmail.com, Jun 23, 2008.

  1. Guest

    , Jun 23, 2008
    #1
    1. Advertising

  2. Lionel B Guest

    On Mon, 23 Jun 2008 03:36:58 -0700, wrote:

    > My experience has generally been that, for CPU-intensive tasks, the
    > Intel compiler produces code that is about as fast as that produced by
    > the Gnu compiler.
    >
    > However, on this simple Shootout entry, Intel seems to be 4.5 times
    > faster:
    >
    > http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=icpp&id=3
    >
    > Any idea why?


    Have you profiled the code? My guess would be that the bulk of the CPU
    time is spent in the trig functions.

    There are a host of possible explanations... maybe the Intel trig
    functions are faster (but do they compute the same level of accuracy?)
    Are the optimization levels really comparable? Does it make a difference
    whether IEEE floating-point compliance is enforced (e.g. the GCC -ffast-
    math flag can make quite a difference)?. Short of analysing the generated
    assembler it's probably impossible to say.

    In the past I've noticed that ICC seemed to be more aggressive at
    vectorization, although recent versions of GCC do a better job (the
    benchmarks don't specify which version of GCC was used) - and I'm not
    sure if this is relevant here (you can test this: I think both compilers
    will tell you what they vectorise if you ask them nicely).

    In any case, this sort of benchmark is highly artificial and probably
    quite irrelevant to real-life program performance. FWIW I too have found
    very little to choose between ICC and GCC over a fair variety of real-
    world numerically-intensive tasks (although I've also found later
    versions of ICC on Linux to be unusably buggy).

    Regards,

    --
    Lionel B
     
    Lionel B, Jun 23, 2008
    #2
    1. Advertising

  3. Mirco Wahab Guest

    wrote:
    > My experience has generally been that, for CPU-intensive tasks, the
    > Intel compiler produces code that is about as fast as that produced by
    > the Gnu compiler.
    >
    > However, on this simple Shootout entry, Intel seems to be 4.5 times
    > faster:
    >
    > http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=icpp&id=3
    >
    > Any idea why?


    because the intel icc/icpc does magical
    optimizations on this code and loads the
    fpu stack (on x86) from ST(0) up to ST(6)
    in the process, whereas the g++ (4.3)
    doesn't have the vigor to go further up
    than ST(2).

    Out of this follows, the gcc code has to
    to much more fldl/fildl and fst/fstp
    to the L1, which isn't bad but not even
    close to FPU register fiddling.

    Thats it, basically.

    Regards

    M.
     
    Mirco Wahab, Jun 23, 2008
    #3
  4. Lionel B Guest

    On Mon, 23 Jun 2008 18:04:49 +0200, Mirco Wahab wrote:

    > wrote:
    >> My experience has generally been that, for CPU-intensive tasks, the
    >> Intel compiler produces code that is about as fast as that produced by
    >> the Gnu compiler.
    >>
    >> However, on this simple Shootout entry, Intel seems to be 4.5 times
    >> faster:
    >>
    >> http://shootout.alioth.debian.org/gp4/benchmark.php?

    test=partialsums&lang=icpp&id=3
    >>
    >> Any idea why?

    >
    > because the intel icc/icpc does magical optimizations on this code and
    > loads the fpu stack (on x86) from ST(0) up to ST(6) in the process,
    > whereas the g++ (4.3) doesn't have the vigor to go further up than
    > ST(2).
    >
    > Out of this follows, the gcc code has to to much more fldl/fildl and
    > fst/fstp
    > to the L1, which isn't bad but not even close to FPU register fiddling.


    That all sounds very impressive... could you possibly explain what it
    means, roughly, to a non-assembler/microprocessor architecture expert?
    Also, what about on x86_64?

    > Thats it, basically.


    I'll quibble that "basically" ;-)

    --
    Lionel B
     
    Lionel B, Jun 23, 2008
    #4
  5. Mirco Wahab Guest

    Lionel B wrote:
    > On Mon, 23 Jun 2008 18:04:49 +0200, Mirco Wahab wrote:
    >> Out of this follows, the gcc code has to to much more fldl/fildl and
    >> fst/fstp
    >> to the L1, which isn't bad but not even close to FPU register fiddling.

    >
    > That all sounds very impressive... could you possibly explain what it
    > means, roughly, to a non-assembler/microprocessor architecture expert?
    > Also, what about on x86_64?


    Shouldn't sound very impressive imho. The central part of said
    benchmark is the following loop:
    21:
    for (int k = 1; k <= n; ++k, pot = -pot) {
    kd = double(k);
    kd2 = kd * kd;
    kd3 = kd * kd2;

    sink = std::sin(kd);
    cosk = std::cos(kd);

    res1 += std::pow(dt, kd);
    res2 += 1.0 / std::sqrt(kd);
    res3 += 1.0 / (kd2 + kd);
    res4 += 1.0 / (kd3 * sink * sink);
    res5 += 1.0 / (kd3 * cosk * cosk);
    res6 += 1.0 / kd;
    res7 += 1.0 / kd2;
    res8 += pot / kd;
    res9 += pot / (2.0 * kd - 1.0);
    }
    39:

    What one may see is a bunch of operands that are
    used all along the computation of the 9 different
    terms (kd, kd2 etc). For me, it looks like the
    Intel compiler counts the occurrences of these
    operands and puts the "best" five into the upper
    four or five fpu registers (x86) (ST[3] ... ST[7])
    and does the increments if the res[1-9] terms
    entirely out of these fpu registers.

    Example:
    ;;; res4 += 1.0 / (kd3 * sink * sink);
    ;;; res5 += 1.0 / (kd3 * cosk * cosk);
    ;;; res6 += 1.0 / kd;
    ;;; res7 += 1.0 / kd2;
    gives:
    fdiv %st, %st(2) #36.27
    fdiv %st, %st(1) #32.34
    fxch %st(1) #32.13
    faddp %st, %st(6) #32.13
    fldl 112(%esp) #33.34
    fxch %st(6) #32.13
    fstpl 80(%esp) #32.13
    fld %st(4) #33.34
    fmul %st(6), %st #33.34
    fmulp %st, %st(6) #33.41
    fdiv %st, %st(5) #33.41
    fldl 96(%esp) #33.13
    faddp %st, %st(6) #33.13
    fxch %st(5) #33.13
    fstpl 96(%esp) #33.13
    fldl 104(%esp) #34.34
    fmul %st, %st(4) #34.34
    fmulp %st, %st(4) #34.41
    fxch %st(3) #34.41
    fdivr %st(4), %st #34.41
    [snipped]

    One can immediately see that the operations use and store stuff
    across the (almost) full fpu register set %st(0) .. %st(6).
    Even the last register, %st(7) is used (elsewhere). A lot of
    'fxch' operations are used too, which is '(fpu-) register renaming'
    and costs 0 cycles on newer x86. This is necessary to throw out
    operands no longer used, they are 'renamed' from %st(7) to %st
    (which is the 'top of stack'). To the application, the x86 fpu
    is a stack and can only used like a stack - except for 'renaming'.

    x86_64 doesn't make a difference here. Only SSE would, which
    isn't involved.

    regards

    M.
     
    Mirco Wahab, Jun 23, 2008
    #5
  6. Lionel B Guest

    On Mon, 23 Jun 2008 19:03:03 +0200, Mirco Wahab wrote:

    > Lionel B wrote:
    >> On Mon, 23 Jun 2008 18:04:49 +0200, Mirco Wahab wrote:
    >>> Out of this follows, the gcc code has to to much more fldl/fildl and
    >>> fst/fstp
    >>> to the L1, which isn't bad but not even close to FPU register
    >>> fiddling.

    >>
    >> That all sounds very impressive... could you possibly explain what it
    >> means, roughly, to a non-assembler/microprocessor architecture expert?
    >> Also, what about on x86_64?

    >
    > Shouldn't sound very impressive imho. The central part of said benchmark
    > is the following loop:


    [...]

    Thanks,

    --
    Lionel B
     
    Lionel B, Jun 24, 2008
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter

    1 day gnu, whole life gnu?

    Peter, Jan 10, 2005, in forum: Java
    Replies:
    3
    Views:
    354
    John C. Bollinger
    Jan 10, 2005
  2. Peter
    Replies:
    17
    Views:
    614
    Chris Smith
    Jan 13, 2005
  3. Markus Elfring
    Replies:
    2
    Views:
    381
    Markus Elfring
    Feb 23, 2005
  4. whatnext
    Replies:
    9
    Views:
    437
  5. Kazik�
    Replies:
    4
    Views:
    1,366
    Jonathan Lee
    Jul 6, 2009
Loading...

Share This Page