Cannot optimize 64bit Linux code

Discussion in 'C Programming' started by legrape@gmail.com, Feb 28, 2008.

  1. Guest

    I am porting a piece of C code to 64bit on Linux. I am using 64bit
    integers. It is a floating point intensive code and when I compile
    (gcc) on 64 bit machine, I don't see any runtime improvement when
    optimizing -O3. If I construct a small program I can get significant
    (>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
    machine, it runs 5x faster on the 64 bit machine than does the 64bit
    compiled code.

    It seems like something is inhibiting the optimization. Someone on
    comp.lang.fortran suggested it might be an alignment problem. I am
    trying to go through and eliminate all 32 bit integers righ now (this
    is a pretty large hunk of code). But thought I would survey this
    group, in case it is something naive I am missing.

    Any opinion is welcomed. I really need this to run up to speed, and I
    need the big address space. Thanks in advance.

    Dick
    , Feb 28, 2008
    #1
    1. Advertising

  2. In article <>,
    <> wrote:
    >I am porting a piece of C code to 64bit on Linux. I am using 64bit
    >integers. It is a floating point intensive code and when I compile
    >(gcc) on 64 bit machine, I don't see any runtime improvement when
    >optimizing -O3.


    >It seems like something is inhibiting the optimization. Someone on
    >comp.lang.fortran suggested it might be an alignment problem.


    Possibly. It could possibly also be a cache issue: you might have
    cache-line conflicts, or the larger size of your integers might
    be causing your key data to no longer fit into cache.
    --
    "The shallow murmur, but the deep are dumb." -- Sir Walter Raleigh
    Walter Roberson, Feb 28, 2008
    #2
    1. Advertising

  3. santosh Guest

    wrote:

    > I am porting a piece of C code to 64bit on Linux. I am using 64bit
    > integers. It is a floating point intensive code and when I compile
    > (gcc) on 64 bit machine, I don't see any runtime improvement when
    > optimizing -O3. If I construct a small program I can get significant
    > (>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
    > machine, it runs 5x faster on the 64 bit machine than does the 64bit
    > compiled code.
    >
    > It seems like something is inhibiting the optimization. Someone on
    > comp.lang.fortran suggested it might be an alignment problem. I am
    > trying to go through and eliminate all 32 bit integers righ now (this
    > is a pretty large hunk of code). But thought I would survey this
    > group, in case it is something naive I am missing.
    >
    > Any opinion is welcomed. I really need this to run up to speed, and I
    > need the big address space. Thanks in advance.


    This group may not be the best option. Maybe you should try a Linux or
    GCC group?

    If the same code and compilation commands produce such runtime
    difference then perhaps the 64 bit version of the compiler and it's
    runtime libraries, as well as perhaps the system runtime libraries are
    not yet exploiting all the optimisations possible. Did you try giving
    gcc the permission to use intrinsics and SSE? Alignment could well be a
    problem though gcc *should* have chosen the best alignment for the
    target, unless you specified otherwise. Are there any aspects to your
    code (like choice of data types, compiler specific pragmas, struct
    padding) that are perhaps selected for 32 bit systems and thus less
    than optimal under 64 bit?

    Did you try with the Intel compiler? If it produces better code then
    that is a piece of evidence indicative, perhaps, that gcc isn't
    emitting good code.
    santosh, Feb 28, 2008
    #3
  4. cr88192 Guest

    <> wrote in message
    news:...
    >I am porting a piece of C code to 64bit on Linux. I am using 64bit
    > integers. It is a floating point intensive code and when I compile
    > (gcc) on 64 bit machine, I don't see any runtime improvement when
    > optimizing -O3. If I construct a small program I can get significant
    > (>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
    > machine, it runs 5x faster on the 64 bit machine than does the 64bit
    > compiled code.
    >
    > It seems like something is inhibiting the optimization. Someone on
    > comp.lang.fortran suggested it might be an alignment problem. I am
    > trying to go through and eliminate all 32 bit integers righ now (this
    > is a pretty large hunk of code). But thought I would survey this
    > group, in case it is something naive I am missing.
    >
    > Any opinion is welcomed. I really need this to run up to speed, and I
    > need the big address space. Thanks in advance.
    >



    OT:

    this is actually an issue related to the mismatch between current processor
    performance behavior, and the calling conventions used on Linux x86-64.

    they were like:
    let's base everything on a variant of the "register" calling convention, and
    use SSE for all the floating point math rather than crufty old x87.

    the problem is that, current processors don't quite agree, and in practice
    this sort of thing actually goes *slower*...

    it seems, actually, that x87, lots of mem loads/stores, and complex
    addressing forms, can be used to better effect wrt performance than SSE,
    register-heavy approaches, and the use of "simple" addressing forms (in
    seeming opposition to current "optimization wisdom").

    I can't give much explanation as to why this is exactly, but it has been my
    observation (periodic performance testing during the ongoing
    compiler-writing task...).

    my guess is because these things are heavily optimized, given that much
    existing x86 code uses them heavily (this may change in the future though,
    as 64 bit code becomes more prevalent...).


    my guess is that the calling convention was designed according to some
    misguided sense of "optimization wisdom", rather than good solid benchmarks.

    better performance could probably have been achieved at present just by
    pretending the x86-64 was just an x86 with more registers and gueranteed
    present SSE.

    not only this, but the convention is designed in such a way as to be awkward
    as well, and leaves open the question of how to effectively pull off
    varargs...



    or, at least, this is what happens on my processor (an Athlon 64 X2 4400+).

    I don't know if it is similar on Intel chips.


    > Dick
    cr88192, Feb 29, 2008
    #4
  5. Bartc Guest

    <> wrote in message
    news:...
    >I am porting a piece of C code to 64bit on Linux. I am using 64bit
    > integers. It is a floating point intensive code and when I compile
    > (gcc) on 64 bit machine, I don't see any runtime improvement when
    > optimizing -O3. If I construct a small program I can get significant
    > (>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
    > machine, it runs 5x faster on the 64 bit machine than does the 64bit
    > compiled code.
    >
    > It seems like something is inhibiting the optimization. Someone on
    > comp.lang.fortran suggested it might be an alignment problem. I am
    > trying to go through and eliminate all 32 bit integers righ now (this
    > is a pretty large hunk of code). But thought I would survey this
    > group, in case it is something naive I am missing.
    >
    > Any opinion is welcomed. I really need this to run up to speed, and I
    > need the big address space. Thanks in advance.


    Hesitant to attempt an answer as I know nothing about 64-bit or gcc, but..

    Does the program compiled in 32-bit mode run faster when compiled with
    optimisation than without (or a 32 or 64-bit machine)? In other words, what
    scale of improvement are you expecting? (This on the main program)

    Is the improvement really likely to be 5x or more? If not, that sounds like
    something wrong with the 64-bit-compiled version, forget the optimisation,
    if the 32-bit version can run that much faster.

    Do you have the capability to look at a sample of code and see what
    exactly is the 64-compiler generating? I doubt it's going to be as silly as
    using (and emulating) 128-bit floats, but it does sound like there's
    something seriously wrong. It seems unlikely that using int32 instead of
    int64 would slow things down 5 times or more.

    An alignment fault would be a compiler error; but you can print out a few
    data addresses and see whether they are on 8/16-byte boundaries or whatever
    is recommended.

    Is the small program doing anything similar to the big one? It may be
    benefiting from smaller instruction/data cache requirements.

    You might find that ints/pointers suddenly turn from 32-bits to 64-bits when
    compiled on 64-bit (and therefore using twice the memory bandwidth if you
    have a lot of them), that might hit some of the performance. You might like
    to check the size of pointers, if you don't need 64-bit addressing.

    --
    Bart
    Bartc, Feb 29, 2008
    #5
  6. cr88192 Guest

    "Bartc" <> wrote in message
    news:6tIxj.15225$...
    >
    > <> wrote in message
    > news:...
    >>I am porting a piece of C code to 64bit on Linux. I am using 64bit
    >> integers. It is a floating point intensive code and when I compile
    >> (gcc) on 64 bit machine, I don't see any runtime improvement when
    >> optimizing -O3. If I construct a small program I can get significant
    >> (>4x) speed improvement using -O3 versus -g. If I compile on a 32 bit
    >> machine, it runs 5x faster on the 64 bit machine than does the 64bit
    >> compiled code.
    >>
    >> It seems like something is inhibiting the optimization. Someone on
    >> comp.lang.fortran suggested it might be an alignment problem. I am
    >> trying to go through and eliminate all 32 bit integers righ now (this
    >> is a pretty large hunk of code). But thought I would survey this
    >> group, in case it is something naive I am missing.
    >>
    >> Any opinion is welcomed. I really need this to run up to speed, and I
    >> need the big address space. Thanks in advance.

    >
    > Hesitant to attempt an answer as I know nothing about 64-bit or gcc, but..
    >
    > Does the program compiled in 32-bit mode run faster when compiled with
    > optimisation than without (or a 32 or 64-bit machine)? In other words,
    > what
    > scale of improvement are you expecting? (This on the main program)
    >
    > Is the improvement really likely to be 5x or more? If not, that sounds
    > like
    > something wrong with the 64-bit-compiled version, forget the optimisation,
    > if the 32-bit version can run that much faster.
    >


    yes, that is a bit harsh...


    > Do you have the capability to look at a sample of code and see what
    > exactly is the 64-compiler generating? I doubt it's going to be as silly
    > as
    > using (and emulating) 128-bit floats, but it does sound like there's
    > something seriously wrong. It seems unlikely that using int32 instead of
    > int64 would slow things down 5 times or more.
    >


    int32 vs int64, int32 should actually be faster on x86-64 (after all, 32-bit
    ints have both less-complex instruction encodings, aka, no REX prefix, ...,
    and also because the core of x86-64 is, after all, still x86...).

    as for emulating 128 bit floats, it is conceivably possible. I am aware, in
    any case, that on x86-64 gcc uses a 128-bit long-double, but whether or not
    this is an 80-bit float stuffed into a 128 bit space (doing magic of
    shuffling between SSE regs and the FPU), or whether it uses emulated 128 bit
    floats, I don't know (I have not investigated gcc's output in this case).

    note that SSE does not support 80 bit floats, and the conventions used on
    x86-64 generally don't use the FPU (it may be used for some calculations,
    but not much else), so if using long double, it is very possible something
    funky is going on.

    if this is the case, maybe try switching over to double and see if anything
    is different.


    > An alignment fault would be a compiler error; but you can print out a few
    > data addresses and see whether they are on 8/16-byte boundaries or
    > whatever
    > is recommended.
    >


    yes. unless one is using "__attribute__((packed))" everywhere, it should not
    be a problem...


    > Is the small program doing anything similar to the big one? It may be
    > benefiting from smaller instruction/data cache requirements.
    >
    > You might find that ints/pointers suddenly turn from 32-bits to 64-bits
    > when
    > compiled on 64-bit (and therefore using twice the memory bandwidth if you
    > have a lot of them), that might hit some of the performance. You might
    > like
    > to check the size of pointers, if you don't need 64-bit addressing.
    >



    yes, I will agree here...


    > --
    > Bart
    >
    >
    >
    cr88192, Feb 29, 2008
    #6
  7. Dick Dowell Guest

    Thanks for all the hints and thoughts.

    My small program is:

    main()
    {
    struct timespec ts;
    double x,y;
    int i;
    long long n;
    n=15000000;
    n *= 10000;
    fprintf(stderr,"LONG %Ld\n",n);
    /*
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
    */
    printf(" _POSIX_THREAD_CPUTIME _POSIX_CPUTIME %d %d\n",
    _POSIX_THREAD_CPUTIME
    ,_POSIX_CPUTIME);
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
    n = ts.tv_nsec;
    fprintf(stderr,"Before %d sec %d nsec\n",ts.tv_sec,ts.tv_nsec);
    fprintf(stderr,"Before %d sec %Ld nsec\n",ts.tv_sec,ts.tv_nsec);
    y=3.3;
    for(i=0;i<111100000;i++) {
    x=sqrt(y);
    y += 1.0;
    }
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
    fprintf(stderr,"After %d sec %d nsec\n",ts.tv_sec,ts.tv_nsec);
    fprintf(stderr,"After %d sec %Ld nsec\n",ts.tv_sec,ts.tv_nsec-n);
    }

    It shows considerable improvement with -O3.

    I think the problem is something less esoteric than the cache,
    wordsize, etc. One thing I didn't say, I have multi threading loaded,
    though no new threads are created by these runs. I have tried a newer
    redhat, have not tried Intel compilers.

    Dick
    Dick Dowell, Mar 1, 2008
    #7
  8. Dick Dowell Guest

    I think I misspoke on my timer program. That one was used to attempt
    to measure thread time. You can remove the references to the timers
    and run it. It only shows about a 2x improvement on optimization.

    The large difference I have actually seen is 32bit compile on another
    machine, run on 64bit machine (12sec) versus 64bit code compiled on
    64bit machine (70sec).

    Sorry for the confusion.

    Dick
    Dick Dowell, Mar 1, 2008
    #8
  9. In article <>,
    Dick Dowell <> wrote:
    >Thanks for all the hints and thoughts.


    >My small program is:


    >main()
    >{
    > struct timespec ts;
    > double x,y;
    > int i;
    > long long n;
    > n=15000000;
    > n *= 10000;
    > fprintf(stderr,"LONG %Ld\n",n);
    > /*
    > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
    > */
    > printf(" _POSIX_THREAD_CPUTIME _POSIX_CPUTIME %d %d\n",
    >_POSIX_THREAD_CPUTIME
    > ,_POSIX_CPUTIME);
    > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
    > n = ts.tv_nsec;
    > fprintf(stderr,"Before %d sec %d nsec\n",ts.tv_sec,ts.tv_nsec);
    > fprintf(stderr,"Before %d sec %Ld nsec\n",ts.tv_sec,ts.tv_nsec);
    > y=3.3;
    > for(i=0;i<111100000;i++) {
    > x=sqrt(y);
    > y += 1.0;
    > }
    > clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
    > fprintf(stderr,"After %d sec %d nsec\n",ts.tv_sec,ts.tv_nsec);
    > fprintf(stderr,"After %d sec %Ld nsec\n",ts.tv_sec,ts.tv_nsec-n);
    >}


    >It shows considerable improvement with -O3.


    You do not do anything with x after you compute it. Any good
    optimizer would optimize away the x=sqrt(y) statement. Once that
    is done, the optimizer could even eliminate the loop completely
    and replace it by y += 111100000. Compilers that did the one or
    both of these optimizations would result in much faster code than
    compilers that did not. Your problem might have nothing to do
    with 64 bit integers and everything to do with which optimizations
    the compiler performs.
    --
    "The human mind is so strangely capricious, that, when freed from
    the pressure of real misery, it becomes open and sensitive to the
    ideal apprehension of ideal calamities." -- Sir Walter Scott
    Walter Roberson, Mar 2, 2008
    #9
  10. Dick Dowell Guest

    Thanks for all the suggestions. I've discovered the ineffectiveness
    of optimization is data dependent. I managed to profile the code and
    78% of the runtime is spent in something called

    _mul [1] (from gprof output, the [1] just means #1 cpu user)

    Here's another line from gprof report
    granularity: each sample hit covers 4 byte(s) for 0.01% of 109.71
    seconds

    index % time self children called name
    <spontaneous>
    [1] 78.0 85.55 0.00 __mul [1]
    -----------------------------------------------

    Dick
    Dick Dowell, Mar 10, 2008
    #10
  11. Dick Dowell <> writes:
    > Thanks for all the suggestions. I've discovered the ineffectiveness
    > of optimization is data dependent. I managed to profile the code and
    > 78% of the runtime is spent in something called
    >
    > _mul [1] (from gprof output, the [1] just means #1 cpu user)
    >
    > Here's another line from gprof report
    > granularity: each sample hit covers 4 byte(s) for 0.01% of 109.71
    > seconds
    >
    > index % time self children called name
    > <spontaneous>
    > [1] 78.0 85.55 0.00 __mul [1]
    > -----------------------------------------------


    It's likely (or at least plausible) that _mul is a multiplication
    routine invoked by the compiler for what looks like an ordinary
    multiplication in your code. Perhaps there's some form of
    multiplication that your CPU doesn't directly support.

    In that case, you *might* get a significant performance improvement by
    re-working your algorithm to avoid multiplications. For example, a
    multiplication in a loop can often be replaced by an addition (though
    that's the kind of optimization a good compiler should be able to
    perform).

    Before you consider this, find out exactly what _mul is used for, and
    *measure* the performance improvement you can get by avoiding it
    (assuming you can).

    It's even possible that the hardware supports the multiplication
    directly, but the compiler doesn't know it; the solution might be as
    simple as adding a command-line option to tell the compiler it can use
    a CPU instruction rather than a subroutine call.

    I'm assuming here that you're already using a high optimization level
    in your compiler. Worrying about the performance of unoptimized code
    would be a waste of time unless you seriously mistrust the optimizer.

    --
    Keith Thompson (The_Other_Keith) <>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
    Keith Thompson, Mar 10, 2008
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Wen  Jiang

    pymat on 64bit linux

    Wen Jiang, Jan 25, 2005, in forum: Python
    Replies:
    0
    Views:
    282
    Wen Jiang
    Jan 25, 2005
  2. dwelch91

    Detecting 64bit vs. 32bit Linux

    dwelch91, Jul 7, 2006, in forum: Python
    Replies:
    8
    Views:
    675
    Lawrence D'Oliveiro
    Jul 11, 2006
  3. Aditi
    Replies:
    12
    Views:
    2,751
    Matthias Buelow
    Feb 20, 2008
  4. Replies:
    1
    Views:
    341
    Martin v. Löwis
    Jan 22, 2009
  5. Andreas Hasenkopf
    Replies:
    0
    Views:
    228
    Andreas Hasenkopf
    Jun 29, 2011
Loading...

Share This Page