Array optimizing problem in C++?

Discussion in 'C++' started by Razii, Mar 23, 2008.

  1. Razii

    Razii Guest

    From an old post by James Kanze

    On Apr 9 2003, 5:42 pm, (James Kanze) wrote:

    >When I pass an "array" to a function in C/C++, I actually pass
    >a pointer to the first element. And the compiler, when it compiles the
    >function, only sees pointers -- it has no way of determining any
    >relationship between them. Consider, for example, a smoothing function
    >(in C or earlier C++):
    >
    > void
    > smooth( double* dest,
    > double const* src,
    > size_t len )
    > {
    > for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
    > dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
    > }
    > }
    >
    >
    >The compiler cannot arrange to use the src[ i + 1 ] value from the
    >preceding pass through the loop as the src[ i ] value in the current
    >pass, since the write to dest[ i - 1 ] might have changed it. In Java,
    >it can, since two arrays are either identical or disjoint.
    >
    >
    >This sort of code shows up frequently. In benchmarks from Java
    >suppliers comparing Java to C++, of course:). But also in any number
    >of applications dealing with physical measures: meteorology, geophysical
    >research (oil prospection), etc.


    Out of curiosity, I tried to test if the above is true. It didn't make
    any difference. In fact, C++ was a bit faster (not by much, just 6%).
    Probably due to array bound check in Java, if there is in indeed an
    issue with C++ arrays, overall there is no difference.



    ==c++==
    #include <iostream>
    #include <cstdlib>
    #include <ctime>

    void fill (double* src, int len);
    void smooth (double* dest,
    double const* src,
    int len );

    int main()
    {
    const int len = 50000;

    double src_array [len] = {0};
    double dest_array [len] = {0};


    fill(src_array, len);

    clock_t start=clock();

    for (int i = 0; i < 10000; i++)
    smooth (dest_array, src_array, len);


    clock_t endt=clock();

    std::cout <<"Time smooth(): " <<
    double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";

    // doesn't work without the following cout on vc++
    std::cout << dest_array [0] ;

    return 0;
    }


    void smooth (double* dest, double const* src, int len )
    {
    for (int i = 1 ; i < len - 1 ; i++ ) {
    dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;

    }
    }

    void fill (double* src, int len)
    {
    srand((unsigned)std::time(0));

    for (int i = 0 ; i < len ; ++ i ) {
    src = rand();
    }
    }




    ==== java ========

    import java.util.*;

    public class Arrays{
    public static void main (String[] arg)
    {
    final int LEN = 50000;
    double[] src_array = new double [LEN];
    fill(src_array, LEN);
    double[] dest_array = new double [LEN];

    long start = System.currentTimeMillis();

    for (int i = 0; i < 10000; i++)
    smooth(dest_array, src_array, LEN);

    long end = System.currentTimeMillis();

    System.out.println("Time smooth(): " + (end - start) + " ms");


    }


    static void smooth (double[] dest, double[] src, int len )
    {
    for (int i = 1 ; i < len - 1 ; i++ ) {
    dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
    }
    }


    static void fill (double[] src, int len)
    {
    for (int i = 0 ; i < len; i++)
    src = Math.random();
    }

    }
    Razii, Mar 23, 2008
    #1
    1. Advertising

  2. Razii

    Guest

    On Mar 23, 4:08 am, Razii <> wrote:
    > From an old post by James Kanze
    >
    > On Apr 9 2003, 5:42 pm, (James Kanze) wrote:
    >
    >
    >
    > >When I pass an "array" to a function in C/C++, I actually pass
    > >a pointer to the first element. And the compiler, when it compiles the
    > >function, only sees pointers -- it has no way of determining any
    > >relationship between them. Consider, for example, a smoothing function
    > >(in C or earlier C++):

    >
    > > void
    > > smooth( double* dest,
    > > double const* src,
    > > size_t len )
    > > {
    > > for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
    > > dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
    > > }
    > > }

    >
    > >The compiler cannot arrange to use the src[ i + 1 ] value from the
    > >preceding pass through the loop as the src[ i ] value in the current
    > >pass, since the write to dest[ i - 1 ] might have changed it. In Java,
    > >it can, since two arrays are either identical or disjoint.

    >
    > >This sort of code shows up frequently. In benchmarks from Java
    > >suppliers comparing Java to C++, of course:). But also in any number
    > >of applications dealing with physical measures: meteorology, geophysical
    > >research (oil prospection), etc.

    >
    > Out of curiosity, I tried to test if the above is true. It didn't make
    > any difference. In fact, C++ was a bit faster (not by much, just 6%).
    > Probably due to array bound check in Java, if there is in indeed an
    > issue with C++ arrays, overall there is no difference.


    If you are going for blinding speed be sure to use proper optimization
    flags; otherwise you are only calculating benchmarks for poorly
    optimized code, which, like most of your benchmarks, is meaningless:

    $ g++ -O0 smooth.cpp
    9884.867 ms

    $ g++ -O3 smooth.cpp
    8791.123 ms

    $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp
    1207.944 ms

    $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll-
    loops smooth.cpp
    1084.385 ms

    This is your code slightly modified to use QueryPerformanceCounter for
    timing (using Windows). Most of the speed up is a result of -ffast-
    math.

    Based on the quality and rigor of the other tests I've seen you do,
    I'd guess that you missed a few optimization flags on the Java side as
    well.

    Jason
    , Mar 23, 2008
    #2
    1. Advertising

  3. Razii

    Guest

    On Mar 23, 4:35 am, ""
    <> wrote:
    > If you are going for blinding speed be sure to use proper optimization
    > flags; otherwise you are only calculating benchmarks for poorly
    > optimized code, which, like most of your benchmarks, is meaningless:
    >
    > $ g++ -O0 smooth.cpp
    > 9884.867 ms
    >
    > $ g++ -O3 smooth.cpp
    > 8791.123 ms
    >
    > $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp
    > 1207.944 ms
    >
    > $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll-
    > loops smooth.cpp
    > 1084.385 ms


    I left one out:

    $ g++ -03 -mfpmath=sse -msse3 smooth.cpp
    7379.883 ms

    Compare to -O3 without having GCC generate SSE instructions for you.

    Jason
    , Mar 23, 2008
    #3
  4. Razii

    Guest

    Also, wrt James' original post:

    > > void
    > > smooth( double* dest,
    > > double const* src,
    > > size_t len )
    > > {
    > > for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
    > > dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
    > > }
    > > }

    >
    > >The compiler cannot arrange to use the src[ i + 1 ] value from the
    > >preceding pass through the loop as the src[ i ] value in the current
    > >pass, since the write to dest[ i - 1 ] might have changed it. In Java,
    > >it can, since two arrays are either identical or disjoint.


    I am not sure what you would expect in either language. I believe that
    James had been expecting it to cache [i+1] in an fpu register, and use
    that instead of accessing the value the next time through.

    As it stands right now, VS at least (I didn't check GCC) generates
    assembler instructions that are more equivalent to this:

    fpu_register = src[i - 1]
    fpu_register += src
    fpu_register += src[i + 1]
    fpu_register /= 3
    dest[i - 1] = fpu_register

    Using fld, fadd, fdiv, and fstp on Intel machines. It never loads
    src[i + 1] anyways. I have not tested this or done any research, but I
    suspect this is still a bit faster than:

    fpu_register = src[i - 1]
    fpu_register += other_fpu_register
    other_fpu_register = src[i + 1]
    fpu_register += other_fpu_register
    fpu_register /= 3
    dest[i - 1] = fpu_register

    Which I believe is what the original expected "optimization" was.

    I have removed comp.lang.java.programmer from this one, since it is
    unrelated to Java.

    Jason
    , Mar 23, 2008
    #4
  5. Razii

    Razii Guest

    On Sun, 23 Mar 2008 01:35:06 -0700 (PDT), ""
    <> wrote:

    >If you are going for blinding speed be sure to use proper optimization
    >flags; otherwise you are only calculating benchmarks for poorly
    >optimized code, which, like most of your benchmarks, is meaningless:


    Where did you get that I am not using proper optimizing flag? I said
    nothing about flags in the last post.

    >$ g++ -O0 smooth.cpp
    >9884.867 ms


    I am using vc++

    >I'd guess that you missed a few optimization flags on the Java side as
    >well.


    There are no optimization flags for java compiler or vm. Post proof
    and links.
    Razii, Mar 23, 2008
    #5
  6. Razii

    Jon Harrop Guest

    wrote:
    > If you are going for blinding speed be sure to use proper optimization
    > flags; otherwise you are only calculating benchmarks for poorly
    > optimized code, which, like most of your benchmarks, is meaningless:
    >
    > $ g++ -O0 smooth.cpp
    > 9884.867 ms
    >
    > $ g++ -O3 smooth.cpp
    > 8791.123 ms
    >
    > $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp
    > 1207.944 ms


    GCC's -ffast-math option breaks semantics, so it is not a valid
    optimization. I would not use it in a comparison against Java unless you
    know that accuracy is not required for the given task.

    > $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll-
    > loops smooth.cpp
    > 1084.385 ms
    >
    > This is your code slightly modified to use QueryPerformanceCounter for
    > timing (using Windows). Most of the speed up is a result of -ffast-
    > math.
    >
    > Based on the quality and rigor of the other tests I've seen you do,
    > I'd guess that you missed a few optimization flags on the Java side as
    > well.


    I have altered all implementations of this benchmark to perform progressive
    blurring from one array to the other and back rather than overwriting the
    destination array repeatedly. This slightly improved performance in all
    cases.

    On my Athlon 64 X2 4400+ (2.2Ghz) running AMD64 Debian Linux, I get
    best-of-three results:

    g++: 4.080 g++ -O3 bench.cpp -o bench; time ./bench
    g++2: 4.080 g++ -O3 bench2.cpp -o bench2; time ./bench2
    ocaml: 4.048 ocamlopt bench.ml -o bench; time ./bench
    java: 4.016 javac Arrays.java; time java Arrays

    On the same hardware running 32-bit Windows XP Pro, I get:

    F#: 3.970 fsc -O3

    This shows an insignificant difference in performance which, if nothing
    else, is interesting because the second C++ program uses bounds checking. I
    was also surprised to see OCaml do so well on this benchmark because it
    makes no attempt to hoist bounds checks. F# on .NET typically beats all
    open source compilers on Linux on this kind of benchmark (float array
    intensive). C# presumably would too.

    The bench2 program is the following, more idiomatic, C++ version that uses
    bounds checking:

    #include <vector>
    #include <iostream>
    #include <cstdlib>
    #include <ctime>

    void smooth (std::vector<double> &dest, std::vector<double> &src)
    {
    for (int i = 1; i < src.size() - 1; i++)
    dest.at(i - 1) = (src.at(i - 1) + src.at(i) + src.at(i + 1)) / 3 ;
    }

    void fill (std::vector<double> &src)
    {
    srand((unsigned)std::time(0));

    for (int i = 0; i < src.size(); ++ i )
    src.at(i) = rand();
    }

    int main()
    {
    const int len = 50000;

    std::vector<double> src_array(len);
    std::vector<double> dest_array(len);

    fill(src_array);

    clock_t start=clock();

    for (int i = 0; i < 5000; i++)
    {
    smooth(dest_array, src_array);
    smooth(src_array, dest_array);
    }

    clock_t endt=clock();

    std::cout <<"Time smooth(): " <<
    double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";

    return 0;
    }

    Here is my OCaml equivalent:

    let smooth dst src =
    let n = Array.length dst in
    for i=1 to n-2 do
    dst.(i-1) <- (src.(i-1) +. src.(i) +. src.(i+1)) /. 3.
    done

    let () =
    let n = 50000 in
    let src = Array.init n (fun _ -> Random.float 1.) in
    let dst = Array.create n 0. in
    let t = Sys.time() in
    for i=1 to 5000 do
    smooth dst src;
    smooth src dst;
    done;
    Printf.printf "Time smooth(): %fs\n" (Sys.time() -. t)

    --
    Dr Jon D Harrop, Flying Frog Consultancy Ltd.
    http://www.ffconsultancy.com/products/?u
    Jon Harrop, Mar 23, 2008
    #6
  7. Razii

    Jon Harrop Guest

    wrote:
    > If you are going for blinding speed be sure to use proper optimization
    > flags; otherwise you are only calculating benchmarks for poorly
    > optimized code, which, like most of your benchmarks, is meaningless:
    >
    > $ g++ -O0 smooth.cpp
    > 9884.867 ms
    >
    > $ g++ -O3 smooth.cpp
    > 8791.123 ms
    >
    > $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp
    > 1207.944 ms


    GCC's -ffast-math option breaks semantics, so it is not a valid
    optimization. I would not use it in a comparison against Java unless you
    know that accuracy is not required for the given task.

    > $ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll-
    > loops smooth.cpp
    > 1084.385 ms
    >
    > This is your code slightly modified to use QueryPerformanceCounter for
    > timing (using Windows). Most of the speed up is a result of -ffast-
    > math.
    >
    > Based on the quality and rigor of the other tests I've seen you do,
    > I'd guess that you missed a few optimization flags on the Java side as
    > well.


    I have altered all implementations of this benchmark to perform progressive
    blurring from one array to the other and back rather than overwriting the
    destination array repeatedly. This slightly improved performance in all
    cases.

    On my Athlon 64 X2 4400+ (2.2Ghz) running AMD64 Debian Linux, I get
    best-of-three results:

    g++: 4.080 g++ -O3 bench.cpp -o bench; time ./bench
    g++2: 4.080 g++ -O3 bench2.cpp -o bench2; time ./bench2
    ocaml: 4.048 ocamlopt bench.ml -o bench; time ./bench
    java: 4.016 javac Arrays.java; time java Arrays

    On the same hardware running 32-bit Windows XP Pro, I get:

    F#: 3.970 fsc -O3

    This shows an insignificant difference in performance which, if nothing
    else, is interesting because the second C++ program uses bounds checking. I
    was also surprised to see OCaml do so well on this benchmark because it
    makes no attempt to hoist bounds checks. F# on .NET typically beats all
    open source compilers on Linux on this kind of benchmark (float array
    intensive). C# presumably would too.

    The bench2 program is the following, more idiomatic, C++ version that uses
    bounds checking:

    #include <vector>
    #include <iostream>
    #include <cstdlib>
    #include <ctime>

    void smooth (std::vector<double> &dest, std::vector<double> &src)
    {
    for (int i = 1; i < src.size() - 1; i++)
    dest.at(i - 1) = (src.at(i - 1) + src.at(i) + src.at(i + 1)) / 3 ;
    }

    void fill (std::vector<double> &src)
    {
    srand((unsigned)std::time(0));

    for (int i = 0; i < src.size(); ++ i )
    src.at(i) = rand();
    }

    int main()
    {
    const int len = 50000;

    std::vector<double> src_array(len);
    std::vector<double> dest_array(len);

    fill(src_array);

    clock_t start=clock();

    for (int i = 0; i < 5000; i++)
    {
    smooth(dest_array, src_array);
    smooth(src_array, dest_array);
    }

    clock_t endt=clock();

    std::cout <<"Time smooth(): " <<
    double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";

    return 0;
    }

    Here is my OCaml equivalent:

    let smooth dst src =
    let n = Array.length dst in
    for i=1 to n-2 do
    dst.(i-1) <- (src.(i-1) +. src.(i) +. src.(i+1)) /. 3.
    done

    let () =
    let n = 50000 in
    let src = Array.init n (fun _ -> Random.float 1.) in
    let dst = Array.create n 0. in
    let t = Sys.time() in
    for i=1 to 5000 do
    smooth dst src;
    smooth src dst;
    done;
    Printf.printf "Time smooth(): %fs\n" (Sys.time() -. t)

    --
    Dr Jon D Harrop, Flying Frog Consultancy Ltd.
    http://www.ffconsultancy.com/products/?u
    Jon Harrop, Mar 23, 2008
    #7
  8. Razii

    red floyd Guest

    Razii wrote:
    > On Sun, 23 Mar 2008 01:35:06 -0700 (PDT), ""
    > <> wrote:
    >
    >> If you are going for blinding speed be sure to use proper optimization
    >> flags; otherwise you are only calculating benchmarks for poorly
    >> optimized code, which, like most of your benchmarks, is meaningless:

    >
    > Where did you get that I am not using proper optimizing flag? I said
    > nothing about flags in the last post.
    >
    >> $ g++ -O0 smooth.cpp
    >> 9884.867 ms

    >
    > I am using vc++
    >


    Well, then, you can't make sweeping statements about C++ vs. Java. You
    can only make statements about *a particular implementation* of C++ vs.
    Java.
    red floyd, Mar 23, 2008
    #8
  9. Razii

    Lew Guest

    red floyd wrote:
    > Well, then, you can't make sweeping statements about C++ vs. Java. You
    > can only make statements about *a particular implementation* of C++ vs.
    > Java.


    This is one of the procedural flaws common to all benchmarks. Generally
    speaking, one can only hope for full disclosure of the conditions for a
    benchmark, and that those conditions approximate some meaningful aspect of
    one's actual business need.

    This does not mean that benchmarks are useless, only that one cannot decry
    them for limitations inherent to the benchmark process itself.

    --
    Lew
    Lew, Mar 23, 2008
    #9
  10. Razii

    Razii Guest

    On Sun, 23 Mar 2008 15:45:21 GMT, red floyd <> wrote:

    >Well, then, you can't make sweeping statements about C++ vs. Java. You
    >can only make statements about *a particular implementation* of C++ vs.
    >Java.


    I used the proper flag /O2 in vc++. Also, when you are deploying a
    commercial software, you will have to use flags that target the
    least-common-denominator processor. That's a disadvantage of C++ vs
    JIT language. The JIT compiler knows what processor it is running on,
    and can generate code specifically for that processor. Thus, I won't
    use anything other than /O2 for c++... because, as I said, when you
    are deploying a commercial software, you will have to use flags that
    target the least-common-denominator processor anyway.
    Razii, Mar 23, 2008
    #10
  11. Razii

    Guest

    On Mar 23, 9:08 am, Razii <> wrote:
    > From an old post by James Kanze
    >
    > On Apr 9 2003, 5:42 pm, (James Kanze) wrote:
    >
    >
    >
    >
    >
    > >When I pass an "array" to a function in C/C++, I actually pass
    > >a pointer to the first element.  And the compiler, when it compiles the
    > >function, only sees pointers -- it has no way of determining any
    > >relationship between them. Consider, for example, a smoothing function
    > >(in C or earlier C++):

    >
    > >    void
    > >    smooth( double* dest,
    > >            double const* src,
    > >            size_t len )
    > >    {
    > >        for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
    > >            dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
    > >        }
    > >    }

    >
    > >The compiler cannot arrange to use the src[ i + 1 ] value from the
    > >preceding pass through the loop as the src[ i ] value in the current
    > >pass, since the write to dest[ i - 1 ] might have changed it.  In Java,
    > >it can, since two arrays are either identical or disjoint.

    >
    > >This sort of code shows up frequently.  In benchmarks from Java
    > >suppliers comparing Java to C++, of course:).  But also in any number
    > >of applications dealing with physical measures: meteorology, geophysical
    > >research (oil prospection), etc.

    >
    > Out of curiosity, I tried to test if the above is true. It didn't make
    > any difference. In fact, C++ was a bit faster (not by much, just 6%).
    > Probably due to array bound check in Java, if there is in indeed an
    > issue with C++ arrays, overall there is no difference.


    The issue C++ has with arrays is known as pointer aliasing. C has the
    keyword "restrict" to deal with this problem but not C++. The
    "restrict" keyword should be added in the next C++ standard. Most C++
    compilers currently propose their own "restrict" keywords though.
    Anyway, C++ has std::valarray. Arrays constructed with std::valarray
    are free from aliasing. So, in the absolute, the Java language has no
    performance edge over the current version of the C++ language in this
    domain. However, even if theorically, std:valarray is free from
    pointer aliasing, in practice, this isn't always the case, depending
    on the compiler.

    > if there is in indeed an issue with C++ arrays, overall there is no difference.


    With the right optimization flags, the smooth function would probably
    get inlined, and static analysis of pointer aliasing should happen,
    allowing the C++ compiler to perform no aliasing optimizations.

    > Probably due to array bound check in Java


    A good java compiler should have applied bounds checking elimination
    in this case especially since the smooth function should be inlined.

    As a side note, your program should handle ints instead of doubles, to
    focus more on the performance of array access.

    Alexandre Courpron.
    , Mar 23, 2008
    #11
  12. Razii

    Guest

    On Mar 23, 12:00 pm, Razii <> wrote:
    > On Sun, 23 Mar 2008 15:45:21 GMT, red floyd <> wrote:
    > >Well, then, you can't make sweeping statements about C++ vs. Java. You
    > >can only make statements about *a particular implementation* of C++ vs.
    > >Java.

    >
    > I used the proper flag /O2 in vc++. Also, when you are deploying a
    > commercial software, you will have to use flags that target the
    > least-common-denominator processor. That's a disadvantage of C++ vs
    > JIT language.


    There are three things that are not correct about this:

    1) The -ffast-math flag to GCC does not use any special processor
    instructions. It simply removes certain assumptions, which is safe
    when you know they aren't true. This is the flag with the major
    performance increase. Do not confuse "targeting the least-common-
    denominator processor" with "not using an ideal compiler for the
    application". Whether you compile your commercial code with GCC or VC+
    + is irrelevant to what machine you are targetting. If your goal is to
    target the least-common-denominator, then you would leave off, for
    example, the SSE optimizations. However, in my example, it was -ffast-
    math that was responsible for the largest speedup. Target platform is
    not relevant.

    2) It's more of a consequence of only distributing the least-common-
    denominator than of using C++. That's a choice you can make. It would
    be completely reasonable, for example, for a vendor of high-
    performance research applications to distribute SSE-optimized builds.
    E.g. Folding@Home (non-commercial, http://folding.stanford.edu)
    maintains a GPU-optimized version of their software as well as a
    "normal" one; and their non-GPU client checks for various processor
    extensions at run-time (such as SSE) and uses optimized code
    accordingly. This is the same with 32- and 64-bit applications. This
    is also common in OpenGL application development; where applications
    can look for specific OpenGL features and take advantage of them,
    rather than always only targeting the least-common-denominator of
    graphics hardware. The same thing applies to Java wrt OpenGL.
    Depending on how you *want* to look at it, an Intel machine with SSE
    support is just as different from an Intel machine without SSE as it
    is from a PowerPC with AltiVec (same with multicore machines). It just
    so happens that you can easily target the least-common-denominator for
    those Intel machines, but nothing stops you from releasing platform-
    optimized builds. So it is not a disadvantage of C++ vs. anything; it
    is a disadvantage of sticking to the last-common-denominator.

    3) A JIT language has an extra layer of abstraction before the
    hardware. This makes it difficult to perform these particular kinds of
    optimizations, anyways. It's apples and oranges, the Java JIT compiler
    is a completely different kind of beast than a native code compiler,
    the optimizations available for a JIT compiler to make and the
    approaches it can take to making them are too different from those of
    a native compiler to make a valid comparison. They're just different
    tools.

    What you are talking about is a difference between targeting the least-
    common-denominator of platforms, and targeting specific platforms --
    which is an issue present no matter what language you are using. You
    are not talking about a difference between any two specific languages.
    Furthermore, what *I* was talking about is a difference between
    different compilers, targeting the same platform (I am sorry my point
    wasn't clear, I had meant to show that the compiler can generate very
    fast code for you, in particular -ffast-math [which does not target
    specific hardware features] on GCC moreso than the the architecture-
    specific options).

    > The JIT compiler knows what processor it is running on,
    > and can generate code specifically for that processor.


    True, and a valid point. However, depending on the nature of what you
    are compiling, you easily could lose a lot of information about the
    original intention of the code in the original optimizing pass that
    limits how effective processor-specific optimizations made by the JIT
    compiler can be.

    For example (this is a very specific example but please extend it to
    general cases): let's say you have some Java code that, I don't know,
    multiplies the elements of two large arrays of floating-point values
    together in a loop. You compile it to bytecode, and in the process
    javac unrolls that loop to optimize a bit. Then you run it on a
    platform that, say, provides a specific instruction for multiplying
    large arrays of floating-point values together very quickly. If the
    original compiler had targeted this platform to begin with, it could
    have seen your access pattern and used that optimized instruction.
    Instead it unrolled the loop, a least-common-denominator optimization.
    Now the JIT compiler sees the unrolled loop, and because it does not
    see the sequential access pattern that was very apparent in the
    original code, it fails to analyze the unrolled loop and determine
    that it should be "re-rolled" and use the optimized instruction
    instead. Now this particular example might not be that great, I know,
    because you could conceivably construct a JIT compiler that can
    recognize and re-roll unrolled loops; but the general point is what I
    said: you can potentially lose a lot of information by compiling code
    that you can not recover, and this information could be critical to
    performing the best optimizations off the bat.

    Consequently, the JIT compiler must turn to other techniques for
    optimization, such as caching compiled code for faster startup times,
    etc. Again, it's a whole different beast.

    > Thus, I won't
    > use anything other than /O2 for c++... because, as I said, when you
    > are deploying a commercial software, you will have to use flags that
    > target the least-common-denominator processor anyway.


    I addressed that above: Again, 1) some optimizations, like -ffast-math
    or -funroll-loops (and be careful, of course, you can't blindly use
    these, you have to make sure they make sense for your code), do not
    target a specific processor, and 2) there is no rule that says you
    have to target the least-common-denominator processor; and a heavy
    duty research application, such as what James Kanze *originally*
    cited, would most certainly take advantage of special features on the
    target hardware.

    Jason
    , Mar 23, 2008
    #12
  13. Razii

    Guest

    On Mar 23, 11:12 am, Razii <> wrote:
    > I said
    > nothing about flags in the last post.


    As was pointed out in your other thread, these are important details
    that you should not leave out; especially in this particular benchmark
    (which is not I/O bound like the last one, and is heavily affected by
    compiler optimizations).

    Jason
    , Mar 23, 2008
    #13
  14. Razii

    Guest

    <> wrote in message
    news:0d80b694-8609-4f80-9aeb-
    ...
    > (I am sorry my point
    > wasn't clear, I had meant to show that the compiler can generate very
    > fast code for you, in particular -ffast-math [which does not target
    > specific hardware features] on GCC moreso than the the architecture-
    > specific options).


    Here are some key missing timings with GCC and VS, Razii:

    $ g++ -O2 smooth.cpp
    8677.623 ms

    $ g++ -O2 -ffast-math smooth.cpp
    1077.232 ms

    $ g++ -O2 -ffast-math -funroll-loops smooth.cpp
    919.622 ms

    With CL 14 (VS2005):
    7275.611 ms

    No platform-specific optimizations were used. Note the order of
    magnitude speed up using -ffast-math and -funroll-loops with GCC,
    which generates code that can still be run on the least-common-
    denominator of Intel-compatible platforms.

    Also note that -O2 provides slightly better performance than -O3. Just
    goes to show why I shouldn't be using -O3, I guess :)

    Jason
    , Mar 23, 2008
    #14
  15. Razii

    Guest

    On Mar 23, 1:18 pm, ""
    <> wrote:
    > Also note that -O2 provides slightly better performance than -O3. Just
    > goes to show why I shouldn't be using -O3, I guess :)


    I take that back, I got it down to 903ms with -O3 over -O2. Perhaps my
    machine was just in a bad mood yesterday.

    I guess I shouldn't post in this thread for a while, I'm getting a
    little uncomfortable with my post ratio here. :-(

    Jason
    , Mar 23, 2008
    #15
  16. Razii

    Guest

    On Mar 23, 5:00 pm, Razii <> wrote:
    > On Sun, 23 Mar 2008 15:45:21 GMT, red floyd <> wrote:
    > >Well, then, you can't make sweeping statements about C++ vs. Java.  You
    > >can only make statements about *a particular implementation* of C++ vs.
    > >Java.

    >
    > I used the proper flag /O2 in vc++. Also, when you are deploying a
    > commercial software, you will have to use flags that target the
    > least-common-denominator processor. That's a disadvantage of C++ vs
    > JIT language. The JIT compiler knows what processor it is running on,
    > and can generate code specifically for that processor. Thus, I won't
    > use anything other than /O2 for c++... because, as I said, when you
    > are deploying a commercial software, you will have to use flags that
    > target the least-common-denominator processor anyway.


    You are using Visual C++. You should use at least the following flags,
    in order to enable whole program optimization and link-time code
    generation :

    cl /O2 /GL prog.cpp /link /ltcg

    You should make again all your benchmarks with at least those options
    enabled. By the way, as I said earlier, you should use ints instead of
    doubles. It will save you from a lot of troubles.

    Alexandre Courpron.
    , Mar 23, 2008
    #16
  17. Razii

    Guest

    On Mar 23, 1:18 pm, ""
    <> wrote:
    > With CL 14 (VS2005):
    > 7275.611 ms


    With /O2 and LTCG enabled!
    , Mar 23, 2008
    #17
  18. Razii

    James Kanze Guest

    On 23 mar, 09:58, ""
    <> wrote:
    > Also, wrt James' original post:


    > > > void
    > > > smooth( double* dest,
    > > > double const* src,
    > > > size_t len )
    > > > {
    > > > for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
    > > > dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
    > > > }
    > > > }


    > > >The compiler cannot arrange to use the src[ i + 1 ] value
    > > >from the preceding pass through the loop as the src[ i ]
    > > >value in the current pass, since the write to dest[ i - 1 ]
    > > >might have changed it. In Java, it can, since two arrays
    > > >are either identical or disjoint.


    > I am not sure what you would expect in either language. I
    > believe that James had been expecting it to cache [i+1] in an
    > fpu register, and use that instead of accessing the value the
    > next time through.


    Exactly. It's a very common optimization.

    > As it stands right now, VS at least (I didn't check GCC)
    > generates assembler instructions that are more equivalent to
    > this:


    > fpu_register = src[i - 1]
    > fpu_register += src
    > fpu_register += src[i + 1]
    > fpu_register /= 3
    > dest[i - 1] = fpu_register


    > Using fld, fadd, fdiv, and fstp on Intel machines. It never
    > loads src[i + 1] anyways. I have not tested this or done any
    > research, but I suspect this is still a bit faster than:


    > fpu_register = src[i - 1]
    > fpu_register += other_fpu_register
    > other_fpu_register = src[i + 1]
    > fpu_register += other_fpu_register
    > fpu_register /= 3
    > dest[i - 1] = fpu_register


    The fastest solution would pre-charge the first two values, and
    only read one new value each time through the loop. (Don't
    forget that memory bandwidth will be the limiting factor for
    this type of loop.) The Intel's stack architecture may make
    optimizing this somewhat more difficult, but it should still be
    possible. You'll end up with more instructions, but less memory
    accesses and better run time.

    But of course, you can't do this in C++, because dest[ i -1 ]
    might modify one of the values you're holding in register. (Of
    course, some compilers do generate two versions of the loop,
    with code which checks for aliasing first, and uses one or the
    other, depending on whether aliasing is present or not. But
    this isn't the usual case.)

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Mar 23, 2008
    #18
  19. Razii

    Razii Guest

    On Sun, 23 Mar 2008 10:29:44 -0700 (PDT), James Kanze
    <> wrote:

    >> I am not sure what you would expect in either language. I
    >> believe that James had been expecting it to cache [i+1] in an
    >> fpu register, and use that instead of accessing the value the
    >> next time through.

    >
    >Exactly. It's a very common optimization.


    Post an example that I can test where aliasing is a problem. In the
    example that I posted, c++ was faster.
    Razii, Mar 23, 2008
    #19
  20. Razii

    Razii Guest

    On Sun, 23 Mar 2008 10:23:07 -0700 (PDT), wrote:

    >You should make again all your benchmarks with at least those options
    >enabled


    What all other bench marks? :) I only posted one IO example. This
    post was not a benchmark but a question to Kanze to prove by posting
    an example (that I can test) where aliasing in c++ makes optimizing
    arrays hard (or impossible). The example that he gave (which I used),

    for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
    dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
    }

    c++ was in fact faster. Obviously aliasing was no issue in this case,
    as he claimed.
    Razii, Mar 23, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Guest
    Replies:
    5
    Views:
    350
    Guest
    Jul 28, 2004
  2. Clemens Lode

    Optimizing array accesses

    Clemens Lode, Oct 12, 2003, in forum: C++
    Replies:
    2
    Views:
    314
    Jon Bell
    Oct 12, 2003
  3. Hagen
    Replies:
    8
    Views:
    476
    Dylan Nicholson
    Mar 3, 2005
  4. Razii
    Replies:
    114
    Views:
    2,937
    Roedy Green
    Apr 16, 2008
  5. George Ogata

    Optimizing ruby constant array data

    George Ogata, Mar 11, 2006, in forum: Ruby
    Replies:
    4
    Views:
    141
    Robert Klemme
    Mar 11, 2006
Loading...

Share This Page