Array optimizing problem in C++?

R

Razii

From an old post by James Kanze

When I pass an "array" to a function in C/C++, I actually pass
a pointer to the first element. And the compiler, when it compiles the
function, only sees pointers -- it has no way of determining any
relationship between them. Consider, for example, a smoothing function
(in C or earlier C++):

void
smooth( double* dest,
double const* src,
size_t len )
{
for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}
}


The compiler cannot arrange to use the src[ i + 1 ] value from the
preceding pass through the loop as the src[ i ] value in the current
pass, since the write to dest[ i - 1 ] might have changed it. In Java,
it can, since two arrays are either identical or disjoint.


This sort of code shows up frequently. In benchmarks from Java
suppliers comparing Java to C++, of course:). But also in any number
of applications dealing with physical measures: meteorology, geophysical
research (oil prospection), etc.

Out of curiosity, I tried to test if the above is true. It didn't make
any difference. In fact, C++ was a bit faster (not by much, just 6%).
Probably due to array bound check in Java, if there is in indeed an
issue with C++ arrays, overall there is no difference.



==c++==
#include <iostream>
#include <cstdlib>
#include <ctime>

void fill (double* src, int len);
void smooth (double* dest,
double const* src,
int len );

int main()
{
const int len = 50000;

double src_array [len] = {0};
double dest_array [len] = {0};


fill(src_array, len);

clock_t start=clock();

for (int i = 0; i < 10000; i++)
smooth (dest_array, src_array, len);


clock_t endt=clock();

std::cout <<"Time smooth(): " <<
double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";

// doesn't work without the following cout on vc++
std::cout << dest_array [0] ;

return 0;
}


void smooth (double* dest, double const* src, int len )
{
for (int i = 1 ; i < len - 1 ; i++ ) {
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;

}
}

void fill (double* src, int len)
{
srand((unsigned)std::time(0));

for (int i = 0 ; i < len ; ++ i ) {
src = rand();
}
}




==== java ========

import java.util.*;

public class Arrays{
public static void main (String[] arg)
{
final int LEN = 50000;
double[] src_array = new double [LEN];
fill(src_array, LEN);
double[] dest_array = new double [LEN];

long start = System.currentTimeMillis();

for (int i = 0; i < 10000; i++)
smooth(dest_array, src_array, LEN);

long end = System.currentTimeMillis();

System.out.println("Time smooth(): " + (end - start) + " ms");


}


static void smooth (double[] dest, double[] src, int len )
{
for (int i = 1 ; i < len - 1 ; i++ ) {
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}
}


static void fill (double[] src, int len)
{
for (int i = 0 ; i < len; i++)
src = Math.random();
}

}
 
J

jason.cipriani

From an old post by James Kanze

When I pass an "array" to a function in C/C++, I actually pass
a pointer to the first element. And the compiler, when it compiles the
function, only sees pointers -- it has no way of determining any
relationship between them. Consider, for example, a smoothing function
(in C or earlier C++):
void
smooth( double* dest,
double const* src,
size_t len )
{
for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}
}
The compiler cannot arrange to use the src[ i + 1 ] value from the
preceding pass through the loop as the src[ i ] value in the current
pass, since the write to dest[ i - 1 ] might have changed it. In Java,
it can, since two arrays are either identical or disjoint.
This sort of code shows up frequently. In benchmarks from Java
suppliers comparing Java to C++, of course:). But also in any number
of applications dealing with physical measures: meteorology, geophysical
research (oil prospection), etc.

Out of curiosity, I tried to test if the above is true. It didn't make
any difference. In fact, C++ was a bit faster (not by much, just 6%).
Probably due to array bound check in Java, if there is in indeed an
issue with C++ arrays, overall there is no difference.

If you are going for blinding speed be sure to use proper optimization
flags; otherwise you are only calculating benchmarks for poorly
optimized code, which, like most of your benchmarks, is meaningless:

$ g++ -O0 smooth.cpp
9884.867 ms

$ g++ -O3 smooth.cpp
8791.123 ms

$ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp
1207.944 ms

$ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll-
loops smooth.cpp
1084.385 ms

This is your code slightly modified to use QueryPerformanceCounter for
timing (using Windows). Most of the speed up is a result of -ffast-
math.

Based on the quality and rigor of the other tests I've seen you do,
I'd guess that you missed a few optimization flags on the Java side as
well.

Jason
 
J

jason.cipriani

If you are going for blinding speed be sure to use proper optimization
flags; otherwise you are only calculating benchmarks for poorly
optimized code, which, like most of your benchmarks, is meaningless:

$ g++ -O0 smooth.cpp
9884.867 ms

$ g++ -O3 smooth.cpp
8791.123 ms

$ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp
1207.944 ms

$ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll-
loops smooth.cpp
1084.385 ms

I left one out:

$ g++ -03 -mfpmath=sse -msse3 smooth.cpp
7379.883 ms

Compare to -O3 without having GCC generate SSE instructions for you.

Jason
 
R

Razii

If you are going for blinding speed be sure to use proper optimization
flags; otherwise you are only calculating benchmarks for poorly
optimized code, which, like most of your benchmarks, is meaningless:

Where did you get that I am not using proper optimizing flag? I said
nothing about flags in the last post.
$ g++ -O0 smooth.cpp
9884.867 ms

I am using vc++
I'd guess that you missed a few optimization flags on the Java side as
well.

There are no optimization flags for java compiler or vm. Post proof
and links.
 
J

Jon Harrop

If you are going for blinding speed be sure to use proper optimization
flags; otherwise you are only calculating benchmarks for poorly
optimized code, which, like most of your benchmarks, is meaningless:

$ g++ -O0 smooth.cpp
9884.867 ms

$ g++ -O3 smooth.cpp
8791.123 ms

$ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp
1207.944 ms

GCC's -ffast-math option breaks semantics, so it is not a valid
optimization. I would not use it in a comparison against Java unless you
know that accuracy is not required for the given task.
$ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll-
loops smooth.cpp
1084.385 ms

This is your code slightly modified to use QueryPerformanceCounter for
timing (using Windows). Most of the speed up is a result of -ffast-
math.

Based on the quality and rigor of the other tests I've seen you do,
I'd guess that you missed a few optimization flags on the Java side as
well.

I have altered all implementations of this benchmark to perform progressive
blurring from one array to the other and back rather than overwriting the
destination array repeatedly. This slightly improved performance in all
cases.

On my Athlon 64 X2 4400+ (2.2Ghz) running AMD64 Debian Linux, I get
best-of-three results:

g++: 4.080 g++ -O3 bench.cpp -o bench; time ./bench
g++2: 4.080 g++ -O3 bench2.cpp -o bench2; time ./bench2
ocaml: 4.048 ocamlopt bench.ml -o bench; time ./bench
java: 4.016 javac Arrays.java; time java Arrays

On the same hardware running 32-bit Windows XP Pro, I get:

F#: 3.970 fsc -O3

This shows an insignificant difference in performance which, if nothing
else, is interesting because the second C++ program uses bounds checking. I
was also surprised to see OCaml do so well on this benchmark because it
makes no attempt to hoist bounds checks. F# on .NET typically beats all
open source compilers on Linux on this kind of benchmark (float array
intensive). C# presumably would too.

The bench2 program is the following, more idiomatic, C++ version that uses
bounds checking:

#include <vector>
#include <iostream>
#include <cstdlib>
#include <ctime>

void smooth (std::vector<double> &dest, std::vector<double> &src)
{
for (int i = 1; i < src.size() - 1; i++)
dest.at(i - 1) = (src.at(i - 1) + src.at(i) + src.at(i + 1)) / 3 ;
}

void fill (std::vector<double> &src)
{
srand((unsigned)std::time(0));

for (int i = 0; i < src.size(); ++ i )
src.at(i) = rand();
}

int main()
{
const int len = 50000;

std::vector<double> src_array(len);
std::vector<double> dest_array(len);

fill(src_array);

clock_t start=clock();

for (int i = 0; i < 5000; i++)
{
smooth(dest_array, src_array);
smooth(src_array, dest_array);
}

clock_t endt=clock();

std::cout <<"Time smooth(): " <<
double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";

return 0;
}

Here is my OCaml equivalent:

let smooth dst src =
let n = Array.length dst in
for i=1 to n-2 do
dst.(i-1) <- (src.(i-1) +. src.(i) +. src.(i+1)) /. 3.
done

let () =
let n = 50000 in
let src = Array.init n (fun _ -> Random.float 1.) in
let dst = Array.create n 0. in
let t = Sys.time() in
for i=1 to 5000 do
smooth dst src;
smooth src dst;
done;
Printf.printf "Time smooth(): %fs\n" (Sys.time() -. t)
 
J

Jon Harrop

If you are going for blinding speed be sure to use proper optimization
flags; otherwise you are only calculating benchmarks for poorly
optimized code, which, like most of your benchmarks, is meaningless:

$ g++ -O0 smooth.cpp
9884.867 ms

$ g++ -O3 smooth.cpp
8791.123 ms

$ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math smooth.cpp
1207.944 ms

GCC's -ffast-math option breaks semantics, so it is not a valid
optimization. I would not use it in a comparison against Java unless you
know that accuracy is not required for the given task.
$ g++ -O3 -mfpmath=sse -msse3 -march=prescott -ffast-math -funroll-
loops smooth.cpp
1084.385 ms

This is your code slightly modified to use QueryPerformanceCounter for
timing (using Windows). Most of the speed up is a result of -ffast-
math.

Based on the quality and rigor of the other tests I've seen you do,
I'd guess that you missed a few optimization flags on the Java side as
well.

I have altered all implementations of this benchmark to perform progressive
blurring from one array to the other and back rather than overwriting the
destination array repeatedly. This slightly improved performance in all
cases.

On my Athlon 64 X2 4400+ (2.2Ghz) running AMD64 Debian Linux, I get
best-of-three results:

g++: 4.080 g++ -O3 bench.cpp -o bench; time ./bench
g++2: 4.080 g++ -O3 bench2.cpp -o bench2; time ./bench2
ocaml: 4.048 ocamlopt bench.ml -o bench; time ./bench
java: 4.016 javac Arrays.java; time java Arrays

On the same hardware running 32-bit Windows XP Pro, I get:

F#: 3.970 fsc -O3

This shows an insignificant difference in performance which, if nothing
else, is interesting because the second C++ program uses bounds checking. I
was also surprised to see OCaml do so well on this benchmark because it
makes no attempt to hoist bounds checks. F# on .NET typically beats all
open source compilers on Linux on this kind of benchmark (float array
intensive). C# presumably would too.

The bench2 program is the following, more idiomatic, C++ version that uses
bounds checking:

#include <vector>
#include <iostream>
#include <cstdlib>
#include <ctime>

void smooth (std::vector<double> &dest, std::vector<double> &src)
{
for (int i = 1; i < src.size() - 1; i++)
dest.at(i - 1) = (src.at(i - 1) + src.at(i) + src.at(i + 1)) / 3 ;
}

void fill (std::vector<double> &src)
{
srand((unsigned)std::time(0));

for (int i = 0; i < src.size(); ++ i )
src.at(i) = rand();
}

int main()
{
const int len = 50000;

std::vector<double> src_array(len);
std::vector<double> dest_array(len);

fill(src_array);

clock_t start=clock();

for (int i = 0; i < 5000; i++)
{
smooth(dest_array, src_array);
smooth(src_array, dest_array);
}

clock_t endt=clock();

std::cout <<"Time smooth(): " <<
double(endt-start)/CLOCKS_PER_SEC * 1000 << " ms\n";

return 0;
}

Here is my OCaml equivalent:

let smooth dst src =
let n = Array.length dst in
for i=1 to n-2 do
dst.(i-1) <- (src.(i-1) +. src.(i) +. src.(i+1)) /. 3.
done

let () =
let n = 50000 in
let src = Array.init n (fun _ -> Random.float 1.) in
let dst = Array.create n 0. in
let t = Sys.time() in
for i=1 to 5000 do
smooth dst src;
smooth src dst;
done;
Printf.printf "Time smooth(): %fs\n" (Sys.time() -. t)
 
R

red floyd

Razii said:
Where did you get that I am not using proper optimizing flag? I said
nothing about flags in the last post.


I am using vc++

Well, then, you can't make sweeping statements about C++ vs. Java. You
can only make statements about *a particular implementation* of C++ vs.
Java.
 
L

Lew

red said:
Well, then, you can't make sweeping statements about C++ vs. Java. You
can only make statements about *a particular implementation* of C++ vs.
Java.

This is one of the procedural flaws common to all benchmarks. Generally
speaking, one can only hope for full disclosure of the conditions for a
benchmark, and that those conditions approximate some meaningful aspect of
one's actual business need.

This does not mean that benchmarks are useless, only that one cannot decry
them for limitations inherent to the benchmark process itself.
 
R

Razii

Well, then, you can't make sweeping statements about C++ vs. Java. You
can only make statements about *a particular implementation* of C++ vs.
Java.

I used the proper flag /O2 in vc++. Also, when you are deploying a
commercial software, you will have to use flags that target the
least-common-denominator processor. That's a disadvantage of C++ vs
JIT language. The JIT compiler knows what processor it is running on,
and can generate code specifically for that processor. Thus, I won't
use anything other than /O2 for c++... because, as I said, when you
are deploying a commercial software, you will have to use flags that
target the least-common-denominator processor anyway.
 
C

courpron

From an old post by James Kanze

When I pass an "array" to a function in C/C++, I actually pass
a pointer to the first element.  And the compiler, when it compiles the
function, only sees pointers -- it has no way of determining any
relationship between them. Consider, for example, a smoothing function
(in C or earlier C++):
   void
   smooth( double* dest,
           double const* src,
           size_t len )
   {
       for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
           dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
       }
   }
The compiler cannot arrange to use the src[ i + 1 ] value from the
preceding pass through the loop as the src[ i ] value in the current
pass, since the write to dest[ i - 1 ] might have changed it.  In Java,
it can, since two arrays are either identical or disjoint.
This sort of code shows up frequently.  In benchmarks from Java
suppliers comparing Java to C++, of course:).  But also in any number
of applications dealing with physical measures: meteorology, geophysical
research (oil prospection), etc.

Out of curiosity, I tried to test if the above is true. It didn't make
any difference. In fact, C++ was a bit faster (not by much, just 6%).
Probably due to array bound check in Java, if there is in indeed an
issue with C++ arrays, overall there is no difference.

The issue C++ has with arrays is known as pointer aliasing. C has the
keyword "restrict" to deal with this problem but not C++. The
"restrict" keyword should be added in the next C++ standard. Most C++
compilers currently propose their own "restrict" keywords though.
Anyway, C++ has std::valarray. Arrays constructed with std::valarray
are free from aliasing. So, in the absolute, the Java language has no
performance edge over the current version of the C++ language in this
domain. However, even if theorically, std:valarray is free from
pointer aliasing, in practice, this isn't always the case, depending
on the compiler.
if there is in indeed an issue with C++ arrays, overall there is no difference.

With the right optimization flags, the smooth function would probably
get inlined, and static analysis of pointer aliasing should happen,
allowing the C++ compiler to perform no aliasing optimizations.
Probably due to array bound check in Java

A good java compiler should have applied bounds checking elimination
in this case especially since the smooth function should be inlined.

As a side note, your program should handle ints instead of doubles, to
focus more on the performance of array access.

Alexandre Courpron.
 
J

jason.cipriani

I used the proper flag /O2 in vc++. Also, when you are deploying a
commercial software, you will have to use flags that target the
least-common-denominator processor. That's a disadvantage of C++ vs
JIT language.

There are three things that are not correct about this:

1) The -ffast-math flag to GCC does not use any special processor
instructions. It simply removes certain assumptions, which is safe
when you know they aren't true. This is the flag with the major
performance increase. Do not confuse "targeting the least-common-
denominator processor" with "not using an ideal compiler for the
application". Whether you compile your commercial code with GCC or VC+
+ is irrelevant to what machine you are targetting. If your goal is to
target the least-common-denominator, then you would leave off, for
example, the SSE optimizations. However, in my example, it was -ffast-
math that was responsible for the largest speedup. Target platform is
not relevant.

2) It's more of a consequence of only distributing the least-common-
denominator than of using C++. That's a choice you can make. It would
be completely reasonable, for example, for a vendor of high-
performance research applications to distribute SSE-optimized builds.
E.g. Folding@Home (non-commercial, http://folding.stanford.edu)
maintains a GPU-optimized version of their software as well as a
"normal" one; and their non-GPU client checks for various processor
extensions at run-time (such as SSE) and uses optimized code
accordingly. This is the same with 32- and 64-bit applications. This
is also common in OpenGL application development; where applications
can look for specific OpenGL features and take advantage of them,
rather than always only targeting the least-common-denominator of
graphics hardware. The same thing applies to Java wrt OpenGL.
Depending on how you *want* to look at it, an Intel machine with SSE
support is just as different from an Intel machine without SSE as it
is from a PowerPC with AltiVec (same with multicore machines). It just
so happens that you can easily target the least-common-denominator for
those Intel machines, but nothing stops you from releasing platform-
optimized builds. So it is not a disadvantage of C++ vs. anything; it
is a disadvantage of sticking to the last-common-denominator.

3) A JIT language has an extra layer of abstraction before the
hardware. This makes it difficult to perform these particular kinds of
optimizations, anyways. It's apples and oranges, the Java JIT compiler
is a completely different kind of beast than a native code compiler,
the optimizations available for a JIT compiler to make and the
approaches it can take to making them are too different from those of
a native compiler to make a valid comparison. They're just different
tools.

What you are talking about is a difference between targeting the least-
common-denominator of platforms, and targeting specific platforms --
which is an issue present no matter what language you are using. You
are not talking about a difference between any two specific languages.
Furthermore, what *I* was talking about is a difference between
different compilers, targeting the same platform (I am sorry my point
wasn't clear, I had meant to show that the compiler can generate very
fast code for you, in particular -ffast-math [which does not target
specific hardware features] on GCC moreso than the the architecture-
specific options).
The JIT compiler knows what processor it is running on,
and can generate code specifically for that processor.

True, and a valid point. However, depending on the nature of what you
are compiling, you easily could lose a lot of information about the
original intention of the code in the original optimizing pass that
limits how effective processor-specific optimizations made by the JIT
compiler can be.

For example (this is a very specific example but please extend it to
general cases): let's say you have some Java code that, I don't know,
multiplies the elements of two large arrays of floating-point values
together in a loop. You compile it to bytecode, and in the process
javac unrolls that loop to optimize a bit. Then you run it on a
platform that, say, provides a specific instruction for multiplying
large arrays of floating-point values together very quickly. If the
original compiler had targeted this platform to begin with, it could
have seen your access pattern and used that optimized instruction.
Instead it unrolled the loop, a least-common-denominator optimization.
Now the JIT compiler sees the unrolled loop, and because it does not
see the sequential access pattern that was very apparent in the
original code, it fails to analyze the unrolled loop and determine
that it should be "re-rolled" and use the optimized instruction
instead. Now this particular example might not be that great, I know,
because you could conceivably construct a JIT compiler that can
recognize and re-roll unrolled loops; but the general point is what I
said: you can potentially lose a lot of information by compiling code
that you can not recover, and this information could be critical to
performing the best optimizations off the bat.

Consequently, the JIT compiler must turn to other techniques for
optimization, such as caching compiled code for faster startup times,
etc. Again, it's a whole different beast.
Thus, I won't
use anything other than /O2 for c++... because, as I said, when you
are deploying a commercial software, you will have to use flags that
target the least-common-denominator processor anyway.

I addressed that above: Again, 1) some optimizations, like -ffast-math
or -funroll-loops (and be careful, of course, you can't blindly use
these, you have to make sure they make sense for your code), do not
target a specific processor, and 2) there is no rule that says you
have to target the least-common-denominator processor; and a heavy
duty research application, such as what James Kanze *originally*
cited, would most certainly take advantage of special features on the
target hardware.

Jason
 
J

jason.cipriani

I said
nothing about flags in the last post.

As was pointed out in your other thread, these are important details
that you should not leave out; especially in this particular benchmark
(which is not I/O bound like the last one, and is heavily affected by
compiler optimizations).

Jason
 
J

jason.cipriani

(e-mail address removed)...
(I am sorry my point
wasn't clear, I had meant to show that the compiler can generate very
fast code for you, in particular -ffast-math [which does not target
specific hardware features] on GCC moreso than the the architecture-
specific options).

Here are some key missing timings with GCC and VS, Razii:

$ g++ -O2 smooth.cpp
8677.623 ms

$ g++ -O2 -ffast-math smooth.cpp
1077.232 ms

$ g++ -O2 -ffast-math -funroll-loops smooth.cpp
919.622 ms

With CL 14 (VS2005):
7275.611 ms

No platform-specific optimizations were used. Note the order of
magnitude speed up using -ffast-math and -funroll-loops with GCC,
which generates code that can still be run on the least-common-
denominator of Intel-compatible platforms.

Also note that -O2 provides slightly better performance than -O3. Just
goes to show why I shouldn't be using -O3, I guess :)

Jason
 
J

jason.cipriani

Also note that -O2 provides slightly better performance than -O3. Just
goes to show why I shouldn't be using -O3, I guess :)

I take that back, I got it down to 903ms with -O3 over -O2. Perhaps my
machine was just in a bad mood yesterday.

I guess I shouldn't post in this thread for a while, I'm getting a
little uncomfortable with my post ratio here. :-(

Jason
 
C

courpron

I used the proper flag /O2 in vc++. Also, when you are deploying a
commercial software, you will have to use flags that target the
least-common-denominator processor. That's a disadvantage of C++ vs
JIT language. The JIT compiler knows what processor it is running on,
and can generate code specifically for that processor. Thus, I won't
use anything other than /O2 for c++... because, as I said, when you
are deploying a commercial software, you will have to use flags that
target the least-common-denominator processor anyway.

You are using Visual C++. You should use at least the following flags,
in order to enable whole program optimization and link-time code
generation :

cl /O2 /GL prog.cpp /link /ltcg

You should make again all your benchmarks with at least those options
enabled. By the way, as I said earlier, you should use ints instead of
doubles. It will save you from a lot of troubles.

Alexandre Courpron.
 
R

Razii

You should make again all your benchmarks with at least those options
enabled

What all other bench marks? :) I only posted one IO example. This
post was not a benchmark but a question to Kanze to prove by posting
an example (that I can test) where aliasing in c++ makes optimizing
arrays hard (or impossible). The example that he gave (which I used),

for ( size_t i = 1 ; i < len - 1 ; ++ i ) {
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}

c++ was in fact faster. Obviously aliasing was no issue in this case,
as he claimed.
 
R

Razii

And there was no speed difference between

cl /O2 /GL prog.cpp /link /ltcg

And the time I am getting for (VC++) is

Time smooth(): 5391 ms
Time smooth(): 5375 ms
Time smooth(): 5453 ms
Time smooth(): 5421 ms

For Java

C:\>java -server Arrays
Time smooth(): 5610 ms

C:\>java -server Arrays
Time smooth(): 5625 ms

C:\>java -server Arrays
Time smooth(): 5562 ms

Note that I am using server VM (-server). The client VM is a bit
slower in this case

After changing double to int,

(vc++ with cl /O2 /GL prog.cpp /link /ltcg)

Time smooth(): 1000 ms
Time smooth(): 1000 ms
Time smooth(): 1000 ms

and

C:\>java -server Arrays
Time smooth(): 2541 ms
 
J

jason.cipriani

GCC's -ffast-math option breaks semantics, so it is not a valid optimization.

Only sometimes; and it's a valid optimization. Specifically, in this
case, the results are identical. Mostly, in my experience, you start
to lose precision with -ffast-math when you start doing things beyond
simple arithmetic, such as sqrt() and cos(), or when you get into the
realm of overflows and NaNs.

In case anybody is curious the Intel compiler yields similar results
to VS, and to GCC with SSE3 enabled (but no -ffast-math), which is the
expected results:

icl /Ox /QxP /Qipo /Qunroll-aggressive smooth.cpp

Was about 7400 ms for me. With:

icl /Ox /QxP /Qipo /Qprec-div- /Qunroll-aggressive smooth.cpp

Dropping it down to 1100 ms (ICC's /Qprec-div- is similar in spirit to
GCC's -ffast-math).

Following are 3 source files and a Makefile, I used MinGW GCC 3.4.5;
you will want to implement your own tick()/tock() functions; the
windows.h #include is only for those. The output, for me, is:

$ ./smooth.exe
no -ffast-math: 8796.27
-ffast-math: 923.052
1e-014
delta: 0
they are precisely equal.


==== Makefile ====

CFLAGS = -O2 -funroll-loops

..PHONY: clean

smooth.exe: smooth_main.cpp smooth_nofm.cpp smooth_fm.o
g++ $(CFLAGS) smooth_main.cpp smooth_nofm.cpp smooth_fm.o -o $@

smooth_fm.o: smooth_fm.cpp
g++ $(CFLAGS) -ffast-math -c $<

clean:
rm -f smooth_fm.o smooth.exe


==== smooth_fm.cpp ====

void smooth_fm (double *dest, double const *src, int len) {
for (int i = 1 ; i < len - 1 ; i++ )
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}


==== smooth_nofm.cpp ====

void smooth_nofm (double *dest, double const *src, int len) {
for (int i = 1 ; i < len - 1 ; i++ )
dest[ i - 1 ] = (src[ i - 1 ] + src[ i ] + src[ i + 1 ]) / 3 ;
}


==== smooth_main.cpp ====

#include <algorithm>
#include <ctime>
#include <iostream>
#include <windows.h>

using namespace std;

LARGE_INTEGER s_tick;

void smooth_nofm (double *, double const *, int);
void smooth_fm (double *, double const *, int);

double tick (void) {
QueryPerformanceCounter(&s_tick);
}

double tock (const string &msg) {
LARGE_INTEGER now, freq;
QueryPerformanceCounter(&now);
QueryPerformanceFrequency(&freq);
cout << msg << ": " <<
((double)(now.QuadPart - s_tick.QuadPart) /
(double)(freq.QuadPart / 1000LL)) << endl;
}

void fill (double *src, int len ) {
srand(time(NULL));
for (int i = 0; i < len; ++ i)
src = rand();
}

int main () {

const int len = 50000;
double src_array1[len];
double src_array2[len];
double dest_array[len];
double fm, nofm;

fill(src_array1, len);
copy(src_array1, src_array1 + len, src_array2);

tick();
for (int i = 0; i < 10000; i++)
smooth_nofm(dest_array, src_array1, len);
tock("no -ffast-math");
nofm = dest_array[0];

tick();
for (int i = 0; i < 10000; i++)
smooth_fm(dest_array, src_array2, len);
tock("-ffast-math");
fm = dest_array[0];

cout << 0.00000000000001 << endl;
cout << "delta: " << (fm - nofm) << endl;

if (fm == nofm)
cout << "they are precisely equal." << endl;

return 0;

}

==== END ====
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top