J
jason.cipriani
At the end of this message is a program that calculates benchmarks for
3 different methods of calculating the mean of large arrays of data.
20 arrays containing 200,000 floats each are summed and averaged, per-
element, producing an array of 200,000 floats as output.
The difference between the 3 methods is the order in which operations
are applied.
Method 1: Add all of array 0 to output, then all of array 1, then all
of array 2, etc (this is done 20 times). Then, divide output by 20.
Method 2: Compute sum of first element of each array, then second
element of each array, one element at a time (this is done 200,000
times). Then, divide output by 20.
Method 3: Similar to method 1 but add array/20.0 each time, and don't
do separate division at the end.
The results of the benchmarks are confusing to me and I am wondering
if somebody could clear up why. The reason I am testing this is
because I have an application where I must compute the means of many
large arrays, and I'm trying to squeeze a few extra CPU cycles out.
The code below leaves as many optimizations up to the compiler as
possible. It does not explicitly use any special instruction sets
(e.g. not SSE-optimized), all arrays are accessed with [] operator,
etc.
Here are the results on an 2.16 Ghz Intel T2600 with lots of RAM. The
values are "iterations / second" (number of complete means per
second). The 3 values correspond to the compiler commands "gcc -O0",
"gcc -O2", and "gcc -O2 -funroll-loops", respectively (method 3 is, of
course, the slowest, performing far more divisions than any of the
other methods):
Method 1: 35.1, 96.2, 101.7
Method 2: 35.1, 54.0, 98.2
Method 3: 14.3, 14.4, 14.3
Now, here are the results using arrays of doubles instead of floats:
Method 1: 32.2, 43.7, 44.6
Method 2: 29.5, 42.7, 70.4
Method 3: 14.1, 14.3, 14.3
Here are my questions, if anybody has any insights:
1) Looking at floats; why does method 2 perform so much worse than
method 1? Both are performing the same number of operations (I think),
the biggest difference is the order the input data is accessed in
(method 1 accesses inputs[0][0], inputs[0][1], inputs[0][2], ...,
method 2 does it like inputs[0][0], inputs[1][0], inputs[2][0], ...).
I suspect it has something to do with loop overhead, since -funroll-
loops eliminated any difference between the two methods.
2) Why are the relative results for doubles so different? I expected
the timings to be slower, of course. But unlike floats, -O2 alone did
not create a significant difference between method 1 and 2. With both
floats and doubles only method 2 was significantly affected by loop
unrolling, but with doubles method 2 is nearly twice as fast as method
1 -- whereas with floats they perform roughly the same. What is going
on?
3) Why is method 3 unaffected by data type or compiler optimizations?
I suspect it is because the overhead of the extra divisions is so
large that it makes other speedups negligible, but I am not sure.
4) How much of this might be unique to my machine? I.e. if I use
method 1 with floats and method 2 with doubles, am I probably going to
get the best performance on any machine (again without explicitly
using any special instruction sets).
I know very little about the intricacies of optimizing code as far as
taking advantage of hardware goes, so a lot of this doesn't make much
sense to me.
The program is below.
Thanks a lot,
Jason
===== BEGIN =====
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef float elem_t;
#define ELEMS 200000
#define BUFS 20
#define ITERS 100
elem_t inputs[BUFS][ELEMS];
elem_t output[ELEMS];
elem_t dummy = 0.0f; /* so compiler doesn't optimize unused stuff
away. */
/*----- timer code appropriate for windows --------------*/
#include <windows.h>
LARGE_INTEGER tfreq, ttick;
void tick (void) {
QueryPerformanceFrequency(&tfreq);
QueryPerformanceCounter(&ttick);
}
void tock (const char *name) {
double dt;
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
dt = (double)(now.QuadPart - ttick.QuadPart) /
(double)tfreq.QuadPart;
printf("%s: %f sec, %f iters/sec, %f sec/iters\n",
name, dt, (double)ITERS/dt, dt/(double)ITERS);
}
/*-------------------------------------------------------*/
void average1 (void) {
int b, n;
/* calculate sum, one input at a time */
memcpy(output, inputs[0], sizeof(output));
for (b = 1; b < BUFS; ++ b)
for (n = 0; n < ELEMS; ++ n)
output[n] += inputs[n];
/* then calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] /= (elem_t)BUFS;
dummy += output[0];
}
void average2 (void) {
int b, n;
/* calculate sum, one element at a time */
memcpy(output, inputs[0], sizeof(output));
for (n = 0; n < ELEMS; ++ n)
for (b = 1; b < BUFS; ++ b)
output[n] += inputs[n];
/* then calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] /= (elem_t)BUFS;
dummy += output[0];
}
void average3 (void) {
int b, n;
/* calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] = (inputs[0][n] / (elem_t)BUFS);
for (b = 1; b < BUFS; ++ b)
for (n = 0; n < ELEMS; ++ n)
output[n] += (inputs[n] / (elem_t)BUFS);
dummy += output[0];
}
/*-----------------------------------*/
void timeit (const char *name, void(*fn)(void) ) {
int n;
tick();
for (n = 0; n < ITERS; ++ n)
fn();
tock(name);
}
int main (void) {
int b, e, r;
for (b = 0; b < BUFS; ++ b)
for (e = 0; e < ELEMS; ++ e)
inputs[e] = 20.0f * (elem_t)rand() / (elem_t)RAND_MAX - 10.0f;
for (r = 0; r < 3; ++ r) {
timeit("average1", average1);
timeit("average2", average2);
timeit("average3", average3);
}
printf("%f\n", dummy);
return 0;
}
3 different methods of calculating the mean of large arrays of data.
20 arrays containing 200,000 floats each are summed and averaged, per-
element, producing an array of 200,000 floats as output.
The difference between the 3 methods is the order in which operations
are applied.
Method 1: Add all of array 0 to output, then all of array 1, then all
of array 2, etc (this is done 20 times). Then, divide output by 20.
Method 2: Compute sum of first element of each array, then second
element of each array, one element at a time (this is done 200,000
times). Then, divide output by 20.
Method 3: Similar to method 1 but add array/20.0 each time, and don't
do separate division at the end.
The results of the benchmarks are confusing to me and I am wondering
if somebody could clear up why. The reason I am testing this is
because I have an application where I must compute the means of many
large arrays, and I'm trying to squeeze a few extra CPU cycles out.
The code below leaves as many optimizations up to the compiler as
possible. It does not explicitly use any special instruction sets
(e.g. not SSE-optimized), all arrays are accessed with [] operator,
etc.
Here are the results on an 2.16 Ghz Intel T2600 with lots of RAM. The
values are "iterations / second" (number of complete means per
second). The 3 values correspond to the compiler commands "gcc -O0",
"gcc -O2", and "gcc -O2 -funroll-loops", respectively (method 3 is, of
course, the slowest, performing far more divisions than any of the
other methods):
Method 1: 35.1, 96.2, 101.7
Method 2: 35.1, 54.0, 98.2
Method 3: 14.3, 14.4, 14.3
Now, here are the results using arrays of doubles instead of floats:
Method 1: 32.2, 43.7, 44.6
Method 2: 29.5, 42.7, 70.4
Method 3: 14.1, 14.3, 14.3
Here are my questions, if anybody has any insights:
1) Looking at floats; why does method 2 perform so much worse than
method 1? Both are performing the same number of operations (I think),
the biggest difference is the order the input data is accessed in
(method 1 accesses inputs[0][0], inputs[0][1], inputs[0][2], ...,
method 2 does it like inputs[0][0], inputs[1][0], inputs[2][0], ...).
I suspect it has something to do with loop overhead, since -funroll-
loops eliminated any difference between the two methods.
2) Why are the relative results for doubles so different? I expected
the timings to be slower, of course. But unlike floats, -O2 alone did
not create a significant difference between method 1 and 2. With both
floats and doubles only method 2 was significantly affected by loop
unrolling, but with doubles method 2 is nearly twice as fast as method
1 -- whereas with floats they perform roughly the same. What is going
on?
3) Why is method 3 unaffected by data type or compiler optimizations?
I suspect it is because the overhead of the extra divisions is so
large that it makes other speedups negligible, but I am not sure.
4) How much of this might be unique to my machine? I.e. if I use
method 1 with floats and method 2 with doubles, am I probably going to
get the best performance on any machine (again without explicitly
using any special instruction sets).
I know very little about the intricacies of optimizing code as far as
taking advantage of hardware goes, so a lot of this doesn't make much
sense to me.
The program is below.
Thanks a lot,
Jason
===== BEGIN =====
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef float elem_t;
#define ELEMS 200000
#define BUFS 20
#define ITERS 100
elem_t inputs[BUFS][ELEMS];
elem_t output[ELEMS];
elem_t dummy = 0.0f; /* so compiler doesn't optimize unused stuff
away. */
/*----- timer code appropriate for windows --------------*/
#include <windows.h>
LARGE_INTEGER tfreq, ttick;
void tick (void) {
QueryPerformanceFrequency(&tfreq);
QueryPerformanceCounter(&ttick);
}
void tock (const char *name) {
double dt;
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
dt = (double)(now.QuadPart - ttick.QuadPart) /
(double)tfreq.QuadPart;
printf("%s: %f sec, %f iters/sec, %f sec/iters\n",
name, dt, (double)ITERS/dt, dt/(double)ITERS);
}
/*-------------------------------------------------------*/
void average1 (void) {
int b, n;
/* calculate sum, one input at a time */
memcpy(output, inputs[0], sizeof(output));
for (b = 1; b < BUFS; ++ b)
for (n = 0; n < ELEMS; ++ n)
output[n] += inputs[n];
/* then calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] /= (elem_t)BUFS;
dummy += output[0];
}
void average2 (void) {
int b, n;
/* calculate sum, one element at a time */
memcpy(output, inputs[0], sizeof(output));
for (n = 0; n < ELEMS; ++ n)
for (b = 1; b < BUFS; ++ b)
output[n] += inputs[n];
/* then calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] /= (elem_t)BUFS;
dummy += output[0];
}
void average3 (void) {
int b, n;
/* calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] = (inputs[0][n] / (elem_t)BUFS);
for (b = 1; b < BUFS; ++ b)
for (n = 0; n < ELEMS; ++ n)
output[n] += (inputs[n] / (elem_t)BUFS);
dummy += output[0];
}
/*-----------------------------------*/
void timeit (const char *name, void(*fn)(void) ) {
int n;
tick();
for (n = 0; n < ITERS; ++ n)
fn();
tock(name);
}
int main (void) {
int b, e, r;
for (b = 0; b < BUFS; ++ b)
for (e = 0; e < ELEMS; ++ e)
inputs[e] = 20.0f * (elem_t)rand() / (elem_t)RAND_MAX - 10.0f;
for (r = 0; r < 3; ++ r) {
timeit("average1", average1);
timeit("average2", average2);
timeit("average3", average3);
}
printf("%f\n", dummy);
return 0;
}