Confused about benchmarks calculating averages of large arrays.

J

jason.cipriani

At the end of this message is a program that calculates benchmarks for
3 different methods of calculating the mean of large arrays of data.
20 arrays containing 200,000 floats each are summed and averaged, per-
element, producing an array of 200,000 floats as output.

The difference between the 3 methods is the order in which operations
are applied.

Method 1: Add all of array 0 to output, then all of array 1, then all
of array 2, etc (this is done 20 times). Then, divide output by 20.

Method 2: Compute sum of first element of each array, then second
element of each array, one element at a time (this is done 200,000
times). Then, divide output by 20.

Method 3: Similar to method 1 but add array/20.0 each time, and don't
do separate division at the end.

The results of the benchmarks are confusing to me and I am wondering
if somebody could clear up why. The reason I am testing this is
because I have an application where I must compute the means of many
large arrays, and I'm trying to squeeze a few extra CPU cycles out.

The code below leaves as many optimizations up to the compiler as
possible. It does not explicitly use any special instruction sets
(e.g. not SSE-optimized), all arrays are accessed with [] operator,
etc.

Here are the results on an 2.16 Ghz Intel T2600 with lots of RAM. The
values are "iterations / second" (number of complete means per
second). The 3 values correspond to the compiler commands "gcc -O0",
"gcc -O2", and "gcc -O2 -funroll-loops", respectively (method 3 is, of
course, the slowest, performing far more divisions than any of the
other methods):

Method 1: 35.1, 96.2, 101.7
Method 2: 35.1, 54.0, 98.2
Method 3: 14.3, 14.4, 14.3

Now, here are the results using arrays of doubles instead of floats:

Method 1: 32.2, 43.7, 44.6
Method 2: 29.5, 42.7, 70.4
Method 3: 14.1, 14.3, 14.3

Here are my questions, if anybody has any insights:

1) Looking at floats; why does method 2 perform so much worse than
method 1? Both are performing the same number of operations (I think),
the biggest difference is the order the input data is accessed in
(method 1 accesses inputs[0][0], inputs[0][1], inputs[0][2], ...,
method 2 does it like inputs[0][0], inputs[1][0], inputs[2][0], ...).
I suspect it has something to do with loop overhead, since -funroll-
loops eliminated any difference between the two methods.

2) Why are the relative results for doubles so different? I expected
the timings to be slower, of course. But unlike floats, -O2 alone did
not create a significant difference between method 1 and 2. With both
floats and doubles only method 2 was significantly affected by loop
unrolling, but with doubles method 2 is nearly twice as fast as method
1 -- whereas with floats they perform roughly the same. What is going
on?

3) Why is method 3 unaffected by data type or compiler optimizations?
I suspect it is because the overhead of the extra divisions is so
large that it makes other speedups negligible, but I am not sure.

4) How much of this might be unique to my machine? I.e. if I use
method 1 with floats and method 2 with doubles, am I probably going to
get the best performance on any machine (again without explicitly
using any special instruction sets).

I know very little about the intricacies of optimizing code as far as
taking advantage of hardware goes, so a lot of this doesn't make much
sense to me.

The program is below.

Thanks a lot,
Jason

===== BEGIN =====

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef float elem_t;

#define ELEMS 200000
#define BUFS 20
#define ITERS 100

elem_t inputs[BUFS][ELEMS];
elem_t output[ELEMS];
elem_t dummy = 0.0f; /* so compiler doesn't optimize unused stuff
away. */

/*----- timer code appropriate for windows --------------*/

#include <windows.h>

LARGE_INTEGER tfreq, ttick;

void tick (void) {
QueryPerformanceFrequency(&tfreq);
QueryPerformanceCounter(&ttick);
}

void tock (const char *name) {
double dt;
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
dt = (double)(now.QuadPart - ttick.QuadPart) /
(double)tfreq.QuadPart;
printf("%s: %f sec, %f iters/sec, %f sec/iters\n",
name, dt, (double)ITERS/dt, dt/(double)ITERS);
}

/*-------------------------------------------------------*/

void average1 (void) {
int b, n;

/* calculate sum, one input at a time */
memcpy(output, inputs[0], sizeof(output));
for (b = 1; b < BUFS; ++ b)
for (n = 0; n < ELEMS; ++ n)
output[n] += inputs[n];

/* then calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] /= (elem_t)BUFS;

dummy += output[0];

}

void average2 (void) {
int b, n;

/* calculate sum, one element at a time */
memcpy(output, inputs[0], sizeof(output));
for (n = 0; n < ELEMS; ++ n)
for (b = 1; b < BUFS; ++ b)
output[n] += inputs[n];

/* then calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] /= (elem_t)BUFS;

dummy += output[0];

}

void average3 (void) {
int b, n;

/* calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] = (inputs[0][n] / (elem_t)BUFS);
for (b = 1; b < BUFS; ++ b)
for (n = 0; n < ELEMS; ++ n)
output[n] += (inputs[n] / (elem_t)BUFS);

dummy += output[0];

}

/*-----------------------------------*/

void timeit (const char *name, void(*fn)(void) ) {
int n;

tick();
for (n = 0; n < ITERS; ++ n)
fn();
tock(name);

}

int main (void) {
int b, e, r;

for (b = 0; b < BUFS; ++ b)
for (e = 0; e < ELEMS; ++ e)
inputs[e] = 20.0f * (elem_t)rand() / (elem_t)RAND_MAX - 10.0f;

for (r = 0; r < 3; ++ r) {
timeit("average1", average1);
timeit("average2", average2);
timeit("average3", average3);
}

printf("%f\n", dummy);
return 0;

}
 
J

Jim Langston

At the end of this message is a program that calculates benchmarks for
3 different methods of calculating the mean of large arrays of data.
20 arrays containing 200,000 floats each are summed and averaged, per-
element, producing an array of 200,000 floats as output.

The difference between the 3 methods is the order in which operations
are applied.

Method 1: Add all of array 0 to output, then all of array 1, then all
of array 2, etc (this is done 20 times). Then, divide output by 20.

Method 2: Compute sum of first element of each array, then second
element of each array, one element at a time (this is done 200,000
times). Then, divide output by 20.

Method 3: Similar to method 1 but add array/20.0 each time, and don't
do separate division at the end.

The results of the benchmarks are confusing to me and I am wondering
if somebody could clear up why. The reason I am testing this is
because I have an application where I must compute the means of many
large arrays, and I'm trying to squeeze a few extra CPU cycles out.

The code below leaves as many optimizations up to the compiler as
possible. It does not explicitly use any special instruction sets
(e.g. not SSE-optimized), all arrays are accessed with [] operator,
etc.

Here are the results on an 2.16 Ghz Intel T2600 with lots of RAM. The
values are "iterations / second" (number of complete means per
second). The 3 values correspond to the compiler commands "gcc -O0",
"gcc -O2", and "gcc -O2 -funroll-loops", respectively (method 3 is, of
course, the slowest, performing far more divisions than any of the
other methods):

Method 1: 35.1, 96.2, 101.7
Method 2: 35.1, 54.0, 98.2
Method 3: 14.3, 14.4, 14.3

Now, here are the results using arrays of doubles instead of floats:

Method 1: 32.2, 43.7, 44.6
Method 2: 29.5, 42.7, 70.4
Method 3: 14.1, 14.3, 14.3

Here are my questions, if anybody has any insights:

1) Looking at floats; why does method 2 perform so much worse than
method 1? Both are performing the same number of operations (I think),
the biggest difference is the order the input data is accessed in
(method 1 accesses inputs[0][0], inputs[0][1], inputs[0][2], ...,
method 2 does it like inputs[0][0], inputs[1][0], inputs[2][0], ...).
I suspect it has something to do with loop overhead, since -funroll-
loops eliminated any difference between the two methods.

Method one uses the arrays in a more natural order. That is, it accesses
the array elements in the same way they're laid out in memory. In memory
[0][0] is followeing directly by [0][1] then by [0][2], ect... Accessing the
elements this way allows for better caching and compiler optimization.
2) Why are the relative results for doubles so different? I expected
the timings to be slower, of course. But unlike floats, -O2 alone did
not create a significant difference between method 1 and 2. With both
floats and doubles only method 2 was significantly affected by loop
unrolling, but with doubles method 2 is nearly twice as fast as method
1 -- whereas with floats they perform roughly the same. What is going
on?

It depends on the OS and the system I think. This may be different for
different machines, depending on the natural size of the word, the CPU and
math chips, etc...
3) Why is method 3 unaffected by data type or compiler optimizations?
I suspect it is because the overhead of the extra divisions is so
large that it makes other speedups negligible, but I am not sure.

That's probably the case. Floating point division is just so slow.
4) How much of this might be unique to my machine? I.e. if I use
method 1 with floats and method 2 with doubles, am I probably going to
get the best performance on any machine (again without explicitly
using any special instruction sets).

The relative speeds of floats .vs. doubles is probably very machine
dependant. On some machines doubles will be faster than floats, on some the
other way around.
I know very little about the intricacies of optimizing code as far as
taking advantage of hardware goes, so a lot of this doesn't make much
sense to me.

The program is below.

Thanks a lot,
Jason

===== BEGIN =====

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef float elem_t;

#define ELEMS 200000
#define BUFS 20
#define ITERS 100

elem_t inputs[BUFS][ELEMS];
elem_t output[ELEMS];
elem_t dummy = 0.0f; /* so compiler doesn't optimize unused stuff
away. */

/*----- timer code appropriate for windows --------------*/

#include <windows.h>

LARGE_INTEGER tfreq, ttick;

void tick (void) {
QueryPerformanceFrequency(&tfreq);
QueryPerformanceCounter(&ttick);
}

void tock (const char *name) {
double dt;
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
dt = (double)(now.QuadPart - ttick.QuadPart) /
(double)tfreq.QuadPart;
printf("%s: %f sec, %f iters/sec, %f sec/iters\n",
name, dt, (double)ITERS/dt, dt/(double)ITERS);
}

/*-------------------------------------------------------*/

void average1 (void) {
int b, n;

/* calculate sum, one input at a time */
memcpy(output, inputs[0], sizeof(output));
for (b = 1; b < BUFS; ++ b)
for (n = 0; n < ELEMS; ++ n)
output[n] += inputs[n];

/* then calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] /= (elem_t)BUFS;

dummy += output[0];

}

void average2 (void) {
int b, n;

/* calculate sum, one element at a time */
memcpy(output, inputs[0], sizeof(output));
for (n = 0; n < ELEMS; ++ n)
for (b = 1; b < BUFS; ++ b)
output[n] += inputs[n];

/* then calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] /= (elem_t)BUFS;

dummy += output[0];

}

void average3 (void) {
int b, n;

/* calculate mean */
for (n = 0; n < ELEMS; ++ n)
output[n] = (inputs[0][n] / (elem_t)BUFS);
for (b = 1; b < BUFS; ++ b)
for (n = 0; n < ELEMS; ++ n)
output[n] += (inputs[n] / (elem_t)BUFS);

dummy += output[0];

}

/*-----------------------------------*/

void timeit (const char *name, void(*fn)(void) ) {
int n;

tick();
for (n = 0; n < ITERS; ++ n)
fn();
tock(name);

}

int main (void) {
int b, e, r;

for (b = 0; b < BUFS; ++ b)
for (e = 0; e < ELEMS; ++ e)
inputs[e] = 20.0f * (elem_t)rand() / (elem_t)RAND_MAX - 10.0f;

for (r = 0; r < 3; ++ r) {
timeit("average1", average1);
timeit("average2", average2);
timeit("average3", average3);
}

printf("%f\n", dummy);
return 0;

}
 
B

Barry Schwarz

At the end of this message is a program that calculates benchmarks for
3 different methods of calculating the mean of large arrays of data.
20 arrays containing 200,000 floats each are summed and averaged, per-
element, producing an array of 200,000 floats as output.

The difference between the 3 methods is the order in which operations
are applied.

Method 1: Add all of array 0 to output, then all of array 1, then all
of array 2, etc (this is done 20 times). Then, divide output by 20.

Method 2: Compute sum of first element of each array, then second
element of each array, one element at a time (this is done 200,000
times). Then, divide output by 20.

Method 3: Similar to method 1 but add array/20.0 each time, and don't
do separate division at the end.

The results of the benchmarks are confusing to me and I am wondering
if somebody could clear up why. The reason I am testing this is
because I have an application where I must compute the means of many
large arrays, and I'm trying to squeeze a few extra CPU cycles out.

The code below leaves as many optimizations up to the compiler as
possible. It does not explicitly use any special instruction sets
(e.g. not SSE-optimized), all arrays are accessed with [] operator,
etc.

Here are the results on an 2.16 Ghz Intel T2600 with lots of RAM. The
values are "iterations / second" (number of complete means per
second). The 3 values correspond to the compiler commands "gcc -O0",
"gcc -O2", and "gcc -O2 -funroll-loops", respectively (method 3 is, of
course, the slowest, performing far more divisions than any of the
other methods):

Method 1: 35.1, 96.2, 101.7
Method 2: 35.1, 54.0, 98.2
Method 3: 14.3, 14.4, 14.3

Now, here are the results using arrays of doubles instead of floats:

Method 1: 32.2, 43.7, 44.6
Method 2: 29.5, 42.7, 70.4
Method 3: 14.1, 14.3, 14.3

Here are my questions, if anybody has any insights:

1) Looking at floats; why does method 2 perform so much worse than
method 1? Both are performing the same number of operations (I think),
the biggest difference is the order the input data is accessed in
(method 1 accesses inputs[0][0], inputs[0][1], inputs[0][2], ...,
method 2 does it like inputs[0][0], inputs[1][0], inputs[2][0], ...).
I suspect it has something to do with loop overhead, since -funroll-
loops eliminated any difference between the two methods.

Method one uses the arrays in a more natural order. That is, it accesses
the array elements in the same way they're laid out in memory. In memory
[0][0] is followeing directly by [0][1] then by [0][2], ect... Accessing the
elements this way allows for better caching and compiler optimization.
I also think caching is the big difference here. Most caching systems
seem to really benefit from nice, straight-forward, sequential
references to memory.

Another possibility is paging. Odds are that your system is doing
other things while your code is executing (firewall rejecting probes,
clock updating umpteen times per second, etc). Even though you have
"lots of RAM", the OS may still choose to steal pages from your arrays
that have not been referenced "recently". In method 1, you stay on
the same page until you are done with it. In method 2, you hop around
all over the place. If you have some way of monitoring the paging
rate, you may notice it is a lot higher in method 2.

Your stated objective was to save CPU cycles but you appear to be
measuring wall clock (elapsed) time. The two are not necessarily
proportional. Do you have a tool that can report on CPU usage for
your task? (By the way, unless you are being charged for CPU time,
wall clock time may be the more useful metric anyway. Most users want
the results as quickly as possible and don't care whether the time is
spent in the CPU or waiting for I/O.)


Remove del for email
 
B

blmblm

At the end of this message is a program that calculates benchmarks for
3 different methods of calculating the mean of large arrays of data.
20 arrays containing 200,000 floats each are summed and averaged, per-
element, producing an array of 200,000 floats as output.

The difference between the 3 methods is the order in which operations
are applied.

Method 1: Add all of array 0 to output, then all of array 1, then all
of array 2, etc (this is done 20 times). Then, divide output by 20.

Method 2: Compute sum of first element of each array, then second
element of each array, one element at a time (this is done 200,000
times). Then, divide output by 20.

Method 3: Similar to method 1 but add array/20.0 each time, and don't
do separate division at the end.

The results of the benchmarks are confusing to me and I am wondering
if somebody could clear up why. The reason I am testing this is
because I have an application where I must compute the means of many
large arrays, and I'm trying to squeeze a few extra CPU cycles out.

The code below leaves as many optimizations up to the compiler as
possible. It does not explicitly use any special instruction sets
(e.g. not SSE-optimized), all arrays are accessed with [] operator,
etc.

Here are the results on an 2.16 Ghz Intel T2600 with lots of RAM. The
values are "iterations / second" (number of complete means per
second). The 3 values correspond to the compiler commands "gcc -O0",
"gcc -O2", and "gcc -O2 -funroll-loops", respectively (method 3 is, of
course, the slowest, performing far more divisions than any of the
other methods):

Method 1: 35.1, 96.2, 101.7
Method 2: 35.1, 54.0, 98.2
Method 3: 14.3, 14.4, 14.3

Now, here are the results using arrays of doubles instead of floats:

Method 1: 32.2, 43.7, 44.6
Method 2: 29.5, 42.7, 70.4
Method 3: 14.1, 14.3, 14.3

Here are my questions, if anybody has any insights:

1) Looking at floats; why does method 2 perform so much worse than
method 1? Both are performing the same number of operations (I think),
the biggest difference is the order the input data is accessed in
(method 1 accesses inputs[0][0], inputs[0][1], inputs[0][2], ...,
method 2 does it like inputs[0][0], inputs[1][0], inputs[2][0], ...).
I suspect it has something to do with loop overhead, since -funroll-
loops eliminated any difference between the two methods.

Method one uses the arrays in a more natural order. That is, it accesses
the array elements in the same way they're laid out in memory. In memory
[0][0] is followeing directly by [0][1] then by [0][2], ect... Accessing the
elements this way allows for better caching and compiler optimization.
I also think caching is the big difference here. Most caching systems
seem to really benefit from nice, straight-forward, sequential
references to memory.

That's normally what I'd say too -- but if the difference is
caching, why wouldn't it show up even with optimization level O0?
and why would method 2 actually perform better (unless I'm
misreading the results) for one optimization level?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top