B
Ben Bacarisse
I'm starting a new thread, but it is prompted by this remark from 88888
Dihedral <[email protected]> about a short binary search
function I posted:
| Do you consider to convert 5 to 10 lines of C to assembly?
| I did that 20 years ago on X86 cpu.
| It was a piece of cake to gain the 20-40% speed required if paid
| well.
I've felt (it's only a feeling) that hand coding for speed (rather than
to access features not available in C like extra wide multiplies and so
on) was a thing of the past. But maybe I'm wrong. Is it still possible
to get a significant speed-up over C by hand coding? How much depends
on the processor? How good are the modern optimising compilers that
would have to be beaten?
Since I don't know enough about modern CPUs to stand much chance of
improving on gcc, I thought I'd post this for others to try. Maybe
you'll find you have some time on your hands over the next few days!
Here is a test program:
#include <stdio.h>
#include <stdlib.h>
size_t binsearch(int a[], size_t length, int key)
{
size_t low = 0, high = length;
while (high > low) {
size_t middle = low + (high - low)/2;
if (key == a[middle])
return middle;
else if (key > a[middle])
low = middle + 1;
else high = middle;
}
return -1;
}
#ifdef TIME_LOOP
#define binsearch(a, length, key) 1
#endif
int main(int argc, char **argv)
{
if (argc == 2) {
int n = atoi(argv[1]);
/* cast so C++ compiled code can timed */
int *data = (int *)malloc(n * sizeof *data);
if (data) {
int i, length, step, key;
long int sum;
for (i = 0; i < n; i++)
data = 2 * i + 1;
/* Include one zero-length search for correctness testing */
sum = binsearch(data, 0, data[0]);
for (length = 1; length <= n; length++)
for (step = 1; step <= length; step++)
for (key = 0; key <= 2 * length; key += step)
sum += binsearch(data, length, key);
printf("%ld\n", sum);
return EXIT_SUCCESS;
}
}
return EXIT_FAILURE;
}
The pattern of calls it makes is not intended to be representative. If
cache effects turn out to dominate, so be it -- this is a just for fun
after all.
On my laptop[1], gcc's optimiser (version 4.6.1) almost exactly halves the
times. clang's code is a little better at -O0, but the optimised code
is a little worse:
-O0 -O1 -O2 -O3
gcc 0.1128 0.0633 0.0571 0.0562
g++ 0.1167 0.0633 0.0572 0.0564
clang 0.1033 0.0669 0.0672 0.0672
These times are microseconds per call. The loop overhead has been
subtracted[2]. On a low-power chip[3] I get:
-O0 -O1 -O2 -O3
gcc 0.7872 0.4296 0.4166 0.3688
g++ 0.7842 0.4290 0.4160 0.3688
which says nothing new (except how slow the CPU is).
What about other compilers and processors? Can anyone produce code that
is significantly better than gcc -O3 by hand? If you try, how long did
it take to get it both correct and faster? Does your version handle the
same range of values as the C version?
[1] Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz, 3MB cache
[2] Crude timing script available on request.
[3] Geode(TM) Integrated Processor by AMD PCS, 500MHz, 128KB cache.
Dihedral <[email protected]> about a short binary search
function I posted:
| Do you consider to convert 5 to 10 lines of C to assembly?
| I did that 20 years ago on X86 cpu.
| It was a piece of cake to gain the 20-40% speed required if paid
| well.
I've felt (it's only a feeling) that hand coding for speed (rather than
to access features not available in C like extra wide multiplies and so
on) was a thing of the past. But maybe I'm wrong. Is it still possible
to get a significant speed-up over C by hand coding? How much depends
on the processor? How good are the modern optimising compilers that
would have to be beaten?
Since I don't know enough about modern CPUs to stand much chance of
improving on gcc, I thought I'd post this for others to try. Maybe
you'll find you have some time on your hands over the next few days!
Here is a test program:
#include <stdio.h>
#include <stdlib.h>
size_t binsearch(int a[], size_t length, int key)
{
size_t low = 0, high = length;
while (high > low) {
size_t middle = low + (high - low)/2;
if (key == a[middle])
return middle;
else if (key > a[middle])
low = middle + 1;
else high = middle;
}
return -1;
}
#ifdef TIME_LOOP
#define binsearch(a, length, key) 1
#endif
int main(int argc, char **argv)
{
if (argc == 2) {
int n = atoi(argv[1]);
/* cast so C++ compiled code can timed */
int *data = (int *)malloc(n * sizeof *data);
if (data) {
int i, length, step, key;
long int sum;
for (i = 0; i < n; i++)
data = 2 * i + 1;
/* Include one zero-length search for correctness testing */
sum = binsearch(data, 0, data[0]);
for (length = 1; length <= n; length++)
for (step = 1; step <= length; step++)
for (key = 0; key <= 2 * length; key += step)
sum += binsearch(data, length, key);
printf("%ld\n", sum);
return EXIT_SUCCESS;
}
}
return EXIT_FAILURE;
}
The pattern of calls it makes is not intended to be representative. If
cache effects turn out to dominate, so be it -- this is a just for fun
after all.
On my laptop[1], gcc's optimiser (version 4.6.1) almost exactly halves the
times. clang's code is a little better at -O0, but the optimised code
is a little worse:
-O0 -O1 -O2 -O3
gcc 0.1128 0.0633 0.0571 0.0562
g++ 0.1167 0.0633 0.0572 0.0564
clang 0.1033 0.0669 0.0672 0.0672
These times are microseconds per call. The loop overhead has been
subtracted[2]. On a low-power chip[3] I get:
-O0 -O1 -O2 -O3
gcc 0.7872 0.4296 0.4166 0.3688
g++ 0.7842 0.4290 0.4160 0.3688
which says nothing new (except how slow the CPU is).
What about other compilers and processors? Can anyone produce code that
is significantly better than gcc -O3 by hand? If you try, how long did
it take to get it both correct and faster? Does your version handle the
same range of values as the C version?
[1] Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz, 3MB cache
[2] Crude timing script available on request.
[3] Geode(TM) Integrated Processor by AMD PCS, 500MHz, 128KB cache.