Performance of hand-optimised assembly

Ben Bacarisse · Dec 23, 2011

I'm starting a new thread, but it is prompted by this remark from 88888
Dihedral <[email protected]> about a short binary search
function I posted:

| Do you consider to convert 5 to 10 lines of C to assembly?
| I did that 20 years ago on X86 cpu.
| It was a piece of cake to gain the 20-40% speed required if paid
| well.

I've felt (it's only a feeling) that hand coding for speed (rather than
to access features not available in C like extra wide multiplies and so
on) was a thing of the past. But maybe I'm wrong. Is it still possible
to get a significant speed-up over C by hand coding? How much depends
on the processor? How good are the modern optimising compilers that
would have to be beaten?

Since I don't know enough about modern CPUs to stand much chance of
improving on gcc, I thought I'd post this for others to try. Maybe
you'll find you have some time on your hands over the next few days!

Here is a test program:

#include <stdio.h>
#include <stdlib.h>

size_t binsearch(int a[], size_t length, int key)
{
size_t low = 0, high = length;
while (high > low) {
size_t middle = low + (high - low)/2;
if (key == a[middle])
return middle;
else if (key > a[middle])
low = middle + 1;
else high = middle;
}
return -1;
}

#ifdef TIME_LOOP
#define binsearch(a, length, key) 1
#endif

int main(int argc, char **argv)
{
if (argc == 2) {
int n = atoi(argv[1]);
/* cast so C++ compiled code can timed */
int *data = (int *)malloc(n * sizeof *data);
if (data) {
int i, length, step, key;
long int sum;
for (i = 0; i < n; i++)
data = 2 * i + 1;

/* Include one zero-length search for correctness testing */
sum = binsearch(data, 0, data[0]);
for (length = 1; length <= n; length++)
for (step = 1; step <= length; step++)
for (key = 0; key <= 2 * length; key += step)
sum += binsearch(data, length, key);
printf("%ld\n", sum);
return EXIT_SUCCESS;
}
}
return EXIT_FAILURE;
}

The pattern of calls it makes is not intended to be representative. If
cache effects turn out to dominate, so be it -- this is a just for fun
after all.

On my laptop[1], gcc's optimiser (version 4.6.1) almost exactly halves the
times. clang's code is a little better at -O0, but the optimised code
is a little worse:
-O0 -O1 -O2 -O3
gcc 0.1128 0.0633 0.0571 0.0562
g++ 0.1167 0.0633 0.0572 0.0564
clang 0.1033 0.0669 0.0672 0.0672

These times are microseconds per call. The loop overhead has been
subtracted[2]. On a low-power chip[3] I get:

-O0 -O1 -O2 -O3
gcc 0.7872 0.4296 0.4166 0.3688
g++ 0.7842 0.4290 0.4160 0.3688

which says nothing new (except how slow the CPU is).

What about other compilers and processors? Can anyone produce code that
is significantly better than gcc -O3 by hand? If you try, how long did
it take to get it both correct and faster? Does your version handle the
same range of values as the C version?

[1] Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz, 3MB cache
[2] Crude timing script available on request.
[3] Geode(TM) Integrated Processor by AMD PCS, 500MHz, 128KB cache.

BartC · Dec 23, 2011

int main(int argc, char **argv)
{
if (argc == 2) {

What value of n (from argv) did you use for your test, and what was the
output?

Ben Bacarisse · Dec 23, 2011

BartC said:
What value of n (from argv) did you use for your test, and what was
the output?

1500 on the laptop and 700 on the slower machine. The output was:

$ gcc -O2 btime.c
$ ./a.out 1500
2297638170

When compiled with -DTIME_LOOP, the output is the number of calls:

$ gcc -O2 -DTIME_LOOP btime.c
$ ./a.out 1500
17258078

which helps with calculating the time per call.

BartC · Dec 23, 2011

Ben Bacarisse said:
1500 on the laptop and 700 on the slower machine. The output was:

$ gcc -O2 btime.c
$ ./a.out 1500
2297638170

That's odd; the output I get across 3 compilers (under Windows)
was -1997329126 (and using 'long long int sum' results in a totally
different result again).

Anyway, the results I got were as follows, with fastest/slowest times in
seconds, all for x86-32 under Windows (and all fairly old):

lcc-win32 1.4/1.8 (3.8 June 2010)
DMC 1.2/1.7 (Digital Mars 8.42n)
gcc 0.8/2.1 (Mingw 4.5.0, no special processor options used)

My first assembly version of the body of binsearch() took about 1.0 seconds;
pretty much anything else I tried, even taking out several instructions,
made it slower!

But then I'm not an expert on the internal workings of modern x86
processors. Someone who is, I'm sure could at least have matched gcc.

Your example though is probably perfect material for a compiler optimiser to
work with: half-a-dozen variables which fit neatly into the half-dozen spare
registers on the x86! Not much scope left to improve on.

I do remember a few years ago implementing an interpreter in C, which had
been previously been written in a mix of high-level code plus lots of inline
assembly; the C code, even compiled under gcc, was half the speed (and my
assembly code then was probably just as poor as it is now). However
implementing languages is probably a special case..

Joe keane · Dec 24, 2011

I've felt (it's only a feeling) that hand coding for speed (rather than
to access features not available in C like extra wide multiplies and so
on) was a thing of the past.

It's not.

If you can get your 0.01% of your code (and You're suRE it is really
that) that it gets 40% faster then it may be a useful thing to do.

[although when <chip manufacturer> comes out with a new chip you may
have actually made it *worse*]

Ben Bacarisse · Dec 24, 2011

BartC said:
That's odd; the output I get across 3 compilers (under Windows)
was -1997329126 (and using 'long long int sum' results in a totally
different result again).

Yes, sadly the result depends on the value of SIZE_MAX and the behaviour
of adding that to a long. It won't affect timing but it does mean the
output is not portable. Making binsearch return int rather than size_t
will, I think, fix that but it may alter the timing. (It has only a tiny
effect on my system but I'd leave it alone none the less).

Anyway, the results I got were as follows, with fastest/slowest times in
seconds, all for x86-32 under Windows (and all fairly old):

lcc-win32 1.4/1.8 (3.8 June 2010)
DMC 1.2/1.7 (Digital Mars 8.42n)
gcc 0.8/2.1 (Mingw 4.5.0, no special processor options used)

Is this variation due to optimised and unoptimised code?

My first assembly version of the body of binsearch() took about 1.0 seconds;
pretty much anything else I tried, even taking out several instructions,
made it slower!

Well, it's not bad that you can beat 2 of the three compilers.

But then I'm not an expert on the internal workings of modern x86
processors. Someone who is, I'm sure could at least have matched gcc.

Your example though is probably perfect material for a compiler optimiser to
work with: half-a-dozen variables which fit neatly into the half-dozen spare
registers on the x86! Not much scope left to improve on.

Yup. That was what rather surprised me about the suggestion of
converting to assembler for speed. It looked an unlikely candidate.

I do remember a few years ago implementing an interpreter in C, which had
been previously been written in a mix of high-level code plus lots of inline
assembly; the C code, even compiled under gcc, was half the speed (and my
assembly code then was probably just as poor as it is now). However
implementing languages is probably a special case..

In some sense everything is a special case. Language implementation is a
perfectly reasonably role for C, so there's no reason to discard such
data. The trouble is that without code to review, it's impossible to
know where the speed was lost.

88888 Dihedral · Dec 24, 2011

http://ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

Check the site to combine C and assembly for GCC.

BartC · Dec 24, 2011

Ben Bacarisse said:
Is this variation due to optimised and unoptimised code?

Yes. Either using -O/-o for the first two, or not using it; and using -O3
or -O0 for gcc.

Well, it's not bad that you can beat 2 of the three compilers.

lcc-win32 is not known for it's fast code, so that wasn't so surprising.
Looking at the fast gcc result in more detail, it eliminates the first call
to binsearch() completely (but doesn't save so much because it doesn't enter
the loop), and inlines the second call.

If I did the same in assembly, and partially unrolled the loop (fiddly in
this case), I got to within 50ms of the gcc timing. (My assembly code was
inside a version in my own language and my own non-optimising compiler, so
probably the rest of the code makes the difference. FWIW the timing in that
high-level code would have been about 2.1 seconds).

So for this example, I don't think gcc was demonstrating any special
knowledge of my processor; maybe there are processor-specific switches that
could have been used. But it would be interesting to know the results for an
compiler like Intel's (or rather, AMD's, if they do a compiler, since that's
what my machine uses).

Yup. That was what rather surprised me about the suggestion of
converting to assembler for speed. It looked an unlikely candidate.

It all depends. If this wasn't just an array of standard machine words, or
there was some other unusual context, then there are more coding
possibilities and perhaps more opportunities to improve on the compiler.

However

In some sense everything is a special case. Language implementation is a
perfectly reasonably role for C, so there's no reason to discard such
data. The trouble is that without code to review, it's impossible to
know where the speed was lost.

In this example, I was making global decisions not easily available to a C
compiler, such as choosing to use some dedicated registers across a specific
set of 'conforming' functions, which comprised the main dispatch loop of
this interpreter (namely, program counter, frame pointer, and stack
pointer). I needed to save and restore those registers when stepping out of,
and back into, those conforming functions.

I was also using threaded code for these functions (jumping from the end of
one function, to the start of the next, with no call/return mechanism). Now
gcc has a mechanism for threaded code via indirect labels, but seems
restricted to labels within the one function, but I didn't really want a
single 10,000-line function.

And in a more recent version, I'm using one dedicated register, but making
use of the *hardware* stack and frame pointers, by using the system stack
for the interpreter's data stack. Now I'm sure you can't just tell C to push
or pop some arbitrary data to the stack, then carry on as normal! (And since
this is 50% faster, the straight-C version may well have been 1/3 the
speed.)

So I think language applications *are* somewhat more special.

Willem · Dec 24, 2011

Ben Bacarisse wrote:
) I'm starting a new thread, but it is prompted by this remark from 88888
) Dihedral <[email protected]> about a short binary search
) function I posted:
)
) <snip>
) On my laptop[1], gcc's optimiser (version 4.6.1) almost exactly halves the
) times. clang's code is a little better at -O0, but the optimised code
) is a little worse:
) -O0 -O1 -O2 -O3
) gcc 0.1128 0.0633 0.0571 0.0562
) g++ 0.1167 0.0633 0.0572 0.0564
) clang 0.1033 0.0669 0.0672 0.0672
)
) These times are microseconds per call. The loop overhead has been
) subtracted[2].

Have you tried enabling the processor-specific switches on gcc?

) [1] Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz, 3MB cache
) [2] Crude timing script available on request.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

88888 Dihedral · Dec 24, 2011

Ben Bacarisse wrote:
) I'm starting a new thread, but it is prompted by this remark from 88888
) Dihedral <[email protected]> about a short binary search
) function I posted:
)
) <snip>
) On my laptop[1], gcc's optimiser (version 4.6.1) almost exactly halves the
) times. clang's code is a little better at -O0, but the optimised code
) is a little worse:
) -O0 -O1 -O2 -O3
) gcc 0.1128 0.0633 0.0571 0.0562
) g++ 0.1167 0.0633 0.0572 0.0564
) clang 0.1033 0.0669 0.0672 0.0672
)
) These times are microseconds per call. The loop overhead has been
) subtracted[2].

Have you tried enabling the processor-specific switches on gcc?

) [1] Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz, 3MB cache
) [2] Crude timing script available on request.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

I think pass by register is faster.
But if just insert 10 to 20 lines of assembly can make the code
to be faster. I'll go for assembly to eliminate the function call
overheads.

By the way swapping two registers with or without a tmp register
in 3 instructions is really hardware dependent.

BartC · Dec 24, 2011

88888 Dihedral said:
http://ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

Check the site to combine C and assembly for GCC.

Having to use that ghastly AT&T syntax is an argument *against* using
assembly, even if it is faster...

James Harris · Dec 24, 2011

....

Having to use that ghastly AT&T syntax is an argument *against* using
assembly, even if it is faster...

Agreed!

Can you give me any pointers on how to link gcc and nasm or another
separate assembler? I've found a few web references with useful tips
but IIRC one of them was your first attempt at this a few years ago.
Maybe you found some more info since then. I know I can do it but
maybe there are good ways and bad ways.

James

BartC · Dec 24, 2011

Ben Bacarisse said:
"BartC" <[email protected]> writes:

Well, it's not bad that you can beat 2 of the three compilers.

I've tried one more compiler, and the results, now normalised and for the
fastest code (and using N=2500 for more accuracy), are:

gcc -O3 1.00 (using inlining)
My asm 1.12 (original without inlining or unrolling)
Pelles C -Ot 1.26
DMC -o 1.35
lcc-win32 -O 1.64

(Adjusted for loop overheads)

So while modern compilers might be ultra-sophisticated, most of these
aren't! Admittedly they are all free downloads..

James Harris · Dec 24, 2011

....

I don't understand some of your code. Queries below.

Here is a test program:

#include <stdio.h>
#include <stdlib.h>

size_t binsearch(int a[], size_t length, int key)

Why use size_t instead of unsigned int or unsigned long int,
especially as you add it to sum in the test harness which is long int?
Why not use unsigned for sum in the test harness?

{
size_t low = 0, high = length;

Why not high = length - 1?

while (high > low) {
size_t middle = low + (high - low)/2;

Why not middle = (high + low) / 2?

if (key == a[middle])
return middle;
else if (key > a[middle])
low = middle + 1;
else high = middle;

Why not high = middle - 1?

}
return -1;

}

The reason for the queries is I wrote an asm binsearch but when
comparing it with your code realised that you had some differences in
the approach.

James

BartC · Dec 24, 2011

James Harris said:
Why not middle = (high + low) / 2?

That form can cause an overflow when the numbers are high. Otherwise
(high+low)/2 is marginally faster.

BartC · Dec 24, 2011

Can you give me any pointers on how to link gcc and nasm or another
separate assembler? I've found a few web references with useful tips
but IIRC one of them was your first attempt at this a few years ago.
Maybe you found some more info since then. I know I can do it but
maybe there are good ways and bad ways.

I haven't made any more attempts since then. You still have the problem in
that the assembly code with the nice syntax has to be in a separate module,
not inline.

To combine, you probably know how; it means running Nasm with the right
object format (nasm -fwin32 under Windows), and presenting the resulting
..obj file to gcc in it's list of list of files.

Oh, and you have to put "_" in front any exported names in the asm module,
for some reason that I still don't know.

And you have to sort out calling conventions between the two languages (C
calling Asm, and Asm calling C).

So it's still a bit fiddly.

(I normally use assembly as inline code within my own language. The
advantages are the use of Intel/Nasm syntax, with no need to write each line
as a string literal either, and being able to write substantial amounts of
code within a high-level language framework: functions, declarations, local
namespaces in each function, etc.

Also interaction with the high-level code is simple (optimisation doesn't
get in the way, for a start!). A lot of other issues are taken care of too.

I would call that a _good_ and easy way of using assembly. If anyone knows
of a C compiler that offers the same, then I might start using C a lot
more.)

Jens Gustedt · Dec 24, 2011

Hello,

Am 12/24/2011 01:42 PM, schrieb BartC:

Yes. Either using -O/-o for the first two, or not using it; and using -O3
or -O0 for gcc.
...
So for this example, I don't think gcc was demonstrating any special
knowledge of my processor; maybe there are processor-specific switches that
could have been used.

with modern gcc you can always use -march=native to optimize for the
processor on which the compiler is running

Jens

James Harris · Dec 24, 2011

....

To combine, you probably know how; it means running Nasm with the right
object format (nasm -fwin32 under Windows), and presenting the resulting
.obj file to gcc in it's list of list of files.

Oh, and you have to put "_" in front any exported names in the asm module,
for some reason that I still don't know.

And you have to sort out calling conventions between the two languages (C
calling Asm, and Asm calling C).

So it's still a bit fiddly.

I'm aware there are various options for name mangling and parameter
passing. I'm just not sure which of the many are best, if there is
such a thing. I asked in response to Ben's initial post as you had got
a test working but I'm thinking this is too big an issue for this
subthread and needs a separate post, and a crosspost at that!
Something to come back to.

James

James Dow Allen · Dec 24, 2011

I'm starting a new thread, but it is prompted by this remark...
| Do you consider to convert 5 to 10 lines of C to assembly?
| I did that 20 years ago on X86 cpu.
| It was a piece of cake to gain the 20-40% speed required if paid
| well.

I've felt (it's only a feeling) that hand coding for speed (rather
than to access features not available in C like extra wide multiplies
and so on) was a thing of the past. But maybe I'm wrong. Is it still
possible to get a significant speed-up over C by hand coding? How
much depends on the processor?

Don't know, but 20 years ago IIRC, good speedups should have been
common.

I did an experiment just yesterday on a snippet posted in
comp.programming. I'll report my results here. The key line of code
seems "wonderful" in its right: A bit-twiddler to replace a word with N
bits set to the next such word in lexicographical order.

Try something like this:

i = (1 << n) -1;
while( !(i >> m) )
{
do_something_with(i);
i = (i+(i&(-i)))|(((i^(i+(i&(-i))))>>2)/(i&(-i)));
}

The first "i =" statement sets i to the smallest number with n 1-
bits.

The second "i =" statement is some bit manipulation "magic" that sets
i to the next sequential number that has the same number of 1-bits.
For an explanation, see
http://groups.google.com/group/programming-puzzles/msg/3a05b3c8b4042d5b
, which points to a PowerPoint presentation that gives a
multi-statement expression of the one-liner that I constructed from
it.

The "while" statement loops as long as i remains less than 2^m.

Dave

With gcc -O{2,4} Pentium I got speedup over

i = (i+(i&(-i)))|(((i^(i+(i&(-i))))>>2)/(i&(-i)));

just by factoring out common subexpressions and assigning them to temp
variables. (One might think gcc could do that; perhaps this line is
just too complicated.)

Buried in the expression is a '/' operator which, of course, is time
consuming. I got a big speedup replacing that divide.
while (!(lb & 15)) i >>= 4, lb >>= 4;
while (!(lb & 1)) i >>= 1, lb >>= 1;
Pentium has an OPcode, I think, accessible from gcc with `fsf()' which
seems like it *might* allow better speedup, but what I tried was slower
than the while's.

Note that neither of these speedup ideas (while or fsf()) required using
assembler instead of C. IIRC it's not unusual to be able to speedup C
by catering to the C compiler's code generator, often in non-obvious
ways.

James

BartC · Dec 24, 2011

James Harris said:
I'm aware there are various options for name mangling and parameter
passing. I'm just not sure which of the many are best, if there is
such a thing.

If you're interfacing with C, the possibilities for parameter passing are
limited, usually you have to go along with what the specific C compiler
does, or whatever foreign interfaces are supported.

Apparently for x86-64, parameters would be passed in registers, which is
best for this example, but on x86-32, the stack seems to be favoured. In any
case the 'winning entry' (gcc -O3) bypassed the whole issue by inlining the
code (which I considered cheating).

I asked in response to Ben's initial post as you had got
a test working

The test proposed actually would have worked well written in a separate
module, but I couldn't remember all the fiddly details I needed to know, and
used inline assembly in a language I was familiar with.

but I'm thinking this is too big an issue for this
subthread and needs a separate post, and a crosspost at that!
Something to come back to.

The bigger issue is not so much getting these things to work, as whether it
is still worthwhile using assembly at all, in these days of
'super-intelligent' compilers (which I haven't really come across myself). I
think that was the point of the thread.

SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
Quick sort algorithm	1	Feb 22, 2023
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
How to inline assembly in a C program?	4	Mar 3, 2013
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Performance of int/long in Python 3	187	Mar 25, 2013
Comparision of C Sharp and C performance	360	Dec 27, 2009
Qsort() messing with my entire Code	0	Apr 25, 2022

Performance of hand-optimised assembly

Ben Bacarisse

BartC

Ben Bacarisse

BartC

Joe keane

Ben Bacarisse

88888 Dihedral

BartC

Willem

88888 Dihedral

BartC

James Harris

BartC

James Harris

BartC

BartC

Jens Gustedt

James Harris

James Dow Allen

BartC

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads