Performance of hand-optimised assembly

James Harris · Dec 28, 2011

....

That form can cause an overflow when the numbers are high. Otherwise
(high+low)/2 is marginally faster.

Understood, thanks. There's an interesting point here. The
straightforward code to implement that expression might be

mov rmid, rhigh
sub rmid, rlow
shr rmid, 1
add rmid, rlow

but that could be improved to

mov rmid, rlow
add rmid, rhigh
rcr rmid, 1

because rcr will bring back in that 'lost' carry bit and IIRC rcr by
one bit is generally as fast as shifts and the rotates which don't
include the carry. (Rotates including the carry are not so fast if
multiple bit positions are moved at the same time.)

In the case in point since the top bits will not be set it can be
improved to

lea rmid, [rlow + rhigh]
shr rmid, 1

on most CPUs. The Intel Atom is the only one I am aware of for which
this would be slowed down by a stall if either rlow or rhigh were
generated just before the lea instruction. For the Atom, I think the
code above including rcr might be best (or rcr could be replaced by
shr).

James

88888 Dihedral · Dec 28, 2011

James Harrisæ–¼ 2011å¹´12æœˆ28æ—¥æ˜ŸæœŸä¸‰UTC+8ä¸‹åˆ7æ™‚46åˆ†21ç§’å¯«é“ï¼š

...

Understood, thanks. There's an interesting point here. The
straightforward code to implement that expression might be

mov rmid, rhigh
sub rmid, rlow
shr rmid, 1
add rmid, rlow

EDX is still free. Thus one still can experiment to use 6 registers.

but that could be improved to

mov rmid, rlow
add rmid, rhigh
rcr rmid, 1

because rcr will bring back in that 'lost' carry bit and IIRC rcr by
one bit is generally as fast as shifts and the rotates which don't
include the carry. (Rotates including the carry are not so fast if
multiple bit positions are moved at the same time.)

In the case in point since the top bits will not be set it can be
improved to

lea rmid, [rlow + rhigh]
shr rmid, 1

on most CPUs. The Intel Atom is the only one I am aware of for which
this would be slowed down by a stall if either rlow or rhigh were
generated just before the lea instruction. For the Atom, I think the
code above including rcr might be best (or rcr could be replaced by
shr).

James

BartC · Dec 28, 2011

James Harris said:
Understood. When I came to code this in 32-bit assembler I realised
that, since high and low can never have their top bits set, the simple
form would do. In general it seems that as long as they index multi-
byte quantities the sum cannot overflow.

It's a non-issue, though. In case anyone is interested, gcc -O3 seems
to have realised this and generated the shorter code.

When I looked at my gcc -O3 code, it seemed to use the longer form, but then
it also inlined the function and didn't have an issue with saving/restoring
registers.

But this is a good point: realising these signed values will only ever be
positive, so they can therefore be treated as unsigned values but with an
unused top bit. That's the kind of extra knowledge that can give assembly
programmers an edge over most compilers (with some exceptions like your
gcc).

However even using this short form, which made my code 6% faster, it was
slower than my gcc (based on my original simple code; using inlining and
some unrolling, it might be marginally faster).

James Kuyper · Dec 28, 2011

You mean -1 may have a strange value on a 1s-complement machine? On a
2s-complement machine wouldn't it always be all bits set even if
assigned to an unsigned variable?

Yes (except possibly the padding bits, if any). That remains true even
if it's not a 2's complement machine. However, since the number of value
bits in size_t will differ from one implementation to another, "all bits
set" will represent a different value on different implementations. This
doesn't strike me as an important problem, but it is a correct statement.

Ben Bacarisse · Dec 28, 2011

James Harris said:
James Harris said:

On Dec 23, 6:43Â pm, Ben Bacarisse <[email protected]> wrote:

Click to expand...

...

size_t binsearch(int a[], size_t length, int key)

Click to expand...

Why use size_t instead of unsigned int or unsigned long int,
especially as you add it to sum in the test harness which is long int?
Why not use unsigned for sum in the test harness?

Click to expand...

int is better yet because the output will then be same across machines.
Using *any* unsigned type means the "not found" -1 return becomes a
value that can vary from implementation to implementation.

Click to expand...

You mean -1 may have a strange value on a 1s-complement machine? On a
2s-complement machine wouldn't it always be all bits set even if
assigned to an unsigned variable?

No, the machine representation of -1 has no bearing on the result of the
C code[1]. What I mean is just that a return of -1 has a value of
SIZE_MAX and that varies from implementation to implementation. It's a
good value to return in a real binary search because a valid index can't
ever be (quite) that large -- the largest C-supported object of SIZE_MAX
bytes can have indexes no higher than SIZE_MAX - 1.

However, it's not a good return for a portable test program like this
because the sum of the returns will vary from machine to machine.
Alternatively, changing the simple sum to something like this:

size_t idx = binsearch(data, length, key)
sum += idx == (size_t)-1 ? -1 : idx;

could have made the output more portable.

The thing is, when you add it to sum in the main routine if sum
(defined as long int) was the same width as size_t the calculation
would, AFAICS, be correct. But say size_t was unsigned 32-bit and long
int was signed 64-bit. Wouldn't that add 4 billion and odds rather
than subtract one?

Yes, that's an example of the problem, but in the first case (where the
sizes coincide) you won't necessarily end up subtracting one. C does
not define the effect of integer overflow -- it's up to the
implementation -- so while "wrapping round" (thereby effectively
subtracting one) is common,it can't be relied on.

...

Understood. When I came to code this in 32-bit assembler I realised
that, since high and low can never have their top bits set, the simple
form would do. In general it seems that as long as they index multi-
byte quantities the sum cannot overflow.

Yes. Viewed from the C end, though, you can't be sure that sizeof(int)
is greater than 1.

It's a non-issue, though. In case anyone is interested, gcc -O3 seems
to have realised this and generated the shorter code.

Clever of it. That's yet another reason to reply on the optimiser. You
can write code that does assume that sizeof(int) > 1 and the compiler
can re-write when it knows that's true.

I wanted to spend some time on this in part to get familiar with
calling between asm and gcc (for which, if anyone else is interested
I've copied below a useful link that I found) but the rest will have
to remain a mystery for now. The bottom line, for me, is that, at
least for this test case, the code generated by gcc -O3 is very good
indeed. Looking at its asm output I can see both a number of standard
compiler optimisations and that it has recognised a number of
potential issues and dealt with them very well.

At least I now have two-way calling working between Nasm and C - and a
new respect for gcc!

http://www.cs.lmu.edu/~ray/notes/nasmexamples/

I am glad it was not wasted effort. I had assumed that most people
would look at what gcc does, scratch their head a bit and conclude that
there's really not much room for improvement. It's the kind of function
optimisers do very well at (as Jacob pointed out). I rather hoped 88888
Dihedral, in particular, would have a go, but I think I was wrong to
assume that his replies carry any real meaning.

[1] Except that it might have some bearing on what implementation-
defined behaviour is chosen for integer overflow.

Ben Bacarisse · Dec 28, 2011

BartC said:
When I looked at my gcc -O3 code, it seemed to use the longer form,
but then it also inlined the function and didn't have an issue with
saving/restoring registers.

But this is a good point: realising these signed values will only ever
be positive, so they can therefore be treated as unsigned values but
with an unused top bit. That's the kind of extra knowledge that can
give assembly programmers an edge over most compilers (with some
exceptions like your gcc).

Not quite. Those are all *unsigned* values. James's point is that
these are indexes into an object whose elements are more than one byte
in size. The maximum array index is then SIZE_MAX/sizeof(int).

Two things strike me now having read your post. I am surprised that gcc
-O3 can do this, and it seems my gcc does not (as you report). Also,
the assumption may be wrong. There was a thread here a while back about
whether an object can be bigger than SIZE_MAX. It probably can, so
the compiler would not be able to make the assumption.

<snip>

BartC · Dec 28, 2011

Not quite. Those are all *unsigned* values. James's point is that
these are indexes into an object whose elements are more than one byte
in size. The maximum array index is then SIZE_MAX/sizeof(int).

I was using signed int myself, I forgot your original size_t types are
unsigned. In that case it would need to infer that the maximum index values
will be limited, if the address calculation (on my machine 4*index +
base-address, the latter limiting the index range even further) is not to
overflow.

(In any case that didn't occur to me; I kept to the longer calculation,
partly because that's what your original code used, and also because I
trusted that my gcc had kept it for a good reason!)

Two things strike me now having read your post. I am surprised that gcc
-O3 can do this, and it seems my gcc does not (as you report). Also,
the assumption may be wrong. There was a thread here a while back about
whether an object can be bigger than SIZE_MAX. It probably can, so
the compiler would not be able to make the assumption.

Is it possible this other (James') gcc uses a somewhat smaller SIZE_MAX? Or
maybe it is just smarter.

Anyway, whether any object can be bigger than SIZE_MAX would be known to the
compiler I would imagine.

James Harris · Dec 28, 2011

On Dec 28, 11:21 am, James Harris <[email protected]>
wrote:

....

Understood. When I came to code this in 32-bit assembler I realised
that, since high and low can never have their top bits set, the simple
form would do. In general it seems that as long as they index multi-
byte quantities the sum cannot overflow.

It's a non-issue, though. In case anyone is interested, gcc -O3 seems
to have realised this and generated the shorter code.

Seeing that Ben and Bart both got different generated code prompted me
to check why I got lea. Aargh! It turns out I had altered the source
at some point, replacing

middle = low + (high - low)/2;

with

middle = (high + low) / 2;

Sorry to mislead you guys. I may have made the change as a test right
at the beginning. At any rate I forgot it was there. All this time
I've been comparing against enhanced code.

For the record, simply changing the source code back to the original
has removed that particular lea from the gcc -O3 code and has made it
slower than the assembly code (by about the same amount, 5% to 10%).

Bart, you were right that this may be an area where a compiler fears
to tread.

James

Phil Carmody · Dec 29, 2011

BartC said:
I've tried one more compiler, and the results, now normalised and for
the fastest code (and using N=2500 for more accuracy), are:

gcc -O3 1.00 (using inlining)
My asm 1.12 (original without inlining or unrolling)
Pelles C -Ot 1.26
DMC -o 1.35
lcc-win32 -O 1.64

(Adjusted for loop overheads)

So while modern compilers might be ultra-sophisticated, most of these
aren't! Admittedly they are all free downloads..

If you're running on an intel arch, you should at least try intel's icc
too. I've practically never known gcc to beat icc.

Phil

Phil Carmody · Dec 29, 2011

James Harris said:
James Harris said:

On Dec 23, 6:43 pm, Ben Bacarisse <[email protected]> wrote:

Click to expand...

...

size_t binsearch(int a[], size_t length, int key) ....
while (high > low) {
size_t middle = low + (high - low)/2;

Click to expand...

Why not middle = (high + low) / 2?

Click to expand...

To avoid "overflow". (Quotes because it's wrap-around with unsigned
types but it's still wrong.) You'd need peculiar input to see this
in practise but it seems nicer to avoid the issue altogether.

Click to expand...

Understood. When I came to code this in 32-bit assembler I realised
that, since high and low can never have their top bits set,

Stop right there. Why can they not have their top bits set?
Are you presuming I can't have an array of over 2^31 ints?
Can you justify that presumption by citing a relevant C standard?
Have you failed to notice that there are x86-based boxes with
256GB RAM on the market nowadays, and 8GiB arrays shouldn't
be considered unexpected any more.

Phil

Phil Carmody · Dec 29, 2011

Ben Bacarisse said:
Not quite. Those are all *unsigned* values. James's point is that
these are indexes into an object whose elements are more than one byte
in size. The maximum array index is then SIZE_MAX/sizeof(int).

Two things strike me now having read your post. I am surprised that gcc
-O3 can do this, and it seems my gcc does not (as you report). Also,
the assumption may be wrong. There was a thread here a while back about
whether an object can be bigger than SIZE_MAX. It probably can, so
the compiler would not be able to make the assumption.

I think there have been several such threads. "calloc" was a common
component to the ones that I remember, as you have two explicit numbers
that need to be multiplied together to get the size in bytes.

Phil

Rui Maciel · Dec 29, 2011

Phil said:
If you're running on an intel arch, you should at least try intel's icc
too. I've practically never known gcc to beat icc.

I suspect that gcc would beat icc if the code is compiled and executed on an
AMD system.

http://developer.amd.com/documentation/articles/pages/4292005119.aspx

Rui Maciel

88888 Dihedral · Dec 29, 2011

Bartæ–¼ 2011å¹´12æœˆ29æ—¥æ˜ŸæœŸå››UTC+8ä¸Šåˆ1æ™‚45åˆ†14ç§’å¯«é“ï¼š

I was using signed int myself, I forgot your original size_t types are
unsigned. In that case it would need to infer that the maximum index values
will be limited, if the address calculation (on my machine 4*index +
base-address, the latter limiting the index range even further) is not to
overflow.

(In any case that didn't occur to me; I kept to the longer calculation,
partly because that's what your original code used, and also because I
trusted that my gcc had kept it for a good reason!)

Is it possible this other (James') gcc uses a somewhat smaller SIZE_MAX? Or
maybe it is just smarter.

Anyway, whether any object can be bigger than SIZE_MAX would be known to the
compiler I would imagine.

The assembly part is the final resort.

For an array of int a[] with a known size, a search of the key of 32 bits
can speed up by using a hash for the top 12 to 16 bits first to limit therange of search in the array.

Do I have to explicitly say the details here?

That would be a very very lousy way to train young programmers.

jacob navia · Dec 29, 2011

Le 29/12/11 02:12, Phil Carmody a écrit :

Stop right there. Why can they not have their top bits set?
Are you presuming I can't have an array of over 2^31 ints?

The original program uses "int" to index, and "int" under windows
linux and mac is at most 32 bits. This is not a problem of assembly
but of the original C program

Can you justify that presumption by citing a relevant C standard?

Yes. In the parafraph "Translation limits" the standard says that
an implemetation is not required to support objects with
more than 65535 bytes.

Have you failed to notice that there are x86-based boxes with
256GB RAM on the market nowadays, and 8GiB arrays shouldn't
be considered unexpected any more.

Those arrays need to be indexed with a long long.

jacob navia · Dec 29, 2011

Le 29/12/11 02:12, Phil Carmody a écrit :

Stop right there. Why can they not have their top bits set?
Are you presuming I can't have an array of over 2^31 ints?

The original program uses "int" to index, and "int" under windows
linux and mac is at most 32 bits. This is not a problem of assembly
but of the original C program

Can you justify that presumption by citing a relevant C standard?

Yes. In the parafraph "Translation limits" the standard says that
an implemetation is not required to support objects with
more than 65535 bytes.

Have you failed to notice that there are x86-based boxes with
256GB RAM on the market nowadays, and 8GiB arrays shouldn't
be considered unexpected any more.

Those arrays need to be indexed with a long long.

Ben Bacarisse · Dec 29, 2011

jacob navia said:
Le 29/12/11 02:12, Phil Carmody a Ã©crit :

The original program uses "int" to index, and "int" under windows
linux and mac is at most 32 bits. This is not a problem of assembly
but of the original C program

The program that sparked this comment does not use ints to index the
array, it uses size_t.

Yes. In the parafraph "Translation limits" the standard says that
an implemetation is not required to support objects with
more than 65535 bytes.

Those arrays need to be indexed with a long long.

It depends what element type they have:

double big[1024][1024][1024];

needs no index larger than 1024 because the elements of big are 8Mib in
size. Even so, I think size_t is a better choice for an index variable.

BartC · Dec 29, 2011

Phil Carmody said:
Stop right there. Why can they not have their top bits set?
Are you presuming I can't have an array of over 2^31 ints?
Can you justify that presumption by citing a relevant C standard?
Have you failed to notice that there are x86-based boxes with
256GB RAM on the market nowadays, and 8GiB arrays shouldn't
be considered unexpected any more.

The address calculation for int-arrays usually involves multiplying by 2, 4
or 8. With an index having the top bit set, that will overflow the size_t
range. Whatever size_t happens to be.

James Harris · Dec 29, 2011

On Dec 29, 1:12 am, Phil Carmody <[email protected]>
wrote:

....

....

Stop right there.

Have you been reading too many historical novels lately?

Why can they not have their top bits set?

Because, in the context of use, they are indices to multibyte values
so do not use all bits of a machine word. They are not pointers.

Understood. When I came to code this in 32-bit assembler I realised
that, since high and low can never have their top bits set, the simple
form would do. In general it seems that as long as they index multi-
byte quantities the sum cannot overflow.

Are you presuming I can't have an array of over 2^31 ints?
Can you justify that presumption by citing a relevant C standard?
Have you failed to notice that there are x86-based boxes with
256GB RAM on the market nowadays, and 8GiB arrays shouldn't
be considered unexpected any more.

I don't understand. Are you thinking of 64-bit architectures? Or are
you suggesting that x86-32 can address 8GiBy in a single address
space?

James

Keith Thompson · Dec 29, 2011

BartC said:
The address calculation for int-arrays usually involves multiplying by 2, 4
or 8. With an index having the top bit set, that will overflow the size_t
range. Whatever size_t happens to be.

*Only* if you have an object bigger than size_t bytes. Most
implementations don't support such objects.

Joe keane · Jan 4, 2012

Ben Bacarisse said:
I think it would be enough to show that
"beating" gcc -O2 is both possible and worth-while, but inlining is a
tool available to compilers that is often not available to hand coders.

It is 'available'; it may just not be the best use of time. A C coder
can type i-n-l-i-n-e-space; a compiler can say 'this function is a part
of that function', 'apply the usual algorithms' and be done in ms; where
for a human coder it may be 'inline, no prob'; so then 'suppose that we
switch esi and edi in the called code, we can use -- trick, so it gains
one cycle'; 'suppose you do loop control by this trick'; so then few
days later 'suppose that we switch eax and ecx in the called code, we
can use -- trick', so it gains one cycle', and let me go to sleep.

SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
Quick sort algorithm	1	Feb 22, 2023
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
How to inline assembly in a C program?	4	Mar 3, 2013
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Performance of int/long in Python 3	187	Mar 25, 2013
Comparision of C Sharp and C performance	360	Dec 27, 2009
Qsort() messing with my entire Code	0	Apr 25, 2022

Performance of hand-optimised assembly

James Harris

88888 Dihedral

BartC

James Kuyper

Ben Bacarisse

Ben Bacarisse

BartC

James Harris

Phil Carmody

Phil Carmody

Phil Carmody

Rui Maciel

88888 Dihedral

jacob navia

jacob navia

Ben Bacarisse

BartC

James Harris

Keith Thompson

Joe keane

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads