Performance of hand-optimised assembly

Seebs · Dec 24, 2011

So while modern compilers might be ultra-sophisticated, most of these
aren't! Admittedly they are all free downloads..

The commercial compiler vendors usually regard gcc as a poor performer, I'm
told.

That said... One of the things compilers tend to be better at than humans
is *large* optimizations. Consider sqlite; while it's distributed with a
reasonably normal structure of source files and such, actual compilation is
done by merging all the sources into a single large file. Why? Because
the compiled output runs noticably faster.

Compilers can afford to try experiments and optimizations that human users
usually can't.

-s

Geoff · Dec 24, 2011

The commercial compiler vendors usually regard gcc as a poor performer, I'm
told.

They naturally would, wouldn't they? I would expect them to come up
with benchmarks and lots of nice slide presentations touting their
products against each other and against the freeware stuff. I would be
more impressed with a commercial compiler vendor stating that xyz
freeware compiler is a good performer.

The trouble with GCC is that there are so many switches it's probably
impossible to test all the modes and actually come up with a truly
optimum configuration for a given target and application.

Back in the day, I was programming an embedded Z80/Z180 RTOS using
Microtec Research's MCC80. One day, I found an unusual bottleneck in a
function and nothing in the source could account for it. Examining the
generated assembly I found the thing was thrashing around with a
string of mov hl,hl; mov bc,hl; mov hl,bc; mov hl,hl; opcodes. Thirty
minutes on the phone with the maintenance programmer yielded
indication of a bug in the compiler. Fixed in another hour, emailed a
new release of the compiler. It's hard to do that these days.

Ben Bacarisse · Dec 24, 2011

BartC said:
In any case the 'winning entry' (gcc -O3) bypassed the whole
issue by inlining the code (which I considered cheating).

That's an interesting point. I think it would be enough to show that
"beating" gcc -O2 is both possible and worth-while, but inlining is a
tool available to compilers that is often not available to hand coders.
When it can be done, it's another point in favour of sticking to C and
relying on the compiler.

<snip>

Philip Lantz · Dec 24, 2011

BartC said:
Having to use that ghastly AT&T syntax is an argument *against* using
assembly, even if it is faster...

You can use the Intel syntax with gas. Use the pseudo-op
.intel_syntax noprefix
You can even use it in gcc inline assembly--open each inline group with
that and close with
.att_syntax prefix
I only bother to do it for code groups with more than a few
instructions.

Ben Bacarisse · Dec 24, 2011

Willem said:
Ben Bacarisse wrote:
) I'm starting a new thread, but it is prompted by this remark from 88888
) Dihedral <[email protected]> about a short binary search
) function I posted:
)
) <snip>
) On my laptop[1], gcc's optimiser (version 4.6.1) almost exactly halves the
) times. clang's code is a little better at -O0, but the optimised code
) is a little worse:
) -O0 -O1 -O2 -O3
) gcc 0.1128 0.0633 0.0571 0.0562
) g++ 0.1167 0.0633 0.0572 0.0564
) clang 0.1033 0.0669 0.0672 0.0672
)
) These times are microseconds per call. The loop overhead has been
) subtracted[2].

Have you tried enabling the processor-specific switches on gcc?

As per Jens's post, I just tried -march=native and got:

-O0 -O1 -O2 -O3
gcc 0.1127 0.0633 0.0571 0.0563
gcc -march=native 0.1124 0.0633 0.0567 0.0561

The tiny improvement in the -O2 and -O3 times is repeatable, but the -O0
time varies by two or three in the last digits anyway. It's not a
significant improvement on this processor.

) [1] Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz, 3MB cache

I can't at the moment test the AMD chip.

Ben Bacarisse · Dec 25, 2011

James Harris said:
...

I don't understand some of your code. Queries below.

Here is a test program:

#include <stdio.h>
#include <stdlib.h>

size_t binsearch(int a[], size_t length, int key)

Click to expand...

Why use size_t instead of unsigned int or unsigned long int,
especially as you add it to sum in the test harness which is long int?
Why not use unsigned for sum in the test harness?

int is better yet because the output will then be same across machines.
Using *any* unsigned type means the "not found" -1 return becomes a
value that can vary from implementation to implementation. For this
test code I should have used int, but size_t is the "right" return type
for a binary search function like this and I did not think to change it.

Why not high = length - 1?

Explained below.

Why not middle = (high + low) / 2?

To avoid "overflow". (Quotes because it's wrap-around with unsigned
types but it's still wrong.) You'd need peculiar input to see this
in practise but it seems nicer to avoid the issue altogether.

Â Â Â Â Â if (key == a[middle])
Â Â Â Â Â Â Â Â return middle;
Â Â Â Â Â else if (key > a[middle])
Â Â Â Â Â Â Â Â low = middle + 1;
Â Â Â Â Â else high = middle;

Click to expand...

Why not high = middle - 1?

Unsigned types, basically. The symmetric version that adds or subtracts
one from middle always keeps the target index in [low..high]. That
means high must start at length - 1 which is annoying for unsigned types
like size_t (because length == 0 becomes a special case). I keep the
target index in [low..high) (i.e. a half open interval) so high is set
to length at the start and to middle when the range is being cut to
lower portion.

There are lots of ways round this, but I think they all complicate an
otherwise algorithm. Perhaps the simplest is just to test length first
and return -1 if it's 0 before doing any real work. Maybe you have a
favourite method that can use size_t bounds?

The reason for the queries is I wrote an asm binsearch but when
comparing it with your code realised that you had some differences in
the approach.

Obviously there's no need to copy the approach in the body of the
function, but I'd say you should copy the prototype (at least at first).
If it turn out that you can get 20% better performance with asm that
uses different argument and/or return types, then that's an interesting
fact and should not be ignored just because I used size_t.

Weland · Dec 25, 2011

and so on) was a thing of the past. But maybe I'm wrong. Is it still

possible to get a significant speed-up over C by hand coding? How
much depends on the processor?

It is still possible, but it does greatly depend on the CPU and compiler.
For most common platforms, beating GCC is fairly difficult (or possible,
but not worthwile IMO). However, as you go deeper in their line of ports,
it becomes more and more possible.

On platforms that have only fairly recently become supported, like TI's
MSP430 (mind you, that is a microcontroller), it can be done and it is
worthwhile. I use msp430-gcc at work; most of the time it generates rea-
sonable code, but it doesn't always make the best decisions. I don't re-
member it generating incorrect code, but I do remember it generating
slow code more than once.

IMO, most of the time there's just no reason to do it on modern, high
performance architectures -- the difficulties in porting it later and the
additional time you spent just aren't worth it. But on memory-constrained
systems it might still be an option, as long as you have a sane separation
of portable and non-portable code.

jacob navia · Dec 25, 2011

Well, after 3 days, there are some things that look
fairly obvious:

1) Nobody knows assembly here, besides io_x that is the only one
that really answered the question and presented an assembly
language program.
(Sorry if I missed one assembly language statement or discussion
in the text of all the messages posted into this strange thread)

2) Since the problem chosen by Ben was geared to problems where a
compiler is better than an assembly language programmer the
C programmers must win. Ben twicked the rules like this:

<quote>
I've felt (it's only a feeling) that hand coding for speed (rather
than to access features not available in C like extra wide
multiplies and so on) was a thing of the past.
<quote>

But exactly using features not available in "high level" languages
like C is what makes assembly endure today as a very good way of
programming (I would not say that assembly language is a "language"
in the sense of a "computer language"). So, if you eliminate one
of the main reasons, the other reasons left are surely not so
worthhile.

3) Anyway, I passed the favorite compiler of this group through the
program, and after 30 minutes I had improved by 8% even with -O2.

I wanted to post something but I am not so convinced. Is it worth?
I can hear already the screams of this crowd praying their "portability"
god when an old computer hacker tells them:

ASSEMBLY LIVES!

jacob

88888 Dihedral · Dec 25, 2011

jacob naviaæ–¼ 2011å¹´12æœˆ26æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ3æ™‚12åˆ†54ç§’å¯«é“ï¼š

Well, after 3 days, there are some things that look
fairly obvious:

1) Nobody knows assembly here, besides io_x that is the only one
that really answered the question and presented an assembly
language program.
(Sorry if I missed one assembly language statement or discussion
in the text of all the messages posted into this strange thread)

2) Since the problem chosen by Ben was geared to problems where a
compiler is better than an assembly language programmer the
C programmers must win. Ben twicked the rules like this:

<quote>
I've felt (it's only a feeling) that hand coding for speed (rather
than to access features not available in C like extra wide
multiplies and so on) was a thing of the past.
<quote>

But exactly using features not available in "high level" languages
like C is what makes assembly endure today as a very good way of
programming (I would not say that assembly language is a "language"
in the sense of a "computer language"). So, if you eliminate one
of the main reasons, the other reasons left are surely not so
worthhile.

3) Anyway, I passed the favorite compiler of this group through the
program, and after 30 minutes I had improved by 8% even with -O2.

I wanted to post something but I am not so convinced. Is it worth?
I can hear already the screams of this crowd praying their "portability"
god when an old computer hacker tells them:

ASSEMBLY LIVES!

jacob

If computing PI or RSA2048 encoding and decoding, nowadays the job is still
paid well to use 64 bit multipliers in HW directly.

BartC · Dec 25, 2011

1) Nobody knows assembly here, besides io_x that is the only one
that really answered the question and presented an assembly
language program.

He did, but it's not clear how closely it corresponds to the C algorithm
presented. And anyway it seemed the results were of more interest than the
details, which are obviously going to be very specific to processor,
assembler, language and compiler.

(But here anyway is the asm code I used, in Nasm format, for x86-32, and
used as inline code for the body of a function taking parameters a, length
and key:

%define rlow ebx
%define rhigh ecx
%define rmiddle eax
%define rkey edi
%define rdata esi

mov rdata,[a]
mov rhigh,[length]
mov rkey,[key]
mov rlow,0

lab1:
cmp rhigh,rlow
jle labx

mov rmiddle,rhigh
sub rmiddle,rlow
shr rmiddle,1
add rmiddle,rlow

cmp [rmiddle*4+rdata],rkey
jz labret
jge lab3

lea rlow,[rmiddle+1]
jmp lab1

lab3:
mov rhigh,rmiddle
jmp lab1

labx: mov eax,-1
labret:
pop ebp
retn 12

)

jacob navia · Dec 25, 2011

Le 25/12/11 20:36, 88888 Dihedral a Ã©crit :

If computing PI or RSA2048 encoding and decoding, nowadays the job is still
paid well to use 64 bit multipliers in HW directly.

Look at this:

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result=L+R;
}

Assembly will always do better than standard C in the above specific
code. You can then use the ADDPD and add 2 integers at the same time
(under specific loop conditions).

Now, some compilers support OpenMP but not standard C. With
operator overloading it could be:

struct doubledouble { double hi; double lo; };
typedef struct doubledouble dd;

struct doubledouble operator+(dd left, dd right) {
asm("addpd %xmm1,%xmm0");
}

(Using windows calling conventions, with left in xmm0, right in
xmm1, and result in xmm0).

It would fit into the language much better. But that is
another discussion.

jacob navia · Dec 25, 2011

Interesting but I see some parts missing:

Where is the entry point and interface of the routine?

You do a pop ebp at the end but I do not see the
corresponging push.

Anyway one of the optimizations that a programmer
can do and a compiler can't are the program specific
hacks. For instance if you call a routine once or
twice and the routine's code is available you can
extend the calling convention to *anything* what
you basically assume when you suppose rdata in esi.

This can be extended to squeeze some performance
by minimizing moves That is difficult for a
specific case.

The programmer must be right for *THIS* program
and this data; a compiler must be correct for ALL
pssible programs what is another pair of shoes...

QUITE another.

But even at this scale assembly language manipulations are
very error prone, as every asm programmer knows very
well. The more you hack into calling conventions, the
less manageable and modifiable is your code.

GREAT! You do not move the data at all. But... what do you
do when you need one more argument or to use that
routine from the outside? Every usage needs a rewriting
in assembly of the calling routine.

Or a "sas", a routine that translates into the ad-hoc
routine with just push/pop. But that is going to be
MUCH slower... etc. You must know the limits.

88888 Dihedral · Dec 25, 2011

Bartæ–¼ 2011å¹´12æœˆ26æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ4æ™‚23åˆ†21ç§’å¯«é“ï¼š

1) Nobody knows assembly here, besides io_x that is the only one
that really answered the question and presented an assembly
language program.

Click to expand...

He did, but it's not clear how closely it corresponds to the C algorithm
presented. And anyway it seemed the results were of more interest than the
details, which are obviously going to be very specific to processor,
assembler, language and compiler.

(But here anyway is the asm code I used, in Nasm format, for x86-32, and
used as inline code for the body of a function taking parameters a, length
and key:

%define rlow ebx
%define rhigh ecx
%define rmiddle eax
%define rkey edi
%define rdata esi

mov rdata,[a]
mov rhigh,[length]
mov rkey,[key]
mov rlow,0

lab1:

;while gap

cmp rhigh,rlow
jle labx

; zero gap version
; no need for gap>1

mov rmiddle,rhigh
sub rmiddle,rlow
shr rmiddle,1 divide by 2
add rmiddle,rlow get the middle index

cmp [rmiddle*4+rdata],rkey

SHL ?

jz labret
jge lab3

lea rlow,[rmiddle+1]
jmp lab1

lab3:
mov rhigh,rmiddle
jmp lab1

labx: mov eax,-1
labret:
pop ebp
retn 12

)

BartC · Dec 25, 2011

88888 Dihedral said:
Bartæ–¼ 2011å¹´12æœˆ26æ—¥æ˜ŸæœŸä¸€UTC+8ä¸Šåˆ4æ™‚23åˆ†21ç§’å¯«é“ï¼š

;while gap

; zero gap version
; no need for gap>1

Sorry I didn't see the need for comments (in fact I took them out).

The code corresponds accurately to the original C (see OP) and with comments
is:

mov rdata,[a]
mov rhigh,[length] ;high=length
mov rkey,[key]
mov rlow,0 ;low=0

lab1:
cmp rhigh,rlow ;while high>low
jle labx

mov rmiddle,rhigh ;middle=(high-low)/2+low
sub rmiddle,rlow
shr rmiddle,1
add rmiddle,rlow

cmp [rmiddle*4+rdata],rkey ;if a[middle]==key
jz labret ; return middle (rmiddle already is
eax)
jge lab3 ;else if a[middle]>key

lea rlow,[rmiddle+1] ;low=middle+1
jmp lab1

lab3:
mov rhigh,rmiddle ;high=middle (in case it's not obvious)
jmp lab1

labx:
mov eax,-1 ;return -1
labret:
pop ebp
retn 12

BartC · Dec 25, 2011

jacob navia said:
Interesting but I see some parts missing:

Where is the entry point and interface of the routine?

You do a pop ebp at the end but I do not see the
corresponging push.

The rest of the code is generated by the compiler. Immediately prior to the
start of the asm code, it will execute:

mov esi,[ebp-40] ;<data>
push dword [esi]
push dword 0
push dword [ebp-40] ;<data>
call binsearch
....
binsearch:
push ebp
mov ebp,esp ;

And on returning from the function:
add [ebp-20],eax ;<sum>

Nothing unexpected. (Note in the asm I posted, the host compiler will
convert [a], etc to [ebp+offset] before passing to the external assembler)

hacks. For instance if you call a routine once or
twice and the routine's code is available you can
extend the calling convention to *anything* what
you basically assume when you suppose rdata in esi.

rdata (which is just register esi) is loaded from the parameter 'a' inside
the function; there are no dependencies outside the function. (And this host
language - of the inline assembly - does not require register-saving.)

88888 Dihedral · Dec 26, 2011

You are good in assembly and C. Well done!

88888 Dihedral · Dec 26, 2011

io_xæ–¼ 2011å¹´12æœˆ26æ—¥æ˜ŸæœŸä¸€UTC+8ä¸‹åˆ3æ™‚05åˆ†16ç§’å¯«é“ï¼š

io_x said:
io_x said:

Ben Bacarisse said:

I'm starting a new thread, but it is prompted by this remark from 88888
Dihedral <[email protected]> about a short binary search
function I posted:

| Do you consider to convert 5 to 10 lines of C to assembly?
| I did that 20 years ago on X86 cpu.
| It was a piece of cake to gain the 20-40% speed required if paid
| well.

I've felt (it's only a feeling) that hand coding for speed (rather than
to access features not available in C like extra wide multiplies and so
on) was a thing of the past. But maybe I'm wrong. Is it still possible
to get a significant speed-up over C by hand coding? How much depends
on the processor? How good are the modern optimising compilers that
would have to be beaten?

Click to expand...

i modified your code for handle the C library bsearch and compare it with
my asm binsearch
my asm one is more slow[it seems 20% slower], but i like my 'slow' asm one
and don't like the C library one

Click to expand...

the .asm one is 2x time slower than the C one
but i like the .asm one
i don't know it this is right...
------------------
; return (u8*)-1 error, 0 not found, address of the found element in array
; if return 0 [not found]
; in ecx return the address element of the array < than key if exist
; else ecx=0
; in edx return the address element of the array > than key if exist
; else edx=0
; would be like the C library one, with the only
; difference return (u8*)-1 for error
;u8* __stdcall
;binSearchR(u8* InPKey, u8* InBase,
; u32 nElmBase, u32 SizeElmBase, i32 cmp(u8* elm1, u8* elm2));
; 0k,4j,8i,12b,16ra, 20InKey,24InBase,28P_nElmBase,32P_SizeElmBase,36P_cmp + 64
; 84 88 92 96 100
align 4
binSearchR:
push ebx
push esi
push edi
push ebp
sub esp, 64
;int3

Don't you use fastcall or direct embedding to save the push and pop
mess in the calling overheads ?

mov dword[esp+ 4], 0
mov dword[esp+ 8], 0 ; in ^4 and ^8 thre is the last element >
and <
cmp dword[esp+ 84], 0
je .e
cmp dword[esp+ 88], 0
je .e
mov edi, dword[esp+ 92]
mov ebx, dword[esp+ 100]
cmp dword[esp+ 96], 0
jle .e
cmp edi, 0
je .nf
mov esi, 0
jl .e
cmp ebx, 0
je .e
dec edi
jmp short .2
.nf: xor eax, eax
clc
jmp short .z
.e: xor eax, eax
dec eax
stc
jmp short .z
.2: cmp esi, edi
jg .nf
mov ebp, esi
add ebp, edi
shr ebp, 1 ; k=(i+j)/2
mov eax, dword[esp+ 96]
mul ebp
mov ecx, dword[esp+ 84]
cmp edx, 0
jne .e
add eax, dword[esp+ 88]
jc .e
mov dword[esp+ 0], eax
push ecx
push eax
call ebx
add esp, 8
cmp eax, 0
jle .3
dec ebp
mov ecx, dword[esp+ 0]
mov edi, ebp
mov dword[esp+ 4], ecx
jmp short .2
.3: jz .4
inc ebp
mov ecx, dword[esp+ 0]
mov esi, ebp
mov dword[esp+ 8], ecx
jmp short .2
.4: mov eax, dword[esp+ 0]
clc
.z: mov edx, dword[esp+ 4]
mov ecx, dword[esp+ 8]
lea esp, [esp+64]
pop ebp
pop edi
pop esi
pop ebx
ret 20

;0ra, 4P1, 8P2
align 4
cmpi32:
mov eax, dword[esp+ 4]
mov edx, dword[esp+ 8]
mov ecx, [eax]
cmp ecx, [edx]
jge .1
mov eax, -1
jmp short .z
.1: jz .2
mov eax, 1
jmp short .z
.2: mov eax, 0
.z:
ret

;u32 binsearch(i32 arr, u32 len, i32 key)
;0k,4j,8i,12b,16ra, 20P_arr,24P_len,28P_key
align 4
binsearch:
push ebx
push esi
push edi
push ebp
mov eax, dword[esp+ 20]
mov edx, dword[esp+ 24]
lea ecx, dword[esp+ 28]
push cmpi32
push 4
push edx
push eax
push ecx
call binSearchR
jnc .2
.e: mov eax, 0
jmp short .z ; error
.1: mov eax, -1
clc
jmp short .z ; not found
.2: mov ecx, dword[esp+ 20]
cmp eax, 0
je .1
sub eax, ecx
shr eax, 2
clc ; indx=(res-base)/4
.z:
pop ebp
pop edi
pop esi
pop ebx
ret 12
------------------
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define u32 unsigned
#define i32 int
#define u8 unsigned char
#define R return
#define P printf
#define F for
// const, how is smart...
#define cn const

/*
;return (u8*)-1 error, 0 not found, address of the found element in array
;if return 0 [not found]
; in ecx return the address element of the array < than key if exist
; else ecx=0
; in edx return the address element of the array > than key if exist
; else edx=0
would be like the C library one, with the only difference return (u8*)-1 for
error
*/
u8* __stdcall
binSearchR(u8* InPKey, u8* InBase, u32 nElmBase, u32 SizeElmBase,
i32 cmp(cn u8* elm1, cn u8* elm2) );
i32 __stdcall binsearch(int a[], size_t length, int key);

i32 Cbinsearch(int a[], size_t length, int key)
{
size_t low = 0, high = length;
while (high > low) {
size_t middle = low + (high - low)/2;
if (key == a[middle])
return middle;
else if (key > a[middle])
low = middle + 1;
else high = middle;
}
return -1;
}

i32 cmpu32s(const void* elm1, const void* elm2)
{u32 m1, m2;
m1=*(u32*)elm1; m2=*(u32*)elm2;
if(m1<m2) R -1;
else if(m1>m2) R 1;
else R 0;
}

int main(int argc, char** argv)
{time_t in, fin;
double d;
u32 v[]={1, 2, 5, 7, 8, 8, 9, 13, 20}, val=4, *w, *ww, i;
i32 ix, ixx;
u8 *r;

if(argc==2)
{int n=atoi(argv[1]);
int *data = (int *)malloc(n * sizeof *data);

if(data)
{int i, length, step, key;
long int sum;

for(i=0; i<n; i++) data=2*i+1;
// Include one zero-length search for correctness testing
r = binSearchR(data, data, 0, sizeof(data[0]), cmpu32s);
if(r) printf("Error r=%p\n", (void*) r);
sum=0; time(&in);
for(length=1; length<=n; length++)
for(step=1; step<=length; step++)
for(key=0; key<=2*length; key+=step)
{r=binSearchR(&key, data, length, sizeof(data[0]), cmpu32s);
if(r) sum+=(r-(u8*)data)/sizeof(data[0]);
}
time(&fin);
P("myBsearch=%.3f ", (double) d=difftime(fin, in));
printf("sum=%ld\n", sum);
sum=0; time(&in);
for(length=1; length<=n; length++)
for(step=1; step<=length; step++)
for(key=0; key<=2*length; key+=step)
{r=bsearch(&key, data, length, sizeof(data[0]), cmpu32s);
if(r) sum+=(r-(u8*)data)/sizeof(data[0]);
}
time(&fin);
P("libsearch=%.3f ", (double) d=difftime(fin, in));
printf("sum=%ld\n", sum);
free(data);
}
}

if(argc==2)
{int n = atoi(argv[1]);
/* cast so C++ compiled code can timed */
int *data = (int *)malloc(n * sizeof *data);
if (data)
{int i, length, step, key;
long int sum;

for (i = 0; i < n; i++) data= 2*i+1;
/* Include one zero-length search for correctness testing */
time(&in);
sum = binsearch(data, 0, data[0]);
for(length = 1; length <= n; length++)
for(step = 1; step <= length; step++)
for (key = 0; key <= 2 * length; key += step)
sum += binsearch(data, length, key);
time(&fin);
P("AsmBsearc=%.3f sum=%ld\n", (double) d=difftime(fin, in),sum);
time(&in);
sum = binsearch(data, 0, data[0]);
for(length = 1; length <= n; length++)
for(step = 1; step <= length; step++)
for (key = 0; key <= 2 * length; key += step)
sum += Cbinsearch(data, length, key);
time(&fin);
P("C Bsearc=%.3f sum=%ld\n", (double) d=difftime(fin, in),sum);
free(data);
}
}
R 0;
}
--------------
C:>bins 3000
myBsearch=15.000 sum=-1433105604
libsearch=15.000 sum=-1433105604
AsmBsearc=16.000 sum=-1488664059
C Bsearc=8.000 sum=-1488664059
--------------

Stephen Sprunk · Dec 26, 2011

Le 25/12/11 20:36, 88888 Dihedral a Ã©crit :

If computing PI or RSA2048 encoding and decoding, nowadays the job is
still paid well to use 64 bit multipliers in HW directly.

Click to expand...

Look at this:

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result=L+R;
}

Assembly will always do better than standard C in the above specific
code. You can then use the ADDPD and add 2 integers at the same time
(under specific loop conditions).

.... unless you have a vectorizing compiler, which will do the exact same
thing for you without the need for inline assembly. Also, when AVX
support comes out, that vectorizing compiler will start using AVX
instructions while your assembly is stuck using SSE. Oh, and that
compiler could generate vector code for _other_ architectures as well
eg. VIS), where your assembly code would break compilation.

Furthermore, there is no guaranteed win for using SEE; early SSE chips
simply decoded that into two FADDs, so there was no performance benefit
to all that work you invested in writing SSE assembly. (The same is
likely to be true with early AVX chips.)

S

jacob navia · Dec 26, 2011

Le 25/12/11 20:36, 88888 Dihedral a Ã©crit :

If computing PI or RSA2048 encoding and decoding, nowadays the job is
still paid well to use 64 bit multipliers in HW directly.

Click to expand...

Look at this:

void doADD(double *L, double *R, double *result,size_t n)
{
for(size_t i=0; i<n;i++) result=L+R;
}

Assembly will always do better than standard C in the above specific
code. You can then use the ADDPD and add 2 integers at the same time
(under specific loop conditions).

Click to expand...

... unless you have a vectorizing compiler, which will do the exact same
thing for you without the need for inline assembly.

Sure. That was a simple example, and no, gcc doesn't do it, at least
in the version I have

Also, when AVX
support comes out, that vectorizing compiler will start using AVX
instructions while your assembly is stuck using SSE.

Click to expand...

Sure, you can recompile your program bu I can't change a few loop
`conditions since I am "stuck" with my old program, poor me.

Oh, and that
compiler could generate vector code for _other_ architectures as well
eg. VIS), where your assembly code would break compilation.

Click to expand...

Sure, that super duper compiler could make my coffee too. Why not?
And it will be coffe machine independent coffe, portable from the
coffee machine to anywhere I would take it.

Furthermore, there is no guaranteed win for using SEE; early SSE chips
simply decoded that into two FADDs, so there was no performance benefit
to all that work you invested in writing SSE assembly.

Click to expand...

So what? Those machines aren't interesting anymore.

(The same is

likely to be true with early AVX chips.)

S

Click to expand...

Look, assembly is not for everyone. And waiting for that super duper
compiler that will run anywhere, we could do it much more simpler if we
would just subclass the operator +...

But that is surely not for you.

James Harris · Dec 28, 2011

James Harris said:
James Harris said:

On Dec 23, 6:43 pm, Ben Bacarisse <[email protected]> wrote:

Click to expand...

....

size_t binsearch(int a[], size_t length, int key)

Click to expand...

Click to expand...

Why use size_t instead of unsigned int or unsigned long int,
especially as you add it to sum in the test harness which is long int?
Why not use unsigned for sum in the test harness?

Click to expand...

int is better yet because the output will then be same across machines.
Using *any* unsigned type means the "not found" -1 return becomes a
value that can vary from implementation to implementation.

You mean -1 may have a strange value on a 1s-complement machine? On a
2s-complement machine wouldn't it always be all bits set even if
assigned to an unsigned variable?

For this
test code I should have used int, but size_t is the "right" return type
for a binary search function like this and I did not think to change it.

The thing is, when you add it to sum in the main routine if sum
(defined as long int) was the same width as size_t the calculation
would, AFAICS, be correct. But say size_t was unsigned 32-bit and long
int was signed 64-bit. Wouldn't that add 4 billion and odds rather
than subtract one?

....

To avoid "overflow". (Quotes because it's wrap-around with unsigned
types but it's still wrong.) You'd need peculiar input to see this
in practise but it seems nicer to avoid the issue altogether.

Understood. When I came to code this in 32-bit assembler I realised
that, since high and low can never have their top bits set, the simple
form would do. In general it seems that as long as they index multi-
byte quantities the sum cannot overflow.

It's a non-issue, though. In case anyone is interested, gcc -O3 seems
to have realised this and generated the shorter code.

....

Why not high = middle - 1?

Click to expand...

Unsigned types, basically. The symmetric version that adds or subtracts
one from middle always keeps the target index in [low..high]. That
means high must start at length - 1 which is annoying for unsigned types
like size_t (because length == 0 becomes a special case). I keep the
target index in [low..high) (i.e. a half open interval) so high is set
to length at the start and to middle when the range is being cut to
lower portion.

There are lots of ways round this, but I think they all complicate an
otherwise algorithm. Perhaps the simplest is just to test length first
and return -1 if it's 0 before doing any real work. Maybe you have a
favourite method that can use size_t bounds?

No, no favourite method. Now I understand your approach I can see why
you do it that way.

....

Obviously there's no need to copy the approach in the body of the
function, but I'd say you should copy the prototype (at least at first).
If it turn out that you can get 20% better performance with asm that
uses different argument and/or return types, then that's an interesting
fact and should not be ignored just because I used size_t.

Timings vary on the machine I used possibly because it has other tasks
but taking an average by eye my best asm version was *slower* than
that that produced by gcc -O3. How much? It's hard to be accurate but
I would say there's an average of about 5% to 10% difference.

I puzzled over this for a while. My initial asm code had been almost
the same as the code that Bart posted earlier. (BTW, to make the test
fair I compiled binsearch_test and binsearch.c or binsearch.nasm
separately.) The thing is, it was all what I would call "good" code:
efficiently laid out, using the right type of instructions. So why was
gcc's code quicker? In the end I even compared the asm output from the
compiler but couldn't see why it was quicker.

Finally I took gcc's asm output and assembled it. It ran as fast as
the version produced directly by the compiler which verified the test
was valid. Then I realised that the gcc code does not save ecx and
edx. (Part of the calling convention, I believe.) I had been
concentrating on the body of the code. To my surprise, adding a save
of both or either register brought the gcc code to about the same
speed as the asm.

So is it possible that this is heavily influenced by calling overhead
or is this a code alignment issue or is there some other cause? I
don't know.

I wanted to spend some time on this in part to get familiar with
calling between asm and gcc (for which, if anyone else is interested
I've copied below a useful link that I found) but the rest will have
to remain a mystery for now. The bottom line, for me, is that, at
least for this test case, the code generated by gcc -O3 is very good
indeed. Looking at its asm output I can see both a number of standard
compiler optimisations and that it has recognised a number of
potential issues and dealt with them very well.

At least I now have two-way calling working between Nasm and C - and a
new respect for gcc!

http://www.cs.lmu.edu/~ray/notes/nasmexamples/

James

SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
Quick sort algorithm	1	Feb 22, 2023
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
How to inline assembly in a C program?	4	Mar 3, 2013
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Performance of int/long in Python 3	187	Mar 25, 2013
Comparision of C Sharp and C performance	360	Dec 27, 2009
Qsort() messing with my entire Code	0	Apr 25, 2022

Performance of hand-optimised assembly

Seebs

Geoff

Ben Bacarisse

Philip Lantz

Ben Bacarisse

Ben Bacarisse

Weland

jacob navia

88888 Dihedral

BartC

jacob navia

jacob navia

88888 Dihedral

BartC

BartC

88888 Dihedral

88888 Dihedral

Stephen Sprunk

jacob navia

James Harris

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads