Optimisation questions

Eric Sosman · Feb 5, 2012

Efficiency for what? Did you actually read what we wrote about the
futility of such silly optimisations? Did you measure the performance of
your code?

He didn't even read with enough attention to get the mask right.

Raj Pashwar · Feb 5, 2012

Chances are, your table will be slower, especially if it is decent size.
A table look up means an address calculation and a memory fetch, which
even with superscalar architectures and large caches is likely to cost
you more than a single instruction register-register AND or shift.

By the way, why would you need to build the table in a different thread?
To waste memory and cycles?

You'd be better off with listening to what the others replied to your
question. Unless x must be signed, change it to unsigned and the
compiler will merrily optimise your division away.

Unfortunately x must be signed because -1 is used to denote "invalid
value". So I cannot take full advantage of people's suggestions.

I have tried these out. I find that the program speed does not seem to
change much, if I use x%8 or x&7 or x=-(x>>3)<<3 : disappointing! I did
not try the table yet, but it stands to reason that it will be quicker
just to look up a result, than to do a computation every time.

I have two other optimisation questions I am looking at.

In one loop called many times, I need to swap 2 variables. Currently I do
this with a temporary variable:
tmp=a; a=b; b=tmp;
I think it will save overhead to avoid this extra variable. I found on
the internet an "XOR trick", b^=a; a^=b; b^a=a; but the article said it
can be dangerous in some cases i.e. overflow.

I was thinking of this and I invented an alternative version which I
believe will work correctly in 100% of cases (uses SUM not XOR):
b+=a; a=b-a; b-=a;
Has anyone thought of this idea before? Would you say it will be cheaper
than the temporary variable?

Also program start up is very slow. It must read in 10 million data
points from a text file - access is slow (network file). There are 2
unsigned long integers (x,y) on each line (represented by strings) that I
read into a 10000000x2 array.

Right now I am using fgets() and atol. I don't have any good ideas to
optimise this except maybe setting up 100 threads to read different
sections of the file. Are there any faster ways to convert integers than
atol?

while(fgets(strg,999,fp))
{ ary[ln][0]=atol(strtok(strg," "));
ary[ln][1]=atol(strtok((char*)NULL,"\n"));
ln++; }

Thank You to everyone who has helped so far.

Kaz Kylheku · Feb 5, 2012

In one loop called many times, I need to swap 2 variables. Currently I do
this with a temporary variable:
tmp=a; a=b; b=tmp;
I think it will save overhead to avoid this extra variable. I found on
the internet an "XOR trick", b^=a; a^=b; b^a=a; but the article said it
can be dangerous in some cases i.e. overflow.

The astute reader must now come to this conclusion: troll.

Also program start up is very slow. It must read in 10 million data
points from a text file - access is slow (network file). There are 2
unsigned long integers (x,y) on each line (represented by strings) that I
read into a 10000000x2 array.

Right now I am using fgets() and atol. I don't have any good ideas to
optimise this except maybe setting up 100 threads to read different
sections of the file. Are there any faster ways to convert integers than
atol?

Oh yeah 100 threads: that will make bytes come down faster from the network.

What a hoot.

Eric Sosman · Feb 5, 2012

[...]it stands to reason that it will be quicker
just to look up a result, than to do a computation every time.

Two questions for the astute reasoner:

1) At full throttle, how many instructions can your computer's
CPU execute in one second?

2) At full throttle, how many fetches can your computer's RAM
satisfy in one second?

Since you've already mentioned "cache" you are presumably aware
of it. Does the mere existence of cache memory suggest anything
about the answers to (1) and (2)?

Keith Thompson · Feb 5, 2012

Raj Pashwar said:
Unfortunately x must be signed because -1 is used to denote "invalid
value". So I cannot take full advantage of people's suggestions.

Most people were suggesting dividing by 8 and letting the compiler
generate the best possible code.

I have tried these out. I find that the program speed does not seem to
change much, if I use x%8 or x&7 or x=-(x>>3)<<3 : disappointing! I did
not try the table yet, but it stands to reason that it will be quicker
just to look up a result, than to do a computation every time.

No, it does not stand to reason. A table lookup requires performing
computations to determine the location of the desired entry, as well as
a memory access to retrieve it. Do not depend on your intuition to tell
you what micro-optimizations will make your code faster.

I have two other optimisation questions I am looking at.

In one loop called many times, I need to swap 2 variables. Currently I do
this with a temporary variable:
tmp=a; a=b; b=tmp;
I think it will save overhead to avoid this extra variable. I found on
the internet an "XOR trick", b^=a; a^=b; b^a=a; but the article said it
can be dangerous in some cases i.e. overflow.

I was thinking of this and I invented an alternative version which I
believe will work correctly in 100% of cases (uses SUM not XOR):
b+=a; a=b-a; b-=a;
Has anyone thought of this idea before? Would you say it will be cheaper
than the temporary variable?

Yes, it's been thought of. It's probably even less reliable than the
xor trick; any of the computations can overflow or underflow, producing
undefined behavior.

The way to swap two variables is to use a temporary

const int old_a = a;
a = b;
b = old_a;

What makes you think that copying values *and* performing arithmetic on
them is going to be faster than just copying them?

Also program start up is very slow. It must read in 10 million data
points from a text file - access is slow (network file). There are 2
unsigned long integers (x,y) on each line (represented by strings) that I
read into a 10000000x2 array.

Right now I am using fgets() and atol. I don't have any good ideas to
optimise this except maybe setting up 100 threads to read different
sections of the file. Are there any faster ways to convert integers than
atol?

while(fgets(strg,999,fp))
{ ary[ln][0]=atol(strtok(strg," "));
ary[ln][1]=atol(strtok((char*)NULL,"\n"));
ln++; }

The time spent reading data from the file is likely to greatly exceed
the time spent converting it to integers. 10 million data points isn't
all that many. Can you cache the file locally?

Incidentally, formatting your code for legibility has no performance cost:

while (fgets(strg, 999, fp)) {
ary[ln][0] = atol(strtok(strg, " "));
ary[ln][1] = atol(strtok((char*)NULL, "\n"));
ln++;
}

There are probably better approaches than using strtok(); a solution
using strtol() is likely to be cleaner if not faster.

Michael Angelo Ravera · Feb 6, 2012

Hello :

My current program runs too slow. It uses x/8 and x%8 many times, where x
is an int in both cases.

I have replaced x/8 with x>>3, as this is faster (pretty sure).
However, I am not sure whether this is faster than x%8 :
x-=(x>>3)<<3;

"x & 7" and "x % 8" are logically equivalent for unsigned integers. The former has a chance of being faster. I need to check the definition for the residue of a negative number of a positive one, but, a reasonable definition makes "x & 7" work for one's complement implementations of negative numbersas well. It would be very reasonable to define -1 % 8 as === 7.

Dann Corbit · Feb 6, 2012

int div8_0(int x)
{
return x >> 3;
}

int div8_1(int x)
{
return x / 8;
}

Using gcc 4.7 with -O3 on this program:

#include <stdio.h>
#include <stdlib.h>

int div8_0(int x)
{
return x >> 3;
}

int div8_1(int x)
{
return x / 8;
}

int main(void)
{
int a = rand();
int result0 = div8_0(a);
int result1 = div8_1(a);
printf("%d / 8 = %d, %d\n", a, result0, result1);
return 0;
}

gave this output, which adds additional instructions for the division as
opposed to the shift:

C:\tmp>type test.s
.file "test.c"
.text
.p2align 4,,15
.globl div8_0
.def div8_0; .scl 2; .type 32; .endef
.seh_proc div8_0
div8_0:
.seh_endprologue
movl %ecx, %eax
sarl $3, %eax
ret
.seh_endproc
.p2align 4,,15
.globl div8_1
.def div8_1; .scl 2; .type 32; .endef
.seh_proc div8_1
div8_1:
.seh_endprologue
leal 7(%rcx), %eax
testl %ecx, %ecx
cmovns %ecx, %eax
sarl $3, %eax
ret
.seh_endproc
.def __main; .scl 2; .type 32; .endef
.section .rdata,"dr"
..LC0:
.ascii "%d / 8 = %d, %d\12\0"
.section .text.startup,"x"
.p2align 4,,15
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
subq $40, %rsp
.seh_stackalloc 40
.seh_endprologue
call __main
call rand
leal 7(%rax), %r9d
testl %eax, %eax
movl %eax, %r8d
leaq .LC0(%rip), %rcx
movl %eax, %edx
cmovns %eax, %r9d
sarl $3, %r8d
sarl $3, %r9d
call printf
xorl %eax, %eax
addq $40, %rsp
ret
.seh_endproc
.def rand; .scl 2; .type 32; .endef
.def printf; .scl 2; .type 32; .endef

With similar output for the MS compiler:
; Listing generated by Microsoft (R) Optimizing Compiler Version
16.00.40219.01

include listing.inc

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

_DATA SEGMENT
$SG4006 DB '%d / 8 = %d, %d', 0aH, 00H
_DATA ENDS
PUBLIC div8_0
; Function compile flags: /Ogtpy
_TEXT SEGMENT
x$ = 8
div8_0 PROC
; File c:\tmp\test.c
; Line 6
sar ecx, 3
mov eax, ecx
; Line 7
ret 0
div8_0 ENDP
_TEXT ENDS
PUBLIC div8_1
; Function compile flags: /Ogtpy
_TEXT SEGMENT
x$ = 8
div8_1 PROC
; Line 11
mov eax, ecx
cdq
and edx, 7
add eax, edx
sar eax, 3
; Line 12
ret 0
div8_1 ENDP
_TEXT ENDS
PUBLIC main
EXTRN printf

ROC
EXTRN rand

ROC
pdata SEGMENT
$pdata$main DD imagerel $LN7
DD imagerel $LN7+53
DD imagerel $unwind$main
pdata ENDS
xdata SEGMENT
$unwind$main DD 010401H
DD 04204H
; Function compile flags: /Ogtpy
xdata ENDS
_TEXT SEGMENT
main PROC
; Line 15
$LN7:
sub rsp, 40 ; 00000028H
; Line 16
call rand
; Line 19
lea rcx, OFFSET FLAT:$SG4006
mov r11d, eax
cdq
and edx, 7
mov r8d, r11d
lea r9d, DWORD PTR [rdx+rax]
sar r8d, 3
mov edx, r11d
sar r9d, 3
call printf
; Line 20
xor eax, eax
; Line 21
add rsp, 40 ; 00000028H
ret 0
main ENDP
_TEXT ENDS
END

I didn't bother to time it, but it seems that state of the art compilers
create faster code for the shift than for the division.

I guess that the bottom line is that if you need the last little
smackerel of performance then you have better try all the alternatives
and benchmark them.

Willem · Feb 6, 2012

Dann Corbit wrote:
) Using gcc 4.7 with -O3 on this program:
)
) #include <stdio.h>
) #include <stdlib.h>
)
) int div8_0(int x)
) {
) return x >> 3;
) }
)
) int div8_1(int x)
) {
) return x / 8;
) }
)
) int main(void)
) {
) int a = rand();
) int result0 = div8_0(a);
) int result1 = div8_1(a);
) printf("%d / 8 = %d, %d\n", a, result0, result1);
) return 0;
) }
)
) gave this output, which adds additional instructions for the division as
) opposed to the shift:

That's probably because the compiler has a bit more freedom to deal with
shifts of negative integers. The extra instructions look like they fix
some issue with signed shifting on that architecture.

IOW: Another case of 'undefined behaviour enables faster code'.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Stephen Sprunk · Feb 6, 2012

Yes and the point is that this is what helps to optimize it better.

You missed the point. If x is negative, they are not guaranteed to have
the same behavior, so the compiler is not allowed to make such an
optimization. The human wins because (and _only_ because) he knows
something the compiler doesn't know.

Solution: tell the compiler what you know.

Anyway, let's see how GCC (4.5.2, x86) weighs in on this:

int div8_0(int x)
{
return x >> 3;
}

int div8_1(int x)
{
return x / 8;
}

[ asm elided ]

Your code appears to be amd64. Unfortunately, all I have access to at
the moment is an i686 box with an older GCC, so I'll give the (slightly
different) code I get:

div8_0:
movl 4(%esp), %eax
sarl $3, %eax
ret

div8_1:
movl 4(%esp), %eax
testl %eax, %eax
leal 7(%eax), %edx
cmovs %edx, %eax
sarl $3, %eax
ret

Note that the important bit, eliminating the division, still happens;
the performance loss due to the extra instructions will be minimal in
comparison and, most importantly, the result _will never be incorrect_,
unlike the human-optimized version that doesn't actually do what it says
it does.

Two more instructions in the div case!

Three more for me.

So the advice "use / 8; the compiler will generate the same code" is quite
obsolete. It was that way on some compilers for the C90 language.

Those extra instructions are to deal with the possibility that x is
negative.

If I make x (and the return value) unsigned, I get this:

div8_0:
movl 4(%esp), %eax
shrl $3, %eax
ret

div8_1:
movl 4(%esp), %eax
shrl $3, %eax
ret

Presto, the code is identical.

(For instance, if division has truncation toward negative infinity semantics,
and the arithmetic bit shifter shifts through the sign and copies the sign,
then such a reduction can be made.)

Isn't that what the extra instructions in the compiler's partially
strength-reduced version do?

The reduction can also be made if you somehow make make it obvious to the
compiler that the variable cannot be negative.

Indeed. When I changed the functions like this:

int div8_0(int x)
{
if (x<0) return 0;
return x >> 3;
}

int div8_1(int x)
{
if (x<0) return 0;
return x / 8;
}

I got identical asm code, indicating the compiler was smart enough to
figure the full strength reduction was safe even for a "signed" operand:

div8_0:
movl 4(%esp), %edx
xorl %eax, %eax
movl %edx, %ecx
sarl $3, %ecx
testl %edx, %edx
cmovns %ecx, %eax
ret

div8_1:
movl 4(%esp), %edx
xorl %eax, %eax
movl %edx, %ecx
sarl $3, %ecx
testl %edx, %edx
cmovns %ecx, %eax
ret

(Doing that by making it unsigned, however, has its downsides: like
introducing bugs due to the discontinuity in unsigned in the neighborhood
of zero, and surprising value changes on conversion.)

If you're worried about that, it means you're not really sure that your
x is as non-negative as your micro-optimization assumes it is. IOW, the
optimization is invalid.

S

Kaz Kylheku · Feb 6, 2012

You missed the point. If x is negative, they are not guaranteed to have
the same behavior, so the compiler is not allowed to make such an
optimization. The human wins because (and _only_ because) he knows
something the compiler doesn't know.

Solution: tell the compiler what you know.

The human usually knows something that the compiler does not know. Languages
that require full disclosure of what the human knows are probably going to
be too impractical to be useful.

Often there are things that neither the compiler nor the human knows.

We write a computer program because we want to compute something we don't know,
and in many case the program can't be statically evaluated to give that to us;
it has to actually be run.

It /would/ be nice if assert (x >= 0) could be understood as as an optimization
hint by the compiler (and even with NDEBUG in effect).

Note that the important bit, eliminating the division, still happens;

That depends on what you regard as the important bit. What if the programmer
takes it for granted that x / 8 will not actually emit a division, but still
wants to shave off these extra cycles from not worrying about a negative input.

the performance loss due to the extra instructions will be minimal in
comparison and, most importantly, the result _will never be incorrect_,
unlike the human-optimized version that doesn't actually do what it says
it does.

Three more for me.

Those extra instructions are to deal with the possibility that x is
negative.

Exactly, so if we know that it isn't, how can we express that to the compiler?

The least risky way is to let the compiler deduce it by itself. Explicit
promises are risky because they could turn out to be lies.

If you're worried about that, it means you're not really sure that your
x is as non-negative as your micro-optimization assumes it is. IOW, the
optimization is invalid.

As you know, the way the language works is that if you make some things
unsigned, there is a "collateral damage": some other operands silently become
unsigned also.

What if x is never negative, but is involved in calculations with other
operands that may be negative? Change x to unsigned, and now certain
unwanted conversions are happening which will render those calculations
incorrect.

Unsigned will cause a "contagion" that spreads in the assignment of types
to expressions in the parse tree.

So that would probably be the most risky way of telling the compiler "this
quantity is nonnegative".

You can confine the contagion with some casts. Convert the value to unsigned,
do the division and then convert back to signed. (That relies on the conversion
being a two's complement noop).

Markus Wichmann · Feb 7, 2012

Using gcc 4.7 with -O3 on this program:

#include <stdio.h>
#include <stdlib.h>

int div8_0(int x)
{
return x >> 3;
}

int div8_1(int x)
{
return x / 8;
}

int main(void)
{
int a = rand();
int result0 = div8_0(a);
int result1 = div8_1(a);
printf("%d / 8 = %d, %d\n", a, result0, result1);
return 0;
}

gave this output, which adds additional instructions for the division as
opposed to the shift:

C:\tmp>type test.s
.file "test.c"
.text
.p2align 4,,15
.globl div8_0
.def div8_0; .scl 2; .type 32; .endef
.seh_proc div8_0
div8_0:
.seh_endprologue
movl %ecx, %eax
sarl $3, %eax
ret
.seh_endproc

So, this is really just eax = ecx >> 3.

.p2align 4,,15
.globl div8_1
.def div8_1; .scl 2; .type 32; .endef
.seh_proc div8_1
div8_1:
.seh_endprologue
leal 7(%rcx), %eax
testl %ecx, %ecx
cmovns %ecx, %eax
sarl $3, %eax
ret
.seh_endproc

While this is:

eax = ecx + 7
if (ecx >= 0) eax = ecx
eax >>= 3

Minor difference! The added 7 is there to round towards zero in case ecx
is negative.

[...]

With similar output for the MS compiler:
; Listing generated by Microsoft (R) Optimizing Compiler Version
16.00.40219.01

include listing.inc

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

_DATA SEGMENT
$SG4006 DB '%d / 8 = %d, %d', 0aH, 00H
_DATA ENDS
PUBLIC div8_0
; Function compile flags: /Ogtpy
_TEXT SEGMENT
x$ = 8
div8_0 PROC
; File c:\tmp\test.c
; Line 6
sar ecx, 3
mov eax, ecx
; Line 7
ret 0
div8_0 ENDP

Again, this is eax = ecx >> 3.

_TEXT ENDS
PUBLIC div8_1
; Function compile flags: /Ogtpy
_TEXT SEGMENT
x$ = 8
div8_1 PROC
; Line 11
mov eax, ecx
cdq
and edx, 7
add eax, edx
sar eax, 3
; Line 12
ret 0
div8_1 ENDP

Okay, that's some cool code... this is:

eax = ecx
if (eax < 0) eax += 7
eax >>= 3

Same meaning, different code.
[...]

I didn't bother to time it, but it seems that state of the art compilers
create faster code for the shift than for the division.

Ya think? Well then, let's roll out the optimization guide, shall we? My
optimization guide states that up to three decoder units can be used in
parallel, so we have:

div8_0:
Opcode DUu L
mov eax, ecx 1 1
shr eax, 3 1 1

Total Latency: 1 cycle

DUu = Decoder Units used
L = Latency

div8_1 (by gcc):

lea eax, [rcx+7] 1 1
test ecx, ecx 1 1
cmovns eax, ecx 1 1
shr eax, 3 1 1

Total Latency: 2 cycles

div8_1 (by MSVC):

mov eax, ecx 1 1
cdq 1 1
and edx, 7 1 1
add eax, edx 1 1
shr eax, 3 1 1

Total Latency: 2 cycles

So yeah, div8_0 _is_ faster than div8_1.

I guess that the bottom line is that if you need the last little
smackerel of performance then you have better try all the alternatives
and benchmark them.

You could simply tell the compiler that that number won't be negative,
in which case adding code for rounding towards zero won't be necessary.
But no, you have to do it the hard way.

HTH,
Markus

lawrence.jones · Feb 7, 2012

Nick Keighley said:
You /can/ improve the performance of a program without measuring all
the different bits. I know I've done it (tweaking a hash calculation
resulted in a massive performance improvement). Sometimes educated
guess-work /is/ good enough.

Sometimes, but not very often. I've done a lot of performance
improvement over the years and have found that the bottlenecks
are very rarely where you expect them to be.

Dann Corbit · Feb 7, 2012

You could simply tell the compiler that that number won't be negative,
in which case adding code for rounding towards zero won't be necessary.
But no, you have to do it the hard way.

The original poster had signed integers.
I have no idea if it matter if they are signed or not.
I do understand the opimizations available for unsigned integers, since
I write bitmap based chess programs for a hobby.

Phil Carmody · Feb 21, 2012

Keith Thompson said:
"hit" --> "hint"

Nah, it's a hit - like a drug. You think it's doing you good, but
you're just deluding yourself, and need help!

Phil
--

I'd argue that there is much evidence for the existence of a God.

Pics or it didn't happen.
-- Tom (/. uid 822)

More optimisation	24	Jun 16, 2009
Hello from beginner with some questions!	3	Jul 30, 2021
C Programming functions	2	Dec 3, 2021
-2147483648 and gcc optimisation, all sorts of different results	6	Mar 11, 2010
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Problem with codewars.	5	Dec 4, 2023
Range / empty list issues??	1	Dec 11, 2023
Desmos is insanely fast	8	Jul 12, 2023

Optimisation questions

Eric Sosman

Raj Pashwar

Kaz Kylheku

Eric Sosman

Keith Thompson

Michael Angelo Ravera

Dann Corbit

Willem

Stephen Sprunk

Kaz Kylheku

Markus Wichmann

lawrence.jones

Dann Corbit

Phil Carmody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads