More optimisation

K

kid joe

Hi all,

I know this is fairly system specific but I was hoping some people here
with a wide experience of C implementations on different platforms could
pass on their knowledge.

If I want to perform actions on several large arrays, whats the best way
to do that? Lets say for a silly example I have arrays a,b,c all of size n
and I want to multiply everything in a by 2, everything in b by 3 and
everything in c by 4.

Method 1:
for(long int i = 0; i < n; i++) {
a *= 2;
b *= 3;
c *= 4;
}

Method 2:
for(long int i = 0; i < n; i++)
a *= 2;
for(long int i = 0; i < n; i++)
b *= 3;
for(long int i = 0; i < n; i++)
c *= 4;

Now you could argue that Method 1 should be faster because there are less
branching instructions, and branching is expensive. But then you could
argue that Method 2 should be better because theres better data locality
so better use of cache.

I know people will say to profile it on my system! :) But Id be interested
in peoples thoughts on this on a variety of different hardwares.

Cheers,
Joe
 
U

user923005

Hi all,

I know this is fairly system specific but I was hoping some people here
with a wide experience of C implementations on different platforms could
pass on their knowledge.

If I want to perform actions on several large arrays, whats the best way
to do that? Lets say for a silly example I have arrays a,b,c all of size n
and I want to multiply everything in a by 2, everything in b by 3 and
everything in c by 4.

Method 1:
for(long int i = 0; i < n; i++) {
  a *= 2;
  b *= 3;
  c *= 4;

}

Method 2:
for(long int i = 0; i < n; i++)
  a *= 2;
for(long int i = 0; i < n; i++)
  b *= 3;
for(long int i = 0; i < n; i++)
  c *= 4;

Now you could argue that Method 1 should be faster because there are less
branching instructions, and branching is expensive. But then you could
argue that Method 2 should be better because theres better data locality
so better use of cache.

I know people will say to profile it on my system! :) But Id be interested
in peoples thoughts on this on a variety of different hardwares.


My opinion is that these optimizations are very, very silly.

You should not perform these optimizations unless:
1. The routine does not meet performance specifications
2. The profile reveals that initialization of arrays is the bottleneck

Array initialization is linear for a 1-D array, and so it is probably
the least of your worries.

It's like buying Pirelli P3 tires for an old Volkswagen beetle.
It's costly, it's silly, and it won't make the beetle go any faster.

P.S.
You are also right about the profile. I guess that different machines
and different compilers will give different answers. And if you
change the data types you may get different answers yet again.

P.P.S.
The first rule of optimization is "DON'T DO IT."
The second rule of optimization (for experts only) is "Don't do it
*yet*."

These are serious rules. A practice of premature optimization because
you think something might go faster due to tweaky little hacks is a
very, very, very, very bad coding practice. Not only will it make the
code more expensive, it is also going to be a complete waste of time
in almost all cases.

If you want to make code go faster, the very last place you should
ever imagine going is hopping on board the good ship "tweaky hacks."
Instead, improve the algorithm. Chances are very good you can find an
algorithm in the literature that will improve the code. If you cannot
find one, consider thinking about how to improve the algorithm
yourself.

I guess you will ignore this advice, since it did not stick last
time. That's too bad.
 
K

Kenny McCormack

Hi all,

I know this is fairly system specific but I was hoping some people here
with a wide experience of C implementations on different platforms could
pass on their knowledge.

If I want to perform actions on several large arrays, whats the best way
to do that? Lets say for a silly example I have arrays a,b,c all of size n
and I want to multiply everything in a by 2, everything in b by 3 and
everything in c by 4.

Method 1:
for(long int i = 0; i < n; i++) {

Of course you know, all you will get in the way of responses is people
pointing out that the above is a syntax error in C (C89, of course,
since that's all there is).

Or you may, if you are lucky, get responses telling you, in patronizing
detail, how you need to flag the above as "C99" (or gcc, or whatever) only.
 
T

Tim Prince

kid said:
I know this is fairly system specific but I was hoping some people here
with a wide experience of C implementations on different platforms could
pass on their knowledge.

If I want to perform actions on several large arrays, whats the best way
to do that? Lets say for a silly example I have arrays a,b,c all of size n
and I want to multiply everything in a by 2, everything in b by 3 and
everything in c by 4.

Method 1:
for(long int i = 0; i < n; i++) {
a *= 2;
b *= 3;
c *= 4;
}

Method 2:
for(long int i = 0; i < n; i++)
a *= 2;
for(long int i = 0; i < n; i++)
b *= 3;
for(long int i = 0; i < n; i++)
c *= 4;

Now you could argue that Method 1 should be faster because there are less
branching instructions, and branching is expensive. But then you could
argue that Method 2 should be better because theres better data locality
so better use of cache.

For several years, all desktop CPUs (and laptops of similar families)
have supported at least 4 write combine buffers per logical or physical
processor, thus efficiently supporting 4 or more parallel paths to
memory. There are some still on the market which don't efficiently
support vectorized store without 16-byte alignment, so then you might
have an argument for splitting it up.
Without restrict or other indications to the compiler that the arrays
don't alias, you will kill performance with multiple arrays per loop.

Certain compilers, with appropriate options and syntax, will choose
these optimizations for you.
The loop length threshold where OpenMP parallelism pays off will be
lower with more work in the loop, if you remove the other possible
obstacles to optimization.
 
S

Stephen Sprunk

kid said:
Method 1:
for(long int i = 0; i < n; i++) {
a *= 2;
b *= 3;
c *= 4;
}

Method 2:
for(long int i = 0; i < n; i++)
a *= 2;
for(long int i = 0; i < n; i++)
b *= 3;
for(long int i = 0; i < n; i++)
c *= 4;

Now you could argue that Method 1 should be faster because there are less
branching instructions, and branching is expensive. But then you could
argue that Method 2 should be better because theres better data locality
so better use of cache.


I expect the better data locality in method 2 would give better
performance unless n is very small; your loops aren't complex enough to
keep a modern desktop/server CPU busy in between waiting for cache line
fills.

Branching is _not_ expensive on such CPUs (though it may be on old or
embedded CPUs); what is really expensive is a _mispredicted_ branch, and
in simple for loops like the above, the branch should be correctly
predicted for every iteration except the last (and perhaps the first).
The rest of the branches are effectively free. The increments and
comparisons before the branches will just fill in pipeline bubbles and
are also effectively free. However, those costs may reassert themselves
if your actual loops are more complicated than shown above and there
aren't any bubbles to fill.

S
 
K

Kaz Kylheku

Hi all,

I know this is fairly system specific but I was hoping some people here
with a wide experience of C implementations on different platforms could
pass on their knowledge.

If I want to perform actions on several large arrays, whats the best way
to do that?

The best way is to look at the memory hiearchy of the processor you are using.
Try to avoid needless re-loading the same parts of an array from one level of
caching to another; e.g. from main memory to the L2 cache, or from L2 to L1.
Lets say for a silly example I have arrays a,b,c all of size n
and I want to multiply everything in a by 2, everything in b by 3 and
everything in c by 4.

Method 1:
for(long int i = 0; i < n; i++) {
a *= 2;
b *= 3;
c *= 4;
}

Method 2:
for(long int i = 0; i < n; i++)
a *= 2;
for(long int i = 0; i < n; i++)
b *= 3;
for(long int i = 0; i < n; i++)
c *= 4;

Now you could argue that Method 1 should be faster because there are less
branching instructions, and branching is expensive. But then you could
argue that Method 2 should be better because theres better data locality
so better use of cache.


This caching analysis is false. Neither method uses caching particularly well.
This program simply doesn't benefit from caching very much, because it
processes large (or presumably large) arrays sequentially. From a caching point
of view, it makes little difference whether we process each of three large
arrays separately, or march through them together.

The problem is that our program never looks at a value once it has
updated it. Caching provides the most benefit when you access the same thing
you have recently accessed. If you keep accessing new material, caching
provides less of a benefit. The organization of the cache system can still
speed up the accesses when the pattern is linear, because memory access is
parallelized: entire cache lines are being sucked into the cache in a burst
mode memory access, and in parallel through a wide bus, rather than individual
words. So the fact that you are processing words in order is a kind of locality
which helps to take advantage of the parallelism inherent in the organization
of the cache in the lines.

The one caching consideration may be that the arrays are subject to worst-case
cache aliasing. Suppose that a[], b[] and c[] are situated at such addresses
that their corresponding parts hash to the same cache line, and suppose that
your processor's cache is of the simplest possible type: direct mapped. So
a[0], b[0] and c[0] land on the same cache line and so on. This means that the
first example will trash horribly. The very first access a[0] *= 2 will load an
entire cache line. So you have a[1], a[2] through, say, a[7] ready for fast
access. But you don't take advantage of it: the next access after a[0] is not
a[1] but b[0]. And in our worst-case situation, this maps to the same cache
line as a[0], with a direct mapping. Oops! The cache line is dirty since we
just stored a[0], so the whole line has to be written out first. After being
written out, the cache line is loaded with b[0] through b[7]. The b[0] is
updated in the cache line, and now c[0] is accessed. Oops! The entire cache
line is flushed and re-loaded again with c[0] thorugh c[7]. Then the trashing
repeats with a[1], over the same three cache lines, and so on.

This potential problem does not threaten the second example, since you are
marching through large areas of memory in order; you are not accessing three
different large areas of memory at some fixed displacement from each other.

The first example will run about as fast as the second if the cache lines do
not conflict. That is to say, if the parallel elements of a[], b[] and c[] map
to three separate cache lines you are okay. It doesn't matter whether your
cache has 32 lines in total or 256. You just need three unique ones at any time
so that the loop can suck data without interference.
Aliasing problems can occur at multiple levels in the memory hiearchy.

And also in other kinds of caches like TLB's for virtual address translation.

Suppose you have a direct mapped TLB, and the code alternates its accesses
between two virtual pages that clash to the same TLB entry. Poof; you are
playing a game of TLB miss-and-reload ping pong.
 
A

Antoninus Twink

These are serious rules. A practice of premature optimization because
you think something might go faster due to tweaky little hacks is a
very, very, very, very bad coding practice.

This is true up to a point.

However, *all other things being equal* it's obviously better to choose
code that's likely to run faster on most platforms (as opposed to the
Heathfield method of choosing code likely to run slower on most
platforms).

I think the OP was saying: "OK, I've got this situation where I could do
this in two different ways. Both are equally clear and readable, both
are correct and maintainable, so why not let probable speed on probable
architectures be the determining factor as to which one I pick?"

And why not indeed? Even if you see efficiency (or in Heathfield's case,
inefficiency) as the very last entry in your lexicographic order of the
desirable properties of code, it still beats tossing a coin.
 
H

Hamiral

kid joe a écrit :
Method 2:
for(long int i = 0; i < n; i++)
a *= 2;
for(long int i = 0; i < n; i++)
b *= 3;
for(long int i = 0; i < n; i++)
c *= 4;


I'd sayr method two is faster, but if you like micro-optimization, I'd
do it like this :

long int i;
int *tmp_ptr; (assuming a, b and c are int arrays)

tmp_ptr = a;
for(i=0; i<n; i++) {
*tmp_ptr <<= 1; // equivalent to *=2
tmp_ptr++;
}
tmp_ptr = b;
for(i=0; i<n; i++) {
*tmp_ptr *= 3;
tmp_ptr++;
}
tmp_ptr = c;
for(i=0; i<n; i++) {
*tmp_ptr <<= 2; // equivalent to *=4
tmp_ptr++;
}

But maybe assigning tmp_ptr each time makes you lose the speed you won
by using bitshift and pointer incrementation...

As a final note, I'd say you should rely on your compiler. Unless you're
on a very low end hardware, I'm really not sure you can't optimize
better than your compiler does...

Ham
 
E

Edwin van den Oetelaar

kid said:
Hi all,

I know this is fairly system specific but I was hoping some people here
with a wide experience of C implementations on different platforms could
pass on their knowledge.

If I want to perform actions on several large arrays, whats the best way
to do that? Lets say for a silly example I have arrays a,b,c all of size n
and I want to multiply everything in a by 2, everything in b by 3 and
everything in c by 4.

Method 1:
for(long int i = 0; i < n; i++) {
a *= 2;
b *= 3;
c *= 4;
}

Method 2:
for(long int i = 0; i < n; i++)
a *= 2;
for(long int i = 0; i < n; i++)
b *= 3;
for(long int i = 0; i < n; i++)
c *= 4;



If your calculations are more expensive than *2 *3 *4 it could be worth running multiple threads if
you have more cores/cpu.
Method 2 can be run in parallel, method 1 not.

Just my $0.02, benchmark if in doubt.

Gr,
Edwin
 
S

Stephen Sprunk

Hamiral said:
kid joe a écrit :
Method 2:
for(long int i = 0; i < n; i++)
a *= 2;
for(long int i = 0; i < n; i++)
b *= 3;
for(long int i = 0; i < n; i++)
c *= 4;


I'd sayr method two is faster, but if you like micro-optimization, I'd
do it like this :

long int i;
int *tmp_ptr; (assuming a, b and c are int arrays)

tmp_ptr = a;
for(i=0; i<n; i++) {
*tmp_ptr <<= 1; // equivalent to *=2
tmp_ptr++;
}


A left-shift is _not_ always the same thing as multiplying by two,
particularly when it comes to signed values. Besides, if that
transformation were safe and indeed faster, the compiler would already
be doing it; they're very good at micro-optimizations like that, so the
only real wins come from algorithmic changes.

Also, if you're going to switch from an array-index loop to a pointer
loop, you might as well use the pointer as the loop variable/condition
and get rid of the index entirely.

S
 
H

Hamiral

Stephen Sprunk a écrit :
A left-shift is _not_ always the same thing as multiplying by two,
particularly when it comes to signed values.

Oh yes, forgot that. I'm so used to dealing with unsigned values...
Besides, if that
transformation were safe and indeed faster, the compiler would already
be doing it; they're very good at micro-optimizations like that, so the
only real wins come from algorithmic changes.

Yes, but you say that just as if you didn't read the last paragraph of
my post...
Also, if you're going to switch from an array-index loop to a pointer
loop, you might as well use the pointer as the loop variable/condition
and get rid of the index entirely.

Yes, you could do it like this then :

int *tmp_ptr, *end_ptr;
tmp_ptr = a;
end_ptr = tmp_ptr + n*sizeof(*tmp_ptr);
for(;tmp_ptr<=end_ptr; tmp_ptr++) {
*tmp_ptr = whatever;
}

I'm not sure with the end condition though... <= or < ?

Anyway, as I said earlier, the compiler most probably does that kind of
optimizations much better than any human being...

Ham
 
B

Ben Pfaff

Eric Sosman said:
Kelsey said:
[...]

Frankly, I'd be surprised if there were any discernible difference
other than possibly on an implementation designed to be
pathological.

In my one-off test, the second method ran 2.5 times faster than
the first. I guess that makes gcc and Opteron "pathological."

I'd be surprised if the performance differences didn't depend
significantly on the value of n. What n did you use?
 
W

Wolfgang Draxinger

Lie said:
The fastest would be to use SSE, I think...

Not if it's used naively. Mixing SSE assembler instructions into C code
makes it impossible for the optimizer to rearrange or substiture those
instructions with code, that's better in this certain situation. Both the
GNU and the Intel compiler collection provide propritary extensions in the
form of builtin functions (intrinsics) to access a CPU's vector unit (also
architecture dependent), which when used allow the optimizer to create a
codepath that reaches that optimizers best metrik. Unless you are one of
the top noth experts on the given CPU it will be hard to reach that level
with hand written assembler.


Wolfgang
 
W

Wolfgang Draxinger

kid said:
Now you could argue that Method 1 should be faster because there are less
branching instructions, and branching is expensive. But then you could
argue that Method 2 should be better because theres better data locality
so better use of cache.

And one might argue, that the OOE execution dispatcher recognizes the 3
consecutive loops and dispatches the code in three parallel pipelines, this
executing it in a fashion that resembles method 1.

The only way to know for sure is: Profile the code with many architectures
and compiler versions you can get. Maybe even put #ifdef blocks around it,
creating different codepaths, which are enabled upon runtime depending on
the actual architecture the code is executed on (link into
separate .so/.dll files and load the best matching variant upon program
start).

In my 3D engine about 2% of the core algorithms code is conditionally
selected that way, depending on if it's executed on Intel or AMD CPU; it
would work with other architectures as well, but those are rather rare in
this application field.


Wolfgang
 
U

user923005

This is true up to a point.

However, *all other things being equal* it's obviously better to choose
code that's likely to run faster on most platforms (as opposed to the
Heathfield method of choosing code likely to run slower on most
platforms).

I think the OP was saying: "OK, I've got this situation where I could do
this in two different ways. Both are equally clear and readable, both
are correct and maintainable, so why not let probable speed on probable
architectures be the determining factor as to which one I pick?"

And why not indeed? Even if you see efficiency (or in Heathfield's case,
inefficiency) as the very last entry in your lexicographic order of the
desirable properties of code, it still beats tossing a coin.

I agree with you that we should start with the code that runs the
fastest.
However, I think that the right way to choose code that runs faster is
to choose a better algorithm.
Tweaky hacks are a bad idea.
 
H

Hamiral

Eric Sosman a écrit :
incorrect program is no gain. (Can you spot *both* errors?)

No, I can't ;)
But I usually don't micro-optimize. And my code always has basic bugs
before I compile, test and correct it. So maybe I shouldn't post at all
? I just wanted to show some ways for optimizing, and I never intended
to provide code that he could just paste in his own program without
understanding or testing it.
Usually, yes. But the O.P.'s question was about two different
ways to organize the same calculation, and my exhaustive survey
(sample size N=1) showed that one of them ran 2.5 times faster than
the other *and* that none of the compilers tested (sample size N=1)
managed to optimize the slower version to match the faster. We
humans still have a wee bit of time left before the machines triumph
altogether.

I never pretended my proposition was better than yours... Just wanted to
follow-up on the "optimization topic"...

hhhhhermm, seems to me I should think a bit more before posting here,
some people are taking it too seriously...

Ham
 
K

Kaz Kylheku

Ben said:
Eric Sosman said:
Kelsey Bjarnason wrote:
[...]

Frankly, I'd be surprised if there were any discernible difference
other than possibly on an implementation designed to be
pathological.
In my one-off test, the second method ran 2.5 times faster than
the first. I guess that makes gcc and Opteron "pathological."

Is this on Linux with glibc?
Three mallocated arrays of 1024*1024 doubles apiece (I
reported on this yesterday). There's no point in optimizing
when n is small; speeding up a 10ns operation by a factor of
2.5 (huzzah!) saves a whopping 6ns ...

If this is glibc, then the malloc will use mmap, since these allocations
are larger than the mmap threshold. Your requested size is a multiple of
the page size, so malloc will add an extra page to the request.

So even if the operating system's memory mapper places these maps adjacently,
there won't be an 8 megabyte step size between them, but rather an 8 megabyte
plus 1 page step size. That, and given that the Opteron's Level 2 TLB is
four-way set associative, as far as I can google up, the performance hit is
probably not attributable to TLB cache aliasing.

Furthermore there are two L1 TLB's integrated with the L1 cache. These
hold 32 entries (for regular 4K pages) and are /fully/ associative, whereas our
program never works with more than 3 data pages at a time.

The L2 is 16-way associative, so you can work with 16 different pieces of
memory that clash to the same set.

There shouldn't be L1 cache line aliasing either, because the virtual addresses
of the parallel elements in the two arrays (e.g. a[0] and b[0]) should differ
in bits 12, 13 and 14, bits which are part of the cache line index.

What could be affecting peformance is bank aliasing. The Opteron's L1 cache is
divided into 8 banks, which split each 64 byte cache line 8 ways into 8 units
of 8 bytes. The least significant 3 bits of the address select a byte. The
next bits after that select the bank.

Thus for a given i, a, b and c are almost certainly colliding into the
same L1 cache bank, wheras a, a[i+1], a[i+2] map to different banks. This
collision reduces opportunities for parallelism. The Opteron has two ports into
the cache, and simultaneous accesses are possible through these ports---if the
accesses go to different banks.

Could that make the entire 2.5 factor difference? Probably not, but probably
makes a contribution.
 
B

BartC

Eric Sosman said:
Kelsey said:
[...]

Frankly, I'd be surprised if there were any discernible difference other
than possibly on an implementation designed to be pathological.

In my one-off test, the second method ran 2.5 times faster than
the first. I guess that makes gcc and Opteron "pathological."

In my tests method 2 ran 0 to 100% slower than method 1, in accordance with
/my/ theory that method 2 has 3 times as much loop control as method 1.
 
H

Hamiral

Eric Sosman a écrit :
Kaz said:
Ben Pfaff wrote:

Kelsey Bjarnason wrote:
[...]

Frankly, I'd be surprised if there were any discernible difference
other than possibly on an implementation designed to be
pathological.
In my one-off test, the second method ran 2.5 times faster than
the first. I guess that makes gcc and Opteron "pathological."

Is this on Linux with glibc?

Alas, no: Windows Vaster. (My excuse: I'm working with another
company's product, one that requires a Windows-based management
station even when running on a different platform entirely. So a
Windows desktop is pretty much forced upon me.) The C implementation
is the gcc-based DJGPP environment, originally developed for MS-DOS.

I just compiled and profiled it as well, on gcc 4.3.3 on GNU/Linux
Ubuntu, on a regular intel machine.
Method 2 won :

% cumulative self self total
time seconds seconds calls ms/call ms/call name
62.50 0.05 0.05 1 50.00 50.00 method1
37.50 0.08 0.03 1 30.00 30.00 method2

BUT

This first benchmark was a bit simplistic.

So I made a real one, where method1 and method2 are successively called
50 times, and then their calls are interlaced 50 times.

Here is the gprof result (initarrays is just what its name says) :

% cumulative self self total
time seconds seconds calls ms/call ms/call name
51.46 2.99 2.99 100 29.90 29.90 method2
47.50 5.75 2.76 100 27.60 27.60 method1
1.03 5.81 0.06 1 60.00 60.00 initarrays

So we can say that the difference becomes negligible in the real world.

If you're interested, here is my benchmark, full of bugs and
inaccuracies, of course, and with no doubt someone will point out that
this is *not* real world... But, was the original problem real world ? ;)

==========================
#include <malloc.h>
#include <stdio.h>
#include <time.h>

#define N 1024*1024

void initarrays(double *a, double *b, double *c)
{
long int i;
for(i=0; i<N; i++) {
a = 2;
b = 2;
c = 2;
}
}

void method1(double *a, double *b, double *c)
{
long int i;

for(i=0; i<N; i++) {
a *= 1;
b *= 2;
c *= 3;
}
}

void method2(double *a, double *b, double *c)
{
long int i;

for(i=0; i<N; i++) {
a *= 1;
}
for(i=0; i<N; i++) {
b *= 2;
}
for(i=0; i<N; i++) {
c *= 3;
}
}

int main(int argc, char *argv[])
{
time_t start, end;
int i;
double *a = malloc(N * sizeof *a);
double *b = malloc(N * sizeof *b);
double *c = malloc(N * sizeof *c);

initarrays(a, b, c);

printf("Running method 1 50 times...\n");
start = time(NULL);
for(i=0; i<50; i++) {
method1(a, b, c);
}
end = time(NULL);
printf("%ld seconds\n", (unsigned long int)(end - start));
printf("\n");


printf("Running method 2 50 times...\n");
start = time(NULL);
for(i=0; i<50; i++) {
method2(a, b, c);
}
end = time(NULL);
printf("%ld seconds\n", (unsigned long int)(end - start));

printf("\n");

printf("Interlacing method 1 & 2 50 times...\n");
start = time(NULL);
for(i=0; i<50; i++) {
method1(a, b, c);
method2(a, b, c);
}
end = time(NULL);
printf("%ld seconds\n", (unsigned long int)(end - start));

free(a);
free(b);
free(c);

return 0;
}
==========================


Ham
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,527
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top