More optimisation

U

user923005

Eric Sosman a écrit :




Kaz said:
Ben Pfaff wrote:
Kelsey Bjarnason wrote:
[...]
Frankly, I'd be surprised if there were any discernible difference
other than possibly on an implementation designed to be
pathological.
    In my one-off test, the second method ran 2.5 times faster than
the first.  I guess that makes gcc and Opteron "pathological."
Is this on Linux with glibc?
    Alas, no: Windows Vaster.  (My excuse: I'm working with another
company's product, one that requires a Windows-based management
station even when running on a different platform entirely.  So a
Windows desktop is pretty much forced upon me.)  The C implementation
is the gcc-based DJGPP environment, originally developed for MS-DOS.

I just compiled and profiled it as well, on gcc 4.3.3 on GNU/Linux
Ubuntu, on a regular intel machine.
Method 2 won :

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ms/call  ms/call  name
  62.50      0.05     0.05        1    50.00    50.00  method1
  37.50      0.08     0.03        1    30.00    30.00  method2

BUT

This first benchmark was a bit simplistic.

So I made a real one, where method1 and method2 are successively called
50 times, and then their calls are interlaced 50 times.

Here is the gprof result (initarrays is just what its name says) :

   %   cumulative   self              self     total
  time   seconds   seconds    calls  ms/call  ms/call  name
  51.46      2.99     2.99      100    29.90    29.90  method2
  47.50      5.75     2.76      100    27.60    27.60  method1
   1.03      5.81     0.06        1    60.00    60.00  initarrays

So we can say that the difference becomes negligible in the real world.

If you're interested, here is my benchmark, full of bugs and
inaccuracies, of course, and with no doubt someone will point out that
this is *not* real world... But, was the original problem real world ? ;)

==========================
#include <malloc.h>
#include <stdio.h>
#include <time.h>

#define N 1024*1024

void initarrays(double *a, double *b, double *c)
{
        long int i;
        for(i=0; i<N; i++) {
                a = 2;
                b = 2;
                c = 2;
        }

}

void method1(double *a, double *b, double *c)
{
        long int i;

        for(i=0; i<N; i++) {
                a *= 1;
                b *= 2;
                c *= 3;
        }

}

void method2(double *a, double *b, double *c)
{
        long int i;

        for(i=0; i<N; i++) {
                a *= 1;
        }
        for(i=0; i<N; i++) {
                b *= 2;
        }
        for(i=0; i<N; i++) {
                c *= 3;
        }

}

int main(int argc, char *argv[])
{
        time_t start, end;
        int i;
        double *a = malloc(N * sizeof *a);
        double *b = malloc(N * sizeof *b);
        double *c = malloc(N * sizeof *c);

        initarrays(a, b, c);

        printf("Running method 1 50 times...\n");
        start = time(NULL);    
        for(i=0; i<50; i++) {
                method1(a, b, c);
        }
        end = time(NULL);
        printf("%ld seconds\n", (unsigned long int)(end - start));
        printf("\n");

        printf("Running method 2 50 times...\n");
        start = time(NULL);    
        for(i=0; i<50; i++) {
                method2(a, b, c);
        }
        end = time(NULL);
        printf("%ld seconds\n", (unsigned long int)(end - start));

        printf("\n");

        printf("Interlacing method 1 & 2 50 times...\n");
        start = time(NULL);
        for(i=0; i<50; i++) {
                method1(a, b, c);
                method2(a, b, c);
        }
        end = time(NULL);
        printf("%ld seconds\n", (unsigned long int)(end - start));

        free(a);
        free(b);
        free(c);

        return 0;}

==========================


Here is my adaptation and the default values that I got:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

static size_t N = 1024 * 1024;
static unsigned Iterations = 500;

void initarrays(double *a, double *b, double *c)
{
long int i;
for (i = 0; i < N; i++) {
a = 2;
b = 2;
c = 2;
}
}

void method1(double *a, double *b, double *c)
{
long int i;
for (i = 0; i < N; i++) {
a *= 1;
b *= 2;
c *= 3;
}
}

void method2(double *a, double *b, double *c)
{
long int i;
for (i = 0; i < N; i++) {
a *= 1;
}
for (i = 0; i < N; i++) {
b *= 2;
}
for (i = 0; i < N; i++) {
c *= 3;
}
}

int main(int argc, char *argv[])
{
time_t start,
end;
int i;
double *a;
double *b;
double *c;
size_t v = 0;

if (argc > 1)
v = atof(argv[1]);

if (v > 0)
N = v;

a = malloc(N * sizeof *a);
b = malloc(N * sizeof *b);
c = malloc(N * sizeof *c);
initarrays(a, b, c);
printf("Running method 1 %u times...\n", Iterations);
start = time(NULL);
for (i = 0; i < Iterations; i++) {
method1(a, b, c);
}
end = time(NULL);
printf("%ld seconds\n", (unsigned long int) (end - start));
printf("\n");
printf("Running method 2 %u times...\n", Iterations);
start = time(NULL);
for (i = 0; i < Iterations; i++) {
method2(a, b, c);
}
end = time(NULL);
printf("%ld seconds\n", (unsigned long int) (end - start));
printf("\n");
printf("Interlacing method 1 & 2 %u times...\n", Iterations);
start = time(NULL);
for (i = 0; i < Iterations; i++) {
method1(a, b, c);
method2(a, b, c);
}
end = time(NULL);
printf("%ld seconds\n", (unsigned long int) (end - start));
free(a);
free(b);
free(c);
return 0;
}
/*
Running method 1 500 times...
3 seconds

Running method 2 500 times...
4 seconds

Interlacing method 1 & 2 500 times...
8 seconds
*/
 
T

Tim Prince

user923005 said:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

static size_t N = 1024 * 1024;
static unsigned Iterations = 500;

void initarrays(double *a, double *b, double *c)
{
long int i;
for (i = 0; i < N; i++) {
a = 2;
b = 2;
c = 2;
}
}

void method1(double *a, double *b, double *c)
{
long int i;
for (i = 0; i < N; i++) {
a *= 1;
b *= 2;
c *= 3;
}
}

void method2(double *a, double *b, double *c)
{
long int i;
for (i = 0; i < N; i++) {
a *= 1;
}
for (i = 0; i < N; i++) {
b *= 2;
}
for (i = 0; i < N; i++) {
c *= 3;
}
}

int main(int argc, char *argv[])
{
time_t start,
end;
int i;
double *a;
double *b;
double *c;
size_t v = 0;

if (argc > 1)
v = atof(argv[1]);

if (v > 0)
N = v;

a = malloc(N * sizeof *a);
b = malloc(N * sizeof *b);
c = malloc(N * sizeof *c);
initarrays(a, b, c);
printf("Running method 1 %u times...\n", Iterations);
start = time(NULL);
for (i = 0; i < Iterations; i++) {
method1(a, b, c);
}
end = time(NULL);
printf("%ld seconds\n", (unsigned long int) (end - start));
printf("\n");
printf("Running method 2 %u times...\n", Iterations);
start = time(NULL);
for (i = 0; i < Iterations; i++) {
method2(a, b, c);
}
end = time(NULL);
printf("%ld seconds\n", (unsigned long int) (end - start));
printf("\n");
printf("Interlacing method 1 & 2 %u times...\n", Iterations);
start = time(NULL);
for (i = 0; i < Iterations; i++) {
method1(a, b, c);
method2(a, b, c);
}
end = time(NULL);
printf("%ld seconds\n", (unsigned long int) (end - start));
free(a);
free(b);
free(c);
return 0;
}



After restrict qualifiers are added, icc turns method 2 into the same as
method 1. gcc and Sun cc don't optimize either method.
 
B

BartC

Richard Harter said:
Eric Sosman said:
Kelsey Bjarnason wrote:
[...]

Frankly, I'd be surprised if there were any discernible difference
other
than possibly on an implementation designed to be pathological.

In my one-off test, the second method ran 2.5 times faster than
the first. I guess that makes gcc and Opteron "pathological."

In my tests method 2 ran 0 to 100% slower than method 1, in accordance
with
/my/ theory that method 2 has 3 times as much loop control as method 1.

What optimization level did you use. At higher levels the
optimizer should do loop unrolling for which the loop control cost
is insignificant.

Using gcc under Windows, using arrays of a million elements, and the loop
repeated 100 times, the timings were something like:

Int arrays:
No optimisation options: 1200ms and 2000ms
With -O3: 900ms and 1100ms (I've also seen both 900ms)

Double arrays, set to 0.0 or 1.0:
No optimisation options: 1800ms and 3200ms
With -O3: 2000ms and 2800ms

I never saw method 2 faster.
 
E

Edwin van den Oetelaar

Hamiral said:
So I made a real one, where method1 and method2 are successively called
50 times, and then their calls are interlaced 50 times.

Here is the gprof result (initarrays is just what its name says) :

% cumulative self self total
time seconds seconds calls ms/call ms/call name
51.46 2.99 2.99 100 29.90 29.90 method2
47.50 5.75 2.76 100 27.60 27.60 method1
1.03 5.81 0.06 1 60.00 60.00 initarrays

So we can say that the difference becomes negligible in the real world.

I confirm using your benchmark and 5000 iteration on my PC. (dual cpu athlon)
It does not make a difference.

oetelaar@debian:~$ gcc -Wall -O3 benchmark.c -o benchmark1
oetelaar@debian:~$ ./benchmark1
Running method 1 5000 times...
69 seconds

Running method 2 5000 times...
71 seconds

Interlacing method 1 & 2 5000 times...
142 seconds

[ snip benchmark code ]
 
J

James Dow Allen

It might be fun to start a thread on "Unusual Optimizations";
less-obvious techniques some of us have encountered to enhance
speed.

kid said:
Method 1:
for(long int i = 0; i < n; i++) {
a *= 2;
b *= 3;
c *= 4;
}


... I went
ahead an measured it on *my* system for n=1024*1024 and three arrays
of mallocated `double': Method 2 took 40% as much time as Method 1.


Others suggested this anomaly was caused by some sort of cache
collision. I suppose the mallocated addresses differed by
1024*1024+4 or so?
Did you experiment with a substitution like 1111*1111 in the malloc()
to avert cache collision? If the cache collision hypothesis
is confirmed, an "unusual optimization" is suggested, e.g.

int a[1024*1024];
/* We stick the foos between a/b hoping to avoid a/b cache collision.
... there are better ways to achieve this.
*/
struct foo bar[NUM_FOOS];
int b[1024*1024];

Note that the problem can arise in a less contrived example, e.g.
a = b * c;

I did something like this in World's Fastest Jpeg Compressor(tm),
making inner-loop functions contiguous and arranging that the
primary data work area had different low-order address bits.
Although I *do* try to err on the side of Pointless Perfectionism,
that wasn't even a Premature Optimization: I'd observed a serious
timing anomaly, as I reported on Usenet:
http://groups.google.com/group/alt.sys.sun/msg/0bfdcb0c000f3e97

It will be fun to compare notes on other non-obvious optimizations.
Ideally they should have sped up a real application significantly
(though "pointless" micro-optimizations can be fun to talk about too).

James Dow Allen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,265
Messages
2,571,071
Members
48,771
Latest member
ElysaD

Latest Threads

Top