Odd benchmark behavior

Nils M Holm · Mar 26, 2012

I ran across some odd (as far as I can tell) behavior recently, when
performing some fun benchmarks. (Yeah, but please bear with me!)

The goal was to find out how bad exactly the code generated by my
toy compiler is, so I passed the following program to GNU C (GCC),
Portable C (PCC), and my toy compiler (let's call it SCC):

-----[ p.c ] ---------------------------------------------------
#include <stdio.h>

#define MAX 20000

main() {
int i, j, p[MAX], k;

for (k=0, i=3; k<MAX; i+=2) {
for (j=0; j<k; j++)
if (i % p[j] == 0)
break;
if (j < k) continue;
p[k++] = i;
}
printf("%d\n", p[k-1]);
}
----------------------------------------------------------------

Just a sieve algorithm to calculate the 20,000'th prime number,
nothing fancy, just some tight loops. Please note that I do NOT
intend to boast the code quality of SCC here. The code emitted by
it is in fact abysmally bad. So I was quite surprised to get the
following run times of the generated programs (median of three
runs, almost no deviation):

gcc -O2 p.c: 12.39s
pcc -O p.c: 12.42s
scc p.c: 12.27s

I can explain the difference between GCC/PCC and SCC by lower
startup overhead, but shouldn't the better code quality of PCC
and GCC compensate for this over a run time of 12 seconds?

I am rather puzzled now. Any ideas?

The output of my toy compiler compiler is attached below. You can
download the complete compiler source code here in case you want
to do some own experiments.

http://www.t3x.org/subc/

Any explanations or additional data (run times) would be greatly
appreciated.

-----[ compiler output (SubC) ]---------------------------------
.text
.globl Cmain
Cmain: pushl %ebp # main() {
movl %esp,%ebp
addl $-80012,%esp # int i, j, p[MAX], k;
movl $0,%eax # for (k=0, i=3;
movl %eax,-80012(%ebp)
movl $3,%eax
movl %eax,-4(%ebp)
L2: movl -80012(%ebp),%eax # k<MAX;
pushl %eax
movl $20000,%eax
xorl %edx,%edx
popl %ecx
cmpl %eax,%ecx
jge L6
incl %edx
L6: movl %edx,%eax
orl %eax,%eax
jnz L7
jmp L4
L7: jmp L3
L5: movl $2,%eax # i+=2) {
pushl %eax
movl -4(%ebp),%eax
popl %ecx
addl %ecx,%eax
movl %eax,-4(%ebp)
jmp L2
L3: movl $0,%eax # for (j=0;
movl %eax,-8(%ebp)
L8: movl -8(%ebp),%eax # j<k;
pushl %eax
movl -80012(%ebp),%eax
xorl %edx,%edx
popl %ecx
cmpl %eax,%ecx
jge L12
incl %edx
L12: movl %edx,%eax
orl %eax,%eax
jnz L13
jmp L10
L13: jmp L9
L11: movl -8(%ebp),%eax # j++) {
incl -8(%ebp)
jmp L8
L9: movl -4(%ebp),%eax # if (i % p[j] == 0)
pushl %eax
leal -80008(%ebp),%eax
pushl %eax
movl -8(%ebp),%eax
shll $2,%eax
popl %ecx
addl %ecx,%eax
movl (%eax),%eax
popl %ecx
xchgl %eax,%ecx
cdq
idivl %ecx
movl %edx,%eax
pushl %eax
movl $0,%eax
xorl %edx,%edx
popl %ecx
cmpl %eax,%ecx
jne L14
incl %edx
L14: movl %edx,%eax
orl %eax,%eax
jnz L16
jmp L15
L16: jmp L10 # break;
L15: jmp L11
L10: movl -8(%ebp),%eax # if (j < k)
pushl %eax
movl -80012(%ebp),%eax
xorl %edx,%edx
popl %ecx
cmpl %eax,%ecx
jge L17
incl %edx
L17: movl %edx,%eax
orl %eax,%eax
jnz L19
jmp L18
L19: jmp L5 # continue;
L18: leal -80008(%ebp),%eax # p[k++] = i;
pushl %eax
movl -80012(%ebp),%eax
incl -80012(%ebp)
shll $2,%eax
popl %ecx
addl %ecx,%eax
pushl %eax
movl -4(%ebp),%eax
popl %edx
movl %eax,(%edx)
jmp L5 # }
L4: .data
L20: .byte 37
.byte 'd'
.byte 10
.byte 0
.text # printf("%d\n", p[k-1]);
movl $L20,%eax
pushl %eax
leal -80008(%ebp),%eax
pushl %eax
movl -80012(%ebp),%eax
pushl %eax
movl $1,%eax
popl %ecx
xchgl %eax,%ecx
subl %ecx,%eax
shll $2,%eax
popl %ecx
addl %ecx,%eax
movl (%eax),%eax
pushl %eax
pushl $2
call Cprintf
addl $12,%esp
L1: addl $80012,%esp # }
popl %ebp
ret

BartC · Mar 26, 2012

The goal was to find out how bad exactly the code generated by my
toy compiler is, so I passed the following program to GNU C (GCC),
Portable C (PCC), and my toy compiler (let's call it SCC):

#include <stdio.h>

#define MAX 20000

main() {
int i, j, p[MAX], k;

for (k=0, i=3; k<MAX; i+=2) {
for (j=0; j<k; j++)
if (i % p[j] == 0)
break;
if (j < k) continue;
p[k++] = i;
}
printf("%d\n", p[k-1]);
}

gcc -O2 p.c: 12.39s
pcc -O p.c: 12.42s
scc p.c: 12.27s

I can explain the difference between GCC/PCC and SCC by lower
startup overhead, but shouldn't the better code quality of PCC
and GCC compensate for this over a run time of 12 seconds?

I am rather puzzled now. Any ideas?

I tried it with my compiler (not of C, but similar), which while not a toy,
does also generate some very bad code. And while it was slower than the Cs,
all the results were in quite a narrow range: 3.1 to 3.4 seconds for various
C compilers, and 3.5 seconds for my compiler (all x86-32).

The 3.1 seconds was for gcc -O3.

However, with the conventional sieve benchmark, gcc -O3 was over 3 times as
fast as my compiler. So perhaps there's something peculiar with your
benchmark.

(The ratio between gcc -O3 and gcc -O0 was 1.13:1 with your benchmark, but
4:1 with the regular sieve, however the latter does repeat the main loop
thousands of times, which I suspect -O3 takes advantage of. You need a
different benchmark which isn't repetitive, and doesn't use too much memory
either, otherwise the timings will be dominated by memory accesses.)

BartC · Mar 26, 2012

io_x said:
"Nils M Holm" <[email protected]> ha scritto nel messaggio

i not understand why you build one algo that take 12s if there is
one other can take 0s...

Because we're investigating code generation not algorithms. A benchmark that
runs in 0 seconds on any compiler is not very useful!

Ian Collins · Mar 26, 2012

I ran across some odd (as far as I can tell) behavior recently, when
performing some fun benchmarks. (Yeah, but please bear with me!)

The goal was to find out how bad exactly the code generated by my
toy compiler is, so I passed the following program to GNU C (GCC),
Portable C (PCC), and my toy compiler (let's call it SCC):

-----[ p.c ] ---------------------------------------------------
#include<stdio.h>

#define MAX 20000

main() {
int i, j, p[MAX], k;

for (k=0, i=3; k<MAX; i+=2) {
for (j=0; j<k; j++)
if (i % p[j] == 0)
break;
if (j< k) continue;
p[k++] = i;
}
printf("%d\n", p[k-1]);
}
----------------------------------------------------------------

Just a sieve algorithm to calculate the 20,000'th prime number,
nothing fancy, just some tight loops. Please note that I do NOT
intend to boast the code quality of SCC here. The code emitted by
it is in fact abysmally bad. So I was quite surprised to get the
following run times of the generated programs (median of three
runs, almost no deviation):

gcc -O2 p.c: 12.39s
pcc -O p.c: 12.42s
scc p.c: 12.27s

I can explain the difference between GCC/PCC and SCC by lower
startup overhead, but shouldn't the better code quality of PCC
and GCC compensate for this over a run time of 12 seconds?

I am rather puzzled now. Any ideas?

The probably isn't a great deal an optimiser can do with your code. I
tried your code (doubling to 40000 to get a meaningful time) and got a
run time between 3.3 and 3.4 seconds with everything from -g to -O3
(-fast with Sun cc),

Nils M Holm · Mar 26, 2012

BartC said:
(The ratio between gcc -O3 and gcc -O0 was 1.13:1 with your benchmark, but
4:1 with the regular sieve, however the latter does repeat the main loop
thousands of times, which I suspect -O3 takes advantage of. You need a
different benchmark which isn't repetitive, and doesn't use too much memory
either, otherwise the timings will be dominated by memory accesses.)

Very good point, did not think of that! I already suspected that the
results were tightly knit to my benchmark program, because I did some
other tests where SCC performed as lousy as expected.

Alright, now going to implement the optimizer...

gwowen · Mar 26, 2012

Because we're investigating code generation not algorithms. A benchmark that
runs in 0 seconds on any compiler is not very useful!

<CODE type="io_x" idiotic_obfuscation="off">

printf("224737\n");

</CODE>

I WIN!

BartC · Mar 26, 2012

if (i % p[j] == 0)

it is in fact abysmally bad. So I was quite surprised to get the
following run times of the generated programs (median of three
runs, almost no deviation):

gcc -O2 p.c: 12.39s
pcc -O p.c: 12.42s
scc p.c: 12.27s

I am rather puzzled now. Any ideas?

idivl %ecx

I suspect the % operation (which needs a divide) dominates the proceedings.
I put in an extra dummy % op, and it nearly doubled the runtime (3.5 to 6.6
seconds).

Nils M Holm · Mar 27, 2012

BartC said:
I suspect the % operation (which needs a divide) dominates the proceedings.
I put in an extra dummy % op, and it nearly doubled the runtime (3.5 to 6.6
seconds).

Yes, division is certainly an expensive operation, but I like your
initial guess better (memory access dominates the results). So I
rewrote the program slightly to give the optimizer a chance to do
better register allocation and voila:

scc p2.c: 30s
gcc -O2 p2.c: 15s

Here is the new program (using an even more brute-force algorithm):

----------------------------------------------------------------
#include <stdio.h>

#define MAX 10000

main() {
int i, j, k, n;

for (k=0, i=3; k<MAX; i+=2) {
for (j=2; j<i/2; j++)
if (i % j == 0)
break;
if (j < k) continue;
n = i;
k++;
}
printf("%d\n", n);
}

BartC · Mar 27, 2012

Nils M Holm said:
Yes, division is certainly an expensive operation, but I like your
initial guess better (memory access dominates the results).

I don't think your program actually used that much memory (only 80KB or so).
Have a look at the NSIEVE benchmark:

http://shootout.alioth.debian.org/gp4/benchmark.php?test=nsieve&lang=gcc

which allocates memory for some 5 million integers. (The overheads of memory
accesses were such that I was able to get to a factor of 2.5:1 of gcc -O3,
using interpreted bytecode! Compared with 6:1 when only a few thousand
integers.)

Nils M Holm · Mar 27, 2012

BartC said:
I don't think your program actually used that much memory (only 80KB or so).
Have a look at the NSIEVE benchmark:

http://shootout.alioth.debian.org/gp4/benchmark.php?test=nsieve&lang=gcc

I think it is not a matter of the amount of memory being allocated,
but a matter of the amount of memory actually being accessed. The
algorithm used in the shootout accesses memory N+p(N) times, where
N is the upper limit of the range to check and N is the number of
primes in that range. My algorithm accesses memory at least N^2/2
times, so the difference is about the same as O(n) vs O(n^2), i.e.
my program does a lot more memory access in the same amount of
memory.

GCC question	34	Jul 28, 2008
how do I know where "this" pointer is?	4	Feb 22, 2008
How does C parse it?	6	Mar 18, 2005
check integer overflow with GCC 4.5.2	19	Nov 24, 2011
volatile Keyword Question	0	Dec 3, 2008
my getline is better than yours	15	Nov 12, 2007
How do ensure atomic update of a shared global in a multi-threadedapplication?	7	Mar 16, 2011
gcc -O3 variable uninitialization problem	2	Aug 1, 2008

Odd benchmark behavior

Nils M Holm

BartC

BartC

Ian Collins

Nils M Holm

gwowen

BartC

Nils M Holm

BartC

Nils M Holm

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads