Odd benchmark behavior

Discussion in 'C Programming' started by Nils M Holm, Mar 26, 2012.

  1. Nils M Holm

    Nils M Holm Guest

    I ran across some odd (as far as I can tell) behavior recently, when
    performing some fun benchmarks. (Yeah, but please bear with me!)

    The goal was to find out how bad exactly the code generated by my
    toy compiler is, so I passed the following program to GNU C (GCC),
    Portable C (PCC), and my toy compiler (let's call it SCC):

    -----[ p.c ] ---------------------------------------------------
    #include <stdio.h>

    #define MAX 20000

    main() {
    int i, j, p[MAX], k;

    for (k=0, i=3; k<MAX; i+=2) {
    for (j=0; j<k; j++)
    if (i % p[j] == 0)
    break;
    if (j < k) continue;
    p[k++] = i;
    }
    printf("%d\n", p[k-1]);
    }
    ----------------------------------------------------------------

    Just a sieve algorithm to calculate the 20,000'th prime number,
    nothing fancy, just some tight loops. Please note that I do NOT
    intend to boast the code quality of SCC here. The code emitted by
    it is in fact abysmally bad. So I was quite surprised to get the
    following run times of the generated programs (median of three
    runs, almost no deviation):

    gcc -O2 p.c: 12.39s
    pcc -O p.c: 12.42s
    scc p.c: 12.27s

    I can explain the difference between GCC/PCC and SCC by lower
    startup overhead, but shouldn't the better code quality of PCC
    and GCC compensate for this over a run time of 12 seconds?

    I am rather puzzled now. Any ideas?

    The output of my toy compiler compiler is attached below. You can
    download the complete compiler source code here in case you want
    to do some own experiments.

    http://www.t3x.org/subc/

    Any explanations or additional data (run times) would be greatly
    appreciated.

    -----[ compiler output (SubC) ]---------------------------------
    .text
    .globl Cmain
    Cmain: pushl %ebp # main() {
    movl %esp,%ebp
    addl $-80012,%esp # int i, j, p[MAX], k;
    movl $0,%eax # for (k=0, i=3;
    movl %eax,-80012(%ebp)
    movl $3,%eax
    movl %eax,-4(%ebp)
    L2: movl -80012(%ebp),%eax # k<MAX;
    pushl %eax
    movl $20000,%eax
    xorl %edx,%edx
    popl %ecx
    cmpl %eax,%ecx
    jge L6
    incl %edx
    L6: movl %edx,%eax
    orl %eax,%eax
    jnz L7
    jmp L4
    L7: jmp L3
    L5: movl $2,%eax # i+=2) {
    pushl %eax
    movl -4(%ebp),%eax
    popl %ecx
    addl %ecx,%eax
    movl %eax,-4(%ebp)
    jmp L2
    L3: movl $0,%eax # for (j=0;
    movl %eax,-8(%ebp)
    L8: movl -8(%ebp),%eax # j<k;
    pushl %eax
    movl -80012(%ebp),%eax
    xorl %edx,%edx
    popl %ecx
    cmpl %eax,%ecx
    jge L12
    incl %edx
    L12: movl %edx,%eax
    orl %eax,%eax
    jnz L13
    jmp L10
    L13: jmp L9
    L11: movl -8(%ebp),%eax # j++) {
    incl -8(%ebp)
    jmp L8
    L9: movl -4(%ebp),%eax # if (i % p[j] == 0)
    pushl %eax
    leal -80008(%ebp),%eax
    pushl %eax
    movl -8(%ebp),%eax
    shll $2,%eax
    popl %ecx
    addl %ecx,%eax
    movl (%eax),%eax
    popl %ecx
    xchgl %eax,%ecx
    cdq
    idivl %ecx
    movl %edx,%eax
    pushl %eax
    movl $0,%eax
    xorl %edx,%edx
    popl %ecx
    cmpl %eax,%ecx
    jne L14
    incl %edx
    L14: movl %edx,%eax
    orl %eax,%eax
    jnz L16
    jmp L15
    L16: jmp L10 # break;
    L15: jmp L11
    L10: movl -8(%ebp),%eax # if (j < k)
    pushl %eax
    movl -80012(%ebp),%eax
    xorl %edx,%edx
    popl %ecx
    cmpl %eax,%ecx
    jge L17
    incl %edx
    L17: movl %edx,%eax
    orl %eax,%eax
    jnz L19
    jmp L18
    L19: jmp L5 # continue;
    L18: leal -80008(%ebp),%eax # p[k++] = i;
    pushl %eax
    movl -80012(%ebp),%eax
    incl -80012(%ebp)
    shll $2,%eax
    popl %ecx
    addl %ecx,%eax
    pushl %eax
    movl -4(%ebp),%eax
    popl %edx
    movl %eax,(%edx)
    jmp L5 # }
    L4: .data
    L20: .byte 37
    .byte 'd'
    .byte 10
    .byte 0
    .text # printf("%d\n", p[k-1]);
    movl $L20,%eax
    pushl %eax
    leal -80008(%ebp),%eax
    pushl %eax
    movl -80012(%ebp),%eax
    pushl %eax
    movl $1,%eax
    popl %ecx
    xchgl %eax,%ecx
    subl %ecx,%eax
    shll $2,%eax
    popl %ecx
    addl %ecx,%eax
    movl (%eax),%eax
    pushl %eax
    pushl $2
    call Cprintf
    addl $12,%esp
    L1: addl $80012,%esp # }
    popl %ebp
    ret
    ----------------------------------------------------------------

    --
    Nils M Holm < n m h @ t 3 x . o r g > www.t3x.org
     
    Nils M Holm, Mar 26, 2012
    #1
    1. Advertising

  2. Nils M Holm

    BartC Guest

    "Nils M Holm" <> wrote in message
    news:...

    > The goal was to find out how bad exactly the code generated by my
    > toy compiler is, so I passed the following program to GNU C (GCC),
    > Portable C (PCC), and my toy compiler (let's call it SCC):


    > #include <stdio.h>
    >
    > #define MAX 20000
    >
    > main() {
    > int i, j, p[MAX], k;
    >
    > for (k=0, i=3; k<MAX; i+=2) {
    > for (j=0; j<k; j++)
    > if (i % p[j] == 0)
    > break;
    > if (j < k) continue;
    > p[k++] = i;
    > }
    > printf("%d\n", p[k-1]);
    > }


    > gcc -O2 p.c: 12.39s
    > pcc -O p.c: 12.42s
    > scc p.c: 12.27s
    >
    > I can explain the difference between GCC/PCC and SCC by lower
    > startup overhead, but shouldn't the better code quality of PCC
    > and GCC compensate for this over a run time of 12 seconds?
    >
    > I am rather puzzled now. Any ideas?


    I tried it with my compiler (not of C, but similar), which while not a toy,
    does also generate some very bad code. And while it was slower than the Cs,
    all the results were in quite a narrow range: 3.1 to 3.4 seconds for various
    C compilers, and 3.5 seconds for my compiler (all x86-32).

    The 3.1 seconds was for gcc -O3.

    However, with the conventional sieve benchmark, gcc -O3 was over 3 times as
    fast as my compiler. So perhaps there's something peculiar with your
    benchmark.

    (The ratio between gcc -O3 and gcc -O0 was 1.13:1 with your benchmark, but
    4:1 with the regular sieve, however the latter does repeat the main loop
    thousands of times, which I suspect -O3 takes advantage of. You need a
    different benchmark which isn't repetitive, and doesn't use too much memory
    either, otherwise the timings will be dominated by memory accesses.)

    --
    Bartc
     
    BartC, Mar 26, 2012
    #2
    1. Advertising

  3. Nils M Holm

    BartC Guest

    "io_x" <> wrote in message
    news:4f70351c$0$1385$...
    >
    > "Nils M Holm" <> ha scritto nel messaggio


    >> gcc -O2 p.c: 12.39s
    >> pcc -O p.c: 12.42s
    >> scc p.c: 12.27s

    >
    > i not understand why you build one algo that take 12s if there is
    > one other can take 0s...


    Because we're investigating code generation not algorithms. A benchmark that
    runs in 0 seconds on any compiler is not very useful!

    --
    Bartc
     
    BartC, Mar 26, 2012
    #3
  4. Nils M Holm

    Ian Collins Guest

    On 03/26/12 08:51 PM, Nils M Holm wrote:
    > I ran across some odd (as far as I can tell) behavior recently, when
    > performing some fun benchmarks. (Yeah, but please bear with me!)
    >
    > The goal was to find out how bad exactly the code generated by my
    > toy compiler is, so I passed the following program to GNU C (GCC),
    > Portable C (PCC), and my toy compiler (let's call it SCC):
    >
    > -----[ p.c ] ---------------------------------------------------
    > #include<stdio.h>
    >
    > #define MAX 20000
    >
    > main() {
    > int i, j, p[MAX], k;
    >
    > for (k=0, i=3; k<MAX; i+=2) {
    > for (j=0; j<k; j++)
    > if (i % p[j] == 0)
    > break;
    > if (j< k) continue;
    > p[k++] = i;
    > }
    > printf("%d\n", p[k-1]);
    > }
    > ----------------------------------------------------------------
    >
    > Just a sieve algorithm to calculate the 20,000'th prime number,
    > nothing fancy, just some tight loops. Please note that I do NOT
    > intend to boast the code quality of SCC here. The code emitted by
    > it is in fact abysmally bad. So I was quite surprised to get the
    > following run times of the generated programs (median of three
    > runs, almost no deviation):
    >
    > gcc -O2 p.c: 12.39s
    > pcc -O p.c: 12.42s
    > scc p.c: 12.27s
    >
    > I can explain the difference between GCC/PCC and SCC by lower
    > startup overhead, but shouldn't the better code quality of PCC
    > and GCC compensate for this over a run time of 12 seconds?
    >
    > I am rather puzzled now. Any ideas?


    The probably isn't a great deal an optimiser can do with your code. I
    tried your code (doubling to 40000 to get a meaningful time) and got a
    run time between 3.3 and 3.4 seconds with everything from -g to -O3
    (-fast with Sun cc),

    --
    Ian Collins
     
    Ian Collins, Mar 26, 2012
    #4
  5. Nils M Holm

    Nils M Holm Guest

    BartC <> wrote:
    > (The ratio between gcc -O3 and gcc -O0 was 1.13:1 with your benchmark, but
    > 4:1 with the regular sieve, however the latter does repeat the main loop
    > thousands of times, which I suspect -O3 takes advantage of. You need a
    > different benchmark which isn't repetitive, and doesn't use too much memory
    > either, otherwise the timings will be dominated by memory accesses.)


    Very good point, did not think of that! I already suspected that the
    results were tightly knit to my benchmark program, because I did some
    other tests where SCC performed as lousy as expected.

    Alright, now going to implement the optimizer...

    --
    Nils M Holm < n m h @ t 3 x . o r g > www.t3x.org
     
    Nils M Holm, Mar 26, 2012
    #5
  6. Nils M Holm

    gwowen Guest

    On Mar 26, 10:40 am, "BartC" <> wrote:
    > "io_x" <> wrote in message
    >
    > news:4f70351c$0$1385$...
    >
    >
    >
    > > "Nils M Holm" <> ha scritto nel messaggio
    > >> gcc -O2 p.c: 12.39s
    > >> pcc -O p.c: 12.42s
    > >> scc p.c: 12.27s

    >
    > > i not understand why you build one algo that take 12s if there is
    > > one other can take 0s...

    >
    > Because we're investigating code generation not algorithms. A benchmark that
    > runs in 0 seconds on any compiler is not very useful!
    >
    > --
    > Bartc


    <CODE type="io_x" idiotic_obfuscation="off">

    printf("224737\n");

    </CODE>

    I WIN!
     
    gwowen, Mar 26, 2012
    #6
  7. Nils M Holm

    BartC Guest

    "Nils M Holm" <> wrote in message
    news:...

    > if (i % p[j] == 0)


    > it is in fact abysmally bad. So I was quite surprised to get the
    > following run times of the generated programs (median of three
    > runs, almost no deviation):
    >
    > gcc -O2 p.c: 12.39s
    > pcc -O p.c: 12.42s
    > scc p.c: 12.27s


    > I am rather puzzled now. Any ideas?


    > idivl %ecx


    I suspect the % operation (which needs a divide) dominates the proceedings.
    I put in an extra dummy % op, and it nearly doubled the runtime (3.5 to 6.6
    seconds).

    --
    Bartc
     
    BartC, Mar 26, 2012
    #7
  8. Nils M Holm

    Nils M Holm Guest

    BartC <> wrote:
    > I suspect the % operation (which needs a divide) dominates the proceedings.
    > I put in an extra dummy % op, and it nearly doubled the runtime (3.5 to 6.6
    > seconds).


    Yes, division is certainly an expensive operation, but I like your
    initial guess better (memory access dominates the results). So I
    rewrote the program slightly to give the optimizer a chance to do
    better register allocation and voila:

    scc p2.c: 30s
    gcc -O2 p2.c: 15s

    Here is the new program (using an even more brute-force algorithm):

    ----------------------------------------------------------------
    #include <stdio.h>

    #define MAX 10000

    main() {
    int i, j, k, n;

    for (k=0, i=3; k<MAX; i+=2) {
    for (j=2; j<i/2; j++)
    if (i % j == 0)
    break;
    if (j < k) continue;
    n = i;
    k++;
    }
    printf("%d\n", n);
    }
    ----------------------------------------------------------------

    I also tried a dummy remainder and it indeed doubled the run time
    of the SCC-generated executable while the GCC-generated executable
    was not affected. I guess that GCC simply removes the dead code.

    --
    Nils M Holm < n m h @ t 3 x . o r g > www.t3x.org
     
    Nils M Holm, Mar 27, 2012
    #8
  9. Nils M Holm

    BartC Guest

    "Nils M Holm" <> wrote in message
    news:...
    > BartC <> wrote:
    >> I suspect the % operation (which needs a divide) dominates the
    >> proceedings.
    >> I put in an extra dummy % op, and it nearly doubled the runtime (3.5 to
    >> 6.6
    >> seconds).

    >
    > Yes, division is certainly an expensive operation, but I like your
    > initial guess better (memory access dominates the results).


    I don't think your program actually used that much memory (only 80KB or so).
    Have a look at the NSIEVE benchmark:

    http://shootout.alioth.debian.org/gp4/benchmark.php?test=nsieve&lang=gcc

    which allocates memory for some 5 million integers. (The overheads of memory
    accesses were such that I was able to get to a factor of 2.5:1 of gcc -O3,
    using interpreted bytecode! Compared with 6:1 when only a few thousand
    integers.)

    --
    Bartc
     
    BartC, Mar 27, 2012
    #9
  10. Nils M Holm

    Nils M Holm Guest

    BartC <> wrote:
    > I don't think your program actually used that much memory (only 80KB or so).
    > Have a look at the NSIEVE benchmark:
    >
    > http://shootout.alioth.debian.org/gp4/benchmark.php?test=nsieve&lang=gcc


    I think it is not a matter of the amount of memory being allocated,
    but a matter of the amount of memory actually being accessed. The
    algorithm used in the shootout accesses memory N+p(N) times, where
    N is the upper limit of the range to check and N is the number of
    primes in that range. My algorithm accesses memory at least N^2/2
    times, so the difference is about the same as O(n) vs O(n^2), i.e.
    my program does a lot more memory access in the same amount of
    memory.

    --
    Nils M Holm < n m h @ t 3 x . o r g > www.t3x.org
     
    Nils M Holm, Mar 27, 2012
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Elliot M. Rodriguez

    PLEASE HELP = odd TextChanged behavior

    Elliot M. Rodriguez, Oct 21, 2003, in forum: ASP .Net
    Replies:
    2
    Views:
    338
    Elliot M. Rodriguez
    Oct 22, 2003
  2. Guest

    Step-thru code - odd behavior

    Guest, May 28, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    398
    Guest
    Jun 1, 2004
  3. =?Utf-8?B?Q2hyaXM=?=
    Replies:
    1
    Views:
    350
    Karl Seguin
    Jan 7, 2005
  4. news.microsoft.com

    EXTREMELY odd cookie behavior

    news.microsoft.com, Mar 15, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    323
    Joerg Jooss
    Mar 15, 2005
  5. Michael Speer

    Odd behavior with odd code

    Michael Speer, Feb 16, 2007, in forum: C Programming
    Replies:
    33
    Views:
    1,159
    Richard Heathfield
    Feb 18, 2007
Loading...

Share This Page