Curious benchmark results with Inline::C

Discussion in 'Perl Misc' started by jl_post@hotmail.com, Dec 16, 2009.

  1. Guest

    Hi,

    I've recently been toying around with Inline::C, benchmarking
    certain variations of C and Perl and examining the results.

    I was curious about the performance penalty of calling an extra
    function, so I created three different (but similar C functions). All
    of them use the Pythagorean theorem to calculate the average distance
    from the origin for all points with integer coordinates from 1 to
    100. However, one function calls distance() to compute the distance,
    another is the same except that it calls an inlined function, and the
    other doesn't use a distance() function -- it just uses the "unrolled"
    code, calling the distance logic in place of the function.

    I theorized that the "unrolled" code would be the fastest, followed
    by the code that called the inlined function, followed by the code
    that called the non-inlined function.

    However, to my surprise, when I benchmarked the code, I saw that
    the code with the non-inlined function ran consistently faster, while
    the other two sets of code ran at about the same speed, with the code
    that doesn't call the function being a little faster than the code
    that called the inline function.

    If you're curious, here is the code I used:

    #!/usr/bin/perl

    use strict;
    use warnings;

    use Inline 'C' => <<'END_OF_C_CODE';

    /* Given a set of integer coordinates, this function
    * calculates the distance from the origin. */
    double distance(int x, int y)
    {
    return sqrt(x*x + y*y);
    }

    /* Same as distance(), but declared inline. */
    inline double inline_distance(int x, int y)
    {
    return sqrt(x*x + y*y);
    }

    /* This function loops through all integer coordinates
    * from (1,1) to (100,100) and returns the average
    * distance from the origin. The distance is computed
    * without calling distance() nor inline_distance(). */
    double c_unrolled()
    {
    int x, y;
    int numEntries = 0;
    double total = 0;

    for (x = 1; x <= 100; x++)
    {
    for (y = 1; y <= 100; y++)
    {
    total += sqrt(x*x + y*y);
    numEntries++;
    }
    }

    return total/numEntries;
    }

    /* Same as c_unrolled(), except that the distance
    * from the origin is computed with distance(). */
    double c_with_function()
    {
    int x, y;
    int numEntries = 0;
    double total = 0;

    for (x = 1; x <= 100; x++)
    {
    for (y = 1; y <= 100; y++)
    {
    total += distance(x, y);
    numEntries++;
    }
    }

    return total/numEntries;
    }

    /* Same as c_with_function(), except that the distance
    * from the origin is computed with inline_distance(). */
    double c_with_inline()
    {
    int x, y;
    int numEntries = 0;
    double total = 0;

    for (x = 1; x <= 100; x++)
    {
    for (y = 1; y <= 100; y++)
    {
    total += inline_distance(x, y);
    numEntries++;
    }
    }

    return total/numEntries;
    }

    END_OF_C_CODE


    die "Usage: perl $0 <NUM_TIMES_TO_TEST>\n",
    "Sample usage: perl $0 10_000\n"
    unless @ARGV == 1;

    my ($count) = @ARGV;
    $count =~ tr/_//d; # remove all '_' characters

    use Benchmark ':all', ':hireswallclock';
    my $results = timethese($count, {
    'C unrolled' => 'c_unrolled()',
    'C with function' => 'c_with_function()',
    'C with inline' => 'c_with_inline()',
    });
    cmpthese($results);

    __END__


    When I ran this code with the following command:

    perl extra_function_c.pl 100_000

    I got the following as output:

    Rate C with inline C unrolled C with
    function
    C with inline 11090/s -- -0%
    -11%
    C unrolled 11130/s 0% --
    -10%
    C with function 12422/s 12% 12%
    --

    So calling the C code that made use of an extra function call was
    actually faster! But why is this so?

    In case anyone wants to know, my "gcc -v" output is:

    Reading specs from C:/strawberry/c/bin/../lib/gcc/mingw32/3.4.5/specs
    Configured with: ../gcc-3.4.5-20060117-3/configure --with-gcc --with-
    gnu-ld --with-gnu-as --host=mingw32 --target=mingw32 --prefix=/mingw --
    enable-threads --disable-nls --enable-languages=c,c+
    +,f77,ada,objc,java --disable-win32-registry --disable-shared --enable-
    sjlj-exceptions --enable-libgcj --disable-java-awt --without-x --
    enable-java-gc=boehm --disable-libgcj-debug --enable-interpreter --
    enable-hash-synchronization --enable-libstdcxx-debug
    Thread model: win32
    gcc version 3.4.5 (mingw-vista special r3)


    Still curious, I created an all-C file that contained the C code in
    my Perl script (on the same platform). When I compiled, ran, and
    timed it, I saw that the C code without the function call was the
    fastest (while the C code that used the inline function was the
    slowest). This is in contrast to the Perl Benchmark findings, which
    say that the function with the non-inlined function call was the
    fastest.

    (Incidentally, I searched on the web as to why the code that called
    the inlined function might be slowest, and I discovered that inlining
    functions doesn't necessarily make the code any faster. This might
    explain why it's always the slowest when benchmarking.)

    So I'm curious if other people also see similar results as mine
    when running the above Perl script. And if so, why would the C code
    with the extra function call be consistently faster in Perl (while not
    in straight C)? The fact that the code with the extra function call
    is faster seems counter-intuitive to me, no matter what platform I'm
    using.

    Thanks in advance for any advice, tips, or general wisdom.

    -- Jean-Luc
    , Dec 16, 2009
    #1
    1. Advertising

  2. sisyphus Guest

    On Dec 16, 11:35 am, "" <>
    wrote:

    >
    >    So calling the C code that made use of an extra function call was
    > actually faster!  But why is this so?
    >


    I don't have an answer, but I have experienced similar things myself.

    With this particular code, howevever, I'm getting comparative results
    that are more in line with what's expected:

    Rate C with function C with inline C unrolled
    C with function 1773/s -- -38% -39%
    C with inline 2870/s 62% -- -0%
    C unrolled 2883/s 63% 0% --

    (I'm on Windows Vista - hence the ridiculously low number of
    iterations per second.)

    Cheers,
    Rob
    sisyphus, Dec 16, 2009
    #2
    1. Advertising

  3. wrote:
    > Hi,
    >
    > I've recently been toying around with Inline::C, benchmarking
    > certain variations of C and Perl and examining the results.
    >
    > I was curious about the performance penalty of calling an extra
    > function, so I created three different (but similar C functions). All
    > of them use the Pythagorean theorem to calculate the average distance
    > from the origin for all points with integer coordinates from 1 to
    > 100. However, one function calls distance() to compute the distance,
    > another is the same except that it calls an inlined function, and the
    > other doesn't use a distance() function -- it just uses the "unrolled"
    > code, calling the distance logic in place of the function.
    >

    ....
    >
    > When I ran this code with the following command:
    >
    > perl extra_function_c.pl 100_000
    >
    > I got the following as output:
    >
    > Rate C with inline C unrolled C with
    > function
    > C with inline 11090/s -- -0%
    > -11%
    > C unrolled 11130/s 0% --
    > -10%
    > C with function 12422/s 12% 12%
    > --


    I get very similar results, only a factor of 10 slower (Asus EeePC
    Netbook with Linux)

    ....
    >
    > Still curious, I created an all-C file that contained the C code in
    > my Perl script (on the same platform). When I compiled, ran, and
    > timed it, I saw that the C code without the function call was the
    > fastest (while the C code that used the inline function was the
    > slowest). This is in contrast to the Perl Benchmark findings, which
    > say that the function with the non-inlined function call was the
    > fastest.


    What were the actual timings? If they all got faster when going to pure
    C, just at different rates, that would probably mean something rather
    different than if just one of them got faster upon conversion.

    Anyway, if I were really serious about figuring this out, I'd compile
    the Inline::C code with NoClean, then go look in the build directory log
    to see what options were passed to cc or gcc, and then try your pure C
    code with those same options.

    Also, I would compile with the -S flag so that it saves the assembly
    code to see if that reveals anything. Doing that, I notice that the
    with_inline and the unrolled yield identical assembly. Unfortunately I
    don't know assembly well enough to figure anything else out. It looks
    like the with_function code might be have fewer "align" operations.

    Xho
    Xho Jingleheimerschmidt, Dec 17, 2009
    #3
  4. Guest

    On Tue, 15 Dec 2009 16:35:43 -0800 (PST), "" <> wrote:

    > I theorized that the "unrolled" code would be the fastest, followed
    >by the code that called the inlined function, followed by the code
    >that called the non-inlined function.
    >

    Actually you are correct. This in theory is what happens.
    However, no compiler will cooperate if left on its own accord.

    Rarely is inline used, given all the other optimizations the compiler
    does on your behalf. The only way to get the results you think is to
    disable optimizations and force inlining, if you can even do that.
    Inlining won't be done for a number of reasons (ie:recursion,...),
    it is only a *suggestion* (as docs will tell you) to the compiler.

    Usually, given specific code (and compiler), getting the compiler
    to do with it what you want is time consuming, often trial and error.
    Linking is a bigger hassle.

    Here's some benchmarks ...

    -sln

    c:\temp\_INLTEST>cl -?
    Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 12.00.8804 for 80x86
    Copyright (C) Microsoft Corp 1984-1998. All rights reserved.

    C/C++ COMPILER OPTIONS

    -OPTIMIZATION-

    /O1 minimize space /Op[-] improve floating-pt consistency
    /O2 maximize speed /Os favor code space
    /Oa assume no aliasing /Ot favor code speed
    /Ob<n> inline expansion (default n=0) /Ow assume cross-function aliasing
    /Od disable optimizations (default) /Ox maximum opts. (/Ogityb1 /Gs)
    /Og enable global optimization /Oy[-] enable frame pointer omission
    /Oi enable intrinsic functions

    -CODE GENERATION-
    ..., etc ...

    Benchmarks
    ==================================
    Optimization /Od - disable (none)

    use Inline (C => Config =>
    BUILD_NOISY => 1,
    FORCE_BUILD => 1,
    OPTIMIZE => "-Od",
    );
    use Inline 'C' => <<'END_OF_C_CODE';
    ....

    cl -c -IC:/temp/_INLTEST -nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
    -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_I
    NC -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFI
    X -Od -DVERSION=\"0.00\" -DXS_VERSION=\"0.00\" "-IC:\Perl\lib\CORE" inlin
    e_test_pl_95db.c
    Command line warning D4025 : overriding '/O1' with '/Od'
    inline_test_pl_95db.c

    Rate C with inline C with function C unrolled
    C with inline 4129/s -- -0% -23%
    C with function 4131/s 0% -- -23%
    C unrolled 5333/s 29% 29% --

    ** With optimizations disabled, with inline/function are the same
    unrolled is best performance

    ==================================
    Optimization /O1 - favor code size over speed (config.pm defaults?)

    use Inline (C => Config =>
    BUILD_NOISY => 1,
    FORCE_BUILD => 1,
    OPTIMIZE => "",
    );
    use Inline 'C' => <<'END_OF_C_CODE';
    ....

    cl -c -IC:/temp/_INLTEST -nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
    -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_I
    NC -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFI
    X -DVERSION=\"0.00\" -DXS_VERSION=\"0.00\" "-IC:\Perl\lib\CORE" inline_t
    est_pl_95db.c
    inline_test_pl_95db.c

    Rate C with function C with inline C unrolled
    C with function 4572/s -- -25% -25%
    C with inline 6094/s 33% -- -0%
    C unrolled 6098/s 33% 0% --

    ** With code size favored, function is slowest
    inline/unrolled is best performance, about the same

    ==================================
    Optimization's /Od /Ob1 - disable optimization, force inline (__inline/__forceinline same)

    use Inline (C => Config =>
    BUILD_NOISY => 1,
    FORCE_BUILD => 1,
    OPTIMIZE => "-Od -Ob1",
    );
    use Inline 'C' => <<'END_OF_C_CODE';
    ....
    cl -c -IC:/temp/_INLTEST -nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
    -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_I
    NC -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFI
    X -Od -Ob1 -DVERSION=\"0.00\" -DXS_VERSION=\"0.00\" "-IC:\Perl\lib\CORE"
    inline_test_pl_95db.c
    Command line warning D4025 : overriding '/O1' with '/Od'
    inline_test_pl_95db.c

    Rate C with function C with inline C unrolled
    C with function 4129/s -- -22% -23%
    C with inline 5288/s 28% -- -1%
    C unrolled 5333/s 29% 1% --

    ** Optimizations disabled, inline is forced. * Pure results *
    function is slowest
    inline is alot faster
    unrolled is a little faster than inline.
    These results make sense!

    ==================================
    Optimization /O2 - favor code speed over size

    use Inline (C => Config =>
    BUILD_NOISY => 1,
    FORCE_BUILD => 1,
    OPTIMIZE => "-O2",
    );
    use Inline 'C' => <<'END_OF_C_CODE';
    ....
    cl -c -IC:/temp/_INLTEST -nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
    -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_I
    NC -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFI
    X -O2 -DVERSION=\"0.00\" -DXS_VERSION=\"0.00\" "-IC:\Perl\lib\CORE" inlin
    e_test_pl_95db.c
    Command line warning D4025 : overriding '/O1' with '/O2'
    inline_test_pl_95db.c

    Rate C with inline C with function C unrolled
    C with inline 9699/s -- -0% -0%
    C with function 9699/s 0% -- 0%
    C unrolled 9699/s 0% 0% --

    ** Code speed favored, inline is irrelavent in this case.
    Inlineing has no affect here.
    Apparently, where speed is concerned,
    return sqrt(x*x + y*y);
    is too generic and easily optimized to the point that function
    wrapper is stripped off entirely.
    , Dec 19, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dang Griffith

    Re: any benchmark results for python?

    Dang Griffith, Jun 25, 2003, in forum: Python
    Replies:
    0
    Views:
    1,271
    Dang Griffith
    Jun 25, 2003
  2. Hans Mull

    Benchmark results unrealistic?

    Hans Mull, Feb 11, 2008, in forum: C++
    Replies:
    3
    Views:
    409
    James Kanze
    Feb 12, 2008
  3. Daniel Berger

    StringIO affecting Benchmark results

    Daniel Berger, Aug 25, 2004, in forum: Ruby
    Replies:
    0
    Views:
    145
    Daniel Berger
    Aug 25, 2004
  4. Juan Alvarez

    Help interpreting benchmark results

    Juan Alvarez, Feb 24, 2009, in forum: Ruby
    Replies:
    3
    Views:
    141
    Sandor Szücs
    Feb 25, 2009
  5. Michele Dondi

    Q: re Inline and Benchmark

    Michele Dondi, Nov 2, 2004, in forum: Perl Misc
    Replies:
    41
    Views:
    340
    Anno Siegel
    Nov 10, 2004
Loading...

Share This Page