# Curious benchmark results with Inline::C

Discussion in 'Perl Misc' started by jl_post@hotmail.com, Dec 16, 2009.

1. ### Guest

Hi,

I've recently been toying around with Inline::C, benchmarking
certain variations of C and Perl and examining the results.

I was curious about the performance penalty of calling an extra
function, so I created three different (but similar C functions). All
of them use the Pythagorean theorem to calculate the average distance
from the origin for all points with integer coordinates from 1 to
100. However, one function calls distance() to compute the distance,
another is the same except that it calls an inlined function, and the
other doesn't use a distance() function -- it just uses the "unrolled"
code, calling the distance logic in place of the function.

I theorized that the "unrolled" code would be the fastest, followed
by the code that called the inlined function, followed by the code
that called the non-inlined function.

However, to my surprise, when I benchmarked the code, I saw that
the code with the non-inlined function ran consistently faster, while
the other two sets of code ran at about the same speed, with the code
that doesn't call the function being a little faster than the code
that called the inline function.

If you're curious, here is the code I used:

#!/usr/bin/perl

use strict;
use warnings;

use Inline 'C' => <<'END_OF_C_CODE';

/* Given a set of integer coordinates, this function
* calculates the distance from the origin. */
double distance(int x, int y)
{
return sqrt(x*x + y*y);
}

/* Same as distance(), but declared inline. */
inline double inline_distance(int x, int y)
{
return sqrt(x*x + y*y);
}

/* This function loops through all integer coordinates
* from (1,1) to (100,100) and returns the average
* distance from the origin. The distance is computed
* without calling distance() nor inline_distance(). */
double c_unrolled()
{
int x, y;
int numEntries = 0;
double total = 0;

for (x = 1; x <= 100; x++)
{
for (y = 1; y <= 100; y++)
{
total += sqrt(x*x + y*y);
numEntries++;
}
}

}

/* Same as c_unrolled(), except that the distance
* from the origin is computed with distance(). */
double c_with_function()
{
int x, y;
int numEntries = 0;
double total = 0;

for (x = 1; x <= 100; x++)
{
for (y = 1; y <= 100; y++)
{
total += distance(x, y);
numEntries++;
}
}

}

/* Same as c_with_function(), except that the distance
* from the origin is computed with inline_distance(). */
double c_with_inline()
{
int x, y;
int numEntries = 0;
double total = 0;

for (x = 1; x <= 100; x++)
{
for (y = 1; y <= 100; y++)
{
total += inline_distance(x, y);
numEntries++;
}
}

}

END_OF_C_CODE

die "Usage: perl \$0 <NUM_TIMES_TO_TEST>\n",
"Sample usage: perl \$0 10_000\n"
unless @ARGV == 1;

my (\$count) = @ARGV;
\$count =~ tr/_//d; # remove all '_' characters

use Benchmark ':all', ':hireswallclock';
my \$results = timethese(\$count, {
'C unrolled' => 'c_unrolled()',
'C with function' => 'c_with_function()',
'C with inline' => 'c_with_inline()',
});
cmpthese(\$results);

__END__

When I ran this code with the following command:

perl extra_function_c.pl 100_000

I got the following as output:

Rate C with inline C unrolled C with
function
C with inline 11090/s -- -0%
-11%
C unrolled 11130/s 0% --
-10%
C with function 12422/s 12% 12%
--

So calling the C code that made use of an extra function call was
actually faster! But why is this so?

In case anyone wants to know, my "gcc -v" output is:

Configured with: ../gcc-3.4.5-20060117-3/configure --with-gcc --with-
gnu-ld --with-gnu-as --host=mingw32 --target=mingw32 --prefix=/mingw --
sjlj-exceptions --enable-libgcj --disable-java-awt --without-x --
enable-java-gc=boehm --disable-libgcj-debug --enable-interpreter --
enable-hash-synchronization --enable-libstdcxx-debug
gcc version 3.4.5 (mingw-vista special r3)

Still curious, I created an all-C file that contained the C code in
my Perl script (on the same platform). When I compiled, ran, and
timed it, I saw that the C code without the function call was the
fastest (while the C code that used the inline function was the
slowest). This is in contrast to the Perl Benchmark findings, which
say that the function with the non-inlined function call was the
fastest.

(Incidentally, I searched on the web as to why the code that called
the inlined function might be slowest, and I discovered that inlining
functions doesn't necessarily make the code any faster. This might
explain why it's always the slowest when benchmarking.)

So I'm curious if other people also see similar results as mine
when running the above Perl script. And if so, why would the C code
with the extra function call be consistently faster in Perl (while not
in straight C)? The fact that the code with the extra function call
is faster seems counter-intuitive to me, no matter what platform I'm
using.

-- Jean-Luc
, Dec 16, 2009

2. ### sisyphusGuest

On Dec 16, 11:35 am, "" <>
wrote:

>
>    So calling the C code that made use of an extra function call was
> actually faster!  But why is this so?
>

I don't have an answer, but I have experienced similar things myself.

With this particular code, howevever, I'm getting comparative results
that are more in line with what's expected:

Rate C with function C with inline C unrolled
C with function 1773/s -- -38% -39%
C with inline 2870/s 62% -- -0%
C unrolled 2883/s 63% 0% --

(I'm on Windows Vista - hence the ridiculously low number of
iterations per second.)

Cheers,
Rob
sisyphus, Dec 16, 2009

3. ### Xho JingleheimerschmidtGuest

wrote:
> Hi,
>
> I've recently been toying around with Inline::C, benchmarking
> certain variations of C and Perl and examining the results.
>
> I was curious about the performance penalty of calling an extra
> function, so I created three different (but similar C functions). All
> of them use the Pythagorean theorem to calculate the average distance
> from the origin for all points with integer coordinates from 1 to
> 100. However, one function calls distance() to compute the distance,
> another is the same except that it calls an inlined function, and the
> other doesn't use a distance() function -- it just uses the "unrolled"
> code, calling the distance logic in place of the function.
>

....
>
> When I ran this code with the following command:
>
> perl extra_function_c.pl 100_000
>
> I got the following as output:
>
> Rate C with inline C unrolled C with
> function
> C with inline 11090/s -- -0%
> -11%
> C unrolled 11130/s 0% --
> -10%
> C with function 12422/s 12% 12%
> --

I get very similar results, only a factor of 10 slower (Asus EeePC
Netbook with Linux)

....
>
> Still curious, I created an all-C file that contained the C code in
> my Perl script (on the same platform). When I compiled, ran, and
> timed it, I saw that the C code without the function call was the
> fastest (while the C code that used the inline function was the
> slowest). This is in contrast to the Perl Benchmark findings, which
> say that the function with the non-inlined function call was the
> fastest.

What were the actual timings? If they all got faster when going to pure
C, just at different rates, that would probably mean something rather
different than if just one of them got faster upon conversion.

Anyway, if I were really serious about figuring this out, I'd compile
the Inline::C code with NoClean, then go look in the build directory log
to see what options were passed to cc or gcc, and then try your pure C
code with those same options.

Also, I would compile with the -S flag so that it saves the assembly
code to see if that reveals anything. Doing that, I notice that the
with_inline and the unrolled yield identical assembly. Unfortunately I
don't know assembly well enough to figure anything else out. It looks
like the with_function code might be have fewer "align" operations.

Xho
Xho Jingleheimerschmidt, Dec 17, 2009
4. ### Guest

On Tue, 15 Dec 2009 16:35:43 -0800 (PST), "" <> wrote:

> I theorized that the "unrolled" code would be the fastest, followed
>by the code that called the inlined function, followed by the code
>that called the non-inlined function.
>

Actually you are correct. This in theory is what happens.
However, no compiler will cooperate if left on its own accord.

Rarely is inline used, given all the other optimizations the compiler
does on your behalf. The only way to get the results you think is to
disable optimizations and force inlining, if you can even do that.
Inlining won't be done for a number of reasons (ie:recursion,...),
it is only a *suggestion* (as docs will tell you) to the compiler.

Usually, given specific code (and compiler), getting the compiler
to do with it what you want is time consuming, often trial and error.

Here's some benchmarks ...

-sln

c:\temp\_INLTEST>cl -?
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 12.00.8804 for 80x86

C/C++ COMPILER OPTIONS

-OPTIMIZATION-

/O1 minimize space /Op[-] improve floating-pt consistency
/O2 maximize speed /Os favor code space
/Oa assume no aliasing /Ot favor code speed
/Ob<n> inline expansion (default n=0) /Ow assume cross-function aliasing
/Od disable optimizations (default) /Ox maximum opts. (/Ogityb1 /Gs)
/Og enable global optimization /Oy[-] enable frame pointer omission
/Oi enable intrinsic functions

-CODE GENERATION-
..., etc ...

Benchmarks
==================================
Optimization /Od - disable (none)

use Inline (C => Config =>
BUILD_NOISY => 1,
FORCE_BUILD => 1,
OPTIMIZE => "-Od",
);
use Inline 'C' => <<'END_OF_C_CODE';
....

cl -c -IC:/temp/_INLTEST -nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
-D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_I
X -Od -DVERSION=\"0.00\" -DXS_VERSION=\"0.00\" "-IC:\Perl\lib\CORE" inlin
e_test_pl_95db.c
Command line warning D4025 : overriding '/O1' with '/Od'
inline_test_pl_95db.c

Rate C with inline C with function C unrolled
C with inline 4129/s -- -0% -23%
C with function 4131/s 0% -- -23%
C unrolled 5333/s 29% 29% --

** With optimizations disabled, with inline/function are the same
unrolled is best performance

==================================
Optimization /O1 - favor code size over speed (config.pm defaults?)

use Inline (C => Config =>
BUILD_NOISY => 1,
FORCE_BUILD => 1,
OPTIMIZE => "",
);
use Inline 'C' => <<'END_OF_C_CODE';
....

cl -c -IC:/temp/_INLTEST -nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
-D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_I
X -DVERSION=\"0.00\" -DXS_VERSION=\"0.00\" "-IC:\Perl\lib\CORE" inline_t
est_pl_95db.c
inline_test_pl_95db.c

Rate C with function C with inline C unrolled
C with function 4572/s -- -25% -25%
C with inline 6094/s 33% -- -0%
C unrolled 6098/s 33% 0% --

** With code size favored, function is slowest
inline/unrolled is best performance, about the same

==================================
Optimization's /Od /Ob1 - disable optimization, force inline (__inline/__forceinline same)

use Inline (C => Config =>
BUILD_NOISY => 1,
FORCE_BUILD => 1,
OPTIMIZE => "-Od -Ob1",
);
use Inline 'C' => <<'END_OF_C_CODE';
....
cl -c -IC:/temp/_INLTEST -nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
-D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_I
X -Od -Ob1 -DVERSION=\"0.00\" -DXS_VERSION=\"0.00\" "-IC:\Perl\lib\CORE"
inline_test_pl_95db.c
Command line warning D4025 : overriding '/O1' with '/Od'
inline_test_pl_95db.c

Rate C with function C with inline C unrolled
C with function 4129/s -- -22% -23%
C with inline 5288/s 28% -- -1%
C unrolled 5333/s 29% 1% --

** Optimizations disabled, inline is forced. * Pure results *
function is slowest
inline is alot faster
unrolled is a little faster than inline.
These results make sense!

==================================
Optimization /O2 - favor code speed over size

use Inline (C => Config =>
BUILD_NOISY => 1,
FORCE_BUILD => 1,
OPTIMIZE => "-O2",
);
use Inline 'C' => <<'END_OF_C_CODE';
....
cl -c -IC:/temp/_INLTEST -nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32
-D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_I
X -O2 -DVERSION=\"0.00\" -DXS_VERSION=\"0.00\" "-IC:\Perl\lib\CORE" inlin
e_test_pl_95db.c
Command line warning D4025 : overriding '/O1' with '/O2'
inline_test_pl_95db.c

Rate C with inline C with function C unrolled
C with inline 9699/s -- -0% -0%
C with function 9699/s 0% -- 0%
C unrolled 9699/s 0% 0% --

** Code speed favored, inline is irrelavent in this case.
Inlineing has no affect here.
Apparently, where speed is concerned,
return sqrt(x*x + y*y);
is too generic and easily optimized to the point that function
wrapper is stripped off entirely.
, Dec 19, 2009