What is the gain of "inline"

S

Stefan Ram

janus said:
Could someone explain "inline" for me?

Making a function an inline function suggests that
calls to the function be as fast as possible.
 
T

Tom St Denis

  Making a function an inline function suggests that
  calls to the function be as fast as possible.

Not quite. "inline" hints to the compiler to place the function *in-
line* with the caller, thus REMOVING the call altogether. It's a
redundant keyword nowadays as most compilers will automatically inline
things when its useful.

Tom
 
B

Ben Bacarisse

Tom St Denis said:
Not quite. "inline" hints to the compiler to place the function *in-
line* with the caller, thus REMOVING the call altogether.

There's nothing wrong with quoting the standard which is what Stefan
Ram did. One can imagine an architecture with executable "code
registers" where there might be a faster method than conventional
in-lining so the standard is wise to avoid saying what inline must
actually do.
It's a
redundant keyword nowadays as most compilers will automatically inline
things when its useful.

Even then it is not quite redundant in that the compiler can diagnose
constraints when the function is marked as inline.
 
J

jacob navia

janus a écrit :
Hello All,

Could someone explain "inline" for me?

Janus

The "inline" directive slows down a program by bloating its code. With
today's CPUs programs go so fast that you can't see the bugs.

To avoid that, you slow down your program using the "inline" directive.

This directive tells the compiler to replace a call to a function
with an insertion of the function's body at each call site.

This bloats the code in most cases. The bloated code slows down the
program that has to load more code into the cache from main memory.

Since most system busses go at 333MHZ or 666 MHZ at most, a CPU at
3GHZ must inserty wait states to wait until the RAM gives it the
new instructions to execute.

If you want to make your program faster, avoid inline. Remember that
a call instruction is very small, and can be predicted in MOST cases
(unless it is a cal through a function pointer). This means that the
pipeline will keep full, and more code will fit in the code caches,
making the program go faster.
 
T

Tom St Denis

Incorrect.

It hints to replace the call. Not inline *with* it.

I have no idea what your distinction is. when you say "inline" it
will attempt to put the called function inside the callee, removing a
function call [thus no stack frames, etc to deal with].

Tom
 
T

Tom St Denis

There's nothing wrong with quoting the standard which is what Stefan
Ram did.  One can imagine an architecture with executable "code
registers" where there might be a faster method than conventional
in-lining so the standard is wise to avoid saying what inline must
actually do.

"inline" can't really suggest anything else but to inline the called
function. Otherwise you still have to setup a stack frame and hit the
cost of calling a function.
Even then it is not quite redundant in that the compiler can diagnose
constraints when the function is marked as inline.

I'm saying that in most modern compilers you will get function
inlining whether you use the keyword or not. So it's largely
academic.

It's like saying "auto" on stack variables...

Tom
 
I

Ian Collins

jacob said:
janus a écrit :

The "inline" directive slows down a program by bloating its code. With
today's CPUs programs go so fast that you can't see the bugs.

To avoid that, you slow down your program using the "inline" directive.

This directive tells the compiler to replace a call to a function
with an insertion of the function's body at each call site.

It's a hint, not an instruction.
This bloats the code in most cases. The bloated code slows down the
program that has to load more code into the cache from main memory.

Define "most cases". If the instruction sequence for the function body
is shorter than that required to call it, it shrinks the code.
Since most system busses go at 333MHZ or 666 MHZ at most, a CPU at
3GHZ must inserty wait states to wait until the RAM gives it the
new instructions to execute.

I don't have a figure for the ratio of embedded devices to desktops and
servers, but I'm sure it's high enough to invalidate that statement.
 
S

Stephen Sprunk

Tom said:
I'm saying that in most modern compilers you will get function
inlining whether you use the keyword or not. So it's largely
academic.

It's like saying "auto" on stack variables...

IIRC, GCC will automatically inline "small" functions, but "inline"
tells it to inline the function regardless of size. The size threshold
can be changed via command-line flag, but IMHO that's not as clean.

Also, I think GCC does not output a standalone function body if the
function is declared "static inline" but does if the function is merely
declared "static", even if every call to the latter gets inlined due to
heuristics.

So, "inline" is not (yet?) completely useless like "auto" or "register".

S
 
S

Stephen Sprunk

jacob said:
janus a écrit :

The "inline" directive slows down a program by bloating its code. With
today's CPUs programs go so fast that you can't see the bugs.

To avoid that, you slow down your program using the "inline" directive.

This directive tells the compiler to replace a call to a function
with an insertion of the function's body at each call site.

This bloats the code in most cases. The bloated code slows down the
program that has to load more code into the cache from main memory.

Since most system busses go at 333MHZ or 666 MHZ at most, a CPU at
3GHZ must inserty wait states to wait until the RAM gives it the
new instructions to execute.

If you want to make your program faster, avoid inline. Remember that
a call instruction is very small, and can be predicted in MOST cases
(unless it is a cal through a function pointer). This means that the
pipeline will keep full, and more code will fit in the code caches,
making the program go faster.

There are so many errors in this I don't even know where to start.

S
 
B

Ben Bacarisse

jacob navia said:
janus a écrit :
The "inline" directive slows down a program by bloating its code. With
today's CPUs programs go so fast that you can't see the bugs.

What bugs? This might have been a joke, in which case I think you
need a little work on your delivery.
To avoid that, you slow down your program using the "inline" directive.

This directive tells the compiler to replace a call to a function
with an insertion of the function's body at each call site.

This bloats the code in most cases. The bloated code slows down the
program that has to load more code into the cache from main memory.

Since most system busses go at 333MHZ or 666 MHZ at most, a CPU at
3GHZ must inserty wait states to wait until the RAM gives it the
new instructions to execute.

If you want to make your program faster, avoid inline. Remember that
a call instruction is very small, and can be predicted in MOST cases
(unless it is a cal through a function pointer). This means that the
pipeline will keep full, and more code will fit in the code caches,
making the program go faster.

This runs counter to my experience. Maybe I used inline (and compilers
that inline automatically) when it is wise to do so, but in all the
cases I can remember, inlined code was faster (and occasionally smaller).

Maybe you can give an example where inlining slows down the code. I'd
like to see what sort of code leads you to this conclusion.
 
A

Antoninus Twink

This runs counter to my experience. Maybe you can give an example
where inlining slows down the code. I'd like to see what sort of code
leads you to this conclusion.

I'd also be interested to hear this - my exprience accords with yours,
Ben. (Obviously inlining code will make the executable bigger, but it's
been a long time since disk space for binaries, or RAM to load them, was
an issue.)

On the other hand, I am a firm believer that loop unrolling can cause
the problems Jacob mentions, and for the reasons he describes. I suspect
that this is why (emprically) gcc's -O3 is almost always slower than
-O2.
 
S

Stefan Ram

Antoninus Twink said:
I'd also be interested to hear this - my exprience accords with yours,
Ben. (Obviously inlining code will make the executable bigger, but it's
been a long time since disk space for binaries, or RAM to load them, was
an issue.)

Well, it /is/ an issue (regarding speed) whether the code
will fit into the processor cache or not.

(Recently someone wrote about how fast RAM always is 32 KB,
it was, when he bought his Pet 2001 back in 1977 and it is
today, when it is being called »L1 Cache«.)

((When »fast RAM« is defined as RAM the processor can
access in his native speed.))
 
E

Eric Sosman

Making a function an inline function suggests that
calls to the function be as fast as possible.

Not quite. [...]

Ah, this is obviously some strange usage of the phrase "not
quite" that I wasn't previously aware of. See ISO/IEC 9899:1999,
Section 6.7.4, paragraph 5, third sentence.
 
K

Kaz Kylheku

janus a écrit :

The "inline" directive slows down a program by bloating its code. With
today's CPUs programs go so fast that you can't see the bugs.

Remember, boys and girls, this is from somenoe who thinks that the
stack-blowing idiocy known as variable length arrays is a good idea!
To avoid that, you slow down your program using the "inline" directive.

Smart use of inline speeds up programs considerably. Some small
functinos can be replaced by an instruction sequence which is as short
as the function call.

I work on GNU/Linux running on MIPS. In userland, function calls are
gross. They have to ensure that the $gp register has the correct value,
load some offsets from the global offset table and then do an indirect
branch through the $t9 register.

For instance, the puts call in this:

#include <stdio.h>

int main(void)
{
puts("hello");
return 0;
}

turns into this:


Fetch the global pointer:

lui $28,%hi(%neg(%gp_rel(main)))
addu $28,$28,$25
addiu $28,$28,%lo(%neg(%gp_rel(main)))

Now go into the global offset table to figure
out where puts is, and begin the calculation
of where the string literal "hello" is:

lw $4,%got_page(.LC0)($28)
lw $25,%call16(puts)($28)

Save our caller's return address.

sd $31,8($sp)

Finally do the call.

jal $25

But not quite; in the branch delay slot, complete calculating the address of
the string literal:

addiu $4,$4,%got_ofst(.LC0)

Phew! It's definitely worth inlining a function that can be done in a few
instructions!
This directive tells the compiler to replace a call to a function
with an insertion of the function's body at each call site.

So does a function-like macro. Only the inline function is type safe.
This bloats the code in most cases.

Can you put a number on this, like 75.3% of the cases?

What sampling method is is used, over what kind of data to arrive at the
statistic?
The bloated code slows down the program that has to load more code into the
cache from main memory.

What if the inline code is bloated, but it's in a tight loop that fits
nicely into the cache?
If you want to make your program faster, avoid inline. Remember that
a call instruction is very small

No it isn't; see MIPS code above.
and can be predicted in MOST cases
(unless it is a cal through a function pointer).

Typically, shared libraries always use indirect jumps.
This means that the
pipeline will keep full

See use of branch delay slot in MIPS code; something can be put into the
pipeline even though a branch is happening. (Though this is now part
of the instruction set architecture and behaves the same way regardless
of whether there actually is a branch delay slot, or how large it is;
if the hardware implementation has a two cycle stall in the pipeline for a
branch, you still get just one slot to fill).
 
B

BGB / cr88192

jacob navia said:
janus a écrit :

The "inline" directive slows down a program by bloating its code. With
today's CPUs programs go so fast that you can't see the bugs.

To avoid that, you slow down your program using the "inline" directive.

This directive tells the compiler to replace a call to a function
with an insertion of the function's body at each call site.

This bloats the code in most cases. The bloated code slows down the
program that has to load more code into the cache from main memory.

Since most system busses go at 333MHZ or 666 MHZ at most, a CPU at
3GHZ must inserty wait states to wait until the RAM gives it the
new instructions to execute.

If you want to make your program faster, avoid inline. Remember that
a call instruction is very small, and can be predicted in MOST cases
(unless it is a cal through a function pointer). This means that the
pipeline will keep full, and more code will fit in the code caches,
making the program go faster.

it depends mostly on the size of the function...

inline is mostly useful for "one-liners", where the overhead of the call
will be larger than that of the code in the function...

but, for code much larger than this (such as code with loops or conditionals
and stuff), the use of inline is ill-advised... (since IME, the cost of a
function call is not usually THAT much higher than that of a loop iteration,
for example...).


except maybe on 64-bit Linux systems, where it remains as my running
hypothesis that their attempt at over-engineering the calling convention
will infact reduce call performance in the average case...

sadly, there is no real good way to test this hypothesis...
 
A

Antoninus Twink

Well, it /is/ an issue (regarding speed) whether the code
will fit into the processor cache or not.

Yes, but the same set of instructions are being executed whether a
function is inlined or not: the only difference is whether or not
there's a function call thrown into the mix. So as much of the program
will fit into the instruction cache either way. There might be tiny
gains if the inlined code can map into the same cache line while the
non-inlined code has to be fetched fresh, but this difference will be
eliminated in subsequent calls to the function -- and we're mostly
talking about the gain being in tight loops here where the same function
gets called many times.

Apart from saving function call overhead, inlining might also give the
optimizer more chance to reuse common subexpressions or the like, so
that's another possible benefit. But better use of cache? I don't see
it.
 
B

BGB / cr88192

Antoninus Twink said:
I'd also be interested to hear this - my exprience accords with yours,
Ben. (Obviously inlining code will make the executable bigger, but it's
been a long time since disk space for binaries, or RAM to load them, was
an issue.)

On the other hand, I am a firm believer that loop unrolling can cause
the problems Jacob mentions, and for the reasons he describes. I suspect
that this is why (emprically) gcc's -O3 is almost always slower than
-O2.

agreed...

granted, WRT loop unrolling, it may depend in large part on the size of the
loop body and number of iterations, where a large number of iterations, or a
large loop body, are likely to hurt things much more than a small loop body
with a small number of iterations.

granted, in general I am not in favor of loop unrolling, since in the cases
it is actually likely to help, I can just as easily unroll by hand, and not
have the compiler unroll the bigger loops (which are likely to just make
things slower...).


granted, micro-optimization is a touchy issue, and cache-hits/misses more
so, and so I generally don't bother, and try to instead just let the
profiler tell me where performance is needed...
 
K

Kaz Kylheku

I'd also be interested to hear this - my exprience accords with yours,
Ben. (Obviously inlining code will make the executable bigger, but it's
been a long time since disk space for binaries, or RAM to load them, was
an issue.)

No, not obviously. Sometimes the function call sequence is actually
longer than the inline code.

Suppose I had a function like this in a shared library on MIPS:

int get_foo_count(struct bar *bar)
{
return bar->foo_count;
}

calling this will be almost certainly longer than just doing the access
directly with inline code (macro or inline function).
See, computing the address of a function in a shared library and calling
it is more complicated than accessing a structure member.

Even if the function is not in a shared library, it may still be
more instructions to do the call. Suppose I add the call into a leaf
function. Now that function becomes a caller to a callee, and has
new responsibilities: namely saving all of the caller-saved registers!
 
P

Phil Carmody

Ian Collins said:
I don't have a figure for the ratio of embedded devices to desktops
and servers, but I'm sure it's high enough to invalidate that
statement.

Whatever it is, multiply it by at least 4. I know the portable device
I have in my hands as we speak has at least _6_ ARM cores on it, each
programmed in C.

In addition:
For most PCs with bluetooth, add one to the embedded device tally.
For most PCs with WLAN, add another one to the embedded device tally.
For most PCs with ethernet, add another one to the embedded device tally.

In fact, it's not even going to be close, is it?

Phil
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,434
Messages
2,571,690
Members
48,796
Latest member
Greg L.

Latest Threads

Top