inline functions

Malcolm McLean · Nov 26, 2012

On 26/11/2012 14:01, BartC wrote:

gcc /can/ inline calls through function pointers (which is pretty
impressive, IMHO), but only when it can tell what the function will be.

That's handy. You very frequently need something like this

int trivialcompfunc(const void *e1, const void *e2)
{
const int *i1 = e1;
const int *i2 = e2;

return *i1 - *i2;
}

void complexalgorithm(int *array, int N)
{
qsort(array, N, sizeof(int), comfunc);
}

clearly if we can inline trivialcompfunc it;'s likely to have a substantial
effect on perfomance.

Les Cargill · Nov 26, 2012

Malcolm said:
That's handy. You very frequently need something like this

int trivialcompfunc(const void *e1, const void *e2) { const int *i1 =
e1; const int *i2 = e2;

return *i1 - *i2; }

void complexalgorithm(int *array, int N) { qsort(array, N,
sizeof(int), comfunc); }

clearly if we can inline trivialcompfunc it;'s likely to have a
substantial effect on perfomance.

I don't see how that can inline trivialcompfunc. qsort is
opaque for that case. If qsort() were in source code and
local to trivialcompfunc then perhaps it could be done.

BartC · Nov 26, 2012

David Brown said:
On 26/11/2012 14:01, BartC wrote:

Cache sizes depend on the processor, but an Intel i7 (as an example) has
32 KB level 1 instruction cache as far as I know. I am not saying that
this /is/ a problem in your code, but it certainly could be (and the risk
gets higher as you fill out the stub functions).

OK, I thought they were of the order of 1MB or something.

One advantage of a simple interpreter, is that the code *is* localised,
because usually a few dozen handlers are constantly being executed, no
matter how big the target program is. So maybe I'm already benefitting!

Yes, gcc re-orders your code (assuming, as always, you use appropriate
optimisation levels).

Depending on the version of gcc that you are using, it might also split
your code automatically into "hot" and "cold" code. "hot" code is things
like inner loops, while "cold" code is things like initialisation code.
gcc will re-arrange the order of the code to make "hot" code stick
together - this improves cache locality when running. You can also mark
functions as "hot" or "cold" using attributes, or automatically by using
profiling.

The problem with an instruction dispatcher like I have is that I can't see
how the compiler can guess which are the more heavily used functions; they
all have equal 'weight'.

The information needed for that isn't in the structure of the C source code,
but in the structure of the bytecode program, which is data that is read
into the C program.

So I will try and do something with those inlining attributes when I've
finished development (at the moment it's just a distraction, so those
functions are kept external. Then I will know if any increase/reduction is
speed is do with my code, rather than a different inlining strategy coming
into play).

gcc /can/ inline calls through function pointers (which is pretty
impressive, IMHO), but only when it can tell what the function will be.

It'll have it's work cut out here:

typedef void (*(*fnptr))(void);

do {
(**(fnptr)(pcptr))();
} while (1);

I wouldn't even know where to start! (pcptr[] is an array of mixed-use
integers, some of which will be function pointers, which is set up at
runtime.)

glen herrmannsfeldt · Nov 26, 2012

(snip)

I know little about instruction caches. I wouldn't have thought all these
functions together were that big (the whole program is only 120K and might
be double that when finished). Would it help if the most common functions
(or perhaps the smallest), were together? (But I don't even know if gcc
reorders my functions anyway.)

I don't know that it is a problem in this case, but many machines have
separate instruction and data (I and D) caches, and get very slow if a
block has to switch from one to the other.

In the olden days, (say 40 to 50 years ago) it was usual for
instructions and data to be close together in a program.
(Especially for languages that didn't allow recursion, with all
data static.)

It might be that on some, the use of function pointers gets some
block to switch from one to the other.

-- glen

Malcolm McLean · Nov 26, 2012

Malcolm McLean wrote:

I don't see how that can inline trivialcompfunc. qsort is
opaque for that case. If qsort() were in source code and
local to trivialcompfunc then perhaps it could be done.

That's the problem, of course.
Generally when you pass a constant function pointer

void hashtable( int (*fptr)(char *) );

int myhash(char *key)
{
return (key[0] << 8) | key[1];
}

hashtable( myhash )

it's because hashtable is a generic function that has had an essential bit
of functionality removed to make it more flexible. In this case we can use
a simple hash because the keys have a random distribution in the first two
letters.
But if hashtable was local, then normally you'd hardcode the hash. There's
not much point in inlining only local indirect calls.

BartC · Nov 26, 2012

So I will try and do something with those inlining attributes when I've
finished development (at the moment it's just a distraction, so those
functions are kept external. Then I will know if any increase/reduction is
speed is do with my code, rather than a different inlining strategy coming
into play).

I decided to try something with inlining now. With all my functions in the
same file as the dispatch code, but all with 'noline' attribute, the timings
are the same as when the functions are external:

Switch: 75 seconds
Label ptrs: 66 seconds

With all functions using 'always_inline', I get:

Switch: 60 seconds
Label ptrs: 48 seconds

About 37% increase in performance (27% decrease in execution time), using
label pointers (but a switch still needs to be considered, as label pointers
are not portable).

I was a bit worried because many functions are now called in several places
(some handlers are wrappers or stubs for others), but it seemed to work out.

(I'm quite impressed. When I first tried a C version of this project, it was
about 1/3 the speed of my last (experimental) interpreter, and 1/2 the speed
of my current production interpreter. That 48 seconds makes it faster than
that latter one!

There is also a way of getting better performance: to generate C source code
instead, a series of function calls, to a modified set of handlers. Then it
will be faster than either, but it will no longer be quite so dynamic then
when it's hard-coded into the C program...)

Rosario1903 · Nov 27, 2012

That's handy. You very frequently need something like this

int trivialcompfunc(const void *e1, const void *e2)
{
const int *i1 = e1;
const int *i2 = e2;

return *i1 - *i2;
}

void complexalgorithm(int *array, int N)
{
qsort(array, N, sizeof(int), comfunc);
}

clearly if we can inline trivialcompfunc it;'s likely to have a substantial
effect on perfomance.

easy one can "inline" it, it is enought have the algo qsort, and
rewrite qsort for that compare function...

Rosario1903 · Nov 27, 2012

typedef void (*(*fnptr))(void);

do {
(**(fnptr)(pcptr))();
} while (1);

is it for a 64 bit program?

BartC · Nov 27, 2012

David Brown said:
On 26/11/2012 15:14, BartC wrote:

That is why you need to use profiling. That means you run the program
with real-world data (in this case, typical bytecode programs), and let
the profiler count the number of times functions really are used. This
profile data gets fed back to the compiler, which then knows how to weight
different functions.

How does it 'get fed back to the compiler'?

BartC · Nov 27, 2012

Rosario1903 said:
is it for a 64 bit program?

This interpreter is designed for a 32-bit processor. The C compiler I use I
guess generates code for 32-bits too. (I think it must do because int and
pointer sizes are 32-bits and I don't remember seeing 64-bit instructions in
the output.)

64-bits wouldn't help in this case unless everything was redesigned, and
then I might be lucky to get achieve the same speed as a 32-bit program.

Alain Ketterlin · Nov 27, 2012

BartC said:
How does it 'get fed back to the compiler'?

Depends on your compiler. With gcc, see -fprofile-generate/-fprofile-use
and friends.

-- Alain.

James Kuyper · Nov 27, 2012

How does it 'get fed back to the compiler'?

A compiler option tells it to retrieve that information from the file
that it's stored in. gcc, for instance, the profile data has the
extension ".gcda". You should look at the documentation for the
following gcc profile-related options:

-p
-pg
-fprofile-arcs
-fprofile-values
-fprofile-generate
-fbranch-probabilities
-freorder-functions
-fprofile-use (which enables "feedback directed optimizations, and
optimizations generally profitable only with profile feedback enabled",
including -fvpt -funroll-loops -fpeel-loops -ftracer -fno-loop-optimize)
-param min-inline-recursive-probability
-param tracer-dynamic-coverage
-param tracer-dynamic-coverage-feedback
-param tracer-min-branch-ratio
-param tracer-min-branch-ratio-feedback
-param reorder-blocks-duplicate
-param reorder-blocks-duplicate-feedback

Jorgen Grahn · Nov 27, 2012

[...]

Why not?

Click to expand...

You're right, my phrasing was misleading. What I meant was: static will
not *by itself* determine whether a function is always inlined or not.

Ah, good.

By the way, static should always be used when it applies, because it
also lets the compiler emit non strictly ABI-compliant code, which may
be faster.

I think I see gcc do this even for non-static functions -- they may
get one ABI-compliant entry point and one tuned for local callers. The
latter presumably skips some register saving or something. (But I
don't know x86 assembly well enough to be sure about this!)

/Jorgen

Jorgen Grahn · Nov 27, 2012

On 26/11/2012 15:14, BartC wrote: ....

That is why you need to use profiling. That means you run the program
with real-world data (in this case, typical bytecode programs), and let
the profiler count the number of times functions really are used. This
profile data gets fed back to the compiler, which then knows how to
weight different functions.

Nitpick: what you're describing is more widely known as
"profile-guided optimization" or "profiler-guided optimization".

When you say "profiling" I think of the traditional (and orthogonal!)
approach: running a profiler and reading and thinking about the
results, possibly doing manual optimization work.

/Jorgem

Jorgen Grahn · Nov 28, 2012

.

To get good code, it is
important to give the compiler freedom to generate it's best code - that
means [...] and either turning off debug information generation,
or at least picking the most expressive possible debug format (so that
the compiler does not restrict code generation to suit debugger
capabilities).

Do any compilers still work that way? Which ones? I remember fighting
that idea fifteen years ago. People kept saying

"We can't enable debug symbols, because that would disable
optimization."

or the dreaded

"We'll enable optimization later, when we ship the code. We can't
do it now, because it would disable debug information."

even though our particular compiler's documentation clearly showed
that these were unrelated settings.

Come to think of it, the fight is still ongoing. Only a month ago I
enabled optimization for a piece of performance-critical code.

/Jorgen

James Kuyper · Nov 28, 2012

To get good code, it is
important to give the compiler freedom to generate it's best code - that
means [...] and either turning off debug information generation,
or at least picking the most expressive possible debug format (so that
the compiler does not restrict code generation to suit debugger
capabilities).

Click to expand...

Do any compilers still work that way? Which ones? I remember fighting
that idea fifteen years ago. People kept saying

"We can't enable debug symbols, because that would disable
optimization."

or the dreaded

"We'll enable optimization later, when we ship the code. We can't
do it now, because it would disable debug information."

even though our particular compiler's documentation clearly showed
that these were unrelated settings.

Come to think of it, the fight is still ongoing. Only a month ago I
enabled optimization for a piece of performance-critical code.

It's not just a matter of ancient compilers that can't combine debugging
with optimization. In my experience, debugging code that has been
significantly optimized is both quite possible, and exceedingly
difficult, because the code that I'm watching behaves quite differently
from the code that I wrote. One of the first things I do, when debugging
appears to be necessary, is to determine whether the bug disappears when
optimizations are turned off. If it doesn't, I normally leave them
turned off during debugging, for the sake of my own sanity.

Ben Pfaff · Nov 28, 2012

James Kuyper said:
It's not just a matter of ancient compilers that can't combine debugging
with optimization. In my experience, debugging code that has been
significantly optimized is both quite possible, and exceedingly
difficult, because the code that I'm watching behaves quite differently
from the code that I wrote. One of the first things I do, when debugging
appears to be necessary, is to determine whether the bug disappears when
optimizations are turned off. If it doesn't, I normally leave them
turned off during debugging, for the sake of my own sanity.

Yesterday I noticed this in the GCC 4.8 changes:

A new general optimization level, -Og, has been introduced.
It addresses the need for fast compilation and a superior
debugging experience while providing a reasonable level of
runtime performance. Overall experience for development
should be better than the default optimization level -O0.

If it works as promised, then this should be very helpful.

Sergio Durigan Junior · Nov 29, 2012

Yesterday I noticed this in the GCC 4.8 changes:

A new general optimization level, -Og, has been introduced.
It addresses the need for fast compilation and a superior
debugging experience while providing a reasonable level of
runtime performance. Overall experience for development
should be better than the default optimization level -O0.

Yeah, the -Og flag has been discussed for quite some time, but it is
still pretty recent and may need some improvements, of course.

Anyway, this subject (debugging optimized programs) is interesting,
difficult, and there are some good approaches that try to address it.
One of the most recent projects is what we call "Variable-tracking at
assignments", a project from Alexandre Oliva at Red Hat, which aims to
improve the debugging experience for local variables that have been
optimized out. It's a pretty good attempt to tackle some subset of the
whole problem.

More about it: http://gcc.gnu.org/wiki/Var_Tracking_Assignments

I am a GDB developer, and when I talk to people about whether they
should disable optimizations or not, I often say that they should.
"Compile your program with -O0 -g3", I say. But fortunately, given the
efforts being made on the compiler front, even -O2 programs have better
debuginfo today than they used to some years ago.

Malcolm McLean · Nov 29, 2012

On 11/28/2012 04:27 PM, Jorgen Grahn wrote:

It's not just a matter of ancient compilers that can't combine debugging
with optimization. In my experience, debugging code that has been
significantly optimized is both quite possible, and exceedingly
difficult, because the code that I'm watching behaves quite differently
from the code that I wrote. One of the first things I do, when debugging
appears to be necessary, is to determine whether the bug disappears when
optimizations are turned off. If it doesn't, I normally leave them
turned off during debugging, for the sake of my own sanity.

Yes, you want to unit-test everything that can be unit tested.

Fortunately, normally the algorithmically-difficult parts of a program
fall quite naturally into leaf functions, which are easy to unit test.
The glue code is normally simple loops, calls, and a bit of routine
condition-testing logic.

But I usually find diagnostic printfs() easier to work with than debuggers.
On MS Windows, printf() is vandalised. So I have my own little console
with a Con_Printf() routine I can call from Windows procedures.

James Kuyper · Nov 29, 2012

On 11/29/2012 10:24 AM, Malcolm McLean wrote:
....

But I usually find diagnostic printfs() easier to work with than debuggers.

In my experience, that depends upon the debugger, the compiler, and the
type of problem being debugged. We used to use the Irix MIPS-Pro C
compiler, and I used dbx to debug problems when necessary; I only rarely
felt the need to use debugging printf()s. We're now using exclusively
gcc, and I have only gdb to debug with, and even when I compile with
-O0, the behavior of the code as displayed by gdb quite frequently
deviates significantly from what I wrote. I can't be sure whether this
is due to the compiler or the debugger, or both.
As a result, I've often found it necessary to use debugging printf()s to
extract information about what's going on that's actually consistent
with the way I wrote the code.

inline functions not inlined	9	Mar 6, 2006
inline vs. function pointers	36	Jan 27, 2011
inline functions	2	Aug 26, 2008
inline + va_list	4	Jun 28, 2008
Inline functions and linkage	5	May 26, 2009
about inline functions	14	Apr 25, 2009
Inline Functions?	3	Aug 19, 2008
static inline functions and gcc	21	May 21, 2009

inline functions

Malcolm McLean

Les Cargill

BartC

glen herrmannsfeldt

Malcolm McLean

BartC

Rosario1903

Rosario1903

BartC

BartC

Alain Ketterlin

James Kuyper

Jorgen Grahn

Jorgen Grahn

Jorgen Grahn

James Kuyper

Ben Pfaff

Sergio Durigan Junior

Malcolm McLean

James Kuyper

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads