Reducing build-times for large projects.

Jason Heyes · Mar 7, 2005

I have read item 26 of "Exceptional C++" that explains a way of minimising
compile-time dependencies and, therefore, a way to improve compile speeds.
The procedure involves replacing #include directives with forward
declarations from within header files.

However, a potentially large number of source files may have previously
relied on the removed #include directives being present. Therefore it is
likely that a large number of source files will not compile as a result of
the modification and the "missing" #include directives must be inserted into
each of them. This is a pain.

Here is an example. The source file Z.cpp includes Y.h. The function defined
in Y.h is inlined.

// X.h
#include <iostream>
struct { void f() { std::cout << "X::f" << std::endl; } };

// Y.h
#include "X.h"
inline void y(X x) { x.f(); }

// Z.cpp
#include "Y.h"
int main()
{
y(X());
return 0;
}

When we use a forward declaration to remove the dependency of Y.h on X.h, we
see that Z.cpp must include X.h because main calls X::X(). So Z.cpp compiles
no faster than before. We might as well have left the dependency in
existence and saved ourselves all the extra work.

// X.h (as before)

// Y.h
class X;
void y(X x);

// Y.cpp
#include "X.h"
void y(X x) { x.f(); }

// Z.cpp
#include "Y.h"
#include "X.h"
// (rest as before)

The example shows that it doesn't always pay to minimise compile-time
dependencies. Here is the question. How do you make a judgement on what
dependencies should (or should not) be removed? Any thoughts are
appreciated.

rajkumar · Mar 7, 2005

I think you have the concept of forward declaration wrong. The idea is

____________
// X.h
class X{};

//Y.h
#include "X.h"

void foo( X* t);
______________

can be replaced to

____________
// X.h
class X{};

//Y.h
class X;

void foo( X* t);
______________

This is because Y doesnt need to know the layout of class X. But in the
below example you cannot forward declare X as foo is passing X by
value. If you do what you have done. You will have to include X.h
before Y.h in every cpp file. This is a bad idea as on a large project
as people would have to remember what order the headers need to be
included in.

_________________
// X.h
class X{};

//Y.h
#include "X.h"

void foo( X t);
______________

If you want to speed up compilation. See if you compiler supports some
sort of precompiled headers and use that. Dont use too many inline
functions. With regular functions only a prototype change triggers a
rebuild. But with inline functions even changing the function body will
trigger rebuilds.

Raj

Phil Staite · Mar 8, 2005

Jason said:
However, a potentially large number of source files may have previously
relied on the removed #include directives being present. Therefore it is
likely that a large number of source files will not compile as a result of
the modification and the "missing" #include directives must be inserted into
each of them. This is a pain.

A pain, maybe, but good sofware IMHO.

In big projects you absolutely must, must reduce coupling or chaining of
header files whenever and where ever possible. Sure, the source files
that need to see the full declarations will have to include the header
files that they need. This is a good thing - it lets you know what
they're grabbing right there - without having to go grovelling through a
zillion chained header files.

Also, it *will* save you time on compiles of source files that do not
need the full declarations. Since the headers they include will be
simpler, not chained to a bunch of extraneous headers, they will compile
faster.

Precompiled headers and small to modest sized projects lull far to many
programmers into lazy practices. Then you hit a project with a few
million SLOC with a few thousand source files and all of a sudden it's
taking hours to build... :-(

The example shows that it doesn't always pay to minimise compile-time
dependencies. Here is the question. How do you make a judgement on what
dependencies should (or should not) be removed? Any thoughts are
appreciated.

On the contrary, I would say that it does always pay to minimize
compile-time dependencies. Even if you don't see an immediate or
impressive reduction in compile times. It makes good sense, and it
reduces coupling between files and compilation units.

Jason Heyes · Mar 8, 2005

Phil Staite said:
On the contrary, I would say that it does always pay to minimize
compile-time dependencies. Even if you don't see an immediate or
impressive reduction in compile times. It makes good sense, and it
reduces coupling between files and compilation units.

Thanks for your excellent reply.

Inline functions go away as compile-time dependencies are minimised. Do you
see this as an issue worth worrying about? Will program performance be
compromised in favour of compilation speed?

rajkumar · Mar 8, 2005

You can do this neat inline trick

A.h contains all the class declarations
______________________________

class A{};

#ifdef RELEASE
#include "a.inl"
#endif

______________________________

A.inl contains all the inline functions
_________________________________

#define MYINLINE inline

MYLINE A::test()
{
}

_____________________________

A.cpp contains method implemenation

include A.inl in both A.h and A.cpp.

In debug builds a.inl actually expands in A.cpp

but in release it could be actually inline. You get smaller compile
time in debug builds and better performance in release

Phil Staite · Mar 9, 2005

Jason said:
Inline functions go away as compile-time dependencies are minimised. Do you
see this as an issue worth worrying about? Will program performance be
compromised in favour of compilation speed?

Actually, there's some evidence that inline functions can *hurt* program
performance. Consider, any reasonably fast CPU will have a clock speed
far in excess of main memory's ability to keep up. Hence instruction
cache is king in terms of performance. Cache "misses" mean wait states
while awaiting a fetch from main memory.

Now consider a reasonably heavily used, short function. (eg. accessor
functions for protected/private data come to mind) Sounds like a great
candidate for inlining, right? But inlining puts a copy of that code
(admittedly small) at N different addresses in the executable code.
Whereas a non-inlined function has one copy, at one address. So if the
cache is big enough, and the program stays local enough, the non-inlined
copy may stay in cache most or all of the time. Meanwhile, the inlined
code, being at unique addresses, must be fetched each time it is needed
from main memory.

Depending on a lot of other factors, it is not beyond the realm of
possibility that the non-inlined code is faster, since a cache hit for
the code would outweigh the penalty of a call/return. While the penalty
for a cache miss might outweigh the benefit of avoiding a call/return.

The upshot of this little thought experiment is the same old song:
Don't ever assume X is better than Y in terms of performance. Always,
always always test to see the real story on your cpu, OS, compiler,
program, and data set... The corallary is, make it work first, then
make it work fast. Of course that's not a license to choose explicitly
stupid algorithms, data structures, and program structure. ;-)

Jason Heyes · Mar 9, 2005

Phil Staite said:
Actually, there's some evidence that inline functions can *hurt* program
performance. Consider, any reasonably fast CPU will have a clock speed
far in excess of main memory's ability to keep up. Hence instruction
cache is king in terms of performance. Cache "misses" mean wait states
while awaiting a fetch from main memory.

Now consider a reasonably heavily used, short function. (eg. accessor
functions for protected/private data come to mind) Sounds like a great
candidate for inlining, right? But inlining puts a copy of that code
(admittedly small) at N different addresses in the executable code.
Whereas a non-inlined function has one copy, at one address. So if the
cache is big enough, and the program stays local enough, the non-inlined
copy may stay in cache most or all of the time. Meanwhile, the inlined
code, being at unique addresses, must be fetched each time it is needed
from main memory.

Depending on a lot of other factors, it is not beyond the realm of
possibility that the non-inlined code is faster, since a cache hit for the
code would outweigh the penalty of a call/return. While the penalty for a
cache miss might outweigh the benefit of avoiding a call/return.

The upshot of this little thought experiment is the same old song: Don't
ever assume X is better than Y in terms of performance. Always, always
always test to see the real story on your cpu, OS, compiler, program, and
data set... The corallary is, make it work first, then make it work fast.
Of course that's not a license to choose explicitly stupid algorithms,
data structures, and program structure. ;-)

Ok. Thanks dude.

DHOLLINGSWORTH2 · Mar 9, 2005

I'm sorry, but Inline functions are loaded right along with the rest of the
code. If the line before it executed, then the very next inline code has
already been loaded.

You are correct that not loading code is faster than loading and waiting for
it to execute. The argument is like saying dont push the BP becuase the
actual code to Push the BP has to be loaded, SO DOES THE F"N CODE TO CALL
THE FUNCTION THAT IS ALREADY IN CACHE!

and it don't have to load as much inline, as it does using the stack.

LOOP unrolling has been going on for years. and its a lot faster loading
that same chunk of code over and over than staying in the same spot doing
all of the compares.

If, however you don't like inline, you can also use register calling, which
the compiler can streamline to gain even more speed. The difference between
that and inline is a CALL [addr];

Phil Staite · Mar 9, 2005

DHOLLINGSWORTH2 said:
I'm sorry, but Inline functions are loaded right along with the rest of the
code. If the line before it executed, then the very next inline code has
already been loaded.

Maybe, if your architecture does speculative execution and fetch. It
would also need a deep enough pipeline to see that it needs to do that
fetch far enough in advance to get it done in time. It would also
depend on the nature of the instructions in front of it, and how long it
took the cpu to chew through them. Otherwise you're looking at a stall.

You are correct that not loading code is faster than loading and waiting for
it to execute. The argument is like saying dont push the BP becuase the
actual code to Push the BP has to be loaded, SO DOES THE F"N CODE TO CALL
THE FUNCTION THAT IS ALREADY IN CACHE!

We're talking about the code for the function, not the code that leads
up to the function. You either have multiple calls to one set of
instructions, or just that set of instructions plonked down in the
instruction stream in multiple places. (ie you've replaced your calls
with the code)

In the non-inlined case you may get lucky and only fetch the call
instruction from main memory, then hit in the cache for the actual
function's instructions.

In the case of inlined code, you may get luck and hit the cache for the
inlined code too - just as in the non-inlined case.

However, I would contend that it is simple probability that if there is
only one copy of the code used by all the places that reference it, that
is far more likely to be in cache than one particular instance out of
many inlined copies. Also that one copy of the function's code is less
likely to push other things out of the cache (that you may later want
back) than N copies of the same code, from different addresses.

LOOP unrolling has been going on for years. and its a lot faster loading
that same chunk of code over and over than staying in the same spot doing
all of the compares.

Loop unrolling "works" because you get to do N iterations worth of the
loop code without hitting a conditional/branch. This reduces the per
iteration "overhead" of the loop. It also helps in that speculative
execution past a conditional/branch can be problematic. So the more
work you can get done between those points the better. At some point
though you hit diminishing returns on loop unrolling. You still have to
load N times as much code the first time through the loop. Generally it
all fits in cache so subsequent iterations are free in that respect.
However, you can have too much of a good thing. At some point you're
pushing enough instructions out of cache to make room for your unrolled
loop that it comes back and bite you. IIRC most compilers limit loop
unrolling to the low single digits as far as the number of copies they make.

But this whole discussion is just barely anchored in reality. An awful
lot depends on real world parameters that vary widely and wildy from
system to system, program to program. I was just trying to point out to
the OP that when testing reveals something that makes you say "Oh
{explative}! How'd that happen?" you've got to be ready to challenge
your assumptions. One common assumption is that inlining improves
execution speed. It may, but then again, it may not.

Johan Nilsson · Mar 9, 2005

Jason said:
I have read item 26 of "Exceptional C++" that explains a way of minimising
compile-time dependencies and, therefore, a way to improve compile speeds.
The procedure involves replacing #include directives with forward
declarations from within header files.

I've not read "Exceptional C++", only "More Exceptional C++", so can't
comment on the specifics.

[snip]

// X.h (as before)

// Y.h
class X;
void y(X x);

// Y.cpp
#include "X.h"
void y(X x) { x.f(); }

// Z.cpp
#include "Y.h"
#include "X.h"
// (rest as before)

The example shows that it doesn't always pay to minimise compile-time

dependencies. Here is the question. How do you make a judgement on what
dependencies should (or should not) be removed? Any thoughts are
appreciated.

I assume that there are typos in the code above, as you can't forward
declare classes that are later passed/used by value, i.e.:

// Y.h
class X;
void y(X x);

=>

class X;
void y(X* x); // or X& or X const&

That aside, I'm a bit surprised (if that really is the case) that
classes used in _public_ interfaces (including protected) are only
forward declared. All header files should be self-contained, and not
rely on users first including e.g. the declaration of X.

IMHO, you should limit yourself to forward declare classes used in the
implementation only, e.g. private members, parameters/return values
to/from private methods. If you do that, you will only have to include
the declaration from the implementation file of the containing class
(y.cpp in the above case).

Regards // Johan

DHOLLINGSWORTH2 · Mar 9, 2005

I would have to agree with you when we are talking about functions with a
considerable size. I emmediately thought of inline as function whos size
was comparable to that of a function call, 2 to 3 times the size of the
overhead.

But yes, the bigger the inline function, the greater the chance of a problem
occuring.

Do you know how the multi-core tech works?
do they each have a dual pipeline, or something better?
if it stalls, does each pipline need to re-up, or do they run strictly
independent?

Andre Kostur · Mar 9, 2005

Phil Staite said:
Actually, there's some evidence that inline functions can *hurt*
program performance. Consider, any reasonably fast CPU will have a
clock speed far in excess of main memory's ability to keep up. Hence
instruction cache is king in terms of performance. Cache "misses"
mean wait states while awaiting a fetch from main memory.

Now consider a reasonably heavily used, short function. (eg. accessor
functions for protected/private data come to mind) Sounds like a
great candidate for inlining, right? But inlining puts a copy of that
code (admittedly small) at N different addresses in the executable
code. Whereas a non-inlined function has one copy, at one address. So
if the cache is big enough, and the program stays local enough, the
non-inlined copy may stay in cache most or all of the time.
Meanwhile, the inlined code, being at unique addresses, must be
fetched each time it is needed from main memory.

You'll need a bigger function that simple accessors to trigger that
behaviour. On a simple enough accessor, the entire function call setup
and teardown code may be larger than the actual accessor code. So even
with your argument, inlining would make the code _smaller_ and not
bigger. This is the danger of making generalizations.......

Depending on a lot of other factors, it is not beyond the realm of
possibility that the non-inlined code is faster, since a cache hit for
the code would outweigh the penalty of a call/return. While the
penalty for a cache miss might outweigh the benefit of avoiding a
call/return.

The upshot of this little thought experiment is the same old song:
Don't ever assume X is better than Y in terms of performance. Always,
always always test to see the real story on your cpu, OS, compiler,
program, and data set... The corallary is, make it work first, then
make it work fast. Of course that's not a license to choose
explicitly stupid algorithms, data structures, and program structure.

You betcha! Measure first, then optimize.....

Andre Kostur · Mar 9, 2005

DHOLLINGSWORTH2 wrote:

We're talking about the code for the function, not the code that leads
up to the function. You either have multiple calls to one set of
instructions, or just that set of instructions plonked down in the
instruction stream in multiple places. (ie you've replaced your calls
with the code)

However, by "plonking down" the code, you give the optimizer more to work
with....

In the non-inlined case you may get lucky and only fetch the call
instruction from main memory, then hit in the cache for the actual
function's instructions.

In the case of inlined code, you may get luck and hit the cache for
the inlined code too - just as in the non-inlined case.

However, I would contend that it is simple probability that if there
is only one copy of the code used by all the places that reference it,
that is far more likely to be in cache than one particular instance
out of many inlined copies. Also that one copy of the function's code
is less likely to push other things out of the cache (that you may
later want back) than N copies of the same code, from different
addresses.

Except (as I mentioned in another post) if your function setup and
teardown code is larger than the actual code within the inlined function,
then you are increasing the code size at the call point and thus
increasing the probabilities of your other function being pushed out of
cache (hey... if we're talking probablilities....)

Or.. if your inlined function is executed within a loop (let's assume
something along the lines of: for (int i = 0; i < 1000000; ++i) someFn
(i);

If someFn() isn't inlined, then you'd have to pay for the 1000000
call/return to someFn. If someFn is inlined, then perhaps the first 90%
of someFn works out to be a loop invariant and the optimizer pulls all of
that code out in front of the loop, and only executes the remaining 10%
of the code 1000000 times. (Of course this depends on your optimizer)

[snip]

But this whole discussion is just barely anchored in reality. An
awful lot depends on real world parameters that vary widely and wildy
from system to system, program to program. I was just trying to point
out to the OP that when testing reveals something that makes you say
"Oh {explative}! How'd that happen?" you've got to be ready to
challenge your assumptions. One common assumption is that inlining
improves execution speed. It may, but then again, it may not.

Next common assumption... marking a function as inline doesn't
necessarily mean that it will be inlined.

As usual, beware of making sweeping generalizations.

Phil Staite · Mar 10, 2005

DHOLLINGSWORTH2 said:
I would have to agree with you when we are talking about functions with a
considerable size. I emmediately thought of inline as function whos size
was comparable to that of a function call, 2 to 3 times the size of the
overhead.

But yes, the bigger the inline function, the greater the chance of a problem
occuring.

Yes, the smaller the code for the function(s), the less likely you are
to run into performance problems, inlined or not. ;-) But yeah, for
simple accessors of say integral types, you may be looking at simple
register loads where the inlined code is the same size as a call/return
would be to even get at the outlined code. In those cases it makes
absolutely no sense *not* to inline.

Do you know how the multi-core tech works?
do they each have a dual pipeline, or something better?
if it stalls, does each pipline need to re-up, or do they run strictly
independent?

I've only read a little bit about the dual cores. The one guy's take on
it was that they still had some bottlenecks - IIRC a single cache, and
some other shared on-chip resources. :-( I like the idea though, since
we're probably stuck with about the same clock speeds we have now for
the next several years. So we're going to have to get more done with
the same GHz. Fine by me, I love multithreaded programming. And at my
day job I get to play with a 128 CPU machine.

Who needs dual core
when you have lots of CPUs?

DHOLLINGSWORTH2 · Mar 11, 2005

Do you know how the multi-core tech works?

I've only read a little bit about the dual cores. The one guy's take on
it was that they still had some bottlenecks - IIRC a single cache, and
some other shared on-chip resources. :-( I like the idea though, since
we're probably stuck with about the same clock speeds we have now for the
next several years. So we're going to have to get more done with the same
GHz. Fine by me, I love multithreaded programming. And at my day job I
get to play with a 128 CPU machine. Who needs dual core when you have
lots of CPUs?

My concern is this:
We already have a Memory Bus bottle neck, now with multiple cores accessing
the same ram, that would tend to make the Bottleneck worse. They are going
to need loads of internal Cache to keep them all running Smooth.

And I'm willing to bet you'll see a lot more problems like the inline one.

Reducing build times	12	Jul 12, 2005
Problem With Visual C++	7	Feb 23, 2005
Best Build Tool for large C projects	2	Dec 9, 2008
Best Build Tool for large C projects	10	Dec 9, 2008
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
source tree organization for large-scale C++ projects	4	Apr 7, 2008
Incremental build systems, infamake	20	Jul 25, 2011
Dealocating large objects?	4	Mar 26, 2010

Reducing build-times for large projects.

Jason Heyes

rajkumar

Phil Staite

Jason Heyes

rajkumar

Phil Staite

Jason Heyes

DHOLLINGSWORTH2

Phil Staite

Johan Nilsson

DHOLLINGSWORTH2

Andre Kostur

Andre Kostur

Phil Staite

DHOLLINGSWORTH2

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads