You are correct about current build systems not handling all incremental
changes correctly, I've been programming long enough to have come across
that problem. Without more thought on this topic I'm not entirely sure
how to ensure incremental changes are correctly handled, but it sounds
like you'd need to generate some sort of relationship graph. As long as
the overhead in doing this is not too time consuming, it is worthwhile.
Then again if the relationships are becoming that complex, perhaps the
code design is poor and needs to be redesigned.
I'm leaning towards the following design, and I've implemented
something based on this design. My design is a bastard mix of concepts
from Ant, Maven, and GNU Make.
First, my goal, using the terms define above, is to guarantee
incremental correctness over all possible deltas which can be checked
into source control.
This specifically declares things to be outside the scope of
incremental correctness, including correct and bug-free OS, correct
version of gcc and other compilers, correctly installed "third party
libraries" which are installed apart from the project in question and
the project's source control system. This also excludes the build
system implementation itself. That is, most makefiles (or their
equivalent) will be checked into source control, so they must be
covered by the incremental correctness guarantees, but the make
implementation itself is not checked into the same source control as
the project in question, so the make implementation is outside the
guarantee of correctness; the incremental correctness guarantee is
contingent on the correctness of the build system implementation.
With that out of the way, it follows pretty quickly that a free form
language like makefiles of GNU Make are unsuitable for this purpose.
Apart from the design considerations of a file-level dependency graph,
the turing complete nature of GNU Make means that we cannot guarantee
incremental correctness when developers can arbitrarily modify
makefiles.
What is needed is a very strict and narrow build script language ala
idiomatic Maven, and to a lesser degree idiomatic Ant. The goal is a
build script language where the developer "instantiates" a template or
a macro from a prebuilt list of macros. A macro would correspond to a
"build type", like "cpp dll" or "java jar". The macro definition
itself would not be in the source control repository of the project in
question, and thus the incremental correctness guarantee would be
contingent on the correctness of the macro implementations. Great care
must be made when changing or adding new macros as it could break
incremental correctness, but this would not be a normal developer
activity.
This system also has the desirable property that it removes a lot of
clutter from the build scripts. In my company's current build scripts,
especially the vcproj files, there is a very low "information
content". There is much duplication of settings. Sometimes the
settings differ. It's difficult to find the variations, if not near
impossible, and it is near impossible to determine why this project
has that build variation from that build.
With these macros, all of the common configurations would be moved to
a common place. This allows for easier auditing of current build
options as only the deltas from the default would be in the build
script files (delta used in a meaning than the deltas for incremental
builds). It also allows for useful and easy extensions and rewrites. A
simple example might be that you want to publish Javadocs. In an Ant
system, you would have to add a new Javadoc task to each build script,
whereas if you used Ant macros, or my macros, you could just add a new
piece to the macro which all the build scripts use. I go one step
further than Ant and require all developers to use the prebuilt,
presumably correct, macros. (Ant's prebuild tasks are overwhelmingly
not incrementally correct, including its depends task.)
Each build script would give arguments to the macros, things such as
include path, library dependencies, preprocessor defines, etc. The
build system would parse it all, get all of the instantiated macros,
and then create tasks for each macro. Ex: a macro might be "java files
to jar". (Sorry I didn't use C++ as an example, but the C++ case is
actually more complex.) Each macro invocation would create two tasks,
one task to call javac, and one task to call jar on the produced class
files. In the future, this macro might be edited to include another
task, a Javadoc task. The macro implementation contains information to
associate execution time dependencies between these tasks. Once all
the tasks are created, the build engine can then execute these tasks
in DAG order, parallelizing as possible. It's up to each individual
task to guarantee incremental correctness.
GNU Make has all of the incremental logic hidden away in its
implementation, in its dependency graph logic. However, this is
insufficient. In my system, this logic is encoded in the task
implementations. There will be a very small number of very stable task
implementations. It would not be usual practice for a developer to
modify them. Thus there is some hope that they could be kept correct.
Let me also note that I don't see any particular reason that this
should be much slower than GNU Make's approach of keeping all of the
incremental logic isolated to one place. Sure, the implementation code
is a little more complex, but the actual actions, such as filesystem
accesses, will be on the same order. As an example, for a real portion
of code of my company's product, roughly 4300 java files, on a
standard 4 core desktop, this dependency analysis finishes in less
than 5 seconds. An optimization already in place which skips detailed
analysis of each javac unit when all dependencies have not changed
reduces that to about 2 seconds. For the same codebase, my build
system does a full clean build in about 200 seconds to 340 seconds
depending on the state of the caches of the hard disk (aka hot vs cold
start), and Maven does it in about about 700 seconds. (However, this
comparison is somewhat cheating, as my build system parallelizes the
build whereas Maven does not.) When I picked files which I expected to
have the worst case incremental build time, the worst I managed was
about 60 seconds for a single file modification. (No time given for
Maven, as Maven isn't incrementally correct, so the number is
meaningless.) As noted else-thread, I expect this time to remain
relatively independent of the size of the code base, though
particulars do matter. This is also Java dependency analysis, which
when done correctly, is a lot more complex than the dependency
analysis required for C++ compilation, so I would expect it to be even
faster for the equivalent amount of C++ source files.
A slightly longer explanation of my build system for java: The java
code is broken up into different directories which will end up in
different jars. There's about 100 jars for those java files. The
dependency cascade goes without termination to the boundary of this
javac task, taking the conservative approach, then executes javac. It
begins the analysis anew for the next javac task, in effect allowing a
termination to the dependency cascade. This is required because of the
possible circular references in the java code, because developers
frequently do not specify java to java file dependency information,
and because javac is quite expensive to invoke so the cost is
amortized over lots of java files. This seems to result in a good
degree of parallelization and a good degree of incremental while still
having a fast build.
C++ has an entirely different model. Each cpp dir has its own macro
invocation, each of which creates a single task. When that task
executes during dependency graph traversal, it analyzes which cpp
files are out of date, then creates new tasks for the out of date cpp
files, and adds those tasks to the dependency graph. To be clear, I am
modifying the graph during graph traversal. This allows cool things
like creating cpp files from a black box process (such as unzipping)
and compiling those in parallel (something which would be quite
difficult in GNU Make while still preserving a global cap on the
number of active build jobs), and aggregating multiple cpp files into
a single compiler invocation. I've read that some compilers, IIRC
notably microsoft's visual studios compiler, has a large startup cost,
so if more than 1 cpp file is given to the compiler, say instead in
batches of 5, this greatly speeds up the build.
While implementing some common tasks (such as compile cpp, link obj
files, compile java files, make jar, make zip, unzip, download a file,
etc.), I have been seeing some commonalities between tasks. I still
don't see a good way to factor out some of these commonalities, but
I'm trying. Still, the logic, the hard implementation stuff, is
centralized to a small set of tasks instead of spread out across all
the build scripts of a project.
Let me emphasize that with this system, when the macros are correctly
implemented, it will be quite difficult for a developer to break
incremental correctness short of outright maliciousness. Examples
include: modifying the state file saved between runs of the build
system, adding malicious code in automated tests, setting timestamps
of source files to be the past (a rather notable one as this could
happen from an unzip tool or from a sync from a source control
system), modifying the timestamps of output files at all (though these
are hidden away in a parallel folder structure, so less prone to
accidents).