Meta-C question about header order

T

Tim Rentsch

Jens Schweikhardt said:
consider a small project of 100 C source and 100 header files.
The coding rules require that only C files include headers,
headers are not allowed to include other headers (not sure if
this is 100% the "IWYU - Include What You Use" paradigma).

The headers contain only what headers should contain:
prototypes, typedefs, declarations, macro definitions.

The problem: given a set of headers, determine a sequence of
#include directives that avoids syntax errors due to undeclared
identifiers. I.e. if "foo.h" declares type foo_t and "bar.h" uses
foo_t in a prototype, "bar.h" must be included before "foo.h".

I have been reading the many comments in this thread with some
interest. Reading through your responses, I have come up with
this summary of motivations for using this approach (these are
my paraphrasings, often not quotes of the originals):

+ no lint warnings about repeated headers
+ no need for include guards
+ doxygen dependency graph much simpler
+ no cycles in include graph
+ removing unneeded includes is easier
+ simpler compiler diagnostics
+ easier to generate dependency makefile
+ improved identifiability of refactoring opportunities
+ ... and of interface accumulation [not sure what this means]
+ ... and of code collecting fat
+ constant reminders of all dependencies of each .c file

Some questions:

1. Is this an accurate summary?

2. Has anything been left out (ie, is there any other
positive you would add to the list)?

3. Would you mind listing these from most important
to least important, and giving some indication of
relative weight for each item?
I'm not a computer scientist, but it sounds as if this requires a
topological sort of all the '"foo.h" needs "bar.h"' relations.
Now the interesting part is: how to automate this, i.e. how to
determine "this header declares identifiers A, B, C and requires
X, Y, Z"? [IWYU by google was looked at but seems not a good
solution]

Yes, a topological sort. The topological sort is the easy
part - the harder part is identifying what the first-level
dependencies are.
Can you think of a lightweight way to solve this? Maybe using
perl, the unix tool box, make, gmake, gcc, a C lexer?

I may have some suggestions here, but first I would like to read
through responses to the questions asked above, to make sure I'm
going in a good direction.
 
J

Jens Schweikhardt

in <[email protected]>:
#
#> consider a small project of 100 C source and 100 header files.
#> The coding rules require that only C files include headers,
#> headers are not allowed to include other headers (not sure if
#> this is 100% the "IWYU - Include What You Use" paradigma).
#>
#> The headers contain only what headers should contain:
#> prototypes, typedefs, declarations, macro definitions.
#>
#> The problem: given a set of headers, determine a sequence of
#> #include directives that avoids syntax errors due to undeclared
#> identifiers. I.e. if "foo.h" declares type foo_t and "bar.h" uses
#> foo_t in a prototype, "bar.h" must be included before "foo.h".
#
# I have been reading the many comments in this thread with some
# interest. Reading through your responses, I have come up with
# this summary of motivations for using this approach (these are
# my paraphrasings, often not quotes of the originals):
#
# + no lint warnings about repeated headers
# + no need for include guards
# + doxygen dependency graph much simpler
# + no cycles in include graph
# + removing unneeded includes is easier
# + simpler compiler diagnostics
# + easier to generate dependency makefile
# + improved identifiability of refactoring opportunities
# + ... and of interface accumulation [not sure what this means]
# + ... and of code collecting fat
# + constant reminders of all dependencies of each .c file

Thanks Tim, for taking the time. To expand on interface accumulation:
the process where interface A needs another, which then grows the need
for yet another and eventually includes half the total number of
interfaces of the project.

Lets face it: programmers are lazy, and its too easy in C to blow up an
initially small interface design by writing another #include in the
first header that looks like it's included by "most" of the files where
it is needed and include it directly where not. How many projects have
you seen with project_types.h, misc.h, macros.h, and such headers
invented on the spot.


# Some questions:
#
# 1. Is this an accurate summary?
#
# 2. Has anything been left out (ie, is there any other
# positive you would add to the list)?

+ Reduced processing time by all the tools that operate on
C source. That's the compiler of course, but also lint,
auto dependency generators, static checkers, doxygen, ...
For each translation unit, the headers are tokenized and
parsed at most once (not at all when in an disabled #ifdef).
I observe 20% in our project.

+ Giving developers a hard and fast unambiguous rule in which file the
include directives go. There is only one choice. If foo.c needs the
bar_t declaration from bar.h, it gets included. Contrast this with
"traditional wisdom", where possibly a large number of headers would be
candidates for the new #include statement. A good design would make this
choice obvious, a bright developer would know the state of the art, but
it's a rare trait. "Indented six feet down and covered with dirt" is the
reality out there. Yes, this requires selection of the proper *line*
among the includes. But *any* compiler will tell in no uncertain words
if that was the wrong line or you're missing another header. It's fool
proof.

# 3. Would you mind listing these from most important
# to least important, and giving some indication of
# relative weight for each item?

+ improved identifiability of refactoring opportunities
$ grep -c '#include "foo.h"' */*.c
Whoa! foo.h is included by 95% of files, why?
Whoa! foo.h is included by one file only. Maybe incorporate it.
Hmm. All foo.h require bar.h, baz.h and blurb.h. Could I
encapsulate this better? Maybe merge some headers?
(50 points)
+ ... and of interface accumulation
$ grep -c '#include' */*.c
Whoa! big.c includes everything and the kitchen sink. What's up?
(30 points)
+ ... and of code collecting fat - optional debug code in #ifdef maze.
Should be moved out to separate object files, linked in when needed.
(20 points)
+ doxygen dependency graph much simpler. It's a document for
the customer.
(20 points)
+ removing unneeded includes is easier
(20 points)
+ constant reminders of all dependencies of each .c file
(10 points)
+ no cycles in include graph
(10 points)
+ giving developers a fast rule in what file the include goes.
(10 points)
+ reduced processing time by all the tools that operate on source
(10 points)
+ no lint warnings about repeated headers
(10 points)
+ easier to generate dependency makefile
(7 points)
+ no need for include guards
(5 points)
+ simpler compiler diagnostics
(5 points)

The overall goal is to make emerging complexity stand out the moment it
emerges, opening developers eyes. The reality in any random project is:
not all developers are stellar C programmers (the set of participants in
this newsgroup then and now looks like an accurate statistical sample.
From Tanmoy to Bill...)

Unfortunately, the C preprocessor is a deceptive tool (apologies to dmr,
may his soul rest in peace, I know why it was needed in the time back
then) and gets frequently abused. Taming it is probably what I'm after.
The only reason cpp has survived is because of the include guard kluge.
Making interfaces stand out, both in number and circumference, should
help, I hope.

[...]
#> Can you think of a lightweight way to solve this? Maybe using
#> perl, the unix tool box, make, gmake, gcc, a C lexer?
#
# I may have some suggestions here, but first I would like to read
# through responses to the questions asked above, to make sure I'm
# going in a good direction.

This is certainly incomplete:
One would need to find the identifiers of macro definitions (easy)
and typedefs (harder). In prototypes one must distinguish between
types and optional parameter names.
In other declarations one needs to determine the declared identifier.
This is a little more involved for enums and aggregates.
Build the "needed by" pairs and pipe to tsort(1). Voilà!
Version 7 came with all the goodies built in, didn't it?


Regards,

Jens
 
J

Jens Schweikhardt

in <[email protected]>:
# On 4/19/14, 4:25 AM, Tim Rentsch wrote:
# I have been reading the many comments in this thread with some
#> interest. Reading through your responses, I have come up with
#> this summary of motivations for using this approach (these are
#> my paraphrasings, often not quotes of the originals):
#>
#> + no lint warnings about repeated headers
#
#> + no need for include guards
#
#> + doxygen dependency graph much simpler
# I am not sure I would call a "flat" graph simpler. It obscures the
# difference between what you actually reference and what has been pulled
# in due to implementation detail of something you reference.

You can't see this in a graph of 20 nodes with 40 edges either.
I understand the graph with a root and 20 leaves.

#> + no cycles in include graph
# Neither method will generated cycles. What becomes a "cycle" in nested
# includes becomes a broken topographical order (a.h must be before b.h,
# but b.h also must be before a.h)

The cycles I mean a closed loops of edges via one or more nodes.

#> + removing unneeded includes is easier
# I disagree here, with headers include what they need, you need to only
# look at one file to see what is needed, so if you make a change, you
# have less to look at to see what is no longer needed. With all
# dependencies, both direct and indirect, expressed in the .c file, to
# identify that a header is no longer needed you need to look at every
# header file included after it to confirm. Also, if you make a change in
# a header, you need to know every client for that header, to know where
# you need to make the changes.

Well, I use a tool (Gimpel FlexeLint) to tell me which headers are
not needed. That is at least simple for me. However, FlexeLint can
tell this only for a C file, not for headers.

#> + simpler compiler diagnostics
# Yes, you get simpler diagnostics, but a lot more of them (or a lot more
# work to avoid them). If headers include what they need, you don't need
# to look at the include trace, if foo.h has a error because it is missing
# a dependency, it doesn't matter how it got included, it needs to resolve
# it, once. In the .c includes everything case, every file that included
# that header will need to have the needed dependency fixed.
#
#> + easier to generate dependency makefile
# Yes, if you are manually generating makefiles.
#
#> + improved identifiability of refactoring opportunities
# I disagree. Since you have lost all the real dependency information, you
# have lost the hints that help you see the refactoring.

See my answer to Tim Rentsch for details of what I expect to find and how.

....
# Ultimately, it is the programmer putting effort in to make the computer
# jobs easier. The biggest advantage it gains is that for a naive
# compiler, that doesn't recognize include guards, it will reparse
# multiply included files (looking for the #endif). This can be fixed in
# the most used headers by making the include itself conditional testing
# the include guard.

Which nobody does. Having the include guard *in* the included header
instead of some intelligence in the preprocessor is the ugly kluge.
Have you ever seen

#ifndef PROJECT_TYPES_H
#include "project_types.h"
#endif
...repeat for N other headers...

out in the wild? I haven't. So the common wisdom accepts endless
rereading and retokenization of the same headers in the header djungle.
I question the status quo in search of a paradigm shift. Big words :)

# The main purpose we use computers is that they can do work much faster
# than us, and there job is to take away the mechanical operations so we
# can focus on the creative. This "rule" puts back on the programmer a lot
# of mechanical operations that belong really to the computer.

I believe this burden is quite lightweight. You get the header sequence
right once, that's basically it. I provide a reference in a dummy C file
including all headers with an empty main()). Look up where your header
appears in the reference sequence--no more guessing or compiler errors.

Regards,

Jens
 
J

James Kuyper

On 04/23/2014 03:47 PM, Jens Schweikhardt wrote:
....
Well, I use a tool (Gimpel FlexeLint) to tell me which headers are
not needed. That is at least simple for me. However, FlexeLint can
tell this only for a C file, not for headers.

test_header.c:
#include "header.h"
int dummy=0; // to silences some compiler warning messages.
 
T

Tim Rentsch

Richard Damon said:
Jens Schweikhardt <[email protected]> writes:
I have been reading the many comments in this thread with some
interest. Reading through your responses, I have come up with
this summary of motivations for using this approach (these are
my paraphrasings, often not quotes of the originals):

+ no lint warnings about repeated headers

+ no need for include guards

+ doxygen dependency graph much simpler

[..several point by point responses..]

I think you may have misunderstood my intentions there. I wasn't
trying to agree with his points, just restate them to make sure
I understood his position and didn't leave out anything. It was
useful to see his reply, much moreso I think than if I had started
by arguing against the scheme proposed.
 
T

Tim Rentsch

Jens Schweikhardt said:
in <[email protected]>:
#
#> consider a small project of 100 C source and 100 header files.
#> The coding rules require that only C files include headers,
#> headers are not allowed to include other headers (not sure if
#> this is 100% the "IWYU - Include What You Use" paradigma).
#>
#> The headers contain only what headers should contain:
#> prototypes, typedefs, declarations, macro definitions.
#>
#> The problem: given a set of headers, determine a sequence of
#> #include directives that avoids syntax errors due to undeclared
#> identifiers. I.e. if "foo.h" declares type foo_t and "bar.h" uses
#> foo_t in a prototype, "bar.h" must be included before "foo.h".
#
# I have been reading the many comments in this thread with some
# interest. Reading through your responses, I have come up with
# this summary of motivations for using this approach (these are
# my paraphrasings, often not quotes of the originals):
#
# + no lint warnings about repeated headers
# + no need for include guards
# + doxygen dependency graph much simpler
# + no cycles in include graph
# + removing unneeded includes is easier
# + simpler compiler diagnostics
# + easier to generate dependency makefile
# + improved identifiability of refactoring opportunities
# + ... and of interface accumulation [not sure what this means]
# + ... and of code collecting fat
# + constant reminders of all dependencies of each .c file

Thanks Tim, for taking the time. [snip]

Thank you for the extended reply. Reading through it, I
don't find any of your arguments convincing. In almost all
cases they either mischaracterize one of the two positions
or make use of a non-logical inference. It isn't necessary
to use the scheme you propose to get the benefits you say
are important, and it's significantly more work for developers,
starting with having to build a tool that will produce the
necessary include ordering. By contrast, following the more usual
rule that include files will #include any other header directly
necessary for themselves, I did a little scripting in my regular
development environment to produce a list of include files used
by each .c file. It took 10 or 15 minutes. The arguments you
give just don't make your case.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top