(My post is pretty much "an aside" to the highly technical nature of the
thread. How should I actually post in that way? Putting an "Aside:" in front
of the "Re:"?).
On Sun, 27 May 2007 08:11:04 -0700, Sam of California wrote:
Is it accurate to say that "the preprocessor is just a pass in the
parsing
of the source file"?
I responded to that comment by saying that the preprocessor is not
just a
pass.
How can a processor be a pass; something which performs a pass, at
most.
Informally, I use the terms "preprocessing" or "preprocessing phase"
to identify --roughly-- the sequence of what the ISO standard defines
as phase 3 and phase 4 of translation -- but only when there isn't any
need to be more precise (it is also the name of one of the directories
in the Breeze source tree, for instance). In any case, the standard
doesn't use the term "preprocessor", nor "preprocessing" as a
standalone noun (it uses expressions such as "preprocessing
directive", though, which may make somewhat reasonable the personal
terminology choice explained above. It arose exactly because I didn't
like to use the term "preprocessor").
To quote the standard (§2.1/7): "[In phase 7] Each preprocessing
token is converted into a token." I've always understood
everything which preceded this (i.e. phases 1-6) to be
"preprocessing", and the "preprocessor" whatever does the
"preprocessing".
Yes, that's another possibility. As I said, neither terms is defined
by the standard, and everything is quite vague. People use the terms
quite informally.
"There's also the historical context to be considered. In
Johnson's pcc, the preprocessor was a separate pass, before the
compiler front-end pass. Roughly speaking, the preprocessor
read your code, broke it up into preprocessing tokens, did what
it did, and then spit out text. The front end then read this
text, and broke it up into language tokens, and parsed it. Line
breaks had significance in the pre-processor, but not in the
front-end."
I'm all for "the good ol' days" if it makes compiler system construction
easier. Do you high-end engineers feel that only the current "state of the
art" machinery is worthy of building upon? (If you know me yet, you know
what I think: that if its really complex, then it's not foundational).
"Based on this, it seems logical to make the break at the point
where preprocessor tokens are converted into language tokens,
and all white space (including new-lines) ceases to have any
significance."
See, now that's something I could grok if I wanted to learn about compiler
construction and wanted to develop a compiler. (I hope people still want to
build "simple" compilers, because I surely don't want to do it!)
All "remote references" aside, aren't things like "optimizing compilers" for
scientific computing and the like only now? I mean, I want to build my
program with multiple threads (!) (yes, and with C++!). Pretty risky once
you turn on the optimizations huh?
<Thoughts about "saving the preprocessor's life".... no wait, "giving it its
life back!", omitted>.
Speaking of terminology and personal preferences, I've always felt
that the standard could have given a name to the phases, rather than
just numbering them.
"The sole role of the phases in the standard is to define the
order in which the different actions take place. Numbers are
very good for defining order. Everyone knows that 1 comes
before 2, but it must be explicitly stated that character
mapping comes before line splicing."
Isn't that a programmer's dream (!): a sequential list of things to program.
(I admit, a bit boring, but fine work when the brain is only at half
capacity).
In that case, I'd see something like (off the top
of my head: don't focus too much on the names):
Character Mapping (1)
Line Splicing (2)
Pre-tokenization (3)
Preprocessing (4)
Execution Character Set Mapping (5)
Literal Concatenation (6)
Tokenization (7a)
Syntactical and Semantic Analysis (7b)
Translation (7c)
Instantiation (8)
Linking (9)
Damn, I'm learning too much about this stuff now and feel like I'm "going
backwards" again!

(No worries though, I'm never going to write a
compiler!)
To sum it up, the original question is simply ill-posed. The
translation of a C++ program conceptually happens in phases, as
described in the standard. One may decide to call preprocessing some
specific sub-sequence, and compilation some other, but there's no such
official terminology. To some, a compiler is what performs phases from
(7a) to 8, included. A linker what performs (9). Others mean by
"compiler", or "translator", the executor of the whole translation.
"The traditional break (in C) has been: preprocessor: phases 1
through 6, compiler: phase 7, linker phase 9. (Phase 8 is
concerned with instantiating templates, and doesn't have a place
in traditional C.) If you're talking about passes, however,
most compilers today will use a single pass for everything
through your 7b, above, then up to three passes for 7c, and
linking remains separate. Where phase 8 fits in varies, but I
suspect that a lot of modern compilers cram it into the first
pass as well."
And if one wanted to get that kind of info formally, where would someone get
that? Certainly not the dragon compiler book (?). (Or would "one" just hire
you or you company to use that knowledge?)
John