preprocessor tokenization whitespace?

  • Thread starter Walter Roberson
  • Start date
W

Walter Roberson

I have run into a peculiarity with SGI's C compiler (7.3.1.2m). I have been
reading carefully over the ANSI X3.159-1989 specification, but I cannot
seem to find a justification for the behaviour. Could someone point
me to the appropriate section, or else confirm the behaviour as a bug?

For a particular project, I am using the C preprocessor phase only.
I am not using the standalone program 'cpp' because proper functioning
of my project depends upon being able to splice preprocessor tokens,
which is not supported in the standalone 'cpp'.

I am having the compiler stop after preprocessing by using SGI's C -P
option:

-P Runs only the preprocessor and puts the result for each source
file in a corresponding .i file. The .i file has no inline
directives in it.

It should be noted that my source is *not* C code -- I am using the
preprocessor to generate data files based upon templates.

The point I am having trouble with can be illustrated fairly simply,
by running these lines through the preprocessing phase:

#define eye L@@K
I eye

$ cpp -P look.c
I L@@K

That's with the standalone cpp program, and is the output I expect. But,

$ cc -P look.c
$ cat look.i
I L@ @K

And

$ cat look2.c
I L@@K
$ cc -P look2.c
$ cat look2.i
I L@@K

In short, certain combinations of symbols, when macro-replaced into
source, get separated by single space characters. Not every combination
is so treated: -~ and ~$ are left alone, for example. It is not
operator based, as it happens especially for ` and @ and $ .

The work around I have found is:

$ cat look3.c
#define eye L@##@K
I eye
$ cc -P look3.c
$ cat look3.i
I L@@K


The closest I have found to the whitespace-introducing behaviour is
the ANSI description of translation phases, 2.1.1.2, for phase 3:

3. The source file is decomposed into preprocessing tokens and
sequences of white-space characters (including comments). A source
file shall not end in a partial preprocessing token or comment.
Each comment is replaced by one space character. New-line characters
are retained. Whether each nonempty sequence of white-space characters
other than new-line is retained or replaced by one space character
is implimentation-defined.

Okay, so there's implimentation behaviour for *nonempty* sequence
of white-space characters, but L@@K has only the -empty- sequence
between the two @.

I see nothing in the discussion of macro replacement that would
lead to spaces being introduced {other than the behaviour of # in
function-like macro replacements.}

The only excuse I can think of is that as ` and @ and $ are not
C operators, that outside of character strings and character literals
they are perhaps not considered to be valid preprocessor tokens,
in which case the behaviour would become undefined ?
 
A

Alex Fraser

[snip: using a C preprocessor on non-C files gives unexpected results]
I see nothing in the discussion of macro replacement that would
lead to spaces being introduced {other than the behaviour of # in
function-like macro replacements.}

The only excuse I can think of is that as ` and @ and $ are not
C operators, that outside of character strings and character literals
they are perhaps not considered to be valid preprocessor tokens,
in which case the behaviour would become undefined ?

Sounds extremely likely; I'm not going to check. The solution is to use
something other than a C preprocessor, eg m4.

Alex
 
C

CBFalconer

Walter said:
.... snip ...

It should be noted that my source is *not* C code -- I am using
the preprocessor to generate data files based upon templates.

The point I am having trouble with can be illustrated fairly
simply, by running these lines through the preprocessing phase:
.... snip ...

And thus is off-topic for c.l.c. You need to find a group that
deals with your particular compiler. F'ups set.
 
E

Eric Sosman

Walter said:
I have run into a peculiarity with SGI's C compiler (7.3.1.2m). I have been
reading carefully over the ANSI X3.159-1989 specification, but I cannot
seem to find a justification for the behaviour. Could someone point
me to the appropriate section, or else confirm the behaviour as a bug?
[preprocessor output isn't as expected]

No bug, as far as the C Standard is concerned. Check
with SGI to see whether it's a bug from their perspective.

First problem: The Standard doesn't promise that the
preprocessor will produce any kind of output at all; as far
as the Standard is concerned the preprocessor is merely
"translation phase 4." (You don't expect access to the
output of phase 2 or phase 5; what's special about 4?) If
phase 4 produces any incidental output, the Standard doesn't
specify what it should look like.

Second problem: The Standard describes what a translator
(of which the preprocessor is a part) must do with C source
code, but the only requirement on what it does with non-C is
that some kinds of aberrations require a diagnostic. You're
trying to (ab?)use the preprocessor as a general-purpose
macro machine, which is a bit like driving nails with a
crescent wrench: You may be able to do it, sort of, but if
things don't work out it's not the wrench's fault.

Third problem: By the time phase 4 operates most of the
source text of the program has disappeared. Phases 1 through 3
transform the source into "preprocessing tokens" and white
space which phase 4 then shuffles around; phase 4 manipulates
tokens, not text. (The distinction is usually blurred, but its
effects can be seen here and there: consider the non-recursive
nature of macro expansion, for example.) The consequence is that
if phase 4 produces output what it must actually do is generate
a textual approximation of the internal token sequence. There
was a thread some time ago involving C source that meant one
thing if fed to a translator but something entirely different
if preprocessed first and then fed into the translator (alas,
I can't recall the details; perhaps you can find the thread on
Google). Sometimes the preprocessor cannot turn hamburger back
into cow.

It seems to me you're (mostly) running afoul of the first
two issues, with the third a looming but distant threat. What
to do? Well, it seems that your C implementation (like many)
allows you to run phases 1-4 separately from the rest of the
translator, and when you do so you get the output you want;
it's only when you run the entire translator (with a special
switch) that the output is unsatisfactory. Well then, why
don't you just use the variant that happens to give what you
want? Alternatively, use a full-fledged macro processor (m4
is often mentioned; I've never used it myself) instead of
trying to get the C translator to do something it wasn't really
designed for.
 
L

Lawrence Kirby

I have run into a peculiarity with SGI's C compiler (7.3.1.2m). I have been
reading carefully over the ANSI X3.159-1989 specification, but I cannot
seem to find a justification for the behaviour. Could someone point
me to the appropriate section, or else confirm the behaviour as a bug?

For a particular project, I am using the C preprocessor phase only.

The C standard does not define the output of the preprocessor as a text
stream. It is not possible to validate such a text stream for correctness
against the standard. Such a text output is a *representation* of a
sequence of tokens and white-space. Since there is no formatting
specification for this representation different compilers can and do
produce different output.
I am not using the standalone program 'cpp' because proper functioning
of my project depends upon being able to splice preprocessor tokens,
which is not supported in the standalone 'cpp'.

I am having the compiler stop after preprocessing by using SGI's C -P
option:

-P Runs only the preprocessor and puts the result for each source
file in a corresponding .i file. The .i file has no inline
directives in it.

It should be noted that my source is *not* C code -- I am using the
preprocessor to generate data files based upon templates.

That's the basic problem, the C preprocessor isn't a general macro
language, it is specifically for C and can make assumptions based on
knowledge of the language.
The point I am having trouble with can be illustrated fairly simply, by
running these lines through the preprocessing phase:

#define eye L@@K
I eye

$ cpp -P look.c
I L@@K

That's with the standalone cpp program, and is the output I expect. But,

$ cc -P look.c
$ cat look.i
I L@ @K

And

$ cat look2.c
I L@@K
$ cc -P look2.c
$ cat look2.i
I L@@K

In short, certain combinations of symbols, when macro-replaced into
source, get separated by single space characters. Not every combination
is so treated: -~ and ~$ are left alone, for example. It is not operator
based, as it happens especially for ` and @ and $ .

The work around I have found is:

$ cat look3.c
#define eye L@##@K
I eye
$ cc -P look3.c
$ cat look3.i
I L@@K

It looks like in some cases the preprocessor inserts spaces where it
considers it would otherwise be unclear where the token boundaries are. It
is quite reasonable for it to do this, indeed it may have to if the
compiler is capable of taking this text output and completing the
compilation process on it.
The closest I have found to the whitespace-introducing behaviour is the
ANSI description of translation phases, 2.1.1.2, for phase 3:

3. The source file is decomposed into preprocessing tokens and
sequences of white-space characters (including comments). A source
file shall not end in a partial preprocessing token or comment. Each
comment is replaced by one space character. New-line characters are
retained. Whether each nonempty sequence of white-space characters
other than new-line is retained or replaced by one space character is
implimentation-defined.

The output of the "preprocessor" is the input to translation phase 7 which
says:

"White-space characters separating tokens are no longer significant"

So adding white-space between tokens is not a problem for the translation
process.
Okay, so there's implimentation behaviour for *nonempty* sequence of
white-space characters, but L@@K has only the -empty- sequence between
the two @.

I see nothing in the discussion of macro replacement that would lead to
spaces being introduced {other than the behaviour of # in function-like
macro replacements.}

I grant you that the behaviour difference between L@@K being part of a
macro replacement and not is odd, but there's nothing that you can say is
wrong with the output in either case.
The only excuse I can think of is that as ` and @ and $ are not C
operators, that outside of character strings and character literals they

They aren't even required to exist in the character set which makes their
use, even in character constants and string literals, non-portable.
are perhaps not considered to be valid preprocessor tokens, in which
case the behaviour would become undefined ?

In the grammar for a preprocessing-token there is

preprocesing-token:
...
each non-white-space character that cannot be one of the above

They would cause a constraint violation when the pp-token is converted to
a token in translation phase 7.

Lawrence
 
W

Walter Roberson

:> It should be noted that my source is *not* C code -- I am using
:> the preprocessor to generate data files based upon templates.

:> The point I am having trouble with can be illustrated fairly
:> simply, by running these lines through the preprocessing phase:

:And thus is off-topic for c.l.c. You need to find a group that
:deals with your particular compiler. F'ups set.

Was the question not one pertaining to the details of translation
phases in ANSI C? I pointed to particular clauses in the standard,
acknowledged that I did not know them thoroughly, and asked for
assistance from those who understand them better; I even included
simple ways to reproduce the behaviour. Why then was c.l.c
not a suitable place to have asked?


If, hypothetically, this were comp.dcom.ethernet and I were to ask a
question that involved the detailed specifications of Cat5e wiring,
which I was [e.g.] interesting in using to transmit a digital signal
that did not happen to meet the ethernet frame format, then would you
have said "Wrong newsgroup, you will have to find one that deals with
the manufacturer of your particular brand of cable!", even though the
question was squarely one about what digital signal propogation
characteristics that one could expect with -any- cable rated as Cat5e?
 
W

Walter Roberson

: First problem: The Standard doesn't promise that the
:preprocessor will produce any kind of output at all;

That's a good point.

: Second problem: The Standard describes what a translator
:(of which the preprocessor is a part) must do with C source
:code, but the only requirement on what it does with non-C is
:that some kinds of aberrations require a diagnostic.

Hmmm, I think I would have to disagree with that point. The
standard describes very particular steps about what is required,
legal or invalid when the preprocessor is used. The standard makes
it clear that semantic analysis does not occur until phase 7,
so by the end of phase 4, the internal representation of the
source must not have undergone any changes that are dependant
upon the semantics of C, other than the precisely defined changes
about splicing lines together, replacement of comments with a single
blank, detection of character literals and string boundaries, and
so on as set out in phases 1-4.

: You're
:trying to (ab?)use the preprocessor as a general-purpose
:macro machine, which is a bit like driving nails with a
:crescent wrench: You may be able to do it, sort of, but if
:things don't work out it's not the wrench's fault.

A closer analogy, I would say, would be trying to use a
Robertson screw driver with a Philips screw in a situation that
depended upon the details of the physics of Philips screws.


: Third problem: By the time phase 4 operates most of the
:source text of the program has disappeared. Phases 1 through 3
:transform the source into "preprocessing tokens" and white
:space which phase 4 then shuffles around; phase 4 manipulates
:tokens, not text.

True in one respect, but not true in another: ANSI goes to
a lot of trouble to detail that certain preprocessor operations
involve not the token itself but the "spelling" of the token,
so the preprocessor must carry around the original [whitespace-
squished] text even if (as is likely) it creates an internal
data structure that ascribes some kind of meaning to the text
sequences it is carrying around.


: It seems to me you're (mostly) running afoul of the first
:two issues, with the third a looming but distant threat. What
:to do? Well, it seems that your C implementation (like many)
:allows you to run phases 1-4 separately from the rest of the
:translator, and when you do so you get the output you want;
:it's only when you run the entire translator (with a special
:switch) that the output is unsatisfactory.

Unfortunately not; 'cpp' is the K&R preprocessor, a distinct
standalone program that will cannot do the transformations
I need [I actively use the ANSI ## preprocessor token-spliting operator.]

:Alternatively, use a full-fledged macro processor (m4
:is often mentioned; I've never used it myself) instead of
:trying to get the C translator to do something it wasn't really
:designed for.

The details of the ANSI C preprocessor are incorperated by
reference into the standards for some other languages, so it
is fair and meaningful to ask about the details even if one is
not compiling C code.


I appreciate your comments; they are good points to think about
even if I happen to split hairs a slightly different way that you.
 
K

Keith Thompson

: First problem: The Standard doesn't promise that the
:preprocessor will produce any kind of output at all;

That's a good point.

: Second problem: The Standard describes what a translator
:(of which the preprocessor is a part) must do with C source
:code, but the only requirement on what it does with non-C is
:that some kinds of aberrations require a diagnostic.

Hmmm, I think I would have to disagree with that point. The
standard describes very particular steps about what is required,
legal or invalid when the preprocessor is used. The standard makes
it clear that semantic analysis does not occur until phase 7,
so by the end of phase 4, the internal representation of the
source must not have undergone any changes that are dependant
upon the semantics of C, other than the precisely defined changes
about splicing lines together, replacement of comments with a single
blank, detection of character literals and string boundaries, and
so on as set out in phases 1-4.

Your input was something like:

#define eye L@@K
I eye

you expected:

I L@@K

but your preprocessor produced:

I L@ @K

Since L@@K is not a valid C token or a sequence of valid C tokens, the
preprocessor's behavior isn't going to affect any valid C program.
Later phases of the C compiler are going to produce a syntax error
message whether the preprocessor produces "I L@@K" or "I L@ @K".

I've also seen problems using a C preprocessor on input containing
apostrophes. If there's a single apostrophe on a line, the
preprocessor is going to treat it as an incomplete character constant.
(The same thing applies to quotation marks, but standalone apostrophes
are more common.)

For anything that's going to be flagged as an error by later phases,
different C preprocessors are likely to behave differently -- and if
you manage to get your project working with the quirks of whatever
preprocessor you're currently using, it's likely to break with a later
version.

A C preprocessor, even if it happens to have the (non-required)
ability to produce text output, is really designed to work on C source
code.

You may find that m4 is more suitable for your purposes (there's a GNU
implementation).
 
W

Walter Roberson

:> The only excuse I can think of is that as ` and @ and $ are not C
:> operators, that outside of character strings and character literals they

:They aren't even required to exist in the character set which makes their
:use, even in character constants and string literals, non-portable.

Checking around, I see that you are correct that those 3 characters
are not part of the minimal environment. It seems odd to think
that even the most elementary financial program using the north
american currancy symbol would be technically non-portable, but
that does appear to be the case.
 
M

Michael Wojcik

[Followups set to comp.lang.c.]

:> The only excuse I can think of is that as ` and @ and $ are not C
:> operators, that outside of character strings and character literals they

:They aren't even required to exist in the character set which makes their
:use, even in character constants and string literals, non-portable.

Checking around, I see that you are correct that those 3 characters
are not part of the minimal environment. It seems odd to think
that even the most elementary financial program using the north
american currancy symbol would be technically non-portable, but
that does appear to be the case.

The standard aims to accomodate implementations on platforms where
that symbol may not be conveniently available. I don't think that's
odd at all. Chances are, if you're writing a program that requires
that symbol, it will be conveniently available to you as an implemen-
tation extension to the standard, and you ought to use that extension
just as you might use any other. Very, very few C programs do not
depend on any implementation extensions whatsoever.

--
Michael Wojcik (e-mail address removed)

The antics which have been drawn together in this book are huddled here
for mutual protection like sheep. If they had half a wit apiece each
would bound off in many directions, to unsimplify the target. -- Walt Kelly
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top