Re: Stripping multiline C comments without using Lex

Discussion in 'C Programming' started by Stephane CHAZELAS, Feb 4, 2004.

  1. 2004-02-03, 14:28(-06), Ed Morton:
    [...discussing about the best way to strip comments from a C file...]
    > Try the above on this code:
    >
    > #include "stdio.h"
    >
    > #define GOOGLE(txt) printf("Google web page = " #txt "\n")
    >
    > int main(void) {
    > GOOGLE(http://www.google.com);
    > }

    [...]
    > Using "gcc -E -ansi" handles it OK.

    [... while // could be taken as a comment otherwise]

    I didn't expect that to work. Are you sure it is valid ANSI C
    code? For me, stringizing only makes sense for valid C
    expressions (or at least parts of valid C expressions) for
    logging/debugging purpose or the like. When the argument of a
    macro is just intented to be used only as a string, it's more
    sensible to write it as

    #define GOOGLE(txt) printf("Google web page = " txt "\n")
    ....
    GOOGLE("http://www.google.com");

    I'd use stringizing for example for:

    ~$ cpp -P << EOF
    heredoc> #define check(cond) { if (!(cond)) { fprintf(stderr, \
    heredoc> "condition \"" #cond "\" not met\n."; exit(2); } }
    heredoc> ...
    heredoc> check(length < sizeof(buffer))
    heredoc> EOF
    ....
    { if (!(length < sizeof(buffer))) { fprintf(stderr, "condition \"" "length < sizeof(buffer)" "\" not met\n."; exit(2); } }

    (i.e. where "cond" is a syntactically valid C expression).

    [x-post, no fu2 (feel free to add one)]

    --
    Stéphane ["Stephane.Chazelas" at "free.fr"]
    Stephane CHAZELAS, Feb 4, 2004
    #1
    1. Advertising

  2. Stephane CHAZELAS

    Chris Torek Guest

    In article <>
    Stephane CHAZELAS <> writes:
    >2004-02-03, 14:28(-06), Ed Morton:
    >[...discussing about the best way to strip comments from a C file...]
    >> Try the above on this code:
    >>
    >> #include "stdio.h"
    >>
    >> #define GOOGLE(txt) printf("Google web page = " #txt "\n")
    >>
    >> int main(void) {
    >> GOOGLE(http://www.google.com);
    >> }

    >[...]
    >> Using "gcc -E -ansi" handles it OK.

    >[... while // could be taken as a comment otherwise]
    >
    >I didn't expect that to work. Are you sure it is valid ANSI C
    >code?


    The "stringize" operator, and indeed the entire preprocessor, works
    on tokens, or more precisely, a sequence of "preprocessing-token"s.

    Preprocessing tokens are defined as:

    preprocessing-token:
    header-name
    identifier
    pp-number
    character-constant
    string-literal
    operator
    punctuator
    each non-white-space character that cannot be one of the above

    (from a C99 draft, but should be close enough).

    The C89 and C99 standards differ in an important way here: in C99,
    // is a comment. In C89, // is simply two slashes. Translation
    proceeds in "phases" and comments are replaced with a single space
    character in phase 3, while preprocessing directives and macro
    invocations are handled in phase 4.

    Thus, in C99, before any macro processing (including stringizing)
    can occur, the sequence "GOOGLE(http://www.google.com);" turns into
    "GOOGLE(http: ". The closing parenthesis is missing and you must
    get a diagnostic. (Double quotes here are simply to allow for
    whitespace.)

    In C89, on the other hand, the text survives phase 3, and the
    pp-token sequence is:

    GOOGLE
    (
    http
    :
    /
    /
    www
    .
    google
    .
    com
    )
    ;

    The stringizing operator "#" allows a complete token sequence
    and should produce the string-literal "http://www.google.com"
    in this case.

    Thus, whether this works depends on whether your compiler
    implements the new 1999 standard ("doesn't work") or the
    old 1989 one ("does work"), perhaps with the 1995 updates
    (no change to whether this works).
    --
    In-Real-Life: Chris Torek, Wind River Systems
    Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
    email: forget about it http://web.torek.net/torek/index.html
    Reading email is like searching for food in the garbage, thanks to spammers.
    Chris Torek, Feb 5, 2004
    #2
    1. Advertising

  3. 2004-02-5, 01:01(+00), Chris Torek:
    [...]
    > The "stringize" operator, and indeed the entire preprocessor, works
    > on tokens, or more precisely, a sequence of "preprocessing-token"s.
    >
    > Preprocessing tokens are defined as:
    >
    > preprocessing-token:
    > header-name
    > identifier
    > pp-number
    > character-constant
    > string-literal
    > operator
    > punctuator
    > each non-white-space character that cannot be one of the above
    >
    > (from a C99 draft, but should be close enough).

    [...]
    > In C89, on the other hand, the text [GOOGLE(http://www.google.com)]
    > survives phase 3, and the pp-token sequence is:
    >
    > GOOGLE
    > (
    > http
    > :
    > /
    > /
    > www
    > .
    > google
    > .
    > com
    > )
    > ;
    >
    > The stringizing operator "#" allows a complete token sequence
    > and should produce the string-literal "http://www.google.com"
    > in this case.

    [...]

    Thanks for that very detailed answer. But, there are still
    points unclear to me. blanks are not tokens, so I guess they are
    just ignored. But how do the stringizing operator join the
    tokens from a pp-tokens list. From what you say, it seems that
    they are stuck together, but in

    #define s(t) #t
    s(//)
    s(1 + 1)
    s(1 + 1)
    s(1+1)

    I get, with GNU cpp -P -ansi
    "//"
    "1 + 1"
    "1 + 1"
    "1+1"

    (spaces seem to have an influence somehow).

    And, I guess that when calling a macro, there are things you
    can't do that restrict the range of possible strings that can be
    stringized.

    For instance, it seems impossible to stringize "foo)", or
    "foo," (or "/*", or 'a, or "aer...), that's why I thought in
    the first place that there had to be rules on what is allowed
    for either a macro argument or for the stringizing operator, and
    that http://www.google.com might break those rules (but I can
    see now that it's very likely that it breaks no rule [except in
    C99]).

    --
    Stéphane ["Stephane.Chazelas" at "free.fr"]
    Stephane CHAZELAS, Feb 5, 2004
    #3
  4. In comp.unix.shell Stephane CHAZELAS <> wrote:
    ....
    # Thanks for that very detailed answer. But, there are still
    # points unclear to me. blanks are not tokens, so I guess they are
    # just ignored. But how do the stringizing operator join the
    # tokens from a pp-tokens list. From what you say, it seems that
    # they are stuck together, but in
    #
    # #define s(t) #t
    # s(//)
    # s(1 + 1)
    # s(1 + 1)
    # s(1+1)
    #
    # I get, with GNU cpp -P -ansi
    # "//"
    # "1 + 1"
    # "1 + 1"
    # "1+1"
    #
    # (spaces seem to have an influence somehow).

    The C (99) Standard requires in 6.10.3.2#2 that "... Each occurrence of
    white space between the argument's preprocessing tokens becomes a single
    space character in the string literal. White space before the first pp
    token and after the last pp token composing the argument is deleted."

    Regards,

    Jens
    --
    Jens Schweikhardt http://www.schweikhardt.net/
    SIGSIG -- signature too long (core dumped)
    Jens Schweikhardt, Feb 5, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    4
    Views:
    581
  2. Jay
    Replies:
    3
    Views:
    412
  3. dale zhang
    Replies:
    8
    Views:
    419
    Tintin
    Nov 30, 2004
  4. Xicheng Jia
    Replies:
    9
    Views:
    219
    robic0
    Apr 19, 2006
  5. bizt
    Replies:
    1
    Views:
    101
    Evertjan.
    Nov 16, 2009
Loading...

Share This Page