C, lexical

Discussion in 'C Programming' started by Lucas Zimmerman, Sep 9, 2005.

  1. Is there any Lex code available that describes how to scan C programs?
    I'd like to
    read someting related to this. One of my doubs is how C deals with
    ambiguities,
    for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
    `//').

    thanks in advance,

    n.
    Lucas Zimmerman, Sep 9, 2005
    #1
    1. Advertising

  2. "Lucas Zimmerman" <> wrote:
    >Is there any Lex code available that describes how to scan C programs?
    >I'd like to
    >read someting related to this. One of my doubs is how C deals with
    >ambiguities,
    >for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
    >`//').


    Well, it's not C99, but maybe a good starting point:

    http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

    Best Regards
    --
    Irrwahn Grausewitz ()
    welcome to clc : http://www.ungerhu.com/jxh/clc.welcome.txt
    clc faq-list : http://www.faqs.org/faqs/C-faq/faq/
    clc frequent answers: http://benpfaff.org/writings/clc.
    Irrwahn Grausewitz, Sep 9, 2005
    #2
    1. Advertising

  3. Irrwahn Grausewitz wrote:
    > "Lucas Zimmerman" <> wrote:
    > >Is there any Lex code available that describes how to scan C programs?
    > >I'd like to
    > >read someting related to this. One of my doubs is how C deals with
    > >ambiguities,
    > >for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
    > >`//').

    >
    > Well, it's not C99, but maybe a good starting point:
    >
    > http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
    >
    > Best Regards


    Amazing document! thanks a lot Irrwahn.
    Interesting how `char x<:N:>;' is valid in C. Is this c99 too?
    I'm still learning C after 3 years studying it!! There is always
    something
    new to know about this language.

    thanks once again,

    n.
    Lucas Zimmerman, Sep 10, 2005
    #3
  4. Lucas Zimmerman

    Thad Smith Guest

    Lucas Zimmerman wrote:

    > Is there any Lex code available that describes how to scan C programs?
    > I'd like to
    > read someting related to this. One of my doubs is how C deals with
    > ambiguities,
    > for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
    > `//').


    Those are not ambiguous because C specifies the processing order. The
    first example contains the start of comment. The second example
    performs a division in C90 and fragment "a = x" in C99.

    Thad
    Thad Smith, Sep 10, 2005
    #4
  5. Irrwahn Grausewitz wrote:
    > "Lucas Zimmerman" <> wrote:
    > >Is there any Lex code available that describes how to scan C programs?
    > >I'd like to
    > >read someting related to this. One of my doubs is how C deals with
    > >ambiguities,
    > >for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
    > >`//').

    >
    > Well, it's not C99, but maybe a good starting point:
    >
    > http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
    >
    > Best Regards


    I'm not sure but I think I found a bug in this code.
    ....
    L?\"(\\.|[^\\"])*\" { count(); return(STRING_LITERAL); }
    ....

    If I'm right, there is one backslash missing, so we would have this:

    L?\"(\\.|[^\\\"])*\" { count(); return(STRING_LITERAL); /* right? */ }

    insted of the original. It makes sense to me, since '\' is a lex regex
    operator.

    n.
    Lucas Zimmerman, Sep 10, 2005
    #5
  6. "Lucas Zimmerman" <> wrote:
    <snip>
    >Interesting how `char x<:N:>;' is valid in C. Is this c99 too?


    Yup, digraphs are still mentioned in the standard, and I do not
    expect them to be dropped any time soon.

    ISO/IEC 9899:1999 (E) 6.4.6p3:

    In all aspects of the language, the six tokens (*)
    <: :> <% %> %: %:%:
    behave, respectively, the same as the six tokens
    [ ] { } # ##
    except for their spelling.

    (*) These tokens are sometimes called ‘‘digraphs’’.

    Addition: note, that in the document I mentioned upthread the
    *trigraphs* are missing.

    ISO/IEC 9899:1999 (E) 5.2.1.1p1

    All occurrences in a source file of the following sequences of three
    characters (called trigraph sequences) are replaced with the
    corresponding single character.
    ??= # ??) ] ??! |
    ??( [ ??' ^ ??> }
    ??/ \ ??< { ??- ~
    No other trigraph sequences exist. Each ? that does not begin one of
    the trigraphs listed above is not changed.

    Should you ever notice, that printf("Huh???/n"); prints Huh?
    followed
    by a new-line, you now know why. :)

    Best regards
    --
    Irrwahn Grausewitz ()
    welcome to clc : http://www.ungerhu.com/jxh/clc.welcome.txt
    clc faq-list : http://www.faqs.org/faqs/C-faq/faq/
    clc frequent answers: http://benpfaff.org/writings/clc.
    Irrwahn Grausewitz, Sep 10, 2005
    #6
  7. Lucas Zimmerman

    Simon Biber Guest

    Lucas Zimmerman wrote:
    > Is there any Lex code available that describes how to scan C programs?
    > I'd like to
    > read someting related to this. One of my doubs is how C deals with
    > ambiguities,
    > for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
    > `//').


    C uses a "greedy parser", ie. it tries to make the largest token
    possible at each point. So, x/*p is always the start of a comment, not x
    divided by whatever p points to.

    Your second example is equivalent to a = x/ -3; on C89, but equivalent
    to a = x (with no semicolon) on C99. One of the stranger ways to tell
    the difference at run time is:

    [sbiber@eagle c]$ cat version.c
    #include <stdio.h>

    int main(void)
    {
    if(1//**/2
    ) printf("C99\n");
    else printf("C89\n");

    return 0;
    }
    [sbiber@eagle c]$ c89 version.c && ./a.out
    C89
    [sbiber@eagle c]$ c99 version.c && ./a.out
    C99

    Note how the closing parenthesis of the if statement must be on the next
    line, so that it is not part of the C99 comment.

    --
    Simon.
    Simon Biber, Sep 10, 2005
    #7
  8. Lucas Zimmerman

    Old Wolf Guest

    Irrwahn Grausewitz wrote:
    >
    > All occurrences in a source file of the following sequences of three
    > characters (called trigraph sequences) are replaced with the
    > corresponding single character.
    > ??= # ??) ] ??! |
    > ??( [ ??' ^ ??> }
    > ??/ \ ??< { ??- ~
    > No other trigraph sequences exist. Each ? that does not begin one
    > of the trigraphs listed above is not changed.
    >
    > Should you ever notice, that printf("Huh???/n"); prints Huh?
    > followed by a new-line, you now know why. :)


    A more insidious example (plagiarized from www.gotw.ca article 86):

    #include <stdio.h>

    int main(void)
    {
    int x = 1;
    int i;
    for( i = 0; i < 100; ++i )
    // What will the next line do? Increment???????????/
    ++x;
    printf("%d\n", x);
    }
    Old Wolf, Sep 11, 2005
    #8
  9. "Lucas Zimmerman" <> wrote in message
    news:...
    > Irrwahn Grausewitz wrote:
    > > "Lucas Zimmerman" <> wrote:
    > > >Is there any Lex code available that describes how to scan C programs?
    > > >I'd like to
    > > >read someting related to this. One of my doubs is how C deals with
    > > >ambiguities,
    > > >for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
    > > >`//').

    > >
    > > Well, it's not C99, but maybe a good starting point:
    > >
    > > http://www.lysator.liu.se/c/ANSI-C-grammar-l.html
    > >
    > > Best Regards

    >
    > Amazing document! thanks a lot Irrwahn.
    > Interesting how `char x<:N:>;' is valid in C. Is this c99 too?
    > I'm still learning C after 3 years studying it!! There is always
    > something
    > new to know about this language.


    Its been almost 25 years, and I'm still learning as well ;-)

    Enjoy!

    Chqrlie.
    Charlie Gordon, Sep 12, 2005
    #9
  10. another question...

    I tried to compile the following code with gcc:
    ------
    #include <stdio.h>
    @

    int main(void) {
    return 0;
    }
    -------

    the output was:
    t.c:2: error: syntax error at '@' token

    My question then is: why gcc says `syntax error'? I'm not
    sure what is happening here but I think the lexical analyzer
    is passing '@' as a valid token to the parser and then parser
    says `ok, I'm not expecting a @ so, syntax error'.

    am I missing something? I thought lex would be responsible
    for giving this error message since '@' is (AFAIC) not a valid
    C token.

    thanks a lot in advance once again,

    n.
    Lucas Zimmerman, Sep 13, 2005
    #10
  11. In article <>,
    Lucas Zimmerman <> wrote:
    >I tried to compile the following code with gcc:
    >------
    >#include <stdio.h>
    >@
    >
    >int main(void) {
    > return 0;
    >}
    >-------


    >the output was:
    >t.c:2: error: syntax error at '@' token


    >My question then is: why gcc says `syntax error'?


    Why not?

    >I'm not
    >sure what is happening here but I think the lexical analyzer
    >is passing '@' as a valid token to the parser and then parser
    >says `ok, I'm not expecting a @ so, syntax error'.


    >am I missing something? I thought lex would be responsible
    >for giving this error message since '@' is (AFAIC) not a valid
    >C token.


    It appears to me that you are assuming that the program 'lex' is
    being used to do lexical analysis, and that the result is passed
    to gcc. gcc does not, however, use 'lex': it has its own built-in
    lexical analyzer as -part- of its processing. gcc doesn't even
    have a seperate preprocessing program (e.g., "cpp"): it does
    everything up to an intermediate code representation in a single
    unified program. There might be a bunch of different routines
    that that unified program calls upon, but that part is all one
    program, so all the error messages are going to appear to be
    from the same program.
    --
    I was very young in those days, but I was also rather dim.
    -- Christopher Priest
    Walter Roberson, Sep 13, 2005
    #11
  12. -cnrc.gc.ca (Walter Roberson) writes:
    > In article <>,
    > Lucas Zimmerman <> wrote:
    >>I tried to compile the following code with gcc:
    >>------
    >>#include <stdio.h>
    >>@
    >>
    >>int main(void) {
    >> return 0;
    >>}
    >>-------

    >
    >>the output was:
    >>t.c:2: error: syntax error at '@' token

    >
    >>My question then is: why gcc says `syntax error'?

    >
    > Why not?
    >
    >>I'm not
    >>sure what is happening here but I think the lexical analyzer
    >>is passing '@' as a valid token to the parser and then parser
    >>says `ok, I'm not expecting a @ so, syntax error'.

    >
    >>am I missing something? I thought lex would be responsible
    >>for giving this error message since '@' is (AFAIC) not a valid
    >>C token.

    >
    > It appears to me that you are assuming that the program 'lex' is
    > being used to do lexical analysis, and that the result is passed
    > to gcc. gcc does not, however, use 'lex': it has its own built-in
    > lexical analyzer as -part- of its processing. gcc doesn't even
    > have a seperate preprocessing program (e.g., "cpp"): it does
    > everything up to an intermediate code representation in a single
    > unified program. There might be a bunch of different routines
    > that that unified program calls upon, but that part is all one
    > program, so all the error messages are going to appear to be
    > from the same program.


    Or perhaps he was using "lex" as an abbreviation of "lexical
    analyzer". (In any case, the "lex" program *generates* a lexical
    analyzer.)

    Some versions of gcc do use a separate preprocessor. For example,
    "gcc -v" with version 2.95.2 shows that it invokes "cpp" followed by
    "cc1". Later versions just invoke "cc1". (Later phases aren't
    invoked if there's a failure in an earlier phase.)

    This is off-topic, except that it illustrates that a compiler has a
    lot of freedom in how it implements the translation phases described
    in section 5.1.1.2 of the standard.

    With gcc versions 3.4.4 and 4.0.0, the error message I get is
    "error: stray '@' in program".

    Also, note that a lone @ character *is* a valid preprocessor token,
    though it isn't a valid token. This means that this:

    #if 0
    @
    #endif
    int main(void){}

    is a legal program, but this:

    #if 0
    "
    #endif
    int main(void){}

    isn't (it invokes undefined behavior).

    The point of all this is that, although the standard defines 8
    distinct translation phases, an implementation is not required to
    implement them as separate sequential phases. As long as it processes
    legal programs correctly and issues diagnostics where required, it can
    do whatever it likes.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    We must do something. This is something. Therefore, we must do this.
    Keith Thompson, Sep 13, 2005
    #12
  13. In article <>,
    Keith Thompson <> wrote:
    >Also, note that a lone @ character *is* a valid preprocessor token,
    >though it isn't a valid token. This means that this:


    >#if 0
    >@
    >#endif
    >int main(void){}


    >is a legal program,


    Keith, I'm not quite sure how you get that? @ is not part of
    the basic C character set, so how can its behaviour be well defined?

    As the validity of the presence of @ would appear to be an
    implementation extension, then that implementation extension could
    treat @ as an alias for " for example.
    --
    Any sufficiently old bug becomes a feature.
    Walter Roberson, Sep 13, 2005
    #13
  14. -cnrc.gc.ca (Walter Roberson) writes:
    > In article <>,
    > Keith Thompson <> wrote:
    >>Also, note that a lone @ character *is* a valid preprocessor token,
    >>though it isn't a valid token. This means that this:

    >
    >>#if 0
    >>@
    >>#endif
    >>int main(void){}

    >
    >>is a legal program,

    >
    > Keith, I'm not quite sure how you get that? @ is not part of
    > the basic C character set, so how can its behaviour be well defined?
    >
    > As the validity of the presence of @ would appear to be an
    > implementation extension, then that implementation extension could
    > treat @ as an alias for " for example.


    You're right (at least partly); I didn't think of that.

    C99 5.2.1 says that the source character set includes *at least* a
    specified set of characters (upper and lower case letters, digits,
    space, horizontal tab, vertical tab, form feed, and 29 punctuation
    characters, *not* including '@'). But '@' can be, an often is, an
    "extended character".

    For an implementation that doesn't define '@' as part of the source
    character set, any occurrence of @ in a source file invokes undefined
    behavior (which, as you say, can include treating it as an alias for ").
    But if '@' *is* part of the source character set, then it's a legal
    preprocessor token (but not a legal token).

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    We must do something. This is something. Therefore, we must do this.
    Keith Thompson, Sep 13, 2005
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Eric
    Replies:
    3
    Views:
    1,119
  2. Collin VanDyck
    Replies:
    0
    Views:
    307
    Collin VanDyck
    Nov 30, 2004
  3. Collin VanDyck
    Replies:
    0
    Views:
    414
    Collin VanDyck
    Nov 30, 2004
  4. Ron

    Lexical convention

    Ron, Aug 30, 2003, in forum: C++
    Replies:
    0
    Views:
    378
  5. cricket

    Python lexical scanner

    cricket, Sep 25, 2003, in forum: Python
    Replies:
    0
    Views:
    442
    cricket
    Sep 25, 2003
Loading...

Share This Page