C, lexical

Lucas Zimmerman · Sep 9, 2005

Is there any Lex code available that describes how to scan C programs?
I'd like to
read someting related to this. One of my doubs is how C deals with
ambiguities,
for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
`//').

thanks in advance,

n.

Irrwahn Grausewitz · Sep 9, 2005

Lucas Zimmerman said:
Is there any Lex code available that describes how to scan C programs?
I'd like to
read someting related to this. One of my doubs is how C deals with
ambiguities,
for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
`//').

Well, it's not C99, but maybe a good starting point:

http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

Best Regards

Lucas Zimmerman · Sep 10, 2005

Irrwahn said:
Well, it's not C99, but maybe a good starting point:

http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

Best Regards

Amazing document! thanks a lot Irrwahn.
Interesting how `char x<:N:>;' is valid in C. Is this c99 too?
I'm still learning C after 3 years studying it!! There is always
something
new to know about this language.

thanks once again,

n.

Thad Smith · Sep 10, 2005

Lucas said:
Is there any Lex code available that describes how to scan C programs?
I'd like to
read someting related to this. One of my doubs is how C deals with
ambiguities,
for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
`//').

Those are not ambiguous because C specifies the processing order. The
first example contains the start of comment. The second example
performs a division in C90 and fragment "a = x" in C99.

Thad

Lucas Zimmerman · Sep 10, 2005

Irrwahn said:
Well, it's not C99, but maybe a good starting point:

http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

Best Regards

I'm not sure but I think I found a bug in this code.
....
L?\"(\\.|[^\\"])*\" { count(); return(STRING_LITERAL); }
....

If I'm right, there is one backslash missing, so we would have this:

L?\"(\\.|[^\\\"])*\" { count(); return(STRING_LITERAL); /* right? */ }

insted of the original. It makes sense to me, since '\' is a lex regex
operator.

n.

Irrwahn Grausewitz · Sep 10, 2005

Lucas Zimmerman said:
Interesting how `char x<:N:>;' is valid in C. Is this c99 too?

Yup, digraphs are still mentioned in the standard, and I do not
expect them to be dropped any time soon.

ISO/IEC 9899:1999 (E) 6.4.6p3:

In all aspects of the language, the six tokens (*)
<: :> <% %> %: %:%:
behave, respectively, the same as the six tokens
[ ] { } # ##
except for their spelling.

(*) These tokens are sometimes called â€˜â€˜digraphsâ€™â€™.

Addition: note, that in the document I mentioned upthread the
*trigraphs* are missing.

ISO/IEC 9899:1999 (E) 5.2.1.1p1

All occurrences in a source file of the following sequences of three
characters (called trigraph sequences) are replaced with the
corresponding single character.
??= # ??) ] ??! |
??( [ ??' ^ ??> }
??/ \ ??< { ??- ~
No other trigraph sequences exist. Each ? that does not begin one of
the trigraphs listed above is not changed.

Should you ever notice, that printf("Huh???/n"); prints Huh?
followed
by a new-line, you now know why.

Best regards

Simon Biber · Sep 10, 2005

Lucas said:
Is there any Lex code available that describes how to scan C programs?
I'd like to
read someting related to this. One of my doubs is how C deals with
ambiguities,
for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
`//').

C uses a "greedy parser", ie. it tries to make the largest token
possible at each point. So, x/*p is always the start of a comment, not x
divided by whatever p points to.

Your second example is equivalent to a = x/ -3; on C89, but equivalent
to a = x (with no semicolon) on C99. One of the stranger ways to tell
the difference at run time is:

[sbiber@eagle c]$ cat version.c
#include <stdio.h>

int main(void)
{
if(1//**/2
) printf("C99\n");
else printf("C89\n");

return 0;
}
[sbiber@eagle c]$ c89 version.c && ./a.out
C89
[sbiber@eagle c]$ c99 version.c && ./a.out
C99

Note how the closing parenthesis of the if statement must be on the next
line, so that it is not part of the C99 comment.

Old Wolf · Sep 11, 2005

Irrwahn said:
All occurrences in a source file of the following sequences of three
characters (called trigraph sequences) are replaced with the
corresponding single character.
??= # ??) ] ??! |
??( [ ??' ^ ??> }
??/ \ ??< { ??- ~
No other trigraph sequences exist. Each ? that does not begin one
of the trigraphs listed above is not changed.

Should you ever notice, that printf("Huh???/n"); prints Huh?
followed by a new-line, you now know why.

A more insidious example (plagiarized from www.gotw.ca article 86):

#include <stdio.h>

int main(void)
{
int x = 1;
int i;
for( i = 0; i < 100; ++i )
// What will the next line do? Increment???????????/
++x;
printf("%d\n", x);
}

Charlie Gordon · Sep 12, 2005

Lucas Zimmerman said:
Amazing document! thanks a lot Irrwahn.
Interesting how `char x<:N:>;' is valid in C. Is this c99 too?
I'm still learning C after 3 years studying it!! There is always
something
new to know about this language.

Its been almost 25 years, and I'm still learning as well ;-)

Enjoy!

Chqrlie.

Lucas Zimmerman · Sep 13, 2005

another question...

I tried to compile the following code with gcc:
------
#include <stdio.h>
@

int main(void) {
return 0;
}
-------

the output was:
t.c:2: error: syntax error at '@' token

My question then is: why gcc says `syntax error'? I'm not
sure what is happening here but I think the lexical analyzer
is passing '@' as a valid token to the parser and then parser
says `ok, I'm not expecting a @ so, syntax error'.

am I missing something? I thought lex would be responsible
for giving this error message since '@' is (AFAIC) not a valid
C token.

thanks a lot in advance once again,

n.

Walter Roberson · Sep 13, 2005

I tried to compile the following code with gcc:
------
#include <stdio.h>
@

int main(void) {
return 0;
}
-------

the output was:
t.c:2: error: syntax error at '@' token

My question then is: why gcc says `syntax error'?

Why not?

I'm not
sure what is happening here but I think the lexical analyzer
is passing '@' as a valid token to the parser and then parser
says `ok, I'm not expecting a @ so, syntax error'.

am I missing something? I thought lex would be responsible
for giving this error message since '@' is (AFAIC) not a valid
C token.

It appears to me that you are assuming that the program 'lex' is
being used to do lexical analysis, and that the result is passed
to gcc. gcc does not, however, use 'lex': it has its own built-in
lexical analyzer as -part- of its processing. gcc doesn't even
have a seperate preprocessing program (e.g., "cpp"): it does
everything up to an intermediate code representation in a single
unified program. There might be a bunch of different routines
that that unified program calls upon, but that part is all one
program, so all the error messages are going to appear to be
from the same program.

Keith Thompson · Sep 13, 2005

Why not?

It appears to me that you are assuming that the program 'lex' is
being used to do lexical analysis, and that the result is passed
to gcc. gcc does not, however, use 'lex': it has its own built-in
lexical analyzer as -part- of its processing. gcc doesn't even
have a seperate preprocessing program (e.g., "cpp"): it does
everything up to an intermediate code representation in a single
unified program. There might be a bunch of different routines
that that unified program calls upon, but that part is all one
program, so all the error messages are going to appear to be
from the same program.

Or perhaps he was using "lex" as an abbreviation of "lexical
analyzer". (In any case, the "lex" program *generates* a lexical
analyzer.)

Some versions of gcc do use a separate preprocessor. For example,
"gcc -v" with version 2.95.2 shows that it invokes "cpp" followed by
"cc1". Later versions just invoke "cc1". (Later phases aren't
invoked if there's a failure in an earlier phase.)

This is off-topic, except that it illustrates that a compiler has a
lot of freedom in how it implements the translation phases described
in section 5.1.1.2 of the standard.

With gcc versions 3.4.4 and 4.0.0, the error message I get is
"error: stray '@' in program".

Also, note that a lone @ character *is* a valid preprocessor token,
though it isn't a valid token. This means that this:

#if 0
@
#endif
int main(void){}

is a legal program, but this:

#if 0
"
#endif
int main(void){}

isn't (it invokes undefined behavior).

The point of all this is that, although the standard defines 8
distinct translation phases, an implementation is not required to
implement them as separate sequential phases. As long as it processes
legal programs correctly and issues diagnostics where required, it can
do whatever it likes.

Walter Roberson · Sep 13, 2005

Also, note that a lone @ character *is* a valid preprocessor token,
though it isn't a valid token. This means that this:

#if 0
@
#endif
int main(void){}

is a legal program,

Keith, I'm not quite sure how you get that? @ is not part of
the basic C character set, so how can its behaviour be well defined?

As the validity of the presence of @ would appear to be an
implementation extension, then that implementation extension could
treat @ as an alias for " for example.

Keith Thompson · Sep 13, 2005

Keith, I'm not quite sure how you get that? @ is not part of
the basic C character set, so how can its behaviour be well defined?

As the validity of the presence of @ would appear to be an
implementation extension, then that implementation extension could
treat @ as an alias for " for example.

You're right (at least partly); I didn't think of that.

C99 5.2.1 says that the source character set includes *at least* a
specified set of characters (upper and lower case letters, digits,
space, horizontal tab, vertical tab, form feed, and 29 punctuation
characters, *not* including '@'). But '@' can be, an often is, an
"extended character".

For an implementation that doesn't define '@' as part of the source
character set, any occurrence of @ in a source file invokes undefined
behavior (which, as you say, can include treating it as an alias for ").
But if '@' *is* part of the source character set, then it's a legal
preprocessor token (but not a legal token).

Lexical Analysis on C++	1	Oct 31, 2023
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
C closures & lexical scoping	28	Dec 12, 2007
Please Help me to Write these C programs, I am fully confused to solve these Programs. Thanks alot.	1	May 30, 2022
C Programming functions	2	Dec 3, 2021
Function is not worked in C	2	Jun 27, 2023
Lexical vs Dynamic Scope	3	Jan 22, 2011
Can't solve problems! please Help	0	Sep 26, 2022

C, lexical

Lucas Zimmerman

Irrwahn Grausewitz

Lucas Zimmerman

Thad Smith

Lucas Zimmerman

Irrwahn Grausewitz

Simon Biber

Old Wolf

Charlie Gordon

Lucas Zimmerman

Walter Roberson

Keith Thompson

Walter Roberson

Keith Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads