C, lexical

  • Thread starter Lucas Zimmerman
  • Start date
L

Lucas Zimmerman

Is there any Lex code available that describes how to scan C programs?
I'd like to
read someting related to this. One of my doubs is how C deals with
ambiguities,
for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
`//').

thanks in advance,

n.
 
T

Thad Smith

Lucas said:
Is there any Lex code available that describes how to scan C programs?
I'd like to
read someting related to this. One of my doubs is how C deals with
ambiguities,
for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
`//').

Those are not ambiguous because C specifies the processing order. The
first example contains the start of comment. The second example
performs a division in C90 and fragment "a = x" in C99.

Thad
 
L

Lucas Zimmerman

Irrwahn said:
Well, it's not C99, but maybe a good starting point:

http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

Best Regards

I'm not sure but I think I found a bug in this code.
....
L?\"(\\.|[^\\"])*\" { count(); return(STRING_LITERAL); }
....

If I'm right, there is one backslash missing, so we would have this:

L?\"(\\.|[^\\\"])*\" { count(); return(STRING_LITERAL); /* right? */ }

insted of the original. It makes sense to me, since '\' is a lex regex
operator.

n.
 
I

Irrwahn Grausewitz

Lucas Zimmerman said:
Interesting how `char x<:N:>;' is valid in C. Is this c99 too?

Yup, digraphs are still mentioned in the standard, and I do not
expect them to be dropped any time soon.

ISO/IEC 9899:1999 (E) 6.4.6p3:

In all aspects of the language, the six tokens (*)
<: :> <% %> %: %:%:
behave, respectively, the same as the six tokens
[ ] { } # ##
except for their spelling.

(*) These tokens are sometimes called ‘‘digraphs’’.

Addition: note, that in the document I mentioned upthread the
*trigraphs* are missing.

ISO/IEC 9899:1999 (E) 5.2.1.1p1

All occurrences in a source file of the following sequences of three
characters (called trigraph sequences) are replaced with the
corresponding single character.
??= # ??) ] ??! |
??( [ ??' ^ ??> }
??/ \ ??< { ??- ~
No other trigraph sequences exist. Each ? that does not begin one of
the trigraphs listed above is not changed.

Should you ever notice, that printf("Huh???/n"); prints Huh?
followed
by a new-line, you now know why. :)

Best regards
 
S

Simon Biber

Lucas said:
Is there any Lex code available that describes how to scan C programs?
I'd like to
read someting related to this. One of my doubs is how C deals with
ambiguities,
for example, `a = x/*p;' or `a = x//*...*/-3;' (considering c99's
`//').

C uses a "greedy parser", ie. it tries to make the largest token
possible at each point. So, x/*p is always the start of a comment, not x
divided by whatever p points to.

Your second example is equivalent to a = x/ -3; on C89, but equivalent
to a = x (with no semicolon) on C99. One of the stranger ways to tell
the difference at run time is:

[sbiber@eagle c]$ cat version.c
#include <stdio.h>

int main(void)
{
if(1//**/2
) printf("C99\n");
else printf("C89\n");

return 0;
}
[sbiber@eagle c]$ c89 version.c && ./a.out
C89
[sbiber@eagle c]$ c99 version.c && ./a.out
C99

Note how the closing parenthesis of the if statement must be on the next
line, so that it is not part of the C99 comment.
 
O

Old Wolf

Irrwahn said:
All occurrences in a source file of the following sequences of three
characters (called trigraph sequences) are replaced with the
corresponding single character.
??= # ??) ] ??! |
??( [ ??' ^ ??> }
??/ \ ??< { ??- ~
No other trigraph sequences exist. Each ? that does not begin one
of the trigraphs listed above is not changed.

Should you ever notice, that printf("Huh???/n"); prints Huh?
followed by a new-line, you now know why. :)

A more insidious example (plagiarized from www.gotw.ca article 86):

#include <stdio.h>

int main(void)
{
int x = 1;
int i;
for( i = 0; i < 100; ++i )
// What will the next line do? Increment???????????/
++x;
printf("%d\n", x);
}
 
C

Charlie Gordon

Lucas Zimmerman said:
Amazing document! thanks a lot Irrwahn.
Interesting how `char x<:N:>;' is valid in C. Is this c99 too?
I'm still learning C after 3 years studying it!! There is always
something
new to know about this language.

Its been almost 25 years, and I'm still learning as well ;-)

Enjoy!

Chqrlie.
 
L

Lucas Zimmerman

another question...

I tried to compile the following code with gcc:
------
#include <stdio.h>
@

int main(void) {
return 0;
}
-------

the output was:
t.c:2: error: syntax error at '@' token

My question then is: why gcc says `syntax error'? I'm not
sure what is happening here but I think the lexical analyzer
is passing '@' as a valid token to the parser and then parser
says `ok, I'm not expecting a @ so, syntax error'.

am I missing something? I thought lex would be responsible
for giving this error message since '@' is (AFAIC) not a valid
C token.

thanks a lot in advance once again,

n.
 
W

Walter Roberson

I tried to compile the following code with gcc:
------
#include <stdio.h>
@

int main(void) {
return 0;
}
-------
the output was:
t.c:2: error: syntax error at '@' token
My question then is: why gcc says `syntax error'?

Why not?
I'm not
sure what is happening here but I think the lexical analyzer
is passing '@' as a valid token to the parser and then parser
says `ok, I'm not expecting a @ so, syntax error'.
am I missing something? I thought lex would be responsible
for giving this error message since '@' is (AFAIC) not a valid
C token.

It appears to me that you are assuming that the program 'lex' is
being used to do lexical analysis, and that the result is passed
to gcc. gcc does not, however, use 'lex': it has its own built-in
lexical analyzer as -part- of its processing. gcc doesn't even
have a seperate preprocessing program (e.g., "cpp"): it does
everything up to an intermediate code representation in a single
unified program. There might be a bunch of different routines
that that unified program calls upon, but that part is all one
program, so all the error messages are going to appear to be
from the same program.
 
K

Keith Thompson

Why not?



It appears to me that you are assuming that the program 'lex' is
being used to do lexical analysis, and that the result is passed
to gcc. gcc does not, however, use 'lex': it has its own built-in
lexical analyzer as -part- of its processing. gcc doesn't even
have a seperate preprocessing program (e.g., "cpp"): it does
everything up to an intermediate code representation in a single
unified program. There might be a bunch of different routines
that that unified program calls upon, but that part is all one
program, so all the error messages are going to appear to be
from the same program.

Or perhaps he was using "lex" as an abbreviation of "lexical
analyzer". (In any case, the "lex" program *generates* a lexical
analyzer.)

Some versions of gcc do use a separate preprocessor. For example,
"gcc -v" with version 2.95.2 shows that it invokes "cpp" followed by
"cc1". Later versions just invoke "cc1". (Later phases aren't
invoked if there's a failure in an earlier phase.)

This is off-topic, except that it illustrates that a compiler has a
lot of freedom in how it implements the translation phases described
in section 5.1.1.2 of the standard.

With gcc versions 3.4.4 and 4.0.0, the error message I get is
"error: stray '@' in program".

Also, note that a lone @ character *is* a valid preprocessor token,
though it isn't a valid token. This means that this:

#if 0
@
#endif
int main(void){}

is a legal program, but this:

#if 0
"
#endif
int main(void){}

isn't (it invokes undefined behavior).

The point of all this is that, although the standard defines 8
distinct translation phases, an implementation is not required to
implement them as separate sequential phases. As long as it processes
legal programs correctly and issues diagnostics where required, it can
do whatever it likes.
 
W

Walter Roberson

Also, note that a lone @ character *is* a valid preprocessor token,
though it isn't a valid token. This means that this:
#if 0
@
#endif
int main(void){}
is a legal program,

Keith, I'm not quite sure how you get that? @ is not part of
the basic C character set, so how can its behaviour be well defined?

As the validity of the presence of @ would appear to be an
implementation extension, then that implementation extension could
treat @ as an alias for " for example.
 
K

Keith Thompson

Keith, I'm not quite sure how you get that? @ is not part of
the basic C character set, so how can its behaviour be well defined?

As the validity of the presence of @ would appear to be an
implementation extension, then that implementation extension could
treat @ as an alias for " for example.

You're right (at least partly); I didn't think of that.

C99 5.2.1 says that the source character set includes *at least* a
specified set of characters (upper and lower case letters, digits,
space, horizontal tab, vertical tab, form feed, and 29 punctuation
characters, *not* including '@'). But '@' can be, an often is, an
"extended character".

For an implementation that doesn't define '@' as part of the source
character set, any occurrence of @ in a source file invokes undefined
behavior (which, as you say, can include treating it as an alias for ").
But if '@' *is* part of the source character set, then it's a legal
preprocessor token (but not a legal token).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top