A simple parser

jacob navia · Oct 16, 2006

Summary:

I have changed (as proposed by Chuck) the code to use isalpha()
instead of (c>='a' && c <= 'z') etc.

I agree that EBCDIC exists

I eliminated the goto statement, obviously it is better in a tutorial
to stick to structured programming whenever possible...

Now the code is around 1000 bytes long. Not bad for what the code
is doing. But I was somehow disappointed that nobody questioned
the algorithm for finding all functions in a C file without
having a full blown C parser. Somehow, it is an important utility,
and it is very small and simple.

Thanks to all people that participated in the discussion.

jacob

Keith Thompson · Oct 16, 2006

jacob navia said:
Summary:

I have changed (as proposed by Chuck) the code to use isalpha()
instead of (c>='a' && c <= 'z') etc.

I agree that EBCDIC exists

I eliminated the goto statement, obviously it is better in a tutorial
to stick to structured programming whenever possible...

Now the code is around 1000 bytes long. Not bad for what the code
is doing. But I was somehow disappointed that nobody questioned
the algorithm for finding all functions in a C file without
having a full blown C parser. Somehow, it is an important utility,
and it is very small and simple.

Thanks to all people that participated in the discussion.

You're welcome.

Perhaps if you posted the revised code, you'd get more substantial
comments.

I haven't yet looked at your code in any detail, but it occurs to me
that you can't *reliably* find all functions in a C file without
having a full C parser *and* preprocessor. I worked on something
related some years ago (an application that searched for struct and
union declarations in header files) and I had to (a) use a full C
parser, with special treatment for typedefs, and (b) feed my tool the
output of the C preprocessor (the alternative would have been to
re-implement the preprocessor).

A more heuristic approach that catches *most* function declarations in
real code might very well be good enough for most purposes, but it's
important to note the limitations. (I don't remember whether you did
so in your original post.)

One thing to watch out for is whether your tool works correctly on
machine-generated C source code. I suspect you may be making some
assumptions about the code layout, aspects that are ignored by the
compiler and that are likely to be ignored by anything that generates
C source code (such as lex, yacc, or a frontend for another language).

If your tool isn't intended to work on such code, that's fine, but you
should explicitly note that fact.

--
Keith Thompson (The_Other_Keith) (e-mail address removed) <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

jacob navia · Oct 16, 2006

Keith said:
Perhaps if you posted the revised code, you'd get more substantial
comments.

OK. Here it is. Blanks instead of tabs.
---------------------------------------------------------------cut here
/* A simple scanner that will take a file of C source code and
print the names of all functions therein, in the following format:
"Function XXXX found line dddd .... ddddd"
Algorithm. It scans for a terminating parentheses and an immediately
following
opening brace. Comments can appear between the closing paren and the
opening braces, but no other characters besides white space. Functions
mst have
the correct prototype, K & R syntax is not supported.
*/
#include <stdio.h>
#include <ctype.h>
// Longest Identifier we support. Sorry Java guys..
#define MAXID 1024
// Buffer for remembering the function name
static char IdBuffer[MAXID];
// Line number counter. We start at line 1
static int line = 1;

// This function reads a character and if it is \n it bumps
// the line counter.
static int Fgetc(FILE *f)
{
int c = fgetc(f);
if (c == '\n')
line++;
return c;
}

// Return 1 if the character is a legal C identifier
// character, zero if not. The parameter "Start"
// means if an identifier START character (excluding
// numbers) is desired.
static int IsIdentifier(int c,int start)
{
if (c == '_' || isalpha(c))
return 1;
if (start == 0 && isdigit(c))
return 1;
return 0;
}
// Just prints the function name
static int PrintFunction(FILE *f)
{
printf("Function %s: line %d ...",IdBuffer,line);
return Fgetc(f);
}

// Reads a global identifier into our name buffer
static int ReadId(char c,FILE *f)
{
int i = 1;
IdBuffer[0] = c;
while (i < MAXID-1) {
c = Fgetc(f);
if (c != EOF) {
if (IsIdentifier(c,0))
IdBuffer[i++] = c;
else break;
}
else break;
}
IdBuffer = 0;
return c;
}

// Skips strings
static int ParseString(FILE *f)
{
int c = Fgetc(f);
while (c != EOF && c != '"') {
if (c == '\\')
c = Fgetc(f);
if (c != EOF)
c = Fgetc(f);
}
if (c == '"')
c = Fgetc(f);
return c;
}
// Skips comments
static int ParseComment(FILE *f)
{
int c = Fgetc(f);

while (1) {
while (c != '*') {
c = Fgetc(f);
if (c == EOF)
return EOF;
}
c = Fgetc(f);
if (c == '/')
break;
}
return Fgetc(f);
}

// Skips // comments
static int ParseCppComment(FILE *f)
{
int c = Fgetc(f);
while (c != EOF && c != '\n') {
if (c == '\\')
c = Fgetc(f);
if (c != EOF)
c = Fgetc(f);
}
if (c == '\n')
c = Fgetc(f);
return c;
}

// Checks if a comment is followed after a '/' char
static int CheckComment(int c,FILE *f)
{
if (c == '/') {
c = Fgetc(f);
if (c == '*')
c = ParseComment(f);
else if (c == '/')
c = ParseCppComment(f);
}
return c;
}

// Skips white space and comments
static int SkipWhiteSpace(int c,FILE *f)
{
c = CheckComment(c,f);
do {
if (c <= ' ')
c = Fgetc(f);
c = CheckComment(c,f);
}
while (c <= ' ');
return c;
}

// Skips chars between simple quotes
static int ParseQuotedChar(FILE *f)
{
int c = Fgetc(f);
while (c != EOF && c != '\'') {
if (c == '\\')
c = Fgetc(f);
if (c != EOF)
c = Fgetc(f);
}
if (c == '\'')
c = Fgetc(f);
return c;
}

int main(int argc,char *argv[]){
if (argc == 1) {
printf("Usage: %s <file.c>\n",argv[0]);
return 1;
}
FILE *f = fopen(argv[1],"r");
if (f == NULL) {
printf("Can't find %s\n",argv[1]);
return 2;
}
int c = Fgetc(f);
int level = 0;
int parenlevel = 0;
int inFunction = 0;
while (c != EOF) {
// Note that each of the switches must advance the character
// read so that we avoid an infinite loop.
switch (c) {
case '"':
c = ParseString(f);
break;
case '/':
c = CheckComment(c,f);
break;
case '\'':
c = ParseQuotedChar(f);
break;
case '{':
level++;
c = Fgetc(f);
break;
case '}':
if (level == 1 && inFunction) {
printf(" %d\n",line);
inFunction = 0;
}
if (level > 0)
level--;
c = Fgetc(f);
break;
case '(':
parenlevel++;
c = Fgetc(f);
break;
case ')':
if (parenlevel > 0)
parenlevel--;
c = Fgetc(f);
if ((parenlevel|level) == 0) {
c = SkipWhiteSpace(c,f);
if (c == '{') {
level++;
inFunction = 1;
c = PrintFunction(f);
}
}
break;
default:
if ((level | parenlevel) == 0 && IsIdentifier(c,1))
c = ReadId(c,f);
else c = Fgetc(f);
}
}
fclose(f);
return 0;
}

Keith Thompson · Oct 16, 2006

jacob navia said:
OK. Here it is. Blanks instead of tabs.
---------------------------------------------------------------cut here

[snip]

You're still using "//" comments, and mixing declarations and
statements.

I can see the use for the latter, but the reasons for avoiding "//"
comments on Usenet, even assuming they're 100% legal and portable,
have been explained here many times.

CBFalconer · Oct 17, 2006

Keith said:
jacob navia said:

OK. Here it is. Blanks instead of tabs.
-------------------------------------------------cut here

Click to expand...

[snip]

You're still using "//" comments, and mixing declarations and
statements.

I can see the use for the latter, but the reasons for avoiding "//"
comments on Usenet, even assuming they're 100% legal and portable,
have been explained here many times.

For those interested here is Jacobs code revised for portability to
C90. It will even compile under C99. I also passed it through
indent. Now you should be able to compile and thrash.

Aside to Jacob - see how easy it is to be semi-portable. Still are
portability problems, such as return 1 in main and the use of
"c <= ' '". I hope this isn't the lexer in lcc.

/*
Subject: Re: A simple parser
Date: Tue, 17 Oct 2006 00:08:19 +0200
From: jacob navia <[email protected]>
Newsgroups: comp.lang.c

Keith said:
Perhaps if you posted the revised code, you'd get more
substantial comments.

OK. Here it is. Blanks instead of tabs.
------------------------cut here */

/* A simple scanner that will take a file of C source code and
print the names of all functions therein, in the following
format:
"Function XXXX found line dddd .... ddddd"
Algorithm. It scans for a terminating parentheses and an
immediately following opening brace. Comments can appear
between the closing paren and the opening braces, but no
other characters besides white space. Functions must have
the correct prototype, K & R syntax is not supported.
*/
#include <stdio.h>
#include <ctype.h>

/* Longest Identifier we support. Sorry Java guys. */
#define MAXID 1024

/* Buffer for remembering the function name */
static char IdBuffer[MAXID];

/* Line number counter. We start at line 1 */
static int line = 1;

/* This function reads a character and
if it is \n it bumps the line counter. */
static int Fgetc(FILE * f)
{
int c;

c = fgetc(f);
if (c == '\n')
line++;
return c;
}

/* Return 1 if the character is a legal C identifier
character, zero if not. The parameter "Start"
means if an identifier START character (excluding
numbers) is desired */
static int IsIdentifier(int c, int start)
{
if (c == '_' || isalpha(c))
return 1;
if (start == 0 && isdigit(c))
return 1;
return 0;
}

/* Just prints the function name */
static int PrintFunction(FILE * f)
{
printf("Function %s: line %d ...", IdBuffer, line);
return Fgetc(f);
}

/* Reads a global identifier into our name buffer */
static int ReadId(char c, FILE * f)
{
int i = 1;

IdBuffer[0] = c;
while (i < MAXID - 1) {
c = Fgetc(f);
if (c != EOF) {
if (IsIdentifier(c, 0))
IdBuffer[i++] = c;
else
break;
}
else
break;
}
IdBuffer = 0;
return c;
}

/* Skips strings */
static int ParseString(FILE * f)
{
int c;

c = Fgetc(f);
while (c != EOF && c != '"') {
if (c == '\\')
c = Fgetc(f);
if (c != EOF)
c = Fgetc(f);
}
if (c == '"')
c = Fgetc(f);
return c;
}

/* Skips comments */
static int ParseComment(FILE * f)
{
int c;

c = Fgetc(f);
while (1) {
while (c != '*') {
c = Fgetc(f);
if (c == EOF)
return EOF;
}
c = Fgetc(f);
if (c == '/')
break;
}
return Fgetc(f);
}

/* Skips / * comments */
static int ParseCppComment(FILE * f)
{
int c;

c = Fgetc(f);
while (c != EOF && c != '\n') {
if (c == '\\')
c = Fgetc(f);
if (c != EOF)
c = Fgetc(f);
}
if (c == '\n')
c = Fgetc(f);
return c;
}

/* Checks if a comment is followed after a '/' char */
static int CheckComment(int c, FILE * f)
{
if (c == '/') {
c = Fgetc(f);
if (c == '*')
c = ParseComment(f);
else if (c == '/')
c = ParseCppComment(f);
}
return c;
}

/* Skips white space and comments */
static int SkipWhiteSpace(int c, FILE * f)
{
c = CheckComment(c, f);
do {
if (c <= ' ')
c = Fgetc(f);
c = CheckComment(c, f);
}
while (c <= ' ');
return c;
}

/* Skips chars between simple quotes */
static int ParseQuotedChar(FILE * f)
{
int c;

c = Fgetc(f);
while (c != EOF && c != '\'') {
if (c == '\\')
c = Fgetc(f);
if (c != EOF)
c = Fgetc(f);
}
if (c == '\'')
c = Fgetc(f);
return c;
}

int main(int argc, char *argv[])
{
FILE *f;
int c;
int level = 0;
int parenlevel = 0;
int inFunction = 0;

if (argc == 1) {
printf("Usage: %s <file.c>\n", argv[0]);
return 1;
}

f = fopen(argv[1], "r");
if (f == NULL) {
printf("Can't find %s\n", argv[1]);
return 2;
}

c = Fgetc(f);
while (c != EOF) {
/* Note that each of the switches must advance the
character read so that we avoid an infinite loop. */
switch (c) {
case '"':
c = ParseString(f);
break;
case '/':
c = CheckComment(c, f);
break;
case '\'':
c = ParseQuotedChar(f);
break;
case '{':
level++;
c = Fgetc(f);
break;
case '}':
if (level == 1 && inFunction) {
printf(" %d\n", line);
inFunction = 0;
}
if (level > 0)
level--;
c = Fgetc(f);
break;
case '(':
parenlevel++;
c = Fgetc(f);
break;
case ')':
if (parenlevel > 0)
parenlevel--;
c = Fgetc(f);
if ((parenlevel | level) == 0) {
c = SkipWhiteSpace(c, f);
if (c == '{') {
level++;
inFunction = 1;
c = PrintFunction(f);
}
}
break;
default:
if ((level | parenlevel) == 0 && IsIdentifier(c, 1))
c = ReadId(c, f);
else
c = Fgetc(f);
}
}
fclose(f);
return 0;
}

Bill Pursell · Oct 17, 2006

CBFalconer wrote (modified code by Jacob Navia):

static int line = 1;

/* This function reads a character and
if it is \n it bumps the line counter. */
static int Fgetc(FILE * f)
{
int c;

c = fgetc(f);
if (c == '\n')
line++;
return c;
}

c = Fgetc(f);
while (c != EOF) {
/* Each case must advance the read of f */
....
}

I'm curious to hear opinions on the following adaptation:

#ifndef NDEBUG
static size_t count;
#endif

static int
Fgetc(FILE *f)
{
int c;

c = fgetc(f);
#ifndef NDEBUG
count++;
#endif
if (c == '\n')
line++;
return c;
}

while (c != EOF) {
#ifndef NDEBUG
size_t prev_count = count;
#endif
...
assert( count == prev_count +1);
}

To paraphrase, I'm replacing the comment that
each case must read a character from the input
stream with an assertion. Doing so requires
a little bit of extra overhead in the code. Is that
overhead (and potential bugs that it might introduce)
worth it? It strikes me that the potential for having
a bug is now lessened, and the resulting bug will
be easier to detect. Other opinions appreciated.

Also, does the assertion accurately capture the
overflow of count? I started to write an explicit
check for count ==0, but realized that the assertion
will still hold when count overflows.

jacob navia · Oct 17, 2006

Keith said:
jacob navia said:

OK. Here it is. Blanks instead of tabs.
---------------------------------------------------------------cut here

Click to expand...

[snip]

You're still using "//" comments, and mixing declarations and
statements.

I can see the use for the latter, but the reasons for avoiding "//"
comments on Usenet, even assuming they're 100% legal and portable,
have been explained here many times.

Look, I have spent all my time in the last 5-8 years writing
a C99 compliant version of my compiler system. To tell me
that now I should please heathfield and co and come back to C89
because he doesn't want to change some compiler options is too
much really.

Richard Heathfield · Oct 17, 2006

jacob navia said:

Look, I have spent all my time in the last 5-8 years writing
a C99 compliant version of my compiler system.

I was not aware that lcc-win32 was C99-"compliant". Last time this came up,
you denied it. Has the situation changed? Or do you just mean that the
source code of the compiler is C99-conforming? That should not have taken
5-8 years to achieve. I've been writing C99-conforming code (without ever
having /heard/ of C99), since long before 1999. It's hardly difficult, is
it?

To tell me
that now I should please heathfield and co and come back to C89
because he doesn't want to change some compiler options is too
much really.

You misunderstand the point entirely. If the point were to please
"heathfield", then of course you'd be foolish to take any notice of some
random hack on Usenet with what you perceive as an axe to grind. This is
far more about the "and co" than it is about "heathfield".

Why do some people - especially comp.lang.c subscribers - choose to invoke
their compilers in conforming mode? Why, it's so that they can get as much
compiler assurance as possible that their source code will be portable to
other compilers, other platforms, other operating systems.

But surely everywhere supports // comments, doesn't it? Well, no, everywhere
doesn't support // comments. And even if everywhere did support such
comments, to get *gcc* to support them we have to invoke it in a mode that
conforms neither to C90 nor to C99, and thus we don't get as much compiler
assurance as possible that the rest of our source code will be portable to
other compilers, other platforms, other operating systems. What is true for
me is true for others, too. This isn't about "heathfield". This is about
those who write C portably because they need their code to work even on
computers that Jacob Navia hasn't heard of and perhaps cannot imagine.

What's more, even if my version of gcc had a C99 mode (which it doesn't),
would there be any point in using it? Well, yes IF my intent were to use
C99 features in my own code. But, as someone whose code must remain as
portable as it reasonably can be, I dare not use C99 features that are not
also C90 features until C99 becomes as widely implemented as C90 currently
is. And, practically speaking, that means there's no point even /looking/
until Microsoft, gcc, and at least one mainframe implementor provide
C99-conforming implementations (compiler *and* library) off the shelf.

Richard · Oct 17, 2006

Richard Heathfield said:
jacob navia said:

I was not aware that lcc-win32 was C99-"compliant". Last time this came up,
you denied it. Has the situation changed? Or do you just mean that the
source code of the compiler is C99-conforming? That should not have taken
5-8 years to achieve. I've been writing C99-conforming code (without ever
having /heard/ of C99), since long before 1999. It's hardly difficult, is
it?

I would love to know how many target platforms your code is "ported to"
which doesn't support a form of GCC and the C99 subsets Jacob uses.

While it is laudable that you think all your code should compile on
every C compiler out there, I dont think that makes it of paramount
importance for Jacob, or those of us who do make use of compiler and
platform specifics because of a need to harness some feature or
optimizaton.

I have spent a lot of time programming GUI messaging systems in C - they
sure as hell wont port, and there was no reason whatsoever to consider
that the C99 features used would be detrimental to the codes continued
use.

Keith Thompson · Oct 17, 2006

jacob navia said:
Keith said:

jacob navia said:

Keith Thompson wrote:

Perhaps if you posted the revised code, you'd get more substantial
comments.

OK. Here it is. Blanks instead of tabs.
---------------------------------------------------------------cut here

Click to expand...

[snip]
You're still using "//" comments, and mixing declarations and
statements.
I can see the use for the latter, but the reasons for avoiding "//"
comments on Usenet, even assuming they're 100% legal and portable,
have been explained here many times.

Click to expand...

Look, I have spent all my time in the last 5-8 years writing
a C99 compliant version of my compiler system. To tell me
that now I should please heathfield and co and come back to C89
because he doesn't want to change some compiler options is too
much really.

lcc-win32 is not C99 compliant, and I am not Richard Heathfield.

Shall I explain again why "//" comments on Usenet are a bad idea, or
do you remember the reasons?

jacob navia · Oct 17, 2006

Keith said:
jacob navia said:

Keith said:

Keith Thompson wrote:

Perhaps if you posted the revised code, you'd get more substantial
comments.

OK. Here it is. Blanks instead of tabs.
---------------------------------------------------------------cut here

[snip]
You're still using "//" comments, and mixing declarations and
statements.
I can see the use for the latter, but the reasons for avoiding "//"
comments on Usenet, even assuming they're 100% legal and portable,
have been explained here many times.

Click to expand...

Look, I have spent all my time in the last 5-8 years writing
a C99 compliant version of my compiler system. To tell me
that now I should please heathfield and co and come back to C89
because he doesn't want to change some compiler options is too
much really.

Click to expand...

lcc-win32 is not C99 compliant, and I am not Richard Heathfield.

Shall I explain again why "//" comments on Usenet are a bad idea, or
do you remember the reasons?

I cutted the lines so that they pass everywhere. Verified that. The
message looked perfectly well in my news reader. Buyt you are
right, line breaking *could* be a problem.

CBFalconer · Oct 18, 2006

CBFalconer said:
.... snip ...

For those interested here is Jacobs code revised for portability to
C90. It will even compile under C99. I also passed it through
indent. Now you should be able to compile and thrash.

Aside to Jacob - see how easy it is to be semi-portable. Still are
portability problems, such as return 1 in main and the use of
"c <= ' '". I hope this isn't the lexer in lcc.

.... snip ...

I did some further simple reformatting and removal of hideous
coding. Now I have a peculiar anomaly. The code compiles with gcc
and runs on itself quite nicely when the output file is "a.exe" (by
default). Operation with no params gives help. However, once I
rename that to "cfunct.exe" no parameter operation goes wild, and
execution on its own code gives nothing. I suspect some of the
non-standard coding is the reason, and will look at it further
later. This was done with gcc 3.2.1 under DJGPP and W98. The
actual version compiled is below, for those interested.

My instinct tells me that the use of argv[0] is to blame. I doubt
very much that other systems will have the same misfunction. At
any rate, Jacobs code has been shown to be non-portable.

This all showed up when I started to build a simple filter to
exterminate // comments.

An aside: I always format do while loops as:

do {
...
} while (condition);

indent, as I have it set up now, makes that

do {
...
}
while (condition);

which casual reading parses as a while loop with an empty statement
phase. I initially revised that to "while (condition) continue;",
which produced parse errors and located the problem. This also
shows up the advantage of using continue in empty loops.

/* A simple scanner that will take a file of C source code and
print the names of all functions therein, in the following
format:
"Function XXXX found line dddd .... ddddd"
Algorithm. It scans for a terminating parentheses and an
immediately following opening brace. Comments can appear
between the closing paren and the opening braces, but no
other characters besides white space. Functions must have
the correct prototype, K & R syntax is not supported.
*/
#include <stdio.h>
#include <ctype.h>

/* Longest Identifier we support. Sorry Java guys. */
#define MAXID 1024

/* Buffer for remembering the function name */
static char IdBuffer[MAXID];

/* Line number counter. We start at line 1 */
static int line = 1;

/* ----------------- */

/* This function reads a character and
if it is \n it bumps the line counter. */
static int Fgetc(FILE * f)
{
int c;

if ('\n' == (c = fgetc(f))) line++;
return c;
} /* Fgetc */

/* ----------------- */

/* Return 1 if the character is a legal C identifier
character, zero if not. The parameter "Start"
means if an identifier START character (excluding
numbers) is desired */
static int IsIdentifier(int c, int start)
{
if (c == '_' || isalpha(c)) return 1;
if (start == 0 && isdigit(c)) return 1;
return 0;
} /* IsIdentifier */

/* ----------------- */

/* Just prints the function name */
static int PrintFunction(FILE * f)
{
printf("Function %s: line %d ...", IdBuffer, line);
return Fgetc(f);
} /* PrintFunction */

/* ----------------- */

/* Reads a global identifier into our name buffer */
static int ReadId(char c, FILE * f)
{
int i = 1;

IdBuffer[0] = c;
while (i < MAXID - 1) {
c = Fgetc(f);
if (EOF == c) break;
else {
if (IsIdentifier(c, 0)) IdBuffer[i++] = c;
else break;
}
}
IdBuffer = 0;
return c;
} /* ReadId */

/* ----------------- */

/* Skips strings */
static int ParseString(FILE * f)
{
int c;

c = Fgetc(f);
while (c != EOF && c != '"') {
if (c == '\\') c = Fgetc(f);
if (c != EOF) c = Fgetc(f);
}
if (c == '"') c = Fgetc(f);
return c;
} /* ParseString */

/* ----------------- */

/* Skips comments */
static int ParseComment(FILE * f)
{
int c;

c = Fgetc(f);
while (1) {
while (c != '*') {
c = Fgetc(f);
if (c == EOF) return EOF;
}
c = Fgetc(f);
if (c == '/') break;
}
return Fgetc(f);
} /* ParseComment */

/* ----------------- */

/* Skips / * comments */
static int ParseCppComment(FILE * f)
{
int c;

c = Fgetc(f);
while (c != EOF && c != '\n') {
if (c == '\\') c = Fgetc(f);
if (c != EOF) c = Fgetc(f);
}
if (c == '\n') c = Fgetc(f);
return c;
} /* ParseCppComment */

/* ----------------- */

/* Checks if a comment is followed after a '/' char */
static int CheckComment(int c, FILE * f)
{
if (c == '/') {
c = Fgetc(f);
if (c == '*') c = ParseComment(f);
else if (c == '/') c = ParseCppComment(f);
}
return c;
} /* CheckComment */

/* ----------------- */

/* Skips white space and comments */
static int SkipWhiteSpace(int c, FILE * f)
{
c = CheckComment(c, f);
do {
if (c <= ' ') c = Fgetc(f);
c = CheckComment(c, f);
} while (c <= ' ');
return c;
} /* SkipWhiteSpace */

/* ----------------- */

/* Skips chars between simple quotes */
static int ParseQuotedChar(FILE * f)
{
int c;

c = Fgetc(f);
while (c != EOF && c != '\'') {
if (c == '\\') c = Fgetc(f);
if (c != EOF) c = Fgetc(f);
}
if (c == '\'') c = Fgetc(f);
return c;
} /* ParseQuotedChar */

/* ----------------- */

int main(int argc, char *argv[])
{
FILE *f;
int c;
int level = 0;
int parenlevel = 0;
int inFunction = 0;

if (argc == 1) {
printf("Usage: %s <file.c>\n", argv[0]);
return 1;
}

f = fopen(argv[1], "r");
if (f == NULL) {
printf("Can't find %s\n", argv[1]);
return 2;
}

c = Fgetc(f);
while (c != EOF) {
/* Note that each of the switches must advance the
character read so that we avoid an infinite loop. */
switch (c) {
case '"':
c = ParseString(f);
break;
case '/':
c = CheckComment(c, f);
break;
case '\'':
c = ParseQuotedChar(f);
break;
case '{':
level++;
c = Fgetc(f);
break;
case '}':
if (level == 1 && inFunction) {
printf(" %d\n", line);
inFunction = 0;
}
if (level > 0) level--;
c = Fgetc(f);
break;
case '(':
parenlevel++;
c = Fgetc(f);
break;
case ')':
if (parenlevel > 0) parenlevel--;
c = Fgetc(f);
if ((parenlevel | level) == 0) {
c = SkipWhiteSpace(c, f);
if (c == '{') {
level++;
inFunction = 1;
c = PrintFunction(f);
}
}
break;
default:
if ((level | parenlevel) == 0 && IsIdentifier(c, 1))
c = ReadId(c, f);
else
c = Fgetc(f);
}
}
fclose(f);
return 0;
} /* main, cfunct.c */

Flash Gordon · Oct 18, 2006

CBFalconer said:
CBFalconer wrote:
... snip ...

For those interested here is Jacobs code revised for portability to
C90. It will even compile under C99. I also passed it through
indent. Now you should be able to compile and thrash.

Aside to Jacob - see how easy it is to be semi-portable. Still are
portability problems, such as return 1 in main and the use of
"c <= ' '". I hope this isn't the lexer in lcc.

Click to expand...

... snip ...

I did some further simple reformatting and removal of hideous
coding. Now I have a peculiar anomaly. The code compiles with gcc
and runs on itself quite nicely when the output file is "a.exe" (by
default). Operation with no params gives help. However, once I
rename that to "cfunct.exe" no parameter operation goes wild, and
execution on its own code gives nothing. I suspect some of the
non-standard coding is the reason, and will look at it further
later. This was done with gcc 3.2.1 under DJGPP and W98. The
actual version compiled is below, for those interested.

My instinct tells me that the use of argv[0] is to blame.

Possibly since you certainly don't allow for it being NULL.

int main(int argc, char *argv[])
{
FILE *f;
int c;
int level = 0;
int parenlevel = 0;
int inFunction = 0;

if (argc == 0) {
printf("Usage: <file.c>\n"v[0]);
return 1;
}

if (argc == 1) {

if (argc == 1 || argc > 2) {

printf("Usage: %s <file.c>\n", argv[0]);
return 1;
}

<snip>

The return value is still non-standard.

jacob navia · Oct 18, 2006

Flash said:
<snip>

The return value is still non-standard.

???
Since when there is a standard return value?

This returns zero for no error, 1 for argument error
and 2 if the file could not be opened...

Keith Thompson · Oct 18, 2006

jacob navia said:
???
Since when there is a standard return value?

This returns zero for no error, 1 for argument error
and 2 if the file could not be opened...

The standard return values for main() are EXIT_SUCCESS or 0 for
success, EXIT_FAILURE for failure. Any other values are non-portable.
In particular, there are real-world systems where "exit(1)" or
"return 1;" from main() will cause the program to terminate and
indicate *success* to the calling environment.

It's often possible to define return values other than the standard
ones, but they're likely to be system-specific, and they should be
clearly documented.

You didn't know that?

jacob navia · Oct 18, 2006

Keith said:
The standard return values for main() are EXIT_SUCCESS or 0 for
success, EXIT_FAILURE for failure. Any other values are non-portable.
In particular, there are real-world systems where "exit(1)" or
"return 1;" from main() will cause the program to terminate and
indicate *success* to the calling environment.

It's often possible to define return values other than the standard
ones, but they're likely to be system-specific, and they should be
clearly documented.

You didn't know that?

Excuse me but it is
int main(int argc,char *argv[])

"int" means at least 16 bits return value. I can choose more
tha 30 000 values, and I used 3:
zero for no error, one for argument error, and two for open failure
error. Other error codes (that I do not use) could be syntax error in
the source file, etc.

EXIT_SUCCESS or EXIT_FAILURE are just too few values to use.
Or you mean that all error codes are unnecessary and that
only "failure" should be returned instead of more detailed
error reports???

I can't understand the argumentation here, that is not based in
any standard whatsoever. "main" returns an "int", not a boolean
value of just success or failure. And that has a reason.

Error codes are an habit for me. I always use them to convey
more information to the calling program than just "failure"...

WHAT FAILED?

The file couldn't be opened? Syntax error in the file?

Error codes allow you to differentiate the different possibilities.

Mark McIntyre · Oct 19, 2006

Excuse me but it is
int main(int argc,char *argv[])

"int" means at least 16 bits return value. I can choose more
tha 30 000 values, and I used 3:

Keith's point is that only the three listed above are guaranteed by
the standard to be meaningful.

For your information, many OSen treat return values as specific error
codes. Defining your own is fraught with peril. I recall that
returning a large positive multiple of two or any negative number on
any VMS based system, could provide hours of amusement with the scary
messages you get from the OS. I recall once being told that I had
initiated a cluster-wide shutdown due to a fire alert, or being
requested to place the tape into DRA0 or somesuch....

Even MSDOS does this - there's a prescribed set of error codes in one
of the system headers that lists what you should return for no memory,
invalid file handle etc. If you return one of these values, some DOS
tools will assume you have encountered such an error, and take
unnecessary remedial action.

Error codes are an habit for me. I always use them to convey
more information to the calling program than just "failure"...

This is a good plan. Keith's point is that you can't do this portably
with the return from main, without unexpected side-effects. You need
to find a different way to signal the precise error to the user, or
drop into system-specific return codes and consider portability
issues.

Error codes allow you to differentiate the different possibilities.

Agreed.
--
Mark McIntyre

"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it."
--Brian Kernighan

Richard Tobin · Oct 19, 2006

EXIT_SUCCESS or EXIT_FAILURE are just too few values to use.
Or you mean that all error codes are unnecessary and that
only "failure" should be returned instead of more detailed
error reports???

It depends on the context your program will be used in.

Many operating systems interpret the return value of a C program as
meaning success or failure. For example, in unix, a non-zero return
value (usually) indicates failure and may cause a script to terminate.
If you want to fit in with the operating system's conventions, and
don't know for sure what the operating system is going to be, then the
standard C way to do it is EXIT_FAILURE and EXIT_SUCCESS.

(As has been pointed out before, the standard could instead have
mapped 0 and 1, but for some reason that wasn't done - presumably
for compatibility with existing programs on odd platforms.)

If you don't care about the operating system's conventions, or you
only care about portability to, say, Posix systems, you can use
any values that seem appropriate.

-- Richard

Keith Thompson · Oct 19, 2006

jacob navia said:
Keith said:

The standard return values for main() are EXIT_SUCCESS or 0 for
success, EXIT_FAILURE for failure. Any other values are non-portable.
In particular, there are real-world systems where "exit(1)" or
"return 1;" from main() will cause the program to terminate and
indicate *success* to the calling environment.
It's often possible to define return values other than the standard
ones, but they're likely to be system-specific, and they should be
clearly documented.
You didn't know that?

Click to expand...

Excuse me but it is
int main(int argc,char *argv[])

"int" means at least 16 bits return value. I can choose more
tha 30 000 values, and I used 3:
zero for no error, one for argument error, and two for open failure
error. Other error codes (that I do not use) could be syntax error in
the source file, etc.

EXIT_SUCCESS or EXIT_FAILURE are just too few values to use.
Or you mean that all error codes are unnecessary and that
only "failure" should be returned instead of more detailed
error reports???

I can't understand the argumentation here, that is not based in
any standard whatsoever. "main" returns an "int", not a boolean
value of just success or failure. And that has a reason.

Not based on any standard whatsoever?? It's based on the C standard,
ISO/IEC 9899:1999.

C99 5.1.2.2.3p1:

... a return from the initial call to the main function is
equivalent to calling the exit function with the value returned by
the main function as its argument ...

C99 7.20.4.3p5:

Finally, control is returned to the host environment. If the value
of status is zero or EXIT_SUCCESS, an implementation-defined form
of the status _successful termination_ is returned. If the value
of status is EXIT_FAILURE, an implementation-defined form of the
status _unsuccessful termination_ is returned. Otherwise the
status returned is implementation-defined.

Note carefully that last sentence; for "exit(1);" or "return 1;", the
status returned is defined by the *implementation*, not by your own
program.

Error codes are an habit for me. I always use them to convey
more information to the calling program than just "failure"...

WHAT FAILED?

The file couldn't be opened? Syntax error in the file?

Error codes allow you to differentiate the different possibilities.

That's great if your implementation allows for it, but exit codes
cannot *portably* distinguish results other than success vs. failure.

Some concrete examples:

In Unix-like systems, all but the low-order 8 bits of the status are
silently ignored, so exit(256) has exactly the same effect as exit(0).
It's common for applications to use multiple return codes for
different failure modes (grep specifies 2 different failure codes;
curl currently specifies 76), but there's no universal standard other
than zero for success, non-zero for failure.

In VMS, the convention is that an odd-numbered status indicates
success and an even-numbered status indicates failure (with a lot more
rules for interpreting specific values). As a special case, in a C
program, a status of 0 is translated to 1, so that exit(0) will work
as expected; this translation is *not* done for exit(1). So "exit(0)"
and "exit(1)" both have exactly the same effect; both indicate that
the program terminated successfully.

It would be perfectly valid for an implementation to treat 0 as
success, 1 as failure, and map *all* non-zero status values to 1. (I
don't know of any systems that actually do this.)

Now if you want to define a set of status codes for your program, and
the program is intended to run only on some particular platform,
that's just fine. If you want to define the same set of status codes
for *all* platforms, even if it violates the established conventions
on some systems, I personally think it's a bad idea but you can go
ahead and do it if you insist.

If you think that success vs. failure just isn't enough information, I
don't necessarily disagree; your argument is with the C standard, not
with me.

Furthermore, if you want your program to convey extra information via
the exit status, you really should document it. I can't use the
information if I don't know it's there.

jacob, did you really not know this?

CBFalconer · Oct 19, 2006

jacob said:
.... snip ...

WHAT FAILED?

The file couldn't be opened? Syntax error in the file?

Look at the first paragraph of my article. If you wish send me an
email and I will return a package with the problems. I have no
idea what and why, even putting printfs at various places,
including as the first statement in main, all fail. The problem
might be some failure in my system for all I know, and may not be
reproducible elsewhere.

Very simple parser... not for me	36	Jan 27, 2014
A very simple parser with scanf & C	49	Sep 23, 2011
A simple parser	121	Oct 14, 2006
Is C an unsuitable choice as a string parser?	72	Dec 13, 2013
simple ElementTree based parser that allows entity definition map	0	Dec 4, 2013
Making a simple parser	8	Apr 16, 2011
simple_html_dom: simple use-case - getting a scipt to work	0	Mar 2, 2020
How to write a language parser ?	5	Feb 22, 2013

A simple parser

jacob navia

Keith Thompson

jacob navia

Keith Thompson

CBFalconer

Bill Pursell

jacob navia

Richard Heathfield

Richard

Keith Thompson

jacob navia

CBFalconer

Flash Gordon

jacob navia

Keith Thompson

jacob navia

Mark McIntyre

Richard Tobin

Keith Thompson

CBFalconer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads