Parsing C declarations [savestruct]

  • Thread starter Arthur J. O'Dwyer
  • Start date
A

Arthur J. O'Dwyer

Well, I'm trying to write that program that was requested
a few weeks back, the one that could take struct definitions
and create portable functions to read and write those
structs. Hence the 'savestruct' in the subject line.

I cannot for the life of me figure out how to parse C
declarations correctly! I've come up with an intermediate
format that I'd really like to keep, if at all possible,
which involves making a linked list (or tree, in the case
of functions) that looks something like this:

int (*p)[5]; becomes

is_a: pointer to
down: --.
|
`--> is_a: array of
down: --.
|
`--> is_a: signed int
down: (NULL)

The problem I'm having is parsing declarations using
as little look-ahead as possible; with one token of
look-ahead, I have managed to parse *either*

int (*p)[5]; correctly, *or*

int p[3][4]; correctly,

but whatever I try ends up messing up one of those
two cases.

Does anyone here already have a nice parser that
generates this kind of tree format, or think it's
trivial enough to kindly explain to me either here
or offline?

My current effort is posted complete at
http://www.contrib.andrew.cmu.edu/~ajo/struct.html

TIA,
-Arthur
 
M

M Kumar

I cannot for the life of me figure out how to parse C
declarations correctly! I've come up with an intermediate

See K&R (The C Programming Language by Kernighan & Ritchie, 2nd Edition),
section 5.12

HTH,
K
 
A

Arthur J. O'Dwyer

See K&R (The C Programming Language by Kernighan & Ritchie,
2nd Edition), section 5.12

That program uses string functions to create a text message
corresponding to a simple declaration. I want to create a
tree data structure corresponding to a declaration, preferably
without having to traverse the tree each time I want to add
something to it (i.e., add new elements only at the root).

For example, currently I can parse


int *i[5];

as

(read int) 'signed int'
(read *) ++ptr_count
(read i) 'i is a -> signed int'
--ptr_count 'i is a -> pointer to -> signed int'
(read [5]) 'i is a -> array[5] of -> pointer to -> signed int'

However, this method breaks down when reading

int j[4][2];

as

(read int) 'signed int'
(read j) 'j is a -> signed int'
(read [4]) 'j is a -> array[4] of -> signed int'
(read [2]) 'j is a -> array[2] of -> array[4] of -> signed int'

You see, at the moment arrays come out "backwards."
I *could* implement a stack, and do crazy manipulations
to force everything to come out right, but surely there's
a better way?

I'd like to keep the code as clean as possible.

-Arthur
 
D

Daniel Haude

On Wed, 1 Oct 2003 18:23:44 -0400 (EDT),
in Msg. said:
Well, I'm trying to write that program that was requested
a few weeks back, the one that could take struct definitions
and create portable functions to read and write those
structs. Hence the 'savestruct' in the subject line.

Have you taken a look at lex and yacc (called flex and bison in the GNU
world)? The learning curve is extremely steep, but at the end writing such
a parser becomes a piece of cake -- when fed a couple of small description
files (one for the parser, and one for the grammar) these tools
essentially spit out the finished C code.

May I ask what the finished program is supposed to be good for?

--Daniel
 
M

Martijn

Arthur said:
For example, currently I can parse


int *i[5];

as

(read int) 'signed int'
(read *) ++ptr_count
(read i) 'i is a -> signed int'
--ptr_count 'i is a -> pointer to -> signed int'
(read [5]) 'i is a -> array[5] of -> pointer to -> signed
int'

This is possibly looking ahead more than 1 (e.g. int **i[5])
However, this method breaks down when reading

int j[4][2];

as

(read int) 'signed int'
(read j) 'j is a -> signed int'
(read [4]) 'j is a -> array[4] of -> signed int'
(read [2]) 'j is a -> array[2] of -> array[4] of ->
signed int'

You see, at the moment arrays come out "backwards."
I *could* implement a stack, and do crazy manipulations
to force everything to come out right, but surely there's
a better way?

You said you had a way to parse this one, but which broke the first one -
what did you do? Does that also work on e.g. j[4][3][2]? I guess you have
to do a similar thing as with the pointer (where you use a sort of stack),
but this time storing the fact that you have an array is not enough.

But as you are trying to write out structures, it doesn't really matter how
the dimensions are, as long as you know that they total 8 elements (in this
example).

Another thing is use of the preprocessor - did you take that into account
(i.e. j[MAX_X][MAX_Y])? Of course you could run it through the preprocessor
first.

Finally, you are only trying to parse declarations, right? Because how
about things like *&i (although I honestly see no point in using it, people
may have a macro which is something like &i and take the value of that).

Maybe you may be better of using lex/yacc, altough I assume that is off
topic here ;)
 
E

Eric Sosman

Arthur J. O'Dwyer said:
Well, I'm trying to write that program that was requested
a few weeks back, the one that could take struct definitions
and create portable functions to read and write those
structs. Hence the 'savestruct' in the subject line.

I cannot for the life of me figure out how to parse C
declarations correctly! [...]

It's going to be difficult in full generality -- not
impossible, obviously, since compilers are able to do it,
but difficult. You'll need to handle macro substitution
and conditional compilation (easiest approach: find a C
implementation that'll preprocess the code and spit it
out again, but remember not to generate I/O routines for,
say, `struct lconv'.). You'll need to keep track of typedef,
and possibly of block scopes. You'll need to figure out what
to do with unions: which member is "current" and should
be written or read? And so on ...

The usual approach isn't to parse the struct declarations
and generate I/O code, but to generate both the code and the
declarations from sources written in a "little language"
invented for the purpose. This gives a number of advantages:
the L.L. can be significantly easier to process than full-
blown C, you can define the L.L. in such a way that it will
only produce "well-behaved" struct declarations, you can even
generate sanity-checking code that'll detect attempts to modify
the generated struct declarations instead of modifying their
L.L. sources.

<off-topic excuse="But it's helpful!">
An example of this approach is the rpcgen utility found
on many Unix systems. It may merit your study.
</off-topic>
 
M

Mike Wahler

Daniel Haude said:
On Wed, 1 Oct 2003 18:23:44 -0400 (EDT),


Have you taken a look at lex and yacc (called flex and bison in the GNU
world)? The learning curve is extremely steep, but at the end writing such
a parser becomes a piece of cake -- when fed a couple of small description
files (one for the parser, and one for the grammar) these tools
essentially spit out the finished C code.

May I ask what the finished program is supposed to be good for?

From Arthur's original post:

"take struct definitions and create portable functions
to read and write those structs."

So it's a 'code generator', a productivity tool.

I hope he succeeds with this, and either sells
or donates it. I'd find it quite useful.

-Mike
 
A

Arthur J. O'Dwyer

For example, currently I can parse

int *i[5];

as

(read int) 'signed int'
(read *) ++ptr_count
(read i) 'i is a -> signed int'
--ptr_count 'i is a -> pointer to -> signed int'
(read [5]) 'i is a -> array[5] of -> pointer to -> signed
int'

This is possibly looking ahead more than 1 (e.g. int **i[5])

Yes, but it's only keeping one *token* of look-ahead at a time.
We need an integer state variable 'ptr_count' to count the number
of asterisks in the current 'dcl()' instantiation, but we don't
need to save a whole stackful of information. This is basically
the approach taken by K&R 5.12, if I recall correctly.

However, this method breaks down when reading

int j[4][2];

as

(read int) 'signed int'
(read j) 'j is a -> signed int'
(read [4]) 'j is a -> array[4] of -> signed int'
(read [2]) 'j is a -> array[2] of -> array[4] of ->
signed int'

You see, at the moment arrays come out "backwards."
I *could* implement a stack, and do crazy manipulations
to force everything to come out right, but surely there's
a better way?

You said you had a way to parse this one, but which broke the first one -
what did you do?

Change the grammar from

dcl :: {'*'}* dir_dcl
dir_dcl :: atom suffixes
atom :: identifier
| '(' dcl ')'
suffixes:: '(' ')'
| '[' number ']'

to

dcl :: {'*'}* dir_dcl suffixes
dir_dcl :: atom
atom :: identifier
| '(' dcl ')'
suffixes:: '(' ')'
| '[' number ']'

I think does the trick. That should make j[4][2] parse
correctly, but break more complicated things like
int *(*k[5])[3]. I forget exactly.
Does that also work on e.g. j[4][3][2]?

Yeah, assuming "this" means the proposed stackful method.
Parse as:

read 'int' 'int'
read 'j' 'j is a' -> 'int'
read '[4]' push '[4]' onto software stack
read '[3]' push '[3]' onto software stack
read '[2]' push '[2]' onto software stack
read ';' [now clear the stack...]
pop '[2]' 'j is a' -> 'array[2] of' -> 'int'
pop '[3]' 'j is a' -> 'array[3] of' -> 'array[2] of' -> 'int'
pop '[4]' 'j is a' -> 'array[4] of' -> 'array[3] of' -> ...

But that's really ugly in practice, and will require a potentially
*really large* amount of state information. You see?

I guess you have
to do a similar thing as with the pointer (where you use a sort of stack),
but this time storing the fact that you have an array is not enough.

Right. (I guess you already figured out the above, but I hope the
example confirms your thoughts, anyway.)
But as you are trying to write out structures, it doesn't really matter how
the dimensions are, as long as you know that they total 8 elements (in this
example).

No, it does. I want to keep this thing as portable as possible.
(I assume you're suggesting I output code along the lines of

int i;
for (i=0; i < 24; ++i) /* (24 = 4*3*2) */
output(((int *)p->arr));

in place of

int i, j, k;
for (i=0; i < 4; ++i)
for (j=0; i < 3; ++j)
for (k=0; i < 2; ++k)
output(p->arr[j][k]);

right? Well, I could do that, but again, I'd prefer not to.
Actually having code that could parse declarations would be
of more general use than just this 'savestruct' program,
anyway...)
Another thing is use of the preprocessor - did you take that into account
(i.e. j[MAX_X][MAX_Y])? Of course you could run it through the preprocessor
first.

Yes, in general there are going to be lots of things that will break
'savestruct', or require special cases or additional user input,
possibly in the form of /** */ comments, like 'lint' and 'javadoc'
do it. I'm not worrying about that yet.
Finally, you are only trying to parse declarations, right? Because how
about things like *&i (although I honestly see no point in using it, people
may have a macro which is something like &i and take the value of that).

*&i is not a declaration. So I don't need to parse it.
Maybe you may be better of using lex/yacc, altough I assume that is off
topic here ;)

I'd prefer not to. And yes, in general it would be.

Thanks.

-Arthur,
scrivening
 
A

Arthur J. O'Dwyer

[double response to both Mike and Eric's posts]


I have tried simple things with lex/yacc before, but I'm not good
with them. Also, I'd *like* to keep the project small, and not
require extra tools (and yes, lex/yacc *are* extra tools, on Windows
machines).
From Arthur's original post:

"take struct definitions and create portable functions
to read and write those structs."

So it's a 'code generator', a productivity tool.

I hope he succeeds with this, and either sells
or donates it. I'd find it quite useful.

If it ever gets to a workable stage, I will post it on
my web page and make it open-source. I honestly doubt
that you'll find it very useful, though -- consider
how many structures you really *need* to be doing I/O
on, and how many times you have weird dependencies
among the members of a structure -- like an 'int' field
specifying the right interpretation of a 'union' field,
or the length of a dynamically allocated array.
At the moment, I'm not worrying about these cases, but
neither do I have any idea how to handle them!
(It might be a helpful tool for games with a "save"
feature, though.)


And Eric Sosman wrote, cross-thread:
It's going to be difficult in full generality -- not
impossible, obviously, since compilers are able to do it,
but difficult.

I think the really hardest part will be the wide variety of
special cases, as I detailed above -- how to specify that
X is a length field, or if Y is zero then Z is an invalid
pointer and must not be followed(!). Compilers don't need
to handle that sort of thing...
You'll need to handle macro substitution
and conditional compilation

I figure there are standalone preprocessors out there
already.
(easiest approach: find a C
implementation that'll preprocess the code and spit it
out again, but remember not to generate I/O routines for,
say, `struct lconv'.).

My current plan is to use a consistent naming convention
for the routines, and simply call similarly-named routines
for other structures. E.g.,

typedef struct {
FILE *fp;
} foo;

becomes something like

int ReadBufferFoo(FILE *fp, foo *p)
{
int rc;
p->fp = malloc(sizeof p->fp);
if (p->fp == NULL) return -3;
rc = ReadBufferFile(fp, p->fp);
if (rc) return rc;
return 0;
}

In this particular case, of course, the "generated" code
doesn't look very helpful. But it's probably okay in
general.
You'll need to keep track of typedef,

Have done.
and possibly of block scopes.

Huh? (I'm only dealing with one 'struct' definition at a
time, at the moment, so if you're thinking of

struct a { int i; };
void foo(void) {
struct a { long i; };
}

that isn't ever supposed to happen.)
You'll need to figure out what
to do with unions: which member is "current" and should
be written or read? And so on ...

Yeah, that's going to be a problem. :)

Thanks for all your input,
-Arthur
 
M

Mike Wahler

Arthur J. O'Dwyer said:
[double response to both Mike and Eric's posts]


I have tried simple things with lex/yacc before, but I'm not good
with them. Also, I'd *like* to keep the project small, and not
require extra tools (and yes, lex/yacc *are* extra tools, on Windows
machines).
From Arthur's original post:

"take struct definitions and create portable functions
to read and write those structs."

So it's a 'code generator', a productivity tool.

I hope he succeeds with this, and either sells
or donates it. I'd find it quite useful.

If it ever gets to a workable stage, I will post it on
my web page and make it open-source.
Great.

I honestly doubt
that you'll find it very useful, though -- consider
how many structures you really *need* to be doing I/O
on, and how many times you have weird dependencies
among the members of a structure -- like an 'int' field
specifying the right interpretation of a 'union' field,

Yes, it seems it will have more than what I want,
but still will include what I do want.
or the length of a dynamically allocated array.
At the moment, I'm not worrying about these cases, but
neither do I have any idea how to handle them!
(It might be a helpful tool for games with a "save"
feature, though.)

A 'quick-n-easy' way to serialize an arbitrary struct
type without hand rolling code every time I would find useful,
for logging, simple debugging, simple 'config files' etc.

I suppose if I Keep It Simple, I could just write it myself. :)
Admittedly I'm also curious to see how you eventually solve this
issue.

-Mike
 
A

Arthur J. O'Dwyer

...

Yes, it seems it will have more than what I want,
but still will include what I do want.

:)
Rather say: it seems that to be at all useful, it must
do more than I want, but still include what I do want.

(I had a false epiphany this morning about how to handle
complicated declarations properly -- I thought that just
making the bit that processed '()' and '[]' recursive
would help. It didn't. The code on the site is not
updated.)

A 'quick-n-easy' way to serialize an arbitrary struct
type without hand rolling code every time I would find useful,
for logging, simple debugging, simple 'config files' etc.

I suppose if I Keep It Simple, I could just write it myself. :)
Admittedly I'm also curious to see how you eventually solve this
issue.

At the moment, I'm leaning towards special comments in some
format similar to 'javadoc' or (I think) 'doxygen'. So far,
the special cases I've thought of specifically include:


int what_type_is_x_really;
union foo x;

int what_length_is_x;
foo *x;

int x_dimension_1;
int x_dimension_2;
foo **x; /* and so on... yuck! */

int which_of_these_fields_apply;
/** case 1: */
foo bob;
bar alice;
/** case 1 or 2: */
baz luhrman;
/** case 3: */ /* and so on... yuck! */


But ATM I'm really still trying to get 'typedecl' alone
to work... it's taking much longer than it should, really!
Bottom line: Don't expect results. But if there are any,
I'll tell you here.

-Arthur
 
I

Ira Baxter

Eric Sosman said:
Arthur J. O'Dwyer said:
Well, I'm trying to write that program that was requested
a few weeks back, the one that could take struct definitions
and create portable functions to read and write those
structs. Hence the 'savestruct' in the subject line.

I cannot for the life of me figure out how to parse C
declarations correctly! [...]

It's going to be difficult in full generality -- not
impossible, obviously, since compilers are able to do it,
but difficult. You'll need to handle macro substitution
and conditional compilation (easiest approach: find a C
implementation that'll preprocess the code and spit it
out again, but remember not to generate I/O routines for,
say, `struct lconv'.). You'll need to keep track of typedef,
and possibly of block scopes. You'll need to figure out what
to do with unions: which member is "current" and should
be written or read? And so on ...

The usual approach isn't to parse the struct declarations
and generate I/O code, but to generate both the code and the
declarations from sources written in a "little language"
invented for the purpose. This gives a number of advantages:
the L.L. can be significantly easier to process than full-
blown C, you can define the L.L. in such a way that it will
only produce "well-behaved" struct declarations, you can even
generate sanity-checking code that'll detect attempts to modify
the generated struct declarations instead of modifying their
L.L. sources.

If the goal is to generate persistification code from structs,
then that's the goal.

However, any attempt to hack one's way through it
is pretty likely doomed to failure. People invented parsing
technology to handle complicated parsing programs,
and C's declaration syntax is one of those.
(Yes, you could use recursive descent parsing technology;
that's at least an organized attack).

But, as Sosman says, you'll still have to handle macros, conditionals,
includes, and everything else if you really want to do this
using a struct definition from a source file.
So your tool will in essence include much of what a real
C compiler-parser does.

I'd recommend hacking the GNU front end as one option,
or looking at commercial tools that are designed to build
custom analsis/code generation, such as our DMS Software
Reengineering Toolkit. (It has a full C front end with
everything Sosman suggests is needed).
 
A

Arthur J. O'Dwyer

Well, I've taken a few minutes just now and done the ugly
hack that I was trying to avoid, re parsing C declarations.
The finished 'typedecl.c' ends up doing a lot of traversing
of the "parse tree" structure in order to get all the
"pointer to" and "array of" bits in the right order.
But it wasn't nearly as painful as I had been expecting it
to be.

Still true. The typedecl-main program should be completely
useful now. If anyone finds any bugs in it (such as incorrect
parsing of some declaration), could you please let me know?

Now, on to the interesting parts... :)

-Arthur
 
A

Arthur J. O'Dwyer

[hey, 'typedecl' works now!]

Now, on to the interesting parts... :)

Interesting Part #1 of 'savestruct', the output-code
generator, is now nearing completion. Mike, or other
interested parties, I'd be much obliged if you'd take
a look at any of:

1) the code so far. It's getting ugly, which worries me.
Feedback, especially on ways to split up the code in a
logical manner, or get rid of the global variables in the
"Icky" section, would be much welcomed.
2) the generated code. You can now see some sample output
at the site below, demonstrating pretty much everything
I've coded so far.
3) the user interface. All my programs try to use the same
command-line structure, so if there's something buggy or
weird about it, I'd *really* like to know. :)
4) the way I've chosen to handle dynamic arrays in the
generated code -- read the "help" text for this, since the
actual program doesn't implement everything yet (nor am I
really sure how to implement it..).

Thanks to everyone who's given me ideas so far, BTW.

-Arthur
 
M

Mike Wahler

Arthur J. O'Dwyer said:
[hey, 'typedecl' works now!]

Now, on to the interesting parts... :)

Interesting Part #1 of 'savestruct', the output-code
generator, is now nearing completion. Mike, or other
interested parties, I'd be much obliged if you'd take
a look at any of:

Arthur:

Since you addressed me directly in your post, I wanted
to let you know that I have taken a quick peek at your
code, but haven't had time to look in depth. However,
I must agree, it does look kind of 'messy' :)

I've made a few minor notes about it, but imo nothing
significant enough yet to start discussing it.

I do like the 'header' part of the output
(prototype list, list of struct members)

Actually, while I'm writing this, I might as well go
ahead an mention one of those 'minor' notes.

In your output header, you have items such as :

i is a(n) signed int

I think it would look 'neater' if instead of all those
(n)', you could make it e.g.

i has type signed int

or

i is of type signed int

Just a thought. ;-)

-Mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top