Henrik said:
Hi,
I would like to create a simplistic parser which goes through each .h file
and finds each function prototype (or inline implementation) along with
class names and member functions.
Examples:
test.h:
void f1();
inline int f2() {return 0;}
class A
{
void f3();
}
How would I aproach this from a simple viewpoint without a steep learning
curve. I know there exist a dozen parsers which are all pretty advanced and
requires lots of background knowledge but for my simple needs I think it
might be a bit overkill.
There are sort of two approaches I see. One is to use text pattern
matching like jussij suggests. (Though remember to also search for A-Z
and if you want to be pedantic, stuff like $ that you can also use in
identifiers but probably no one actually does. Also his won't spot
things like constructors (no return value), functions where there are
newlines in the whitespace (you can't use grep for those), operators,
and probably some other special cases.) There's a variant of this which
would use something like Flex to create a lexer, in which case you just
have to deal with whole tokens. This would might be easier if you know
at least a little Flex (or the ideas behind it) and can find the file
that GCC uses or something to do their lexing. Then again, it might
not.
The problem with that is that I'm not sure how hard it would be to get
just the lines in question. I mean, I know that jussij probably didn't
spent a lot of time working on that and could get something more to the
point with some more effort, but I suspect that it would be very
difficult to get something that works in full generality. At the same
time, if your results don't have to be perfect, this solution could be
very lightweight, even to the point of running a slightly modified
version of jussij's regex over your code with grep.
Now, as for if you want exact answers, you might have to go with one of
those parsers. I'll just give a shoutout for one that I know personally
called Elsa. It is complete and accurate enough to parse its own source
then output the source again in a form where it can be compiled and the
rebuilt version used to run the regression suite. At least, I think it
is, though I'm not quite sure how, because I'm currently fixing a
number of "pretty-printing" bugs that block correct translation of the
GCC 3.4 headers. (I'm working on a project that uses it for
source-to-source transformations.) There is one semi-show-stopping bug
in the parsing end though, which is that code containing endl or flush
confuses it. However, replacing endl with "\n" except in the definition
(I use a regex for telling apart uses and the definition; it's not
perfect either) will let things work right. (I know it's not quite
semantics preserving.) However, if you can stand to do that change,
it's quite easy to write an extension that will do what you want.
http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elsa/semgrep.cc
has about a two and a half page long program that is "semantic grep";
you give it a variable name, and it will tell you all the places a
variable with that name is declared or used. On the other hand, if you
want to include it in another project... probably this is not the best
option. See
www.cubewano.org/oink.
So pro with the parser approach is that it's very robust modulo bugs in
the implementation (in the case of Elsa, which will hopefully go away
in the fairly near future... Mozilla is eyeing the Oink project --
which now more or less includes Elsa -- for helping them), but the cons
are that it is pretty much by definition quite heavyweight. And there
are of course other options here. The other one that might be useful is
OpenC++, though I don't know much about that project. You could try to
hack the GCC front end. That's all the open-source c++ parsers I know
of.
Evan Driscoll