On Mon, 23 Aug 2004 11:10:53 -0400, "Michael J. Fromberger"
[OT]
For simple, single-purpose languages, I am inclined to agree that
hand-written parsers are easy enough to write and maintain. However,
when you are dealing with a more complex general-purpose language, and
one whose grammar needs to evolve over time, then I think a
parser-generator is a much better solution by far.
Like I said before may be it's me... but I see the grammar definition
and the sparse collection of C code snippets required to use yacc &
friends harder to understand than a recursive descending parser.
One could go *forever* by abstracting and generalizing and
formalizing, but there is a ROI point that must be considered,
and, for me, shift-reduce parser are beyond the ROI point.
If the formal description (with the ugly pieces of real meat that
you're required to stick to it) becomes harder to follow and
maintain than another (that is still formal) imperative description
of how the parsing is done (i.e. a program doing the parsing)
then IMO there is simply no point in pursuing further abstraction.
Sometimes, for reasons I don't understand, there is a strong
temptation of abstracting just too much, and the result can
become quite ugly. If you do not understand what I mean then
The chief trouble in maintaining a hand-written recursive descent parser
is that the grammar for the input language is hidden inside the
implementation. The author of the code can usually pick it out, but
over time, even the author may find it difficult (and yes, I am speaking
from a certain degree of first-hand experience in this matter).
Normally the grammar is documented ALSO elsewhere in a very
human readable form. Also grammar sure evolves but (in my
experience) it's not the most common change.
In contrast, the grammars consumed by most parser-generators are
explicitly written out as grammars, in the spirit (of not the form) of
BNF notation. If you want to make a change to the grammar (and thereby,
the parser), you can easily see how the changes will affect the rest of
the language, and a new parser can be created quickly and easily, simply
by re-running the parser generator. The only lines of code you need to
touch are the places where your change affects the translation, and that
is often quite minimal.
I love BNF form, and I even wrote long ago a program that was
accepting a BNF description (and a very "clean" one, written
as you can find it in reference manuals of languages) and then
could check inputs for compliance with the grammar.
However when writing a compiler the grammar is just one part...
you do not have to just read the source. Spreading the meat of
the compiler is in my opinion worse than spreading the grammar...
it's like spreading the business logic of an application in
the buttons you press in a gui.
It's only a personal opinion, of course.
Furthermore, naive implementation of recursive descent is fraught with a
few subtle perils to trip the novice and the unwary. For instance, you
must carefully avoid left-recursive productions, or your parser may not
terminate.
That is something that never happened to me. Some trap I fell
in was for example writing tokenizers that recognized integers
with something like
/-?\d+/
and then later bumping in "x-1" being tokenized as <ident><number>.
Those are not serious issues, however.
Also, error-handling is tricky in recursive descent, because
much of the parser's state is implicit in stack frames that must be
correctly unwound when an error forces you to bail out. If you're
writing in a language (like Python) with good automatic memory
management, the latter is less of an issue, but recursive descent
parsers written in languages without automatic memory management, like C
and C++, must contend mightily with this.
Here is an excerpt of a parser I wrote about a dozen years ago
in C. The grammar is for a complete programming language (a
language designed for the implementation of the server side
of a three-thiers client/business logic/RDBMS architecture).
case -RW_IF :
SkipToken();
c=ParseExpr();
if (!ParseError && NextIs(-RW_THEN))
{ t=ParseStat();
if (!ParseError)
{ if (CurToken==-RW_ELSE)
{ SkipToken(); e=ParseStat(); }
else
e=NULL;
r=NewIf(c,t,e);
}
else
{ Deref(t); Deref(c); }
}
else
Deref(c);
break;
This is the case for parsing IF-THEN-ELSE statement.
Surely it's a bit annoying to remember to do all those
"Deref" call in case of failure, but I wouldn't call this
"contend mightly". The parser for the whole language
(a general purpose programming language with support
for embedded SQL) is about 1000 lines of C.
Of course, there is no silver bullet, but the availability of good LR(1)
and LALR(1) parser generators should not be discounted, even if the
theory behind them seems to be slightly complicated, on the surface.
Surely they're good for someone. I've to say I didn't really
invest a lot of time on shift-reduce parsers; but for the problems
I've been facing I found no serious reason for using those tools
instead of a recursive parser. Once a bit of low-level tricks
(like having current/skip instead of get/unget; or using
reference counters when the language is not providing them to
you) are in place things are really not complex at all.
Of course this is just my personal view, based on the experience
I had on the few languages I implemented. For example I never
dared to write a parser for C++: things like the rule that
says "if it can be interpreted as a declaration then it's a
declaration" really scare me (that is the rule that Scott Meyers
calls "the C++ most vexing parse" and implies that a C++ parser
has to swallow a potentially unlimited number of tokens before
deciding what is the semantic meaning for the very first of them).
In language I invented I never had the crazy idea of allowing
both:
X x(...);
X x = X(...);
for the price of introducing ambiguity with function declarations.
Note that if a grammar is that complex then even USERS of the
grammar will fall into the traps. This is very common for C++;
the following code snippet for example...
double pi = 3.141592654;
int x( int(pi) );
may trick many C++ programmers into thinking that an integer
variable named x is being initialized with value 3.
It's hard to get to this level of complexity with a recursive
descent parser ? Good ... I've another excuse to stay away
from pointless confusion ;-)
Andrea
PS: Ok ok ok... I'll give shift-reduce parser and their tools
another look. May be it's just my laziness that is showing up
and my brain that tries to find excuses for justifying it.