Please review small example code

Rui Maciel · Mar 27, 2010

I've just added a small C++ implementation example for a table-based LL parser to wikipedia's
article on LL parsers. It would be great if the code was reviewed by more knowledgeable and
experienced people, as I've added a couple of crude hacks to it.

So, if anyone is willing to take a crack at it then please take a look at:

http://en.wikipedia.org/wiki/LL_parser#Parser_implementation_in_C.2B.2B

Thanks in advance,
Rui Maciel

Pavel · Mar 28, 2010

Rui said:
I've just added a small C++ implementation example for a table-based LL parser to wikipedia's
article on LL parsers. It would be great if the code was reviewed by more knowledgeable and
experienced people, as I've added a couple of crude hacks to it.

So, if anyone is willing to take a crack at it then please take a look at:

http://en.wikipedia.org/wiki/LL_parser#Parser_implementation_in_C.2B.2B

Thanks in advance,
Rui Maciel

FYI: I tried it with (a+a) and the output was:

Rule 2
Matched 1
Rule 1
Rule 3
Matched 3
Matched 4
Rule 1
Rule 3
Matched 3
Matched 2
Matched 5
finished parsing

-- looks like one extra '1' rule was applied as compared to
[ 2, 1, 3, 3 ]
rule sequence given in the text of the article.

-Pavel

Pavel · Mar 28, 2010

Rui said:
I've just added a small C++ implementation example for a table-based LL parser to wikipedia's
article on LL parsers. It would be great if the code was reviewed by more knowledgeable and
experienced people, as I've added a couple of crude hacks to it.

So, if anyone is willing to take a crack at it then please take a look at:

http://en.wikipedia.org/wiki/LL_parser#Parser_implementation_in_C.2B.2B

Thanks in advance,
Rui Maciel

Funny thing just happened: I tried to change your code to my taste and
the extra '1' rule in the result went away, so now the result is the
expected sequence of rules, 2133. I swear I did not even try to fix any
behavior -- don't even know what the issue was.

For the changes, my only goal was to make the code clearer to a
Wikipedia reader, even if I had to sacrifice some performance (which I
may or may not have done -- replacing std::map with 2-dimensional array
should give some wins and replacing switch with std::find may lose
some). The summary of the changes:

1. Eliminated magic numbers (I would even feel comfortable removing some
comments now).
2. Replaced std::map-based table with a 2-dimensional array as a more
natural representation of a table
3. Got rid of a switch in a lexer() (just to make it shorter); in real
parser I would use direct indexation by character value but I felt it
would be an unnecessary complication for illustrative code.

I tried to preserve the original indentation/braces style although it
was definitely not my favorite. Below is the code.

One last suggestion for the article: a reader may wonder what's so cool
about LL-grammars as opposed to LR (that maybe consider more powerful)
so it may be worth to mention the advantages. (If I understand it
correctly LL(1) parsers can be much faster in principle than LR).

Hope this will help,
-Pavel

#include <algorithm>
#include <cassert>
#include <iostream>
#include <stack>

enum GrammarElements {
L_PAREN, // '('
R_PAREN, // ')'
A_CHAR, // 'a'
PLUS, // '+'
END_OF_STREAM, // $, '\0' in our case
UNRECOGNIZED_TOKEN,
N_TERMINALS,
SUM_EXPR = N_TERMINALS, // S non-terminal
ADDEND, // F non-terminal
N_GRAMMAR_ELEMENTS
};

enum RuleId {
DEFAULTED,
ON_AUGEND, // a funny old name for a first summand
ON_OPENING_PAREN,
ON_ADDEND // a second summand..
};

// compiler initializes to DEFAULTED
static RuleId table[N_GRAMMAR_ELEMENTS][N_TERMINALS];

/**
Translates ASCII characters into terminal symbol reference numbers
**/
int lexer(char c)
{
static const char ABC[N_TERMINALS] = "()a+";
const char *res = std::find(ABC, ABC + N_TERMINALS, c);
return res == ABC + N_TERMINALS ? UNRECOGNIZED_TOKEN : res - ABC;
}

/**
The following program parses the text passed through the first command
line argument
**/
int main(int argc, char **argv)
{
using namespace std;

if (argc < 2)
{
cout << "usage: ll <symbols>" << endl;
return 0;
}

stack<GrammarElements> ss; // symbol stack

// initialize the stack
ss.push(END_OF_STREAM); // terminal, $
ss.push(SUM_EXPR); // non-terminal, S

// initializes the input stream
char *p = &argv[1][0]; // input buffer

// setup the parsing table for this grammar
table[SUM_EXPR][L_PAREN] = ON_OPENING_PAREN;
table[SUM_EXPR][A_CHAR] = ON_AUGEND;
table[ADDEND][A_CHAR] = ON_ADDEND;

while(!ss.empty())
{
if(lexer(*p) == ss.top())
{
// input symbol and top of the stack match, pop stacks
cout << "Matched " << lexer(*p) << endl;
p++;
ss.pop();
}
else
{
// input symbol and top of the stack don't match, execute rule
according to the parsing table
RuleId ruleId = table[ss.top()][lexer(*p)];
cout << "Rule " << ruleId << endl;
switch(ruleId)
{
case ON_AUGEND: // 1. S -> F
ss.pop();
ss.push(ADDEND); // F
break;

case ON_OPENING_PAREN: // 2. S -> ( S + F )
ss.pop();
ss.push(R_PAREN); // )
ss.push(ADDEND); // F
ss.push(PLUS); // +
ss.push(SUM_EXPR); // S
ss.push(L_PAREN); // )
break;

case ON_ADDEND: // 3. F -> a
ss.pop();
ss.push(A_CHAR); // a
break;

default:
assert(ruleId == DEFAULTED);
cout << "parser table defaulted" << endl;
return 0;
}
}
}

cout << "finished parsing" << endl;

return 0;
}

Rui Maciel · Mar 28, 2010

Pavel said:
-- looks like one extra '1' rule was applied as compared to
[ 2, 1, 3, 3 ]
rule sequence given in the text of the article.

You are absolutely right. There's a small misconfiguration in rule 2. It reads:

<code>

case 2: // 2. S â†’ ( S + F )
ss.pop();
ss.push(2); // )
ss.push(6); // F
ss.push(4); // +
ss.push(6); // S
ss.push(1); // )
break;

</code>

While it should push a 'F' terminal symbol to the symbol stack, with magic number 7, instead
it pushes a 'S' terminal symbol, with magic number 6. That was the result of a minor tweak
I've made to the parser while testing it to also support nesting in the right term (i.e.,
consider '(a+(a+a))' also valid).

Tweaking that number should be enough to "fix" the parser.

Rui Maciel

Rui Maciel · Mar 28, 2010

Pavel said:
Funny thing just happened: I tried to change your code to my taste and
the extra '1' rule in the result went away, so now the result is the
expected sequence of rules, 2133. I swear I did not even try to fix any
behavior -- don't even know what the issue was.

Magic numbers are confusing and their problems they cause easily pass under the radar.
You've fixed it by correcting the symbol stack when replacing the magic numbers with the
named integer constants from enum GrammarElements.

For the changes, my only goal was to make the code clearer to a
Wikipedia reader, even if I had to sacrifice some performance (which I
may or may not have done -- replacing std::map with 2-dimensional array
should give some wins and replacing switch with std::find may lose
some). The summary of the changes:

1. Eliminated magic numbers (I would even feel comfortable removing some
comments now).

Sounds good. Actually, I write all my parsers with named integer constants. I don't
remember why I haven't use them in this example. I would only preserve the magic numbers
when referring to the rules because they are numbered to begin with, and therefore in this
case the numbers do make the code a bit easier to follow.

2. Replaced std::map-based table with a 2-dimensional array as a more
natural representation of a table

The "more natural representation" is a bit subjective, particularly in this case when even
the syntax is the same.

3. Got rid of a switch in a lexer() (just to make it shorter); in real
parser I would use direct indexation by character value but I felt it
would be an unnecessary complication for illustrative code.

The reason behind the switch() statement is to enable the code to be easier to read and
follow. I don't believe that your suggestion succeeds in that, although I agree that the
switch() does make the code a bit too extensive.

I tried to preserve the original indentation/braces style although it
was definitely not my favorite. Below is the code.

One last suggestion for the article: a reader may wonder what's so cool
about LL-grammars as opposed to LR (that maybe consider more powerful)
so it may be worth to mention the advantages. (If I understand it
correctly LL(1) parsers can be much faster in principle than LR).

I believe that's a good idea. The article only mentions differences in taste between what
it calls the "european school" and "US-school" of language design, which I don't believe is
a reasonable thing to say. Nonetheless, as I wasn't able to find any decent references to
back that statement I refrained from adding it to the article. Could you please do the
honors?

I've updated the example code.

Rui Maciel

Michael Doubez · Mar 29, 2010

I've just added a small C++ implementation example for a table-based LL parser to wikipedia's
article on LL parsers. It would be great if the code was reviewed by more knowledgeable and
experienced people, as I've added a couple of crude hacks to it.

So, if anyone is willing to take a crack at it then please take a look at:

http://en.wikipedia.org/wiki/LL_parser#Parser_implementation_in_C.2B.2B

A few points:
- the output of lexer() is not tested against TS_INVALID which is a
shame because one of the advantage of LL is the availability of the
context upon error. You could write a general error function:

void parse_error(std:

stream& os, const char* reason, char* p,
char*str)
{
os << " Parse failed ("<<reason<<") on '<<*p<<"' at index
"<<size_t(p-str)<<endl;
}
const Symbols next_symbol = lexer(*p);
if( next_symbol == TS_INVALID )
{
parse_error(cerr,"invalid token",p,argv[1]) ;
return -1;
}
// use next_symbol hereafter instead of lexer(*p)
- You don't need to repeat enum like in C
- You are using a logical short cut in your test
if(lexer(*p) == ss.top())
{
// ...
}
else
{
//
}
in fact is
if( isTerminal(ss.top()) )
{
if( lexer(*p) == ss.top() )
{

}
else
{ // does not match
..
}
}
else
{ // generate rule
...
}
- the loop while(ss.size() > 0) can be rewritten as while( !
ss.empty() )
- the map of map is not very intuitive and since this is an example
program, you could directly use a int[NB_SYMBOL][NB_SYMBOL]; with
NB_SYMBOL being the last enum. You could also make an enum for the
rules.

Simple dynamic loading example (review requested) -- Am I missing something?	3	Jul 31, 2013
Code Review: a framework for writing IRC applications or bots	0	Oct 12, 2011
With this artifact, everyone can easily invent new languages	5	Jan 11, 2014
[ANN] PySourceColor 1.9.6 much new code please review	0	Dec 16, 2004
More Summer of Code goodness (please forward)	1	Mar 21, 2007
[ANN] pyparsing 1.5.3 released	0	Jun 25, 2010
Google Summer of Code proposal: GRuVeR: A General Ruby library forsolving Routing Vehicle Problems	6	Mar 29, 2008
Please help package Search::VectorSpace; problem	2	Apr 5, 2005

Please review small example code

Rui Maciel

Pavel

Pavel

Rui Maciel

Rui Maciel

Michael Doubez

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads