Please review small example code

P

Pavel

Rui said:
I've just added a small C++ implementation example for a table-based LL parser to wikipedia's
article on LL parsers. It would be great if the code was reviewed by more knowledgeable and
experienced people, as I've added a couple of crude hacks to it.

So, if anyone is willing to take a crack at it then please take a look at:

http://en.wikipedia.org/wiki/LL_parser#Parser_implementation_in_C.2B.2B


Thanks in advance,
Rui Maciel
FYI: I tried it with (a+a) and the output was:

Rule 2
Matched 1
Rule 1
Rule 3
Matched 3
Matched 4
Rule 1
Rule 3
Matched 3
Matched 2
Matched 5
finished parsing

-- looks like one extra '1' rule was applied as compared to
[ 2, 1, 3, 3 ]
rule sequence given in the text of the article.

-Pavel
 
P

Pavel

Rui said:
I've just added a small C++ implementation example for a table-based LL parser to wikipedia's
article on LL parsers. It would be great if the code was reviewed by more knowledgeable and
experienced people, as I've added a couple of crude hacks to it.

So, if anyone is willing to take a crack at it then please take a look at:

http://en.wikipedia.org/wiki/LL_parser#Parser_implementation_in_C.2B.2B


Thanks in advance,
Rui Maciel
Funny thing just happened: I tried to change your code to my taste and
the extra '1' rule in the result went away, so now the result is the
expected sequence of rules, 2133. I swear I did not even try to fix any
behavior -- don't even know what the issue was.

For the changes, my only goal was to make the code clearer to a
Wikipedia reader, even if I had to sacrifice some performance (which I
may or may not have done -- replacing std::map with 2-dimensional array
should give some wins and replacing switch with std::find may lose
some). The summary of the changes:

1. Eliminated magic numbers (I would even feel comfortable removing some
comments now).
2. Replaced std::map-based table with a 2-dimensional array as a more
natural representation of a table
3. Got rid of a switch in a lexer() (just to make it shorter); in real
parser I would use direct indexation by character value but I felt it
would be an unnecessary complication for illustrative code.

I tried to preserve the original indentation/braces style although it
was definitely not my favorite. Below is the code.

One last suggestion for the article: a reader may wonder what's so cool
about LL-grammars as opposed to LR (that maybe consider more powerful)
so it may be worth to mention the advantages. (If I understand it
correctly LL(1) parsers can be much faster in principle than LR).

Hope this will help,
-Pavel


#include <algorithm>
#include <cassert>
#include <iostream>
#include <stack>

enum GrammarElements {
L_PAREN, // '('
R_PAREN, // ')'
A_CHAR, // 'a'
PLUS, // '+'
END_OF_STREAM, // $, '\0' in our case
UNRECOGNIZED_TOKEN,
N_TERMINALS,
SUM_EXPR = N_TERMINALS, // S non-terminal
ADDEND, // F non-terminal
N_GRAMMAR_ELEMENTS
};

enum RuleId {
DEFAULTED,
ON_AUGEND, // a funny old name for a first summand
ON_OPENING_PAREN,
ON_ADDEND // a second summand..
};

// compiler initializes to DEFAULTED
static RuleId table[N_GRAMMAR_ELEMENTS][N_TERMINALS];

/**
Translates ASCII characters into terminal symbol reference numbers
**/
int lexer(char c)
{
static const char ABC[N_TERMINALS] = "()a+";
const char *res = std::find(ABC, ABC + N_TERMINALS, c);
return res == ABC + N_TERMINALS ? UNRECOGNIZED_TOKEN : res - ABC;
}

/**
The following program parses the text passed through the first command
line argument
**/
int main(int argc, char **argv)
{
using namespace std;

if (argc < 2)
{
cout << "usage: ll <symbols>" << endl;
return 0;
}

stack<GrammarElements> ss; // symbol stack

// initialize the stack
ss.push(END_OF_STREAM); // terminal, $
ss.push(SUM_EXPR); // non-terminal, S

// initializes the input stream
char *p = &argv[1][0]; // input buffer

// setup the parsing table for this grammar
table[SUM_EXPR][L_PAREN] = ON_OPENING_PAREN;
table[SUM_EXPR][A_CHAR] = ON_AUGEND;
table[ADDEND][A_CHAR] = ON_ADDEND;

while(!ss.empty())
{
if(lexer(*p) == ss.top())
{
// input symbol and top of the stack match, pop stacks
cout << "Matched " << lexer(*p) << endl;
p++;
ss.pop();
}
else
{
// input symbol and top of the stack don't match, execute rule
according to the parsing table
RuleId ruleId = table[ss.top()][lexer(*p)];
cout << "Rule " << ruleId << endl;
switch(ruleId)
{
case ON_AUGEND: // 1. S -> F
ss.pop();
ss.push(ADDEND); // F
break;

case ON_OPENING_PAREN: // 2. S -> ( S + F )
ss.pop();
ss.push(R_PAREN); // )
ss.push(ADDEND); // F
ss.push(PLUS); // +
ss.push(SUM_EXPR); // S
ss.push(L_PAREN); // )
break;

case ON_ADDEND: // 3. F -> a
ss.pop();
ss.push(A_CHAR); // a
break;

default:
assert(ruleId == DEFAULTED);
cout << "parser table defaulted" << endl;
return 0;
}
}
}

cout << "finished parsing" << endl;

return 0;
}
 
R

Rui Maciel

Pavel said:
-- looks like one extra '1' rule was applied as compared to
[ 2, 1, 3, 3 ]
rule sequence given in the text of the article.

You are absolutely right. There's a small misconfiguration in rule 2. It reads:

<code>

case 2: // 2. S → ( S + F )
ss.pop();
ss.push(2); // )
ss.push(6); // F
ss.push(4); // +
ss.push(6); // S
ss.push(1); // )
break;

</code>

While it should push a 'F' terminal symbol to the symbol stack, with magic number 7, instead
it pushes a 'S' terminal symbol, with magic number 6. That was the result of a minor tweak
I've made to the parser while testing it to also support nesting in the right term (i.e.,
consider '(a+(a+a))' also valid).

Tweaking that number should be enough to "fix" the parser.


Rui Maciel
 
R

Rui Maciel

Pavel said:
Funny thing just happened: I tried to change your code to my taste and
the extra '1' rule in the result went away, so now the result is the
expected sequence of rules, 2133. I swear I did not even try to fix any
behavior -- don't even know what the issue was.

Magic numbers are confusing and their problems they cause easily pass under the radar.
You've fixed it by correcting the symbol stack when replacing the magic numbers with the
named integer constants from enum GrammarElements.

For the changes, my only goal was to make the code clearer to a
Wikipedia reader, even if I had to sacrifice some performance (which I
may or may not have done -- replacing std::map with 2-dimensional array
should give some wins and replacing switch with std::find may lose
some). The summary of the changes:

1. Eliminated magic numbers (I would even feel comfortable removing some
comments now).

Sounds good. Actually, I write all my parsers with named integer constants. I don't
remember why I haven't use them in this example. I would only preserve the magic numbers
when referring to the rules because they are numbered to begin with, and therefore in this
case the numbers do make the code a bit easier to follow.

2. Replaced std::map-based table with a 2-dimensional array as a more
natural representation of a table

The "more natural representation" is a bit subjective, particularly in this case when even
the syntax is the same.

3. Got rid of a switch in a lexer() (just to make it shorter); in real
parser I would use direct indexation by character value but I felt it
would be an unnecessary complication for illustrative code.

The reason behind the switch() statement is to enable the code to be easier to read and
follow. I don't believe that your suggestion succeeds in that, although I agree that the
switch() does make the code a bit too extensive.

I tried to preserve the original indentation/braces style although it
was definitely not my favorite. Below is the code.

One last suggestion for the article: a reader may wonder what's so cool
about LL-grammars as opposed to LR (that maybe consider more powerful)
so it may be worth to mention the advantages. (If I understand it
correctly LL(1) parsers can be much faster in principle than LR).

I believe that's a good idea. The article only mentions differences in taste between what
it calls the "european school" and "US-school" of language design, which I don't believe is
a reasonable thing to say. Nonetheless, as I wasn't able to find any decent references to
back that statement I refrained from adding it to the article. Could you please do the
honors?

I've updated the example code.


Rui Maciel
 
M

Michael Doubez

I've just added a small C++ implementation example for a table-based LL parser to wikipedia's
article on LL parsers.  It would be great if the code was reviewed by more knowledgeable and
experienced people, as I've added a couple of crude hacks to it.  

So, if anyone is willing to take a crack at it then please take a look at:

http://en.wikipedia.org/wiki/LL_parser#Parser_implementation_in_C.2B.2B

A few points:
- the output of lexer() is not tested against TS_INVALID which is a
shame because one of the advantage of LL is the availability of the
context upon error. You could write a general error function:

void parse_error(std::eek:stream& os, const char* reason, char* p,
char*str)
{
os << " Parse failed ("<<reason<<") on '<<*p<<"' at index
"<<size_t(p-str)<<endl;
}
const Symbols next_symbol = lexer(*p);
if( next_symbol == TS_INVALID )
{
parse_error(cerr,"invalid token",p,argv[1]) ;
return -1;
}
// use next_symbol hereafter instead of lexer(*p)
- You don't need to repeat enum like in C
- You are using a logical short cut in your test
if(lexer(*p) == ss.top())
{
// ...
}
else
{
//
}
in fact is
if( isTerminal(ss.top()) )
{
if( lexer(*p) == ss.top() )
{

}
else
{ // does not match
..
}
}
else
{ // generate rule
...
}
- the loop while(ss.size() > 0) can be rewritten as while( !
ss.empty() )
- the map of map is not very intuitive and since this is an example
program, you could directly use a int[NB_SYMBOL][NB_SYMBOL]; with
NB_SYMBOL being the last enum. You could also make an enum for the
rules.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top