lexical analysis of python

R

robert.muller2

I am trying to implement a lexer and parser for a subset of python
using lexer and parser generators. (It doesn't matter, but I happen to
be using
ocamllex and ocamlyacc). I've run into the following annoying problem
and hoping someone can tell me what I'm missing. Lexers generated by
such tools return a tokens in a stream as they consume the input text.
But python's indentation appears to require interruption of that
stream. For example, in:
def f(x):
statement1;
statement2;
statement3;
statement4;
A

Between the '\n' at the end of statement4 and the A, a lexer for
Python should return 2 DEDENT tokens. But there is no way to interject
two DEDENT tokens within the token stream between the tokens for
NEWLINE and A. The generated lexer doesn't have anyway to freeze the
input text pointer.

Does this mean that python lexers are all written by hand? If not, how
do you do it using your favorite lexer generator?

Thanks!

Bob Muller
 
P

Paul McGuire

I am trying to implement a lexer and parser for a subset of python
using lexer and parser generators. (It doesn't matter, but I happen to
be using
ocamllex and ocamlyacc). I've run into the following annoying problem
and hoping someone can tell me what I'm missing. Lexers generated by
such tools return a tokens in a stream as they consume the input text.
But python's indentation appears to require interruption of that
stream. For example, in:
def f(x):
        statement1;
        statement2;
              statement3;
              statement4;
A

Between the '\n' at the end of statement4 and the A, a lexer for
Python should return 2 DEDENT tokens. But there is no way to interject
two DEDENT tokens within the token stream between the tokens for
NEWLINE and A.  The generated lexer doesn't have anyway to freeze the
input text pointer.

Does this mean that python lexers are all written by hand? If not, how
do you do it using your favorite lexer generator?

Thanks!

Bob Muller

In pyparsing's indentedBlock expression/helper, I keep a stack of
column numbers representing indent levels. When the indent level of a
line is less than the column number at the top of the stack, I count
one DEDENT for each level that I need to pop off the stack before I
get the new indent column. If I get a column number less than the
indent column, then I know that this is an illegal indent (doesn't
line up with previous indent). Also, when computing the column
number, be wary of tab handling.

-- Paul
 
R

robert.muller2

In pyparsing's indentedBlock expression/helper, I keep a stack of
column numbers representing indent levels.  When the indent level of a
line is less than the column number at the top of the stack, I count
one DEDENT for each level that I need to pop off the stack before I
get the new indent column.  If I get a column number less than the
indent column, then I know that this is an illegal indent (doesn't
line up with previous indent).  Also, when computing the column
number, be wary of tab handling.

-- Paul

Thank you Paul. I am also using the same stack as suggested in the
documentation:

http://docs.python.org/reference/lexical_analysis.html

I understand the method, but when you say you "count one DEDENT for
each level"
well lets say you counted 3 of them. Do you have a way to interject 3
consecutive
DEDENT tokens into the token stream so that the parser receives them
before it
receives the next real token?

Thanks much!

Bob Muller
 
P

Paul McGuire

I understand the method, but when you say you "count one DEDENT for
each level"
well lets say you counted 3 of them. Do you have a way to interject 3
consecutive
DEDENT tokens into the token stream so that the parser receives them
before it
receives the next real token?

Pyparsing makes *heavy* use of the program stack for keeping the
current parsing state in local variables. By the time I am 3 levels
deep in indentation, I am also nested a corresponding depth in the
program stack. The indent stack is kept separately, as a global var.
Each INDENT causes a recursive nested call. When I DEDENT, I unwind
only one level, and return from the corresponding INDENT - at this
time I can push a DEDENT token on the return stack (it so happens I
*don't* do this by default, I just close off the current statement
group - but a user could define a parse action/callback to push DEDENT
tokens). Then the next INDENT checks the indent stack, sees that it
too has dedented, and another DEDENT token gets pushed, and so on. So
the key, I guess, is that there is no iterative popping of indent
levels from the indent stack, each recursive INDENT/DEDENT handler
push/pops its own level, pushing INDENT and DEDENT values
appropriately.

Just FYI, as I said, pyparsing *doesn't* push explicit INDENT/DEDENT
tokens. Instead it returns a nested list of lists representing the
corresponding structure of program statements. Here is a similar
treatment for an expression of nested parentheses:

print pyparsing.nestedExpr().parseString("(a b (c d e)(f g)h (i(j)))")

Prints:
[['a', 'b', ['c', 'd', 'e'], ['f', 'g'], 'h', ['i', ['j']]]]

(Note - this string representation looks like a normal Python list,
but parseString returns a ParseResults object, a rich results
structure which supports list, dict, and object attribute access
methods.)

pyparsing does a sort of "mixed mode" of simultaneous lexing and
parsing, which deviates from traditional lex/yacc-like separation of
duties.

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,900
Latest member
Nell636132

Latest Threads

Top