Using PLY

Maurice LING · Sep 17, 2004

Hi,

I know that PLY lex is able to do line counting. I am wondering if there
is a way to count the number of each keywords (tokens) in a given file?
For example, how many IF tokens etc?

Thanks
Maurice

Bengt Richter · Sep 17, 2004

Hi,

I know that PLY lex is able to do line counting. I am wondering if there
is a way to count the number of each keywords (tokens) in a given file?
For example, how many IF tokens etc?

... if a: foo()
... elif b: bar()
... if c: baz()
... """)

>>> sum([1 for t in tokenize.generate_tokens(src.readline) if t[1]=='if'])

Click to expand...

Click to expand...

2

That generates an intermediate list with a 1 for each 'if', but it's not a big
price to pay IMO.
If you have a file in the current working directory, e.g., foo.py, substitute

src = file('foo.py')

or do it in one line, like (untested):

sum([1 for t in tokenize.generate_tokens(file('foo.py').readline) if t[1]=='if'])

generate_tokens returns a generator that returns tuples, e.g. for the above

Rewind src:
Get the generator:
Manually get a couple of examples: (1, 'if', (2, 0), (2, 2), 'if a: foo()\n')

Rewind the StringIO object to start again:
Show all the token tuples: ...
(53, '\n', (1, 0), (1, 1), '\n')
(1, 'if', (2, 0), (2, 2), 'if a: foo()\n')
(1, 'a', (2, 3), (2, 4), 'if a: foo()\n')
(50, ':', (2, 4), (2, 5), 'if a: foo()\n')
(1, 'foo', (2, 6), (2, 9), 'if a: foo()\n')
(50, '(', (2, 9), (2, 10), 'if a: foo()\n')
(50, ')', (2, 10), (2, 11), 'if a: foo()\n')
(4, '\n', (2, 11), (2, 12), 'if a: foo()\n')
(1, 'elif', (3, 0), (3, 4), 'elif b: bar()\n')
(1, 'b', (3, 5), (3, 6), 'elif b: bar()\n')
(50, ':', (3, 6), (3, 7), 'elif b: bar()\n')
(1, 'bar', (3, 8), (3, 11), 'elif b: bar()\n')
(50, '(', (3, 11), (3, 12), 'elif b: bar()\n')
(50, ')', (3, 12), (3, 13), 'elif b: bar()\n')
(4, '\n', (3, 13), (3, 14), 'elif b: bar()\n')
(1, 'if', (4, 0), (4, 2), 'if c: baz()\n')
(1, 'c', (4, 3), (4, 4), 'if c: baz()\n')
(50, ':', (4, 4), (4, 5), 'if c: baz()\n')
(1, 'baz', (4, 6), (4, 9), 'if c: baz()\n')
(50, '(', (4, 9), (4, 10), 'if c: baz()\n')
(50, ')', (4, 10), (4, 11), 'if c: baz()\n')
(4, '\n', (4, 11), (4, 12), 'if c: baz()\n')
(0, '', (5, 0), (5, 0), '')

HTH

Regards,
Bengt Richter

huy · Sep 17, 2004

Maurice said:
Hi,

I know that PLY lex is able to do line counting. I am wondering if there
is a way to count the number of each keywords (tokens) in a given file?
For example, how many IF tokens etc?

Thanks
Maurice

PLY can do much more then line counting. Build an AST then just count
your IF tokens. If you can use ply to recognise your if tokens, then you
can easility count them.

It's kinda vague what you are wanting to do ? Is it source code you are
parsing or text with keywords ?

Huy

Lonnie Princehouse · Sep 17, 2004

import tokenize

...

The tokenize module would definitely be simpler if it's Python code
that he happens to be parsing. If it's not Python code, then there's
still a reason to use PLY..

------------------------------------------

Here's a kludgy but quick solution- modify the LexToken class in
lex.py to keep track of number of type occurences.

class LexToken(object): # change to new style class
type_count = {} # store the count here
def __setattr__(self, key, value):
if key == 'type':
# when type attribute is assigned, increment counter
if value not in self.type_count:
self.type_count[value] = 1
else:
self.type_count[value] += 1
object.__setattr__(self, key, value)

# ... and proceed with the original definition of LexToken

def __str__(self):
return "LexToken(%s,%r,%d)" %
(self.type,self.value,self.lineno)
def __repr__(self):
return str(self)
def skip(self,n):
try:
self._skipn += n
except AttributeError:
self._skipn = n
-----------------------------------------

After you've run the lexer, lex.LexToken.type_count will the contain
number of occurences of each token type.

-----------------------------------------

(Caveats- 1. I haven't tested this code. 2. I've got PLY 1.3;
syntax may have changed in newer versions. In fact, I hope it's
changed; while PLY works very well, its usage could be way more
pythonic)

Maurice LING · Sep 20, 2004

The tokenize module would definitely be simpler if it's Python code
that he happens to be parsing. If it's not Python code, then there's
still a reason to use PLY..

Thanks, I'm not parsing Python codes for sure and so, it is a good
reason to use PLY.

Another thing that I am quite puzzled by is the yacc part of PLY. Most
of the examples are showing calculators and the yacc part does the
calculations such as,

def p_expression_group(self, p):
'expression : LPAREN expression RPAREN'
p[0] = p[2]

this is a bad example, I know. But how do I get it to output some
intermediate representations, like AST, or an intermediate code
(byte-code type).

Is

def p_expression_group(self, p):
'expression : LPAREN expression RPAREN'
p[0] = p[2]
print "byte_x" + p[0]

or something like this legal?

I hope that I am clear about what I am trying to say.

Thanks in advanced
Maurice

Maurice LING · Sep 20, 2004

PLY can do much more then line counting. Build an AST then just count
your IF tokens. If you can use ply to recognise your if tokens, then you
can easility count them.

How do I build an AST with PLY? I'm trying to find some examples of that
but unsuccessful.

It's kinda vague what you are wanting to do ? Is it source code you are
parsing or text with keywords ?

I'm trying to parse what looks like a 4GL source code.

Maurice LING · Sep 20, 2004

Here's a kludgy but quick solution- modify the LexToken class in
lex.py to keep track of number of type occurences.

class LexToken(object): # change to new style class
type_count = {} # store the count here
def __setattr__(self, key, value):
if key == 'type':
# when type attribute is assigned, increment counter
if value not in self.type_count:
self.type_count[value] = 1
else:
self.type_count[value] += 1
object.__setattr__(self, key, value)

# ... and proceed with the original definition of LexToken

def __str__(self):
return "LexToken(%s,%r,%d)" %
(self.type,self.value,self.lineno)
def __repr__(self):
return str(self)
def skip(self,n):
try:
self._skipn += n
except AttributeError:
self._skipn = n
-----------------------------------------

After you've run the lexer, lex.LexToken.type_count will the contain
number of occurences of each token type.

-----------------------------------------

(Caveats- 1. I haven't tested this code. 2. I've got PLY 1.3;
syntax may have changed in newer versions. In fact, I hope it's
changed; while PLY works very well, its usage could be way more
pythonic)

I may be an idiot here but I don't quite see how LexToken.__setattr__ is
called. There seems to be a gap in my logics.

Please assist.

Thanks
Maurice

Michael Sparks · Sep 20, 2004

Maurice LING wrote:
....

Another thing that I am quite puzzled by is the yacc part of PLY. Most
of the examples are showing calculators and the yacc part does the
calculations such as,

def p_expression_group(self, p):
'expression : LPAREN expression RPAREN'
p[0] = p[2]

this is a bad example, I know.

Simple examples of lex/yacc type things tend to have this though.

But how do I get it to output some
intermediate representations, like AST, or an intermediate code
(byte-code type).

Is

def p_expression_group(self, p):
'expression : LPAREN expression RPAREN'
p[0] = p[2]
print "byte_x" + p[0]

or something like this legal?

It's legal, but probably not what you want.

Normally you have Lex --(token) --> Parse --(AST)--> Something Interesting.

If Something Interesting is simple, you can do that instead at the AST stage
which is what the examples do.

If you wanted to modify the example/calc/calc.py in the PLY distribution to
return an AST to play with you would change it's rules to store the parsed
structure rather than do the work. Taking the route of minimal change to
try and make it obvious what I've changed:

def p_statement_assign(p):
'statement : NAME EQUALS expression'
p[0] = [ "assignment", p[1], p[3] ] # names[p[1]] = p[3]

def p_statement_expr(p):
'statement : expression'
p[0] = [ expr_statement", p[1] ] # print p[1]

def p_expression_binop(p):
'''expression : expression PLUS expression
| expression MINUS expression
| expression TIMES expression
| expression DIVIDE expression'''
p[0] = ["binop_expr", p[2], p[1], p[3] ] # long if/elif evaluation

def p_expression_uminus(p):
'expression : MINUS expression %prec UMINUS'
p[0] = ["uminus_expr", p[2]] # p[0] = -p[2]

def p_expression_group(p):
'expression : LPAREN expression RPAREN'
p[0] = ["expression", p[2] ] # p[0] = p[2]

def p_expression_number(p):
'expression : NUMBER'
p[0] = ["number", p[1]] # p[0] = p[1]

def p_expression_name(p):
'expression : NAME'
p[0] = ["name", p[1] ] # p[0] = names[p[1]], with error handling

A sample AST this could generate would be:

[ "assignment",
["name", "BOB" ],
["expression",
["binop_expr",
"*",
["number", 7],
["number", 9]
]
]
]

In example/calc/calc.py this value would be returned here:

while 1:
try:
s = raw_input('calc > ')
except EOFError:
break
AST = yacc.parse(s) #### <- ------ HERE!

(NB, slight change to the line ####)

This is a very boring, not very interesting, not that great AST,but should
hopefully get you started. You should be able to see that by traversing
this tree you could get the same result as the original code, or could spit
out code that performs this functionality. Often its nice to have some
simplification of the tree as well since this sort of thing can be rather
unwieldy for realistic languages.

It's also worth noting that the calc.py example is also very toy in that it
matches single lines using the parser rather than collections of lines. (ie
the parser has no conception of a piece of code containing more than one
statement)

I'm trying to parse what looks like a 4GL source code.

FWIW, start small - start with matching the simplest expressions you can and
work forward from there (unless you're lucky enough to have a LALR(1) or
SLR(1) grammar for it suitable for PLY already). Test first style coding
for grammars feels intuitively wrong, but seems to work really well in
practice - just make sure that after making every test work check in the
result to CVS/your favourite version control system

One other tip you might find useful - rather than sending the lexer whole
files as PLY seems to expect, do line handling yourself and send it lines
instead - it works much more like Flex/lex that way.

Regards,

Michael.

Maurice LING · Sep 20, 2004

Here's a kludgy but quick solution- modify the LexToken class in
lex.py to keep track of number of type occurences.

class LexToken(object): # change to new style class
type_count = {} # store the count here
def __setattr__(self, key, value):
if key == 'type':
# when type attribute is assigned, increment counter
if value not in self.type_count:
self.type_count[value] = 1
else:
self.type_count[value] += 1
object.__setattr__(self, key, value)

# ... and proceed with the original definition of LexToken

def __str__(self):
return "LexToken(%s,%r,%d)" %
(self.type,self.value,self.lineno)
def __repr__(self):
return str(self)
def skip(self,n):
try:
self._skipn += n
except AttributeError:
self._skipn = n
-----------------------------------------

After you've run the lexer, lex.LexToken.type_count will the contain
number of occurences of each token type.

-----------------------------------------

(Caveats- 1. I haven't tested this code. 2. I've got PLY 1.3;
syntax may have changed in newer versions. In fact, I hope it's
changed; while PLY works very well, its usage could be way more
pythonic)

Thank you, it works well. I think this should be included in the next
release.

I am able to do a "print lex.LexToken.type_count" after each token and
it did show the incremental numbers of each tokens, except t_ignore.

Thanks again
maurice

Maurice LING · Sep 20, 2004

Michael said:
Maurice LING wrote:
...

Another thing that I am quite puzzled by is the yacc part of PLY. Most
of the examples are showing calculators and the yacc part does the
calculations such as,

def p_expression_group(self, p):
'expression : LPAREN expression RPAREN'
p[0] = p[2]

this is a bad example, I know.

Click to expand...

Simple examples of lex/yacc type things tend to have this though.

But how do I get it to output some
intermediate representations, like AST, or an intermediate code
(byte-code type).

Is

def p_expression_group(self, p):
'expression : LPAREN expression RPAREN'
p[0] = p[2]
print "byte_x" + p[0]

or something like this legal?

Click to expand...

It's legal, but probably not what you want.

Normally you have Lex --(token) --> Parse --(AST)--> Something Interesting.

If Something Interesting is simple, you can do that instead at the AST stage
which is what the examples do.

If you wanted to modify the example/calc/calc.py in the PLY distribution to
return an AST to play with you would change it's rules to store the parsed
structure rather than do the work. Taking the route of minimal change to
try and make it obvious what I've changed:

def p_statement_assign(p):
'statement : NAME EQUALS expression'
p[0] = [ "assignment", p[1], p[3] ] # names[p[1]] = p[3]

def p_statement_expr(p):
'statement : expression'
p[0] = [ expr_statement", p[1] ] # print p[1]

def p_expression_binop(p):
'''expression : expression PLUS expression
| expression MINUS expression
| expression TIMES expression
| expression DIVIDE expression'''
p[0] = ["binop_expr", p[2], p[1], p[3] ] # long if/elif evaluation

def p_expression_uminus(p):
'expression : MINUS expression %prec UMINUS'
p[0] = ["uminus_expr", p[2]] # p[0] = -p[2]

def p_expression_group(p):
'expression : LPAREN expression RPAREN'
p[0] = ["expression", p[2] ] # p[0] = p[2]

def p_expression_number(p):
'expression : NUMBER'
p[0] = ["number", p[1]] # p[0] = p[1]

def p_expression_name(p):
'expression : NAME'
p[0] = ["name", p[1] ] # p[0] = names[p[1]], with error handling

A sample AST this could generate would be:

[ "assignment",
["name", "BOB" ],
["expression",
["binop_expr",
"*",
["number", 7],
["number", 9]
]
]
]

In example/calc/calc.py this value would be returned here:

while 1:
try:
s = raw_input('calc > ')
except EOFError:
break
AST = yacc.parse(s) #### <- ------ HERE!

(NB, slight change to the line ####)

This is a very boring, not very interesting, not that great AST,but should
hopefully get you started. You should be able to see that by traversing
this tree you could get the same result as the original code, or could spit
out code that performs this functionality. Often its nice to have some
simplification of the tree as well since this sort of thing can be rather
unwieldy for realistic languages.

It's also worth noting that the calc.py example is also very toy in that it
matches single lines using the parser rather than collections of lines. (ie
the parser has no conception of a piece of code containing more than one
statement)

I'm trying to parse what looks like a 4GL source code.

Click to expand...

FWIW, start small - start with matching the simplest expressions you can and
work forward from there (unless you're lucky enough to have a LALR(1) or
SLR(1) grammar for it suitable for PLY already). Test first style coding
for grammars feels intuitively wrong, but seems to work really well in
practice - just make sure that after making every test work check in the
result to CVS/your favourite version control system

I've worked out my grammar in BNF, so I hope it is context free.

One other tip you might find useful - rather than sending the lexer whole
files as PLY seems to expect, do line handling yourself and send it lines
instead - it works much more like Flex/lex that way.

Regards,

Michael.

Thank you, this really helped my understanding.

maurice

Rearranging .ply file via C++ String Parsing	0	Dec 14, 2019
ply and threads	0	May 7, 2009
ply yacc lineno not working?	0	May 16, 2008
Any help with PLY?	1	Nov 17, 2005
SPARK v.s. PLY	9	Nov 18, 2004
Help in reading a .PLY file via C++	15	Mar 6, 2012
Ply(LALR) and Yacc behaving differently	0	Apr 7, 2005
documenting PLYed parsers	2	Nov 3, 2004

Using PLY

Maurice LING

Bengt Richter

huy

Lonnie Princehouse

Maurice LING

Maurice LING

Maurice LING

Michael Sparks

Maurice LING

Maurice LING

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads