Using PLY

M

Maurice LING

Hi,

I know that PLY lex is able to do line counting. I am wondering if there
is a way to count the number of each keywords (tokens) in a given file?
For example, how many IF tokens etc?

Thanks
Maurice
 
B

Bengt Richter

Hi,

I know that PLY lex is able to do line counting. I am wondering if there
is a way to count the number of each keywords (tokens) in a given file?
For example, how many IF tokens etc?
... if a: foo()
... elif b: bar()
... if c: baz()
... """)
>>> sum([1 for t in tokenize.generate_tokens(src.readline) if t[1]=='if'])
2

That generates an intermediate list with a 1 for each 'if', but it's not a big
price to pay IMO.
If you have a file in the current working directory, e.g., foo.py, substitute

src = file('foo.py')

or do it in one line, like (untested):

sum([1 for t in tokenize.generate_tokens(file('foo.py').readline) if t[1]=='if'])

generate_tokens returns a generator that returns tuples, e.g. for the above

Rewind src:
Get the generator:
Manually get a couple of examples: (1, 'if', (2, 0), (2, 2), 'if a: foo()\n')

Rewind the StringIO object to start again:
Show all the token tuples: ...
(53, '\n', (1, 0), (1, 1), '\n')
(1, 'if', (2, 0), (2, 2), 'if a: foo()\n')
(1, 'a', (2, 3), (2, 4), 'if a: foo()\n')
(50, ':', (2, 4), (2, 5), 'if a: foo()\n')
(1, 'foo', (2, 6), (2, 9), 'if a: foo()\n')
(50, '(', (2, 9), (2, 10), 'if a: foo()\n')
(50, ')', (2, 10), (2, 11), 'if a: foo()\n')
(4, '\n', (2, 11), (2, 12), 'if a: foo()\n')
(1, 'elif', (3, 0), (3, 4), 'elif b: bar()\n')
(1, 'b', (3, 5), (3, 6), 'elif b: bar()\n')
(50, ':', (3, 6), (3, 7), 'elif b: bar()\n')
(1, 'bar', (3, 8), (3, 11), 'elif b: bar()\n')
(50, '(', (3, 11), (3, 12), 'elif b: bar()\n')
(50, ')', (3, 12), (3, 13), 'elif b: bar()\n')
(4, '\n', (3, 13), (3, 14), 'elif b: bar()\n')
(1, 'if', (4, 0), (4, 2), 'if c: baz()\n')
(1, 'c', (4, 3), (4, 4), 'if c: baz()\n')
(50, ':', (4, 4), (4, 5), 'if c: baz()\n')
(1, 'baz', (4, 6), (4, 9), 'if c: baz()\n')
(50, '(', (4, 9), (4, 10), 'if c: baz()\n')
(50, ')', (4, 10), (4, 11), 'if c: baz()\n')
(4, '\n', (4, 11), (4, 12), 'if c: baz()\n')
(0, '', (5, 0), (5, 0), '')

HTH

Regards,
Bengt Richter
 
H

huy

Maurice said:
Hi,

I know that PLY lex is able to do line counting. I am wondering if there
is a way to count the number of each keywords (tokens) in a given file?
For example, how many IF tokens etc?

Thanks
Maurice


PLY can do much more then line counting. Build an AST then just count
your IF tokens. If you can use ply to recognise your if tokens, then you
can easility count them.

It's kinda vague what you are wanting to do ? Is it source code you are
parsing or text with keywords ?

Huy
 
L

Lonnie Princehouse

import tokenize

The tokenize module would definitely be simpler if it's Python code
that he happens to be parsing. If it's not Python code, then there's
still a reason to use PLY..

------------------------------------------

Here's a kludgy but quick solution- modify the LexToken class in
lex.py to keep track of number of type occurences.

class LexToken(object): # change to new style class
type_count = {} # store the count here
def __setattr__(self, key, value):
if key == 'type':
# when type attribute is assigned, increment counter
if value not in self.type_count:
self.type_count[value] = 1
else:
self.type_count[value] += 1
object.__setattr__(self, key, value)

# ... and proceed with the original definition of LexToken

def __str__(self):
return "LexToken(%s,%r,%d)" %
(self.type,self.value,self.lineno)
def __repr__(self):
return str(self)
def skip(self,n):
try:
self._skipn += n
except AttributeError:
self._skipn = n
-----------------------------------------

After you've run the lexer, lex.LexToken.type_count will the contain
number of occurences of each token type.

-----------------------------------------

(Caveats- 1. I haven't tested this code. 2. I've got PLY 1.3;
syntax may have changed in newer versions. In fact, I hope it's
changed; while PLY works very well, its usage could be way more
pythonic)
 
M

Maurice LING

The tokenize module would definitely be simpler if it's Python code
that he happens to be parsing. If it's not Python code, then there's
still a reason to use PLY..
Thanks, I'm not parsing Python codes for sure and so, it is a good
reason to use PLY.

Another thing that I am quite puzzled by is the yacc part of PLY. Most
of the examples are showing calculators and the yacc part does the
calculations such as,

def p_expression_group(self, p):
'expression : LPAREN expression RPAREN'
p[0] = p[2]

this is a bad example, I know. But how do I get it to output some
intermediate representations, like AST, or an intermediate code
(byte-code type).

Is

def p_expression_group(self, p):
'expression : LPAREN expression RPAREN'
p[0] = p[2]
print "byte_x" + p[0]

or something like this legal?

I hope that I am clear about what I am trying to say.

Thanks in advanced
Maurice
 
M

Maurice LING

PLY can do much more then line counting. Build an AST then just count
your IF tokens. If you can use ply to recognise your if tokens, then you
can easility count them.

How do I build an AST with PLY? I'm trying to find some examples of that
but unsuccessful.
It's kinda vague what you are wanting to do ? Is it source code you are
parsing or text with keywords ?

I'm trying to parse what looks like a 4GL source code.
 
M

Maurice LING

Here's a kludgy but quick solution- modify the LexToken class in
lex.py to keep track of number of type occurences.

class LexToken(object): # change to new style class
type_count = {} # store the count here
def __setattr__(self, key, value):
if key == 'type':
# when type attribute is assigned, increment counter
if value not in self.type_count:
self.type_count[value] = 1
else:
self.type_count[value] += 1
object.__setattr__(self, key, value)

# ... and proceed with the original definition of LexToken

def __str__(self):
return "LexToken(%s,%r,%d)" %
(self.type,self.value,self.lineno)
def __repr__(self):
return str(self)
def skip(self,n):
try:
self._skipn += n
except AttributeError:
self._skipn = n
-----------------------------------------

After you've run the lexer, lex.LexToken.type_count will the contain
number of occurences of each token type.

-----------------------------------------

(Caveats- 1. I haven't tested this code. 2. I've got PLY 1.3;
syntax may have changed in newer versions. In fact, I hope it's
changed; while PLY works very well, its usage could be way more
pythonic)

I may be an idiot here but I don't quite see how LexToken.__setattr__ is
called. There seems to be a gap in my logics.

Please assist.

Thanks
Maurice
 
M

Michael Sparks

Maurice LING wrote:
....
Another thing that I am quite puzzled by is the yacc part of PLY. Most
of the examples are showing calculators and the yacc part does the
calculations such as,

def p_expression_group(self, p):
'expression : LPAREN expression RPAREN'
p[0] = p[2]

this is a bad example, I know.

Simple examples of lex/yacc type things tend to have this though.
But how do I get it to output some
intermediate representations, like AST, or an intermediate code
(byte-code type).

Is

def p_expression_group(self, p):
'expression : LPAREN expression RPAREN'
p[0] = p[2]
print "byte_x" + p[0]

or something like this legal?

It's legal, but probably not what you want.

Normally you have Lex --(token) --> Parse --(AST)--> Something Interesting.

If Something Interesting is simple, you can do that instead at the AST stage
which is what the examples do.

If you wanted to modify the example/calc/calc.py in the PLY distribution to
return an AST to play with you would change it's rules to store the parsed
structure rather than do the work. Taking the route of minimal change to
try and make it obvious what I've changed:

def p_statement_assign(p):
'statement : NAME EQUALS expression'
p[0] = [ "assignment", p[1], p[3] ] # names[p[1]] = p[3]

def p_statement_expr(p):
'statement : expression'
p[0] = [ expr_statement", p[1] ] # print p[1]

def p_expression_binop(p):
'''expression : expression PLUS expression
| expression MINUS expression
| expression TIMES expression
| expression DIVIDE expression'''
p[0] = ["binop_expr", p[2], p[1], p[3] ] # long if/elif evaluation

def p_expression_uminus(p):
'expression : MINUS expression %prec UMINUS'
p[0] = ["uminus_expr", p[2]] # p[0] = -p[2]

def p_expression_group(p):
'expression : LPAREN expression RPAREN'
p[0] = ["expression", p[2] ] # p[0] = p[2]

def p_expression_number(p):
'expression : NUMBER'
p[0] = ["number", p[1]] # p[0] = p[1]

def p_expression_name(p):
'expression : NAME'
p[0] = ["name", p[1] ] # p[0] = names[p[1]], with error handling

A sample AST this could generate would be:

[ "assignment",
["name", "BOB" ],
["expression",
["binop_expr",
"*",
["number", 7],
["number", 9]
]
]
]

In example/calc/calc.py this value would be returned here:

while 1:
try:
s = raw_input('calc > ')
except EOFError:
break
AST = yacc.parse(s) #### <- ------ HERE!

(NB, slight change to the line ####)

This is a very boring, not very interesting, not that great AST,but should
hopefully get you started. You should be able to see that by traversing
this tree you could get the same result as the original code, or could spit
out code that performs this functionality. Often its nice to have some
simplification of the tree as well since this sort of thing can be rather
unwieldy for realistic languages.

It's also worth noting that the calc.py example is also very toy in that it
matches single lines using the parser rather than collections of lines. (ie
the parser has no conception of a piece of code containing more than one
statement)
I'm trying to parse what looks like a 4GL source code.

FWIW, start small - start with matching the simplest expressions you can and
work forward from there (unless you're lucky enough to have a LALR(1) or
SLR(1) grammar for it suitable for PLY already). Test first style coding
for grammars feels intuitively wrong, but seems to work really well in
practice - just make sure that after making every test work check in the
result to CVS/your favourite version control system :)

One other tip you might find useful - rather than sending the lexer whole
files as PLY seems to expect, do line handling yourself and send it lines
instead - it works much more like Flex/lex that way.

Regards,


Michael.
 
M

Maurice LING

Here's a kludgy but quick solution- modify the LexToken class in
lex.py to keep track of number of type occurences.

class LexToken(object): # change to new style class
type_count = {} # store the count here
def __setattr__(self, key, value):
if key == 'type':
# when type attribute is assigned, increment counter
if value not in self.type_count:
self.type_count[value] = 1
else:
self.type_count[value] += 1
object.__setattr__(self, key, value)

# ... and proceed with the original definition of LexToken

def __str__(self):
return "LexToken(%s,%r,%d)" %
(self.type,self.value,self.lineno)
def __repr__(self):
return str(self)
def skip(self,n):
try:
self._skipn += n
except AttributeError:
self._skipn = n
-----------------------------------------

After you've run the lexer, lex.LexToken.type_count will the contain
number of occurences of each token type.

-----------------------------------------

(Caveats- 1. I haven't tested this code. 2. I've got PLY 1.3;
syntax may have changed in newer versions. In fact, I hope it's
changed; while PLY works very well, its usage could be way more
pythonic)

Thank you, it works well. I think this should be included in the next
release.

I am able to do a "print lex.LexToken.type_count" after each token and
it did show the incremental numbers of each tokens, except t_ignore.

Thanks again
maurice
 
M

Maurice LING

Michael said:
Maurice LING wrote:
...
Another thing that I am quite puzzled by is the yacc part of PLY. Most
of the examples are showing calculators and the yacc part does the
calculations such as,

def p_expression_group(self, p):
'expression : LPAREN expression RPAREN'
p[0] = p[2]

this is a bad example, I know.


Simple examples of lex/yacc type things tend to have this though.

But how do I get it to output some
intermediate representations, like AST, or an intermediate code
(byte-code type).

Is

def p_expression_group(self, p):
'expression : LPAREN expression RPAREN'
p[0] = p[2]
print "byte_x" + p[0]

or something like this legal?


It's legal, but probably not what you want.

Normally you have Lex --(token) --> Parse --(AST)--> Something Interesting.

If Something Interesting is simple, you can do that instead at the AST stage
which is what the examples do.

If you wanted to modify the example/calc/calc.py in the PLY distribution to
return an AST to play with you would change it's rules to store the parsed
structure rather than do the work. Taking the route of minimal change to
try and make it obvious what I've changed:

def p_statement_assign(p):
'statement : NAME EQUALS expression'
p[0] = [ "assignment", p[1], p[3] ] # names[p[1]] = p[3]

def p_statement_expr(p):
'statement : expression'
p[0] = [ expr_statement", p[1] ] # print p[1]

def p_expression_binop(p):
'''expression : expression PLUS expression
| expression MINUS expression
| expression TIMES expression
| expression DIVIDE expression'''
p[0] = ["binop_expr", p[2], p[1], p[3] ] # long if/elif evaluation

def p_expression_uminus(p):
'expression : MINUS expression %prec UMINUS'
p[0] = ["uminus_expr", p[2]] # p[0] = -p[2]

def p_expression_group(p):
'expression : LPAREN expression RPAREN'
p[0] = ["expression", p[2] ] # p[0] = p[2]

def p_expression_number(p):
'expression : NUMBER'
p[0] = ["number", p[1]] # p[0] = p[1]

def p_expression_name(p):
'expression : NAME'
p[0] = ["name", p[1] ] # p[0] = names[p[1]], with error handling

A sample AST this could generate would be:

[ "assignment",
["name", "BOB" ],
["expression",
["binop_expr",
"*",
["number", 7],
["number", 9]
]
]
]

In example/calc/calc.py this value would be returned here:

while 1:
try:
s = raw_input('calc > ')
except EOFError:
break
AST = yacc.parse(s) #### <- ------ HERE!

(NB, slight change to the line ####)

This is a very boring, not very interesting, not that great AST,but should
hopefully get you started. You should be able to see that by traversing
this tree you could get the same result as the original code, or could spit
out code that performs this functionality. Often its nice to have some
simplification of the tree as well since this sort of thing can be rather
unwieldy for realistic languages.

It's also worth noting that the calc.py example is also very toy in that it
matches single lines using the parser rather than collections of lines. (ie
the parser has no conception of a piece of code containing more than one
statement)

I'm trying to parse what looks like a 4GL source code.


FWIW, start small - start with matching the simplest expressions you can and
work forward from there (unless you're lucky enough to have a LALR(1) or
SLR(1) grammar for it suitable for PLY already). Test first style coding
for grammars feels intuitively wrong, but seems to work really well in
practice - just make sure that after making every test work check in the
result to CVS/your favourite version control system :)

I've worked out my grammar in BNF, so I hope it is context free.
One other tip you might find useful - rather than sending the lexer whole
files as PLY seems to expect, do line handling yourself and send it lines
instead - it works much more like Flex/lex that way.

Regards,


Michael.


Thank you, this really helped my understanding.

maurice
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top