Could this expression parser be more 'Pythonic'?

Amr · May 6, 2009

Hello all,

I've been spending the last few weeks learning Python, and I've just
started to use it to write a simple BASIC compiler. I'm writing a
mathematical expression parser and wrote a function that would take a
string and split it into high level tokens.

The code can be found at http://pastebin.com/f757252ec (have a look at
the function tokenize).

I was interested to see if the tokenize function could be written in a
better way, so as to make better use of the Python language and
libraries. One thing that sprung to mind was whether it would be a
good idea to perhaps take advantage of regular expressions. I had a
look at a few compiler sites and regexps seem to come into that role
quite a bit. I've only done a little work on regexps and they seem
very powerful, but can easily get out of control!

So, any suggestions as to how to make better use of the language
writing something like this?

Thanks,

--Amr

John Machin · May 6, 2009

Hello all,

I've been spending the last few weeks learning Python, and I've just
started to use it to write a simple BASIC compiler. I'm writing a
mathematical expression parser and wrote a function that would take a
string and split it into high level tokens.

The code can be found athttp://pastebin.com/f757252ec(have a look at
the function tokenize).

It's always a good idea to describe to yourself what you are actually
trying to achieve before you get too embrolied in the implementation
details.

Adding a few more tests like this:

if __name__ == '__main__':
tests = """\
1 + 2 * 3
(1 + 2) * 3
1 + (2 * 3)
!@#$%^
foo = bar + zot
foo=bar+zot
xyzzy(bar, zot + 2.34)
"""
for test in tests.splitlines():
test = test.strip()
e = Expression(test)
result = e.tokenize(e.expression)
print "%r => %r" % (test, result)

produces this:
'1 + 2 * 3' => ['1', '+', '2', '*', '3']
'(1 + 2) * 3' => ['1 + 2', '*', '3']
'1 + (2 * 3)' => ['1', '+', '2 * 3']
'!@#$%^' => ['!@#$%^']
'foo = bar + zot' => ['foo', '=', 'bar', '+', 'zot']
'foo=bar+zot' => ['foo=bar', '+', 'zot']
'xyzzy(bar, zot + 2.34)' => ['xyzzy', 'bar, zot + 2.34']
'' => []

which indicates that your notions of what a token is and what you
might use them for must be rather "interesting".

Most other folks' tokenisers would have no concept of "bracket depth",
would not throw the parentheses away, and would return not only the
tokens but a classification as well e.g.

'xyzzy' ID
'(' LPAREN
'bar' ID
',' LISTSEP
'zot' ID
'+' OPER
'2.34' FLOAT_CONST
')' RPAREN

I was interested to see if the tokenize function could be written in a
better way, so as to make better use of the Python language and
libraries. One thing that sprung to mind was whether it would be a
good idea to perhaps take advantage of regular expressions. I had a
look at a few compiler sites and regexps seem to come into that role
quite a bit. I've only done a little work on regexps and they seem
very powerful, but can easily get out of control!

So, any suggestions as to how to make better use of the language
writing something like this?

There are various packages like pyparsing, plex, ply, ... which are
all worth looking at. However if you want to get stuck into the
details, you can lash up a quick and dirty lexer (tokeniser) using
regexps in no time flat, e.g.

an ID is matched by r"[A-Za-z_][A-Za-z0-9_]*"
an INT_CONST is matched by r"[0-9]+"
an OPER is matched by r"[-+/*]"
etc
so you set up a (longish) re containing all the alternatives
surrounded by capturing parentheses and separated by "|".

r"([A-Za-z_][A-Za-z0-9_]*)|([0-9]+)|([-+/*])etc"

Hint: don't do it manually, use ")|(".join()

and you just step through your input string:
1. skip over any leading whitespace
2. look for a match of your re starting at current offset
3. matchobj.lastindex gives you 0 if it's an ID, 1 if it's an
INT_CONST, etc
4. save the slice that represents the token
5. adjust offset, loop around until you hit the end of the input

Hint: the order in which you list the alternatives can matter a lot
e.g. you better have "<=" before "<" otherwise it would regard "<=" as
comprising two tokens "<" and "="

HTH,
John

Amr · May 7, 2009

Hi John,

Thanks for the tips, I will check them out.

--Amr

Can't wrap text around image and one more	1	Jul 25, 2025
I want to learn More Advanced Javascript for Game Development, but can't seem to be able to progress in my coding ability	0	Aug 27, 2024
s-expression parser in python	2	Apr 6, 2010
Looking for feedback on this markup language I developed and my website idea?	0	Jun 17, 2023
PEP/GSoC idea: built-in parser generator module for Python?	0	Mar 14, 2014
Web scraping i guess (Yet to start, maybe this should be done in python?)	1	Nov 10, 2021
Designing a Pythonic search DSL for SQL and NoSQL databases	2	Jul 19, 2013
Hex editor display - can this be more pythonic?	7	Jul 29, 2007

Could this expression parser be more 'Pythonic'?

Amr

John Machin

Amr

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads