Simple eval

T

Tor Erik Sønvisen

Hi,

A while ago I asked a question on the list about a simple eval
function, capable of eval'ing simple python constructs (tuples, dicts,
lists, strings, numbers etc) in a secure manner:
http://groups.google.com/group/comp.lang.python/browse_thread/thread/58a01273441d445f/
From the answers I got I chose to use simplejson... However, I was
also pointed to a simple eval function by Fredrik Lundh:
http://effbot.org/zone/simple-iterator-parser.htm. His solution, using
module tokenize, was short and elegant. So I used his code as a
starting point for simple evaluation of dicts, tuples, lists, strings,
unicode strings, integers, floats, None, True, and False. I've
included the code below, together with some basic tests, and
profiling... On my computer (winXP, python 2.5), simple eval is about
5 times slower than builtin eval...

Comments, speedups, improvements in general, etc are appreciated. As
this is a contribution to the community I suggest that any
improvements are posted in this thread...

-Tor Erik

Code (tested on 2.5, but should work for versions >= 2.3):

'''
Recursive evaluation of:
tuples, lists, dicts, strings, unicode strings, ints, floats,
True, False, and None
'''

import cStringIO, tokenize, itertools

KEYWORDS = {'None': None, 'False': False, 'True': True}

def atom(next, token):
if token[1] == '(':
out = []
token = next()
while token[1] != ')':
out.append(atom(next, token))
token = next()
if token[1] == ',':
token = next()
return tuple(out)
elif token[1] == '[':
out = []
token = next()
while token[1] != ']':
out.append(atom(next, token))
token = next()
if token[1] == ',':
token = next()
return out
elif token[1] == '{':
out = {}
token = next()
while token[1] != '}':
key = atom(next, token)
next() # Skip key-value delimiter
token = next()
out[key] = atom(next, token)
token = next()
if token[1] == ',':
token = next()
return out
elif token[1].startswith('u'):
return token[1][2:-1].decode('unicode-escape')
elif token[0] is tokenize.STRING:
return token[1][1:-1].decode('string-escape')
elif token[0] is tokenize.NUMBER:
try:
return int(token[1], 0)
except ValueError:
return float(token[1])
elif token[1] in KEYWORDS:
return KEYWORDS[token[1]]
raise SyntaxError('malformed expression (%r)¨' % token[1])

def simple_eval(source):
src = cStringIO.StringIO(source).readline
src = tokenize.generate_tokens(src)
src = itertools.ifilter(lambda x: x[0] is not tokenize.NL, src)
res = atom(src.next, src.next())
if src.next()[0] is not tokenize.ENDMARKER:
raise SyntaxError("bogus data after expression")
return res


if __name__ == '__main__':
expr = (1, 2.3, u'h\xf8h\n', 'h\xc3\xa6', ['a', 1],
{'list': [], 'tuple': (), 'dict': {}}, False, True, None)
rexpr = repr(expr)

a = simple_eval(rexpr)
b = eval(rexpr)
assert a == b

import timeit
print timeit.Timer('eval(rexpr)', 'from __main__ import
rexpr').repeat(number=1000)
print timeit.Timer('simple_eval(rexpr)', 'from __main__ import
rexpr, simple_eval').repeat(number=1000)
 
G

greg

Tor said:
Comments, speedups, improvements in general, etc are appreciated.

You're doing a lot of repeated indexing of token[0]
and token[1] in your elif branches. You might gain some
speed by fetching these into locals before entering the
elif chain.

Also you could try ordering the branches so that the
most frequent cases come first. Probably strings and
numbers first, then the various kinds of bracket.
This would also give you a chance to avoid pulling out
token[1] until you need it.

token[1].startswith('u'): It's probably faster to
use an index to get the first character, if you know
that the string is not empty.

Importing the names from tokenize that you use repeatedly
should save some time, too.

Putting all these together, it would look something
like

from tokenize import STRING, NUMBER

def atom(next, token):
token0 = token[0]
if token0 == STRING
...
elif token0 == NUMBER:
...
elif token0[0] == 'u':
...
else:
token1 = token[1]
if token1 in KEYWORDS:
...
elif token1 == '(':
...
elif token1 == '[':
...
elif token1 == '{':
...

If you were willing to indulge in some default-argument abuse, you
could also do

def atom(next, token, STRING = tokenize.STRING, NUMBER = tokenize.NUMBER):
...

A more disciplined way would be to wrap it in a closure:

def make_atom():
from tokenize import STRING, NUMBER
def atom(next, token):
...
return atom

atom = make_atom()
 
G

George Sakkis

Tor said:
Comments, speedups, improvements in general, etc are appreciated.

You're doing a lot of repeated indexing of token[0]
and token[1] in your elif branches. You might gain some
speed by fetching these into locals before entering the
elif chain.

Also you could try ordering the branches so that the
most frequent cases come first. Probably strings and
numbers first, then the various kinds of bracket.
This would also give you a chance to avoid pulling out
token[1] until you need it.

token[1].startswith('u'): It's probably faster to
use an index to get the first character, if you know
that the string is not empty.

I tried several of these micro optimizations but there was very little
improvement; eval() remains practically 5 times faster. The major
bottleneck is generate_tokens(); replacing simple_eval() with the
following is still 3 times slower than eval():

def simple_eval(source):
for _ in generate_tokens(StringIO(source).readline): pass

That's not very surprising since generate_tokens() is quite general
and yields more information than necessary. Clearly if performance is
critical you should write your own simple_generate_tokens(), possibly
as a cut down version of the generic one.

Leaving performance aside, below is a slightly more compact version.
The almost identical code for handling lists and tuples is factored
out in _iter_sequence(). The 'token' parameter here is the actual
token, not the 5-tuple yielded by generate_tokens(). Finally this
version handles negative and long numbers (which the original didn't):


from string import digits
from cStringIO import StringIO
from tokenize import generate_tokens, NL

_consts = {'None': None, 'False': False, 'True': True}

def simple_eval(source):
itertokens = generate_tokens(StringIO(source).readline)
next = (token[1] for token in itertokens
if token[0] is not NL).next
res = atom(next, next())
if next():
raise SyntaxError("bogus data after expression")
return res

def atom(next, token):
def _iter_sequence(end):
token = next()
while token != end:
yield atom(next, token)
token = next()
if token == ',':
token = next()
firstchar = token[0]
if token in _consts:
return _consts[token]
elif token[-1] == 'L':
return long(token)
elif firstchar in digits:
return float(token) if '.' in token else int(token)
elif firstchar in '"\'':
return token[1:-1].decode('string-escape')
elif firstchar == 'u':
return token[2:-1].decode('unicode-escape')
elif token == '-':
return -atom(next, next())
elif token == '(':
return tuple(_iter_sequence(')'))
elif token == '[':
return list(_iter_sequence(']'))
elif token == '{':
out = {}
token = next()
while token != '}':
key = atom(next, token)
next() # Skip key-value delimiter (':')
token = next()
out[key] = atom(next, token)
token = next()
if token == ',':
token = next()
return out
raise SyntaxError('malformed expression (%r)' % token)


Regards,
George
 
M

MonkeeSage

As I see it, just as a matter of common sense, there will be no way to
match the performance of the backend eval() with any interpreted code.
At best, performance-wise, a preprocessor for the built-in eval()
would be in order, filtering out the "unsafe" cases and passing the
rest through. But what do I know, I could be dead wrong. :)

Regards,
Jordan
 
G

Gabriel Genellina

Importing the names from tokenize that you use repeatedly
should save some time, too.
from tokenize import STRING, NUMBER

If you were willing to indulge in some default-argument abuse, you
could also do

def atom(next, token, STRING = tokenize.STRING, NUMBER =
tokenize.NUMBER):
...

A more disciplined way would be to wrap it in a closure:

def make_atom():
from tokenize import STRING, NUMBER
def atom(next, token):
...
return atom

....but unfortunately it's the slowest alternative, so wouldn't count as a
speed optimization.

I would investigate a mixed approach: using a parser to ensure the
expression is "safe", then calling eval.
 
S

Steven D'Aprano

As I see it, just as a matter of common sense, there will be no way to
match the performance of the backend eval() with any interpreted code.
At best, performance-wise, a preprocessor for the built-in eval() would
be in order, filtering out the "unsafe" cases and passing the rest
through. But what do I know, I could be dead wrong. :)


In general, there are two approaches to security.

The first is "Anything not explicitly permitted is forbidden".

The second is "Anything not explicitly forbidden is permitted".


The first is used when you actually want security. The second is when you
are only pretending to care about security.

Valuing convenience and ease of use over security is a reasonable thing
to do, but only so long as you know that you are putting security second
and are prepared to face the consequences. I for one don't surround
myself with a team of crack British ex-SAS officers as body guards, but
that's because I'm prepared to run the risk of Russian ex-KGB special
forces attacking.

On the other hand... I do lock my house. Rather than providing a list of
people not allowed into the house, only those with a key are permitted.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top