My first Python program -- a lexer

Steve Holden · Nov 11, 2008

Thomas said:
It is not a bad idea to transport passengers in an airplane, but then
the airplane should not follow an autobahn, but use the shortest way --
at an appropriate altitude. Translated back to Python dicts, this would
mean that using a dict for my purposes is a good idea, but that I do not
make use of its full capabilities -- in other words, I should rewrite my
code -- still using dict but in a better way. Or do you mean that for 10
kilometers of autobahn, an airplane would be overkill?

The latter.

Maybe I am a bit biased by my PHP background, but { name: regex, ... }
looks simpler to me than [ ( name, regex ), ... ], because the former is
not a nested structure, while the latter would be a 2D-array in PHP.

But it's a question of the properties of the objects. Because lists are
ordered it's appropriate to use them when you want to iterate over them
and no random access is required.

If you need random access by key and no particular ordering is required
for the iteration then you should use a dict.

Forget "simplicity" until you know how the different objects are
implemented under the hood.

A good first rule is that simple, readable Python will usually be
efficient. As far as execution efficiency goes I'd be very surprised if
the dict solution was faster, but even if it were I would still prefer
the list-based solution because it's more appropriate to the problem.

"Premature optimization is the root of all evil" ...

Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.

But when do you want to do that? There's no point inventing use cases -
they should be demonstrated needs.

The only time your original code needs to use the token name to retrieve
the value is for your compile() method, but assuming the passed-in
tokens was of the form [(name, pattern), ...] you could just as easily
throw the method away and in __init__() write

self.tokens = [(name, re.compile(pat)) for name, pat in tokens]

Or simply pass compiled token patterns in in the first place when they
are necessary ... then the caller has the option of not bothering to
optimize in the first place!

The advantage of the list-based approach is that it at least allows of
the possibility that you can observe the relative frequencies of the
tokens' appearance in real code, and then optimize the ordering to give
best (mean) performance (by putting the commonest token first) should
tokenization become a program hotspot.

With a dict you have no such opportunity, because the ordering is
determined by the implementation and not by your data structure.

regards
Steve

Thomas Mlynarczyk · Nov 12, 2008

Steve said:
Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.

Click to expand...

But when do you want to do that? There's no point inventing use cases -
they should be demonstrated needs.

Well, I had been thinking about further reducing the number of regex
matchings needed. So I wanted to modify my lexer not to tokenize the
whole input at once, but only try to grab the next token from the input
"just in time" / on demand. For that I was thinking of having a next()
method like this:

def next( self, nameOfExpectedToken ):
regex = self.getRegexByTokenName( nameOfExpectedToken )
match = regex.match( self.source, self.offset )
if not match: return False
line = self.line
self.line += match.group(0).count( "\n" )
self.offset += len( match.group(0) )
return ( nameOfExpectedToken, match, line )

I'm not sure if this is a good idea, but it looks like one to me. The
problem is the first line of the method which retrieves the regex
associated with the given token name. Using a dict, I could simply write

regex = self.tokendict[nameOfExpectedToken]

But with a list I suppose I wouldn't get away without a loop. Which I
assume is more expensive that the dict.

Or simply pass compiled token patterns in in the first place when they
are necessary ... then the caller has the option of not bothering to
optimize in the first place!

That would be an option. But shouldn't it be the lexer who takes care of
optimizing its own work as much as it can do without the caller's
assistance? After all, the caller should not need to know about the
internal workings of the lexer.

[Optimizing performance by putting most frequent tokens first]

With a dict you have no such opportunity, because the ordering is
determined by the implementation and not by your data structure.

True. Still, I should be able to gain even better performance with my
above approach using a next() function, as this would completely
eliminate all "useless" matching (like trying to match FOO where no foo
is allowed).

Greetings,
Thomas

John Machin · Nov 23, 2008

It is not a bad idea to transport passengers in an airplane, but then
the airplane should not follow an autobahn, but use the shortest way --
at an appropriate altitude. Translated back to Python dicts, this would
mean that using a dict for my purposes is a good idea, but that I do not
make use of its full capabilities -- in other words, I should rewrite my
code -- still using dict but in a better way. Or do you mean that for 10
kilometers of autobahn, an airplane would be overkill?

That is definitely what I mean. Sorry for not replying before; I
thought that this was a rhetorical question ;-)

Maybe I am a bit biased by my PHP background, but { name: regex, ... }
looks simpler to me than [ ( name, regex ), ... ], because the former is
not a nested structure, while the latter would be a 2D-array in PHP.

Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that.

If you need a predictable and easily specified order of testing a
string against multiple regexes, a list of tuples is indicated.

*IF* you need to access the regex associated with a token in O(1)
time, a dict is indicated.

If you have *both* requirements simultaneously, then *both* list and
dict are indicated.

Thomas Mlynarczyk · Nov 24, 2008

John said:
*IF* you need to access the regex associated with a token in O(1)
time, a dict is indicated.

O(1) - Does that mean `mydict[mykey]` takes the same amount of time, no
matter if mydict has 10 or 1000000000 entries? How does this magic work?
O(log n) I would understand, but constant time?

If you have *both* requirements simultaneously, then *both* list and
dict are indicated.

So I would have to duplicate my data and store it once in a list, once
in a dict? Or should I decide for one way and accept that one type of
access will not be optimal?

Greetings,
Thomas

Steve Holden · Nov 24, 2008

Thomas said:
John said:

*IF* you need to access the regex associated with a token in O(1)
time, a dict is indicated.

Click to expand...

O(1) - Does that mean `mydict[mykey]` takes the same amount of time, no
matter if mydict has 10 or 1000000000 entries? How does this magic work?
O(log n) I would understand, but constant time?

Basically it hashes the key values to locate the corresponding value.

This is a drastic simplification, but if you do a Google search for
"hash tables" you will get the basic idea. Tim Peters, among others, has
spent a lot of time optimizing dict behavior.

So I would have to duplicate my data and store it once in a list, once
in a dict? Or should I decide for one way and accept that one type of
access will not be optimal?

Storing the data once in a dict and once in a list doesn't duplicate the
data, since both structures will (optimally: i.e. if you do it right)
refer to the same data objects.

Generally speaking, do what's convenient to do as a programmer, and work
on it if there are serious inadequacies.

regards
Steve

Arnaud Delobelle · Nov 24, 2008

Thomas Mlynarczyk said:
John said:

*IF* you need to access the regex associated with a token in O(1)
time, a dict is indicated.

Click to expand...

O(1) - Does that mean `mydict[mykey]` takes the same amount of time,
no matter if mydict has 10 or 1000000000 entries? How does this magic
work? O(log n) I would understand, but constant time?

As I understand it, in theory it's not O(1) - I guess it's O(n). But
for any non-cunningly crafted data it's as if it was O(1).

Dictionaries are implemented as hashtables, not as trees.

So I would have to duplicate my data and store it once in a list, once
in a dict? Or should I decide for one way and accept that one type of
access will not be optimal?

Depends what you want to do!

A simple lexer	0	Jan 9, 2007
Evaluate my first python script, please	13	Mar 4, 2010
Python code problem	2	Apr 23, 2023
Python battle game help	2	Feb 23, 2023
Python point location of intersect between two lines	0	Feb 28, 2018
My very first python program, need help	1	Aug 10, 2008
Q's on my first python script	8	May 10, 2009
Comments on my first script?	12	Jun 12, 2008

My first Python program -- a lexer

Steve Holden

Thomas Mlynarczyk

John Machin

Thomas Mlynarczyk

Steve Holden

Arnaud Delobelle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads