My first Python program -- a lexer

  • Thread starter Thomas Mlynarczyk
  • Start date
S

Steve Holden

Thomas said:
It is not a bad idea to transport passengers in an airplane, but then
the airplane should not follow an autobahn, but use the shortest way --
at an appropriate altitude. Translated back to Python dicts, this would
mean that using a dict for my purposes is a good idea, but that I do not
make use of its full capabilities -- in other words, I should rewrite my
code -- still using dict but in a better way. Or do you mean that for 10
kilometers of autobahn, an airplane would be overkill?
The latter.
Maybe I am a bit biased by my PHP background, but { name: regex, ... }
looks simpler to me than [ ( name, regex ), ... ], because the former is
not a nested structure, while the latter would be a 2D-array in PHP.
But it's a question of the properties of the objects. Because lists are
ordered it's appropriate to use them when you want to iterate over them
and no random access is required.

If you need random access by key and no particular ordering is required
for the iteration then you should use a dict.

Forget "simplicity" until you know how the different objects are
implemented under the hood.

A good first rule is that simple, readable Python will usually be
efficient. As far as execution efficiency goes I'd be very surprised if
the dict solution was faster, but even if it were I would still prefer
the list-based solution because it's more appropriate to the problem.

"Premature optimization is the root of all evil" ...
Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.
But when do you want to do that? There's no point inventing use cases -
they should be demonstrated needs.

The only time your original code needs to use the token name to retrieve
the value is for your compile() method, but assuming the passed-in
tokens was of the form [(name, pattern), ...] you could just as easily
throw the method away and in __init__() write

self.tokens = [(name, re.compile(pat)) for name, pat in tokens]

Or simply pass compiled token patterns in in the first place when they
are necessary ... then the caller has the option of not bothering to
optimize in the first place!

The advantage of the list-based approach is that it at least allows of
the possibility that you can observe the relative frequencies of the
tokens' appearance in real code, and then optimize the ordering to give
best (mean) performance (by putting the commonest token first) should
tokenization become a program hotspot.

With a dict you have no such opportunity, because the ordering is
determined by the implementation and not by your data structure.

regards
Steve
 
T

Thomas Mlynarczyk

Steve said:
Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.
But when do you want to do that? There's no point inventing use cases -
they should be demonstrated needs.

Well, I had been thinking about further reducing the number of regex
matchings needed. So I wanted to modify my lexer not to tokenize the
whole input at once, but only try to grab the next token from the input
"just in time" / on demand. For that I was thinking of having a next()
method like this:

def next( self, nameOfExpectedToken ):
regex = self.getRegexByTokenName( nameOfExpectedToken )
match = regex.match( self.source, self.offset )
if not match: return False
line = self.line
self.line += match.group(0).count( "\n" )
self.offset += len( match.group(0) )
return ( nameOfExpectedToken, match, line )

I'm not sure if this is a good idea, but it looks like one to me. The
problem is the first line of the method which retrieves the regex
associated with the given token name. Using a dict, I could simply write

regex = self.tokendict[nameOfExpectedToken]

But with a list I suppose I wouldn't get away without a loop. Which I
assume is more expensive that the dict.
Or simply pass compiled token patterns in in the first place when they
are necessary ... then the caller has the option of not bothering to
optimize in the first place!

That would be an option. But shouldn't it be the lexer who takes care of
optimizing its own work as much as it can do without the caller's
assistance? After all, the caller should not need to know about the
internal workings of the lexer.

[Optimizing performance by putting most frequent tokens first]
With a dict you have no such opportunity, because the ordering is
determined by the implementation and not by your data structure.

True. Still, I should be able to gain even better performance with my
above approach using a next() function, as this would completely
eliminate all "useless" matching (like trying to match FOO where no foo
is allowed).

Greetings,
Thomas
 
J

John Machin

It is not a bad idea to transport passengers in an airplane, but then
the airplane should not follow an autobahn, but use the shortest way --
at an appropriate altitude. Translated back to Python dicts, this would
mean that using a dict for my purposes is a good idea, but that I do not
make use of its full capabilities -- in other words, I should rewrite my
code -- still using dict but in a better way. Or do you mean that for 10
kilometers of autobahn, an airplane would be overkill?

That is definitely what I mean. Sorry for not replying before; I
thought that this was a rhetorical question ;-)
Maybe I am a bit biased by my PHP background, but { name: regex, ... }
looks simpler to me than [ ( name, regex ), ... ], because the former is
not a nested structure, while the latter would be a 2D-array in PHP.

Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that.

If you need a predictable and easily specified order of testing a
string against multiple regexes, a list of tuples is indicated.

*IF* you need to access the regex associated with a token in O(1)
time, a dict is indicated.

If you have *both* requirements simultaneously, then *both* list and
dict are indicated.
 
T

Thomas Mlynarczyk

John said:
*IF* you need to access the regex associated with a token in O(1)
time, a dict is indicated.

O(1) - Does that mean `mydict[mykey]` takes the same amount of time, no
matter if mydict has 10 or 1000000000 entries? How does this magic work?
O(log n) I would understand, but constant time?
If you have *both* requirements simultaneously, then *both* list and
dict are indicated.

So I would have to duplicate my data and store it once in a list, once
in a dict? Or should I decide for one way and accept that one type of
access will not be optimal?

Greetings,
Thomas
 
S

Steve Holden

Thomas said:
John said:
*IF* you need to access the regex associated with a token in O(1)
time, a dict is indicated.

O(1) - Does that mean `mydict[mykey]` takes the same amount of time, no
matter if mydict has 10 or 1000000000 entries? How does this magic work?
O(log n) I would understand, but constant time?
Basically it hashes the key values to locate the corresponding value.

This is a drastic simplification, but if you do a Google search for
"hash tables" you will get the basic idea. Tim Peters, among others, has
spent a lot of time optimizing dict behavior.
So I would have to duplicate my data and store it once in a list, once
in a dict? Or should I decide for one way and accept that one type of
access will not be optimal?
Storing the data once in a dict and once in a list doesn't duplicate the
data, since both structures will (optimally: i.e. if you do it right)
refer to the same data objects.

Generally speaking, do what's convenient to do as a programmer, and work
on it if there are serious inadequacies.

regards
Steve
 
A

Arnaud Delobelle

Thomas Mlynarczyk said:
John said:
*IF* you need to access the regex associated with a token in O(1)
time, a dict is indicated.

O(1) - Does that mean `mydict[mykey]` takes the same amount of time,
no matter if mydict has 10 or 1000000000 entries? How does this magic
work? O(log n) I would understand, but constant time?

As I understand it, in theory it's not O(1) - I guess it's O(n). But
for any non-cunningly crafted data it's as if it was O(1).

Dictionaries are implemented as hashtables, not as trees.
So I would have to duplicate my data and store it once in a list, once
in a dict? Or should I decide for one way and accept that one type of
access will not be optimal?

Depends what you want to do!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top