My first Python program -- a lexer

  • Thread starter Thomas Mlynarczyk
  • Start date
T

Thomas Mlynarczyk

Hello,

I started to write a lexer in Python -- my first attempt to do something
useful with Python (rather than trying out snippets from tutorials). It
is not complete yet, but I would like some feedback -- I'm a Python
newbie and it seems that, with Python, there is always a simpler and
better way to do it than you think.

### Begin ###

import re

class Lexer(object):
def __init__( self, source, tokens ):
self.source = re.sub( r"\r?\n|\r\n", "\n", source )
self.tokens = tokens
self.offset = 0
self.result = []
self.line = 1
self._compile()
self._tokenize()

def _compile( self ):
for name, regex in self.tokens.iteritems():
self.tokens[name] = re.compile( regex, re.M )

def _tokenize( self ):
while self.offset < len( self.source ):
for name, regex in self.tokens.iteritems():
match = regex.match( self.source, self.offset )
if not match: continue
self.offset += len( match.group(0) )
self.result.append( ( name, match, self.line ) )
self.line += match.group(0).count( "\n" )
break
else:
raise Exception(
'Syntax error in source at offset %s' %
str( self.offset ) )

def __str__( self ):
return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %
( str( line ), str( match.pos ), name, match.group(0) )
for name, match, line in self.result ] )

# Test Example

source = r"""
Name: "Thomas", # just a comment
Age: 37
"""

tokens = {
'T_IDENTIFIER' : r'[A-Za-z_][A-Za-z0-9_]*',
'T_NUMBER' : r'[+-]?\d+',
'T_STRING' : r'"(?:\\.|[^\\"])*"',
'T_OPERATOR' : r'[=:,;]',
'T_NEWLINE' : r'\n',
'T_LWSP' : r'[ \t]+',
'T_COMMENT' : r'(?:\#|//).*$' }

print Lexer( source, tokens )

### End ###


Greetings,
Thomas
 
J

John Machin

Hello,

I started to write a lexer in Python -- my first attempt to do something
useful with Python (rather than trying out snippets from tutorials). It
is not complete yet, but I would like some feedback -- I'm a Python
newbie and it seems that, with Python, there is always a simpler and
better way to do it than you think.

### Begin ###

import re

class Lexer(object):

So far, so good.
     def __init__( self, source, tokens ):

Be consistent with your punctuation style. I'd suggest *not* having a
space after ( and before ), as in the previous line. Read
http://www.python.org/dev/peps/pep-0008/
         self.source = re.sub( r"\r?\n|\r\n", "\n", source )

Firstly, would you not expect to be getting your text from a text file
(perhaps even one opened with the universal newlines option) i.e. by
the time it's arrived here, source has already had \r\n changed to \n?

Secondly, that's equivalent to
re.sub(r"\n|\r\n|\r\n", "\n", source)
What's wrong with
re.sub(r"\r\n", "\n", source)
?

Thirdly, if source does contain \r\n and there is an error, the
reported value of offset will be incorrect. Consider retaining the
offset of the last newline seen, so that your error reporting can
include the line number and (include or use) the column position in
the line.
         self.tokens = tokens
         self.offset = 0
         self.result = []
         self.line   = 1
         self._compile()
         self._tokenize()

     def _compile( self ):
         for name, regex in self.tokens.iteritems():
             self.tokens[name] = re.compile( regex, re.M )

     def _tokenize( self ):

Unless you have other plans for it, offset could be local to this
method.
         while self.offset < len( self.source ):

You may like to avoid getting len(self.source) for each token.
             for name, regex in self.tokens.iteritems():

dict.iter<anything>() will return its results in essentially random
order. It doesn't matter with your example, but you will rapidly come
across real-world cases where the order matters. One such case is
distinguishing between real constants (1.23, .123, 123.) and integer
constants (123).
                 match = regex.match( self.source, self.offset )
                 if not match: continue
                 self.offset += len( match.group(0) )
                 self.result.append( ( name, match, self.line ) )
                 self.line += match.group(0).count( "\n" )
                 break
             else:
                 raise Exception(
                     'Syntax error in source at offset %s' %
                     str( self.offset ) )

Using str() here and below is redundant ... "%s" % obj is documented
to produce str(obj).
     def __str__( self ):
         return "\n".join(
             [ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %

For avoidance of ambiguity, you may like to change that '%s' to %r
               ( str( line ), str( match.pos ), name, match.group(0) )
               for name, match, line in self.result ] )
HTH,
John
 
A

Arnaud Delobelle

Thomas Mlynarczyk said:
Hello,

I started to write a lexer in Python -- my first attempt to do
something useful with Python (rather than trying out snippets from
tutorials). It is not complete yet, but I would like some feedback --
I'm a Python newbie and it seems that, with Python, there is always a
simpler and better way to do it than you think.

Hi,

Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.

# Later:
 
T

Thomas Mlynarczyk

John said:
Be consistent with your punctuation style. I'd suggest *not* having a
space after ( and before ), as in the previous line. Read
http://www.python.org/dev/peps/pep-0008/

What were the reasons for preferring (foo) over ( foo )? This PEP gives
recommendations for coding style, but (naturally) it does not mention
the reasons why the recommended way is preferrable. I suppose these
matters have all been discussed -- is there a synopsis available?
Firstly, would you not expect to be getting your text from a text file
(perhaps even one opened with the universal newlines option) i.e. by
the time it's arrived here, source has already had \r\n changed to \n?

I was not aware of the universal newlines option. This would then indeed
make my newline conversion superfluous.
Secondly, that's equivalent to
re.sub(r"\n|\r\n|\r\n", "\n", source)

My mistake. I meant r"\r?\n|\r" ("\n", "\r\n" or "\r").
Thirdly, if source does contain \r\n and there is an error, the
reported value of offset will be incorrect. Consider retaining the
offset of the last newline seen, so that your error reporting can
include the line number and (include or use) the column position in
the line.

Indeed, I had not thought of that detail -- if I mess with the newlines,
the offset will be wrong with respect to the original source. But with
the universal newlines option mentioned above, the problem is already
solved :)
You may like to avoid getting len(self.source) for each token.

Yes, I should change that. Unless there is a more elegant way do detect
the end of the source?
dict.iter<anything>() will return its results in essentially random
order.

Ouch! I must do something about that. Thanks for pointing it out. So if
I want a certain order, I must use a list of tuples? Or is there a way
to have order with dicts?
return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %
For avoidance of ambiguity, you may like to change that '%s' to %r

In which way would there be ambiguity? The first two are integers, the
last two strings.

Thanks for your feedback.

Greetings,
Thomas
 
T

Thomas Mlynarczyk

Arnaud said:
Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.

At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:

token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' )
# continue doing something useful with token

Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".

But otherwise, upon reflection, I think you are right and it would
indeed be more appropriate to do as you suggest.

Thanks for your feedback.

Greetings,
Thomas
 
J

John Machin

What were the reasons for preferring (foo) over ( foo )? This PEP gives
recommendations for coding style, but (naturally) it does not mention
the reasons why the recommended way is preferrable. I suppose these
matters have all been discussed -- is there a synopsis available?

Not AFAIK.
Indeed, I had not thought of that detail -- if I mess with the newlines,
the offset will be wrong with respect to the original source. But with
the universal newlines option mentioned above, the problem is already
solved :)

NOT solved. You have TWO problems: (1) Reporting the error location as
(offset from the start of the file) instead of (line number, column
position) would get you an express induction into the User Interface
Hall of Shame. (2) In the case of a file with lines terminated by \r
\n, the offset is ambiguous.
Yes, I should change that. Unless there is a more elegant way do detect
the end of the source?

I see no inelegance in a while statement being used in the manner for
which it was intended, nor any plausible reason for another construct.
Ouch! I must do something about that. Thanks for pointing it out. So if
I want a certain order, I must use a list of tuples?

A list of somethings does seem indicated.
Or is there a way
to have order with dicts?

A dict is a hashtable, intended to provide a mapping from keys to
values. It's not intended to have order. In any case your code doesn't
use the dict as a mapping.
return "\n".join(
    [ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %
For avoidance of ambiguity, you may like to change that '%s' to %r

In which way would there be ambiguity? The first two are integers, the
last two strings.

The first 3 are %s, the last one is '%s'

Cheers,
John
 
T

Thomas Mlynarczyk

John said:
[...] You have TWO problems: (1) Reporting the error location as
(offset from the start of the file) instead of (line number, column
position) would get you an express induction into the User Interface
Hall of Shame.

Of course. For the actual message I would use at least the line number.
Still, the offset could be used to compute line/column in case of an
error, so I wouldn't really need to store line/column with each token,
but only the offset. And provide a method to "convert" offset values
into line/column tuples.
(2) In the case of a file with lines terminated by \r
\n, the offset is ambiguous.

If I explicitly state that the offset counts newlines as one character?
But you're right: the offset would be for internal use only - what gets
reported is line/column.
A list of somethings does seem indicated.

On the other hand: If all my tokens are "mutually exclusive" then, in
theory, the order in which they are tried, should not matter, as at most
one token could match at any given offset. Still, having the most
frequent tokens being tried first should improve performance.
A dict is a hashtable, intended to provide a mapping from keys to
values. It's not intended to have order. In any case your code doesn't
use the dict as a mapping.

I map token names to regular expressions. Isn't that a mapping?
return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %
The first 3 are %s, the last one is '%s'

I only put the single quotes so I could better "see" whitespace in the
output. Anyway, this method is just to be able to check if the lexer
does what it's supposed to do -- in the final version I will probably
get rid of it.

Thanks & greetings,
Thomas
 
J

John Machin

On the other hand: If all my tokens are "mutually exclusive" then,

But they won't *always* be mutually exclusive (another example is
relational operators (< vs <=, > vs >=)) and AFAICT there is nothing
useful that the lexer can do with an assumption/guess/input that they
are mutually exclusive or not.
in
theory, the order in which they are tried, should not matter, as at most
one token could match at any given offset. Still, having the most
frequent tokens being tried first should improve performance.

Your Lexer class should promise to check the regexes in the order
given. Then the users of your lexer can arrange the order to suit
themselves.
I map token names to regular expressions. Isn't that a mapping?

Your code uses dict methods; this forces your callers to *create* a
mapping. However (as I said) your code doesn't *use* that mapping --
there is no RHS usage of dict[key] or dict.get(key) etc. In fact I'm
having difficulty imagining what possible practical use there could be
for a mapping from token-name to regex.
return "\n".join(
    [ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %
The first 3 are %s, the last one is '%s'

I only put the single quotes so I could better "see" whitespace in the
output.

To *best* see whitespace (e.g. Is that a TAB or multiple spaces?), use
%r.

General advice: What you think you see is often not what you've
actually got. repr() is your friend; use it.

Cheers,
John
 
P

Paul McGuire

        Are you forcing the use of fixed length lines then?

        Otherwise, by what algorithm will you convert:


... one
... two
... three
... four
... five
... supercalifragilisticexpialidocious
... seven
... eight
... nine""">>> ix = data.index("list")
39

loc = data.index("list")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

prints 5,14

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

-- Paul
 
R

Robert Lehmann

At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:

You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.append`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.
token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' ) # continue
doing something useful with token

Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".

Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.send("expected
token")`` whenever you expect a special token.

See http://www.python.org/dev/peps/pep-0342/ for details.

HTH,
 
T

Thomas Mlynarczyk

Robert said:
You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.append`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.
Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.send("expected
token")`` whenever you expect a special token.

I will try this. Thank you for the suggestion.

Greetings,
Thomas
 
T

Thomas Mlynarczyk

But they won't *always* be mutually exclusive (another example is
relational operators (< vs <=, > vs >=)) and AFAICT there is nothing
useful that the lexer can do with an assumption/guess/input that they
are mutually exclusive or not.

"<" vs. "<=" can be handled with lookaheads (?=...) / (?!...) in regular
expressions. True, the lexer cannot do anything useful with the
assumption that all tokens are mutually exclusive. But if they are,
there will be no ambiguity and I am guaranteed to get always the same
sequence of tokens from the same input string.
Your Lexer class should promise to check the regexes in the order
given. Then the users of your lexer can arrange the order to suit
themselves.

Yes. So there's no way around a list of tuples instead of dict().
Your code uses dict methods; this forces your callers to *create* a
mapping. However (as I said) your code doesn't *use* that mapping --
there is no RHS usage of dict[key] or dict.get(key) etc. In fact I'm
having difficulty imagining what possible practical use there could be
for a mapping from token-name to regex.

Sorry, but I still don't quite get it.

for name, regex in self.tokens.iteritems():
# ...
self.result.append( ( name, match, self.line ) )

What I do here is take a name and its associated regex and then store a
tuple (name, match, line). In a simpler version of the lexer, I might
store only `name` instead of the tuple. Is your point that the lexer
doesn't care what `name` actually is, but simply passes it through from
the tokenlist to the result?
To *best* see whitespace (e.g. Is that a TAB or multiple spaces?), use
%r.

(Just having modified my code accordingly:) Ah, yes, indeed, that is
much better!
General advice: What you think you see is often not what you've
actually got. repr() is your friend; use it.

Lesson learned :)

Greetings,
Thomas
 
T

Thomas Mlynarczyk

Paul said:
loc = data.index("list")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

prints 5,14

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.

Greetings,
Thomas
 
P

Paul McGuire

Paul said:
loc = data.index("list")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1
prints 5,14
I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.

Greetings,
Thomas

Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.

-- Paul
 
T

Thomas Mlynarczyk

Paul said:
Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.

Another thing I had not thought of -- thanks for the hint.

Greetings,
Thomas
 
J

John Machin

"<" vs. "<=" can be handled with lookaheads (?=...) / (?!...) in regular
expressions.

Single-character tokens like "<" may be more efficiently handled by
doing a dict lookup after failing to find a match in the list of
(name, regex) tuples.
True, the lexer cannot do anything useful with the
assumption that all tokens are mutually exclusive. But if they are,
there will be no ambiguity and I am guaranteed to get always the same
sequence of tokens from the same input string.

So what? That is useless knowledge. It is the ambiguous cases that you
need to be concerned with.
Your Lexer class should promise to check the regexes in the order
given. Then the users of your lexer can arrange the order to suit
themselves.

Yes. So there's no way around a list of tuples instead of dict().
Correct.
Your code uses dict methods; this forces your callers to *create* a
mapping. However (as I said) your code doesn't *use* that mapping --
there is no RHS usage of dict[key] or dict.get(key) etc. In fact I'm
having difficulty imagining what possible practical use there could be
for a mapping from token-name to regex.

Sorry, but I still don't quite get it.

for name, regex in self.tokens.iteritems():
     # ...
     self.result.append( ( name, match, self.line ) )

What I do here is take a name and its associated regex and then store a
tuple (name, match, line). In a simpler version of the lexer, I might
store only `name` instead of the tuple. Is your point that the lexer
doesn't care what `name` actually is, but simply passes it through from
the tokenlist to the result?

No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.
 
T

Thomas Mlynarczyk

John said:
Single-character tokens like "<" may be more efficiently handled by
doing a dict lookup after failing to find a match in the list of
(name, regex) tuples.

Yes, I will keep that in mind. For the time being, I will use only
regexes to keep the code simpler. Later, or when the need for a speedup
arises, I can use your suggestion for optimizing my code. Depending on
the tokens, it might even be better to use the dict lookup right away
and the regex as a secondary means for more complex stuff.

[Mutually exclusive tokens = no ambiguities = same input -> same output]
So what? That is useless knowledge.

For the lexer, perhaps. Not for the user. An ambiguous lexer will be of
no use.
It is the ambiguous cases that you
need to be concerned with.

Exactly. In which way does that contradict what I said?

[Using dict]
No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.

I /could/ have done it with a list of tuples. I use no functionality
that /only/ a dict can do. So using a dict here is like using a truck
for transporting a single sheet of paper?

Greetings,
Thomas
 
J

John Machin

[Using dict]
No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.

I /could/ have done it with a list of tuples. I use no functionality
that /only/ a dict can do. So using a dict here is like using a truck
for transporting a single sheet of paper?

You are getting closer. A better analogy is that using a dict is like
transporting passengers along an autobahn in an aeroplane or
helicopter that never leaves the ground.
 
T

Thomas Mlynarczyk

John said:
You are getting closer. A better analogy is that using a dict is like
transporting passengers along an autobahn in an aeroplane or
helicopter that never leaves the ground.

It is not a bad idea to transport passengers in an airplane, but then
the airplane should not follow an autobahn, but use the shortest way --
at an appropriate altitude. Translated back to Python dicts, this would
mean that using a dict for my purposes is a good idea, but that I do not
make use of its full capabilities -- in other words, I should rewrite my
code -- still using dict but in a better way. Or do you mean that for 10
kilometers of autobahn, an airplane would be overkill?

Maybe I am a bit biased by my PHP background, but { name: regex, ... }
looks simpler to me than [ ( name, regex ), ... ], because the former is
not a nested structure, while the latter would be a 2D-array in PHP.

Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.

Greetings,
Thomas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top