My first Python program -- a lexer

Thomas Mlynarczyk · Nov 8, 2008

Hello,

I started to write a lexer in Python -- my first attempt to do something
useful with Python (rather than trying out snippets from tutorials). It
is not complete yet, but I would like some feedback -- I'm a Python
newbie and it seems that, with Python, there is always a simpler and
better way to do it than you think.

### Begin ###

import re

class Lexer(object):
def __init__( self, source, tokens ):
self.source = re.sub( r"\r?\n|\r\n", "\n", source )
self.tokens = tokens
self.offset = 0
self.result = []
self.line = 1
self._compile()
self._tokenize()

def _compile( self ):
for name, regex in self.tokens.iteritems():
self.tokens[name] = re.compile( regex, re.M )

def _tokenize( self ):
while self.offset < len( self.source ):
for name, regex in self.tokens.iteritems():
match = regex.match( self.source, self.offset )
if not match: continue
self.offset += len( match.group(0) )
self.result.append( ( name, match, self.line ) )
self.line += match.group(0).count( "\n" )
break
else:
raise Exception(
'Syntax error in source at offset %s' %
str( self.offset ) )

def __str__( self ):
return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %
( str( line ), str( match.pos ), name, match.group(0) )
for name, match, line in self.result ] )

# Test Example

source = r"""
Name: "Thomas", # just a comment
Age: 37
"""

tokens = {
'T_IDENTIFIER' : r'[A-Za-z_][A-Za-z0-9_]*',
'T_NUMBER' : r'[+-]?\d+',
'T_STRING' : r'"(?:\\.|[^\\"])*"',
'T_OPERATOR' : r'[=:,;]',
'T_NEWLINE' : r'\n',
'T_LWSP' : r'[ \t]+',
'T_COMMENT' : r'(?:\#|//).*$' }

print Lexer( source, tokens )

### End ###

Greetings,
Thomas

John Machin · Nov 8, 2008

Hello,

I started to write a lexer in Python -- my first attempt to do something
useful with Python (rather than trying out snippets from tutorials). It
is not complete yet, but I would like some feedback -- I'm a Python
newbie and it seems that, with Python, there is always a simpler and
better way to do it than you think.

### Begin ###

import re

class Lexer(object):

So far, so good.

def __init__( self, source, tokens ):

Be consistent with your punctuation style. I'd suggest *not* having a
space after ( and before ), as in the previous line. Read
http://www.python.org/dev/peps/pep-0008/

self.source = re.sub( r"\r?\n|\r\n", "\n", source )

Firstly, would you not expect to be getting your text from a text file
(perhaps even one opened with the universal newlines option) i.e. by
the time it's arrived here, source has already had \r\n changed to \n?

Secondly, that's equivalent to
re.sub(r"\n|\r\n|\r\n", "\n", source)
What's wrong with
re.sub(r"\r\n", "\n", source)
?

Thirdly, if source does contain \r\n and there is an error, the
reported value of offset will be incorrect. Consider retaining the
offset of the last newline seen, so that your error reporting can
include the line number and (include or use) the column position in
the line.

self.tokens = tokens
self.offset = 0
self.result = []
self.line = 1
self._compile()
self._tokenize()

def _compile( self ):
for name, regex in self.tokens.iteritems():
self.tokens[name] = re.compile( regex, re.M )

def _tokenize( self ):

Unless you have other plans for it, offset could be local to this
method.

while self.offset < len( self.source ):

You may like to avoid getting len(self.source) for each token.

for name, regex in self.tokens.iteritems():

dict.iter<anything>() will return its results in essentially random
order. It doesn't matter with your example, but you will rapidly come
across real-world cases where the order matters. One such case is
distinguishing between real constants (1.23, .123, 123.) and integer
constants (123).

match = regex.match( self.source, self.offset )
if not match: continue
self.offset += len( match.group(0) )
self.result.append( ( name, match, self.line ) )
self.line += match.group(0).count( "\n" )
break
else:
raise Exception(
'Syntax error in source at offset %s' %
str( self.offset ) )

Using str() here and below is redundant ... "%s" % obj is documented
to produce str(obj).

def __str__( self ):
return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %

For avoidance of ambiguity, you may like to change that '%s' to %r

( str( line ), str( match.pos ), name, match.group(0) )
for name, match, line in self.result ] )

HTH,
John

Arnaud Delobelle · Nov 9, 2008

Thomas Mlynarczyk said:
Hello,

I started to write a lexer in Python -- my first attempt to do
something useful with Python (rather than trying out snippets from
tutorials). It is not complete yet, but I would like some feedback --
I'm a Python newbie and it seems that, with Python, there is always a
simpler and better way to do it than you think.

Hi,

Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.

# Later:

Thomas Mlynarczyk · Nov 9, 2008

John said:
Be consistent with your punctuation style. I'd suggest *not* having a
space after ( and before ), as in the previous line. Read
http://www.python.org/dev/peps/pep-0008/

What were the reasons for preferring (foo) over ( foo )? This PEP gives
recommendations for coding style, but (naturally) it does not mention
the reasons why the recommended way is preferrable. I suppose these
matters have all been discussed -- is there a synopsis available?

Firstly, would you not expect to be getting your text from a text file
(perhaps even one opened with the universal newlines option) i.e. by
the time it's arrived here, source has already had \r\n changed to \n?

I was not aware of the universal newlines option. This would then indeed
make my newline conversion superfluous.

Secondly, that's equivalent to
re.sub(r"\n|\r\n|\r\n", "\n", source)

My mistake. I meant r"\r?\n|\r" ("\n", "\r\n" or "\r").

Thirdly, if source does contain \r\n and there is an error, the
reported value of offset will be incorrect. Consider retaining the
offset of the last newline seen, so that your error reporting can
include the line number and (include or use) the column position in
the line.

Indeed, I had not thought of that detail -- if I mess with the newlines,
the offset will be wrong with respect to the original source. But with
the universal newlines option mentioned above, the problem is already
solved

You may like to avoid getting len(self.source) for each token.

Yes, I should change that. Unless there is a more elegant way do detect
the end of the source?

dict.iter<anything>() will return its results in essentially random
order.

Ouch! I must do something about that. Thanks for pointing it out. So if
I want a certain order, I must use a list of tuples? Or is there a way
to have order with dicts?

return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %

Click to expand...

For avoidance of ambiguity, you may like to change that '%s' to %r

In which way would there be ambiguity? The first two are integers, the
last two strings.

Thanks for your feedback.

Greetings,
Thomas

Thomas Mlynarczyk · Nov 9, 2008

Arnaud said:
Adding to John's comments, I wouldn't have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.

At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:

token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' )
# continue doing something useful with token

Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".

But otherwise, upon reflection, I think you are right and it would
indeed be more appropriate to do as you suggest.

Thanks for your feedback.

Greetings,
Thomas

John Machin · Nov 9, 2008

What were the reasons for preferring (foo) over ( foo )? This PEP gives
recommendations for coding style, but (naturally) it does not mention
the reasons why the recommended way is preferrable. I suppose these
matters have all been discussed -- is there a synopsis available?

Not AFAIK.

Indeed, I had not thought of that detail -- if I mess with the newlines,
the offset will be wrong with respect to the original source. But with
the universal newlines option mentioned above, the problem is already
solved

NOT solved. You have TWO problems: (1) Reporting the error location as
(offset from the start of the file) instead of (line number, column
position) would get you an express induction into the User Interface
Hall of Shame. (2) In the case of a file with lines terminated by \r
\n, the offset is ambiguous.

Yes, I should change that. Unless there is a more elegant way do detect
the end of the source?

I see no inelegance in a while statement being used in the manner for
which it was intended, nor any plausible reason for another construct.

Ouch! I must do something about that. Thanks for pointing it out. So if
I want a certain order, I must use a list of tuples?

A list of somethings does seem indicated.

Or is there a way
to have order with dicts?

A dict is a hashtable, intended to provide a mapping from keys to
values. It's not intended to have order. In any case your code doesn't
use the dict as a mapping.

return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %

Click to expand...

For avoidance of ambiguity, you may like to change that '%s' to %r

Click to expand...

In which way would there be ambiguity? The first two are integers, the
last two strings.

The first 3 are %s, the last one is '%s'

Cheers,
John

Thomas Mlynarczyk · Nov 9, 2008

John said:
[...] You have TWO problems: (1) Reporting the error location as
(offset from the start of the file) instead of (line number, column
position) would get you an express induction into the User Interface
Hall of Shame.

Of course. For the actual message I would use at least the line number.
Still, the offset could be used to compute line/column in case of an
error, so I wouldn't really need to store line/column with each token,
but only the offset. And provide a method to "convert" offset values
into line/column tuples.

(2) In the case of a file with lines terminated by \r
\n, the offset is ambiguous.

If I explicitly state that the offset counts newlines as one character?
But you're right: the offset would be for internal use only - what gets
reported is line/column.

A list of somethings does seem indicated.

On the other hand: If all my tokens are "mutually exclusive" then, in
theory, the order in which they are tried, should not matter, as at most
one token could match at any given offset. Still, having the most
frequent tokens being tried first should improve performance.

A dict is a hashtable, intended to provide a mapping from keys to
values. It's not intended to have order. In any case your code doesn't
use the dict as a mapping.

I map token names to regular expressions. Isn't that a mapping?

return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %

Click to expand...

Click to expand...

The first 3 are %s, the last one is '%s'

I only put the single quotes so I could better "see" whitespace in the
output. Anyway, this method is just to be able to check if the lexer
does what it's supposed to do -- in the final version I will probably
get rid of it.

Thanks & greetings,
Thomas

John Machin · Nov 9, 2008

On the other hand: If all my tokens are "mutually exclusive" then,

But they won't *always* be mutually exclusive (another example is
relational operators (< vs <=, > vs >=)) and AFAICT there is nothing
useful that the lexer can do with an assumption/guess/input that they
are mutually exclusive or not.

in
theory, the order in which they are tried, should not matter, as at most
one token could match at any given offset. Still, having the most
frequent tokens being tried first should improve performance.

Your Lexer class should promise to check the regexes in the order
given. Then the users of your lexer can arrange the order to suit
themselves.

I map token names to regular expressions. Isn't that a mapping?

Your code uses dict methods; this forces your callers to *create* a
mapping. However (as I said) your code doesn't *use* that mapping --
there is no RHS usage of dict[key] or dict.get(key) etc. In fact I'm
having difficulty imagining what possible practical use there could be
for a mapping from token-name to regex.

return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t'%s'" %

Click to expand...

The first 3 are %s, the last one is '%s'

Click to expand...

I only put the single quotes so I could better "see" whitespace in the
output.

To *best* see whitespace (e.g. Is that a TAB or multiple spaces?), use
%r.

General advice: What you think you see is often not what you've
actually got. repr() is your friend; use it.

Cheers,
John

Paul McGuire · Nov 9, 2008

Are you forcing the use of fixed length lines then?

Otherwise, by what algorithm will you convert:

... one
... two
... three
... four
... five
... supercalifragilisticexpialidocious
... seven
... eight
... nine""">>> ix = data.index("list")
39

loc = data.index("list")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

prints 5,14

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

-- Paul

Robert Lehmann · Nov 10, 2008

At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:

You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.append`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.

token = mylexer.next( 'FOO_TOKEN' )
if not token: raise Exception( 'FOO token expected.' ) # continue
doing something useful with token

Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".

Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.send("expected
token")`` whenever you expect a special token.

See http://www.python.org/dev/peps/pep-0342/ for details.

HTH,

Thomas Mlynarczyk · Nov 10, 2008

Robert said:
You don't have to introduce a `next` method to your Lexer class. You
could just transform your `tokenize` method into a generator by replacing
``self.result.append`` with `yield`. It gives you the just in time part
for free while not picking your algorithm into tiny unrelated pieces.

Python generators recently (2.5) grew a `send` method. You could use
`next` for unconditional tokenization and ``mytokenizer.send("expected
token")`` whenever you expect a special token.

> See http://www.python.org/dev/peps/pep-0342/ for details.

I will try this. Thank you for the suggestion.

Greetings,
Thomas

Thomas Mlynarczyk · Nov 10, 2008

But they won't *always* be mutually exclusive (another example is
relational operators (< vs <=, > vs >=)) and AFAICT there is nothing
useful that the lexer can do with an assumption/guess/input that they
are mutually exclusive or not.

"<" vs. "<=" can be handled with lookaheads (?=...) / (?!...) in regular
expressions. True, the lexer cannot do anything useful with the
assumption that all tokens are mutually exclusive. But if they are,
there will be no ambiguity and I am guaranteed to get always the same
sequence of tokens from the same input string.

Your Lexer class should promise to check the regexes in the order
given. Then the users of your lexer can arrange the order to suit
themselves.

Yes. So there's no way around a list of tuples instead of dict().

Your code uses dict methods; this forces your callers to *create* a
mapping. However (as I said) your code doesn't *use* that mapping --
there is no RHS usage of dict[key] or dict.get(key) etc. In fact I'm
having difficulty imagining what possible practical use there could be
for a mapping from token-name to regex.

Sorry, but I still don't quite get it.

for name, regex in self.tokens.iteritems():
# ...
self.result.append( ( name, match, self.line ) )

What I do here is take a name and its associated regex and then store a
tuple (name, match, line). In a simpler version of the lexer, I might
store only `name` instead of the tuple. Is your point that the lexer
doesn't care what `name` actually is, but simply passes it through from
the tokenlist to the result?

To *best* see whitespace (e.g. Is that a TAB or multiple spaces?), use
%r.

(Just having modified my code accordingly

Ah, yes, indeed, that is
much better!

General advice: What you think you see is often not what you've
actually got. repr() is your friend; use it.

Lesson learned

Greetings,
Thomas

Thomas Mlynarczyk · Nov 10, 2008

Paul said:
loc = data.index("list")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

prints 5,14

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.

Greetings,
Thomas

Paul McGuire · Nov 10, 2008

Paul said:
Paul said:

loc = data.index("list")
print data[:loc].count("\n")-1
print loc-data[:loc].rindex("\n")-1

Click to expand...

prints 5,14

Click to expand...

I'm sure it's non-optimal, but it *is* an algorithm that does not
require keeping track of the start of every line...

Click to expand...

Yes, I was thinking of something like this. As long as the line/column
are only needed in case of an error (i.e. at most once per script run),
I consider this more performant than keeping track of line/column for
every token.

Greetings,
Thomas

Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.

-- Paul

Steve Holden · Nov 10, 2008

Some said:
BLAST YOUR AD [...]

and curse yours

Thomas Mlynarczyk · Nov 10, 2008

Paul said:
Just be sure to account for tabs when computing the column, which this
simple-minded algorithm does not do.

Another thing I had not thought of -- thanks for the hint.

Greetings,
Thomas

John Machin · Nov 10, 2008

"<" vs. "<=" can be handled with lookaheads (?=...) / (?!...) in regular
expressions.

Single-character tokens like "<" may be more efficiently handled by
doing a dict lookup after failing to find a match in the list of
(name, regex) tuples.

True, the lexer cannot do anything useful with the
assumption that all tokens are mutually exclusive. But if they are,
there will be no ambiguity and I am guaranteed to get always the same
sequence of tokens from the same input string.

So what? That is useless knowledge. It is the ambiguous cases that you
need to be concerned with.

Your Lexer class should promise to check the regexes in the order
given. Then the users of your lexer can arrange the order to suit
themselves.

Click to expand...

Yes. So there's no way around a list of tuples instead of dict().
Correct.

Your code uses dict methods; this forces your callers to *create* a
mapping. However (as I said) your code doesn't *use* that mapping --
there is no RHS usage of dict[key] or dict.get(key) etc. In fact I'm
having difficulty imagining what possible practical use there could be
for a mapping from token-name to regex.

Click to expand...

Sorry, but I still don't quite get it.

for name, regex in self.tokens.iteritems():
# ...
self.result.append( ( name, match, self.line ) )

What I do here is take a name and its associated regex and then store a
tuple (name, match, line). In a simpler version of the lexer, I might
store only `name` instead of the tuple. Is your point that the lexer
doesn't care what `name` actually is, but simply passes it through from
the tokenlist to the result?

No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.

Thomas Mlynarczyk · Nov 10, 2008

John said:
Single-character tokens like "<" may be more efficiently handled by
doing a dict lookup after failing to find a match in the list of
(name, regex) tuples.

Yes, I will keep that in mind. For the time being, I will use only
regexes to keep the code simpler. Later, or when the need for a speedup
arises, I can use your suggestion for optimizing my code. Depending on
the tokens, it might even be better to use the dict lookup right away
and the regex as a secondary means for more complex stuff.

[Mutually exclusive tokens = no ambiguities = same input -> same output]

So what? That is useless knowledge.

For the lexer, perhaps. Not for the user. An ambiguous lexer will be of
no use.

It is the ambiguous cases that you
need to be concerned with.

Exactly. In which way does that contradict what I said?

[Using dict]

No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.

I /could/ have done it with a list of tuples. I use no functionality
that /only/ a dict can do. So using a dict here is like using a truck
for transporting a single sheet of paper?

Greetings,
Thomas

John Machin · Nov 10, 2008

[Using dict]

No, not at all. The point is that you were not *using* any of the
mapping functionality of the dict object, only ancillary methods like
iteritems -- hence, you should not have been using a dict at all.

Click to expand...

I /could/ have done it with a list of tuples. I use no functionality
that /only/ a dict can do. So using a dict here is like using a truck
for transporting a single sheet of paper?

You are getting closer. A better analogy is that using a dict is like
transporting passengers along an autobahn in an aeroplane or
helicopter that never leaves the ground.

Thomas Mlynarczyk · Nov 11, 2008

John said:
You are getting closer. A better analogy is that using a dict is like
transporting passengers along an autobahn in an aeroplane or
helicopter that never leaves the ground.

It is not a bad idea to transport passengers in an airplane, but then
the airplane should not follow an autobahn, but use the shortest way --
at an appropriate altitude. Translated back to Python dicts, this would
mean that using a dict for my purposes is a good idea, but that I do not
make use of its full capabilities -- in other words, I should rewrite my
code -- still using dict but in a better way. Or do you mean that for 10
kilometers of autobahn, an airplane would be overkill?

Maybe I am a bit biased by my PHP background, but { name: regex, ... }
looks simpler to me than [ ( name, regex ), ... ], because the former is
not a nested structure, while the latter would be a 2D-array in PHP.

Suppose I use the dict and I want to access the regex associatetd with
the token named "tokenname" (that is, no iteration, but a single
access). I could simple write tokendict["tokenname"]. But with the list
of tuples, I can't think of an equally easy way to do that. But then, as
a beginner, I might be underestimating Python.

Greetings,
Thomas

I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024
Insert replace text based on a name in other file python script	4	Mar 5, 2025
A simple lexer	0	Jan 9, 2007
Evaluate my first python script, please	13	Mar 4, 2010
Python code problem	2	Apr 23, 2023
Explore the Power of AI: Build Your Own Console Chatbot Using GPT-2 XL in Python	2	Mar 17, 2026
Python battle game help	2	Feb 23, 2023
Python point location of intersect between two lines	0	Feb 28, 2018

My first Python program -- a lexer

Thomas Mlynarczyk

John Machin

Arnaud Delobelle

Thomas Mlynarczyk

Thomas Mlynarczyk

John Machin

Thomas Mlynarczyk

John Machin

Paul McGuire

Robert Lehmann

Thomas Mlynarczyk

Thomas Mlynarczyk

Thomas Mlynarczyk

Paul McGuire

Steve Holden

Thomas Mlynarczyk

John Machin

Thomas Mlynarczyk

John Machin

Thomas Mlynarczyk

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads