N
Neil Cerutti
A found some clues on lexing using the re module in Python in an
article by Martin L÷wis.
http://www.python.org/community/sigs/retired/parser-sig/towards-standard/
He writes:
[...]
A scanner based on regular expressions is usually implemented
as an alternative of all token definitions. For XPath, a
fragment of this expressions looks like this:
(?P<Number>\\d+(\\.\\d*)?|\\.\\d+)|
(?P<VariableReference>\\$""" + QName + """)|
(?P<NCName>"""+NCName+""")|
(?P<QName>"""+QName+""")|
(?P<LPAREN>\\()|
Here, each alternative in the regular expression defines a
named group. Scanning proceeds in the following steps:
1. Given the complete input, match the regular expression
with the beginning of the input.
2. Find out which alternative matched.
[...]
Item 2 is where I get stuck. There doesn't seem to be an obvious
way to do it, which I understand is a bad thing in Python.
Whatever source code went with the article originally is not
linked from the above page, so I don't know what Martin did.
Here's what I came up with (with a trivial example regex):
import re
r = re.compile('(?P<x>x+)|(?P<a>a+)')
m = r.match('aaxaxx')
if m:
for k in r.groupindex:
if m.group(k):
# Find the token type.
token = (k, m.group())
I wish I could do something obvious instead, like m.name().
article by Martin L÷wis.
http://www.python.org/community/sigs/retired/parser-sig/towards-standard/
He writes:
[...]
A scanner based on regular expressions is usually implemented
as an alternative of all token definitions. For XPath, a
fragment of this expressions looks like this:
(?P<Number>\\d+(\\.\\d*)?|\\.\\d+)|
(?P<VariableReference>\\$""" + QName + """)|
(?P<NCName>"""+NCName+""")|
(?P<QName>"""+QName+""")|
(?P<LPAREN>\\()|
Here, each alternative in the regular expression defines a
named group. Scanning proceeds in the following steps:
1. Given the complete input, match the regular expression
with the beginning of the input.
2. Find out which alternative matched.
[...]
Item 2 is where I get stuck. There doesn't seem to be an obvious
way to do it, which I understand is a bad thing in Python.
Whatever source code went with the article originally is not
linked from the above page, so I don't know what Martin did.
Here's what I came up with (with a trivial example regex):
import re
r = re.compile('(?P<x>x+)|(?P<a>a+)')
m = r.match('aaxaxx')
if m:
for k in r.groupindex:
if m.group(k):
# Find the token type.
token = (k, m.group())
I wish I could do something obvious instead, like m.name().