Lalr(n) parsing with reg

V

vikkous

Yesterday, I introduced my the Ruby Extended Grammar, a pattern
matching library for ruby data. Astute readers may have noticed a
slight misnomer. Reg is not a grammar (parser), nor a tool for
grammars. It's really just a very fancy regular expression engine.
Regular expressions are equivalent to state machines. State machines
are not powerful enough by themselves to solve interesting parsing
problems -- that is, how to parse a language like ruby with infix
operators of different precedence and associativity. However, reg can
be easily augmented to be a real solve interesting parsing problems.

Handling precedence and associativity requires a lalr(1) parser. Let me
explain briefly the lalr algorithm:

The important lalr data structures are the stack and input. The input
is simply a stream of tokens fed into the parser, as it requests them.
The next token(s) waiting to be taken off the input is called the
lookahead.
The stack contains the results of partially parsed expressions. At each
step of the parse process, the parser decides (based on what's found at
the top of the stack and in the lookahead) whether to shift another
token off the input onto the stack or to reduce some of the tokens at
the top of the stack using the rules of the language's grammar. At the
end, we expect to see the input empty and on the stack a single token,
which represents the parse tree of the entire program.

Normal parsers (also called compiler compilers) use a big complicated
table to decide at runtime whether to shift or reduce and, if reducing,
which rule to reduce by. This table represents the compiled form of the
language grammar. That's why they're called compiler compilers. My
approach is rather different, and might best be described as an
interpreter interpreter. (Or, if it's to be used in a compiler, it
would be a compiler interpreter.)

Instead of shifting or choosing one rule to match at each step, each
rule is given a chance to match, and when none can, then the input is
shifted. Reg is used as the pattern matching engine, and a small
wrapper layer manages the parser data structures and invokes reg at
each step to do a match
attempt. I believe this approach is in general equivalent to the normal
lalr algorithm.

Yesterday's reg release contained a sketch of these ideas in the form
of a parser for a small, bc-like calculator language, in calc.reg.
I've also reproduced it below. Basically, it's a subset of ruby with
only local variables, numbers, a few operators (+, -, *, /, =, ;),
parentheses, and p as the sole function. Although small, parsing this
language is a representative problem because it requires solving
precedence and associativity.

The heart of the parser are its grammar rules, reproduced here:

#last element is always lookahead
Reduce=
-[ -[:p, '(', exp, ')'].sub {PrintExp.new BR[2]}, OB ] |
# p(exp)
-[ -['(', exp, ')'] .sub {BR[1]}, OB ] |
# (exp)
-[ -[exp, leftop, exp] .sub {OpExp.new *BR[0..2]},
regproc{lowerop(BR[1])} ] | # exp+exp
-[ exp, -[';'] .sub [], :EOI ] |
#elide final trailing ;
-[ -[name, '=', exp] .sub {AssignExp.new BR[0],BR[2]}, lowerop('=')
] #name=exp

Precedence is handled by the middle rule. This rule reduces infix
operator expressions (except =). It only matches if the lookahead does
not contain a higher precedence operator. This ensures that expressions
like '3+4*5' will parse correctly.

Associativity is handled by the last rule. = is the only
right-associative operator, so it's the only one that has to be handled
specially. Again, it allows a reduce only if the lookahead is not also
right-associative (and lower precedence...). This ensures that
expressions like 'a=b=c' will parse correctly.

The great advantage of the interpreter interpreter is flexibility. It
would be quite easy to extend this parser -- even at runtime -- by
adding things at the right place in Reduce. The disadvantage is
performance, which is likely to be very bad currently. The current
implementation of reg is not optimized to any great extent. Many
regexp-type optimizations could be applied to reg. Optimized regexp
engines can actually be quite fast, so, (aside from performance issues
with ruby itself) an optimized reg might actually be competitive with a
table-based parser in terms of performance. Keep in mind that
table-based parsers are not actually the fastest; the gold standard are
hand-coded or direct execution parsers.

Error detection is an area that might be troublesome. I haven't given
this a lot of thought yet, but I think it's approachable, without
causing too much pain. One way might be to wait until a synchronizing
token, then report errors.


calc.reg:

require 'reg'

#warning: this code is untested
#currently, it will not work because it depends on
#features of reg which do not exist (backreferences
and substitutions). in addition,
#it is likely to contain serious bugs, as it has
#not been thoroughly tested or assured in any way.
#nevertheless, it should give you a good idea of
#how this sort of thing works.


precedence={
:'('=>10, :p=>10,
:* =>9, :/ =>9,
:+ =>8, :- =>8,
:'='=>7,
:';'=>6
}
name=String.reg
exp=name|PrintExp|OpExp|AssignExp|Number #definitions of the
expression classes ommitted for brevity
leftop=/^[*\/;+-]$/
rightop=/^=$/
op=leftop|rightop
def lowerop opname
regproc{
leftop & proceq(Symbol) {|v| precedence[opname] >= precedence[v] }
}
end

#last element is always lookahead
Reduce=
-[ -[:p, '(', exp, ')'].sub {PrintExp.new BR[2]}, OB ] |
# p(exp)
-[ -['(', exp, ')'] .sub {BR[1]}, OB ] |
# (exp)
-[ -[exp, leftop, exp] .sub {OpExp.new *BR[0..2]},
regproc{lowerop(BR[1])} ] | # exp+exp
-[ exp, -[';'] .sub [], :EOI ] |
#elide final trailing ;
-[ -[name, '=', exp] .sub {AssignExp.new BR[0],BR[2]}, lowerop('=')
] #name=exp

#last element of stack is always lookahead
def reduceloop(stack)
old_stack=stack
while stack.match +[OBS, Reduce]
end
stack.equal? old_stack or raise 'error'
end

#last element of stack is always lookahead
def parse(input)
input<<:EOI
stack=[input.shift]
until input.empty? and +[OB,:EOI]===stack
stack.push input.shift #shift
reduceloop stack
end
return stack.first
end
 
F

Florian G. Pflug

--------------ms000601000202000408080504
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Yesterday, I introduced my the Ruby Extended Grammar, a pattern
matching library for ruby data. Astute readers may have noticed a
slight misnomer. Reg is not a grammar (parser), nor a tool for
grammars. It's really just a very fancy regular expression engine.
Regular expressions are equivalent to state machines. State machines
are not powerful enough by themselves to solve interesting parsing
problems -- that is, how to parse a language like ruby with infix
operators of different precedence and associativity. However, reg can
be easily augmented to be a real solve interesting parsing problems.

Handling precedence and associativity requires a lalr(1) parser. Let me
explain briefly the lalr algorithm:

The important lalr data structures are the stack and input. The input
is simply a stream of tokens fed into the parser, as it requests them.
The next token(s) waiting to be taken off the input is called the
lookahead.
The stack contains the results of partially parsed expressions. At each
step of the parse process, the parser decides (based on what's found at
the top of the stack and in the lookahead) whether to shift another
token off the input onto the stack or to reduce some of the tokens at
the top of the stack using the rules of the language's grammar. At the
end, we expect to see the input empty and on the stack a single token,
which represents the parse tree of the entire program.
Normal parsers (also called compiler compilers) use a big complicated
table to decide at runtime whether to shift or reduce and, if reducing,
which rule to reduce by. This table represents the compiled form of the
language grammar. That's why they're called compiler compilers. My
approach is rather different, and might best be described as an
interpreter interpreter. (Or, if it's to be used in a compiler, it
would be a compiler interpreter.)
Hm.. I belive it not that different. The tables of an LR(k) parser
specifiy for each input symbol, and each top-of-stack
a) An action (either shift, or "reduct p" where p is a rule
( a production) of your grammar
b) A "goto" - the new state the parser shall transition to.

Your represent the "action" table implicitly - you scan
the rules for every symbol you read, and decide to shift
or to reduce based on that, instead of looking into a predefined
table. Therefore, you just trade compiler-compile time for runtime -
but the mechanism is the same.

The goto table is entirely absent in your approach - but this
stems from the fact that you don't _need_ to remeber a state.
The state of a table-based LR(k) parser is just an "abbreviation"
for the current state of the stack. An table-based LR(k) parser
decided wether to shift or to reduce _soley_ based on the current
input symbol, and the top-of-the-stack. It therefore needs a state,
to "remeber" what it put on the stack previously. Each state
of a LR(k) parser represents a _single_ production (or rule) - but
a rule can be represented by more than one state.
Instead of shifting or choosing one rule to match at each step, each
rule is given a chance to match, and when none can, then the input is
shifted. Reg is used as the pattern matching engine, and a small
wrapper layer manages the parser data structures and invokes reg at
each step to do a match
attempt. I believe this approach is in general equivalent to the normal
lalr algorithm.
I believe this too.

I believe that you could improve the performance of your parser by
just-in-time compiling of the action and goto tables, or some
äquivalent thing.

You could, for example, calculate the FOLLOW set (The set of symbols
which can follow a valid right-hand side of a given rule). Then,
you just have to try those rules which have the current top-of-stack
in their FOLLOW set.

This would give a sort of an half-table-based LR(k) parser.

Anyway, thanks for your cool work, and for getting me interested in
parsers again ;-)

greetings, Florian Pflug

--------------ms000601000202000408080504
Content-Type: application/x-pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIL4DCC
BewwggVVoAMCAQICAzAABDANBgkqhkiG9w0BAQQFADCByDELMAkGA1UEBhMCQVQxDzANBgNV
BAgTBlZpZW5uYTEPMA0GA1UEBxMGVmllbm5hMSEwHwYDVQQKExhzb2x1dGlvbi14IFNvZnR3
YXJlIEdtYkgxHjAcBgNVBAsTFUNlcnRpZmljYXRlIEF1dGhvcml0eTEqMCgGA1UEAxMhc29s
dXRpb24teDo6Y2VydGlmaWNhdGUtYXV0aG9yaXR5MSgwJgYJKoZIhvcNAQkBFhlob3N0bWFz
dGVyQHNvbHV0aW9uLXguY29tMB4XDTA0MTAwODE4MTkxN1oXDTA5MTAwODE4MTkxN1owgdUx
CzAJBgNVBAYTAkFUMQ8wDQYDVQQIEwZWaWVubmExDzANBgNVBAcTBlZpZW5uYTEhMB8GA1UE
ChMYc29sdXRpb24teCBTb2Z0d2FyZSBHbWJIMSkwJwYDVQQKEyBjMjFmOTY5YjVmMDNkMzNk
NDNlMDRmOGYxMzZlNzY4MjEeMBwGA1UECxMVQ2VydGlmaWNhdGUgQXV0aG9yaXR5MRkwFwYD
VQQDExBGbG9yaWFuIEcuIFBmbHVnMRswGQYJKoZIhvcNAQkBFgxmZ3BAcGhsby5vcmcwgZ8w
DQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAPaQmTO2jrGX3ZKnGs4+VWaQhLkqUYfeL2jzI4Xa
2amxM7T/mdAzEsyXY5yQd2itdgfX2QFeSdIpB7/kF8NdxEK2jCeYSAwbqw+YXulAg3E4XKR5
qOROdEZIy50/K4X0yKeU/xy5LmIfLoqfJQ2Ao5XQarTvBRPbAnJmITSqkBHZAgMBAAGjggLT
MIICzzAJBgNVHRMEAjAAMBEGCWCGSAGG+EIBAQQEAwIFoDAOBgNVHQ8BAf8EBAMCBeAwHQYD
VR0lBBYwFAYIKwYBBQUHAwQGCCsGAQUFBwMCMB0GA1UdDgQWBBSn0xyNYO5HD0vtMeCcUCTz
J13zozCB9QYDVR0jBIHtMIHqgBTaa1BCkjUW3vQWeh5IMqml0irzFKGBzqSByzCByDELMAkG
A1UEBhMCQVQxDzANBgNVBAgTBlZpZW5uYTEPMA0GA1UEBxMGVmllbm5hMSEwHwYDVQQKExhz
b2x1dGlvbi14IFNvZnR3YXJlIEdtYkgxHjAcBgNVBAsTFUNlcnRpZmljYXRlIEF1dGhvcml0
eTEqMCgGA1UEAxMhc29sdXRpb24teDo6Y2VydGlmaWNhdGUtYXV0aG9yaXR5MSgwJgYJKoZI
hvcNAQkBFhlob3N0bWFzdGVyQHNvbHV0aW9uLXguY29tggEAMBcGA1UdEQQQMA6BDGZncEBw
aGxvLm9yZzAkBgNVHRIEHTAbgRlob3N0bWFzdGVyQHNvbHV0aW9uLXguY29tMEkGA1UdHwRC
MEAwPqA8oDqGOGh0dHBzOi8vYWRtaW4uaHEuc29sdXRpb24teC5jb20vcGhwa2kvY2Fjcmwv
c3hfY2FjcmwuY3JsMEIGCWCGSAGG+EIBDQQ1FjNQSFBraS9PcGVuU1NMIEdlbmVyYXRlZCBF
LW1haWwgKFMvTUlNRSkgQ2VydGlmaWNhdGUwNAYJYIZIAYb4QgECBCcWJWh0dHBzOi8vYWRt
aW4uaHEuc29sdXRpb24teC5jb20vcGhwa2kwIwYJYIZIAYb4QgEDBBYWFG5zX3Jldm9rZV9x
dWVyeS5waHA/MEAGCWCGSAGG+EIBCAQzFjFodHRwczovL2FkbWluLmhxLnNvbHV0aW9uLXgu
Y29tL3BocGtpL3BvbGljeS5odG1sMA0GCSqGSIb3DQEBBAUAA4GBAIpKvrQrG2shBRMGfyGw
4068QBerNs9JMYvkEuaJqZ3Xsdz8Ju+sA4090r9v6BtdZWJ29opyj34+l9Q/BM8mKoIHUB8U
Vy0iNpL0rsB9wphSCva4vnyO+Y9okcC+kgCcY8QAqq+z929pMURLtWcJeQS1L4Ae0qzL+3Ry
cp+WdH3GMIIF7DCCBVWgAwIBAgIDMAAEMA0GCSqGSIb3DQEBBAUAMIHIMQswCQYDVQQGEwJB
VDEPMA0GA1UECBMGVmllbm5hMQ8wDQYDVQQHEwZWaWVubmExITAfBgNVBAoTGHNvbHV0aW9u
LXggU29mdHdhcmUgR21iSDEeMBwGA1UECxMVQ2VydGlmaWNhdGUgQXV0aG9yaXR5MSowKAYD
VQQDEyFzb2x1dGlvbi14OjpjZXJ0aWZpY2F0ZS1hdXRob3JpdHkxKDAmBgkqhkiG9w0BCQEW
GWhvc3RtYXN0ZXJAc29sdXRpb24teC5jb20wHhcNMDQxMDA4MTgxOTE3WhcNMDkxMDA4MTgx
OTE3WjCB1TELMAkGA1UEBhMCQVQxDzANBgNVBAgTBlZpZW5uYTEPMA0GA1UEBxMGVmllbm5h
MSEwHwYDVQQKExhzb2x1dGlvbi14IFNvZnR3YXJlIEdtYkgxKTAnBgNVBAoTIGMyMWY5Njli
NWYwM2QzM2Q0M2UwNGY4ZjEzNmU3NjgyMR4wHAYDVQQLExVDZXJ0aWZpY2F0ZSBBdXRob3Jp
dHkxGTAXBgNVBAMTEEZsb3JpYW4gRy4gUGZsdWcxGzAZBgkqhkiG9w0BCQEWDGZncEBwaGxv
Lm9yZzCBnzANBgkqhkiG9w0BAQEFAAOBjQAwgYkCgYEA9pCZM7aOsZfdkqcazj5VZpCEuSpR
h94vaPMjhdrZqbEztP+Z0DMSzJdjnJB3aK12B9fZAV5J0ikHv+QXw13EQraMJ5hIDBurD5he
6UCDcThcpHmo5E50RkjLnT8rhfTIp5T/HLkuYh8uip8lDYCjldBqtO8FE9sCcmYhNKqQEdkC
AwEAAaOCAtMwggLPMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMA4GA1UdDwEB/wQE
AwIF4DAdBgNVHSUEFjAUBggrBgEFBQcDBAYIKwYBBQUHAwIwHQYDVR0OBBYEFKfTHI1g7kcP
S+0x4JxQJPMnXfOjMIH1BgNVHSMEge0wgeqAFNprUEKSNRbe9BZ6HkgyqaXSKvMUoYHOpIHL
MIHIMQswCQYDVQQGEwJBVDEPMA0GA1UECBMGVmllbm5hMQ8wDQYDVQQHEwZWaWVubmExITAf
BgNVBAoTGHNvbHV0aW9uLXggU29mdHdhcmUgR21iSDEeMBwGA1UECxMVQ2VydGlmaWNhdGUg
QXV0aG9yaXR5MSowKAYDVQQDEyFzb2x1dGlvbi14OjpjZXJ0aWZpY2F0ZS1hdXRob3JpdHkx
KDAmBgkqhkiG9w0BCQEWGWhvc3RtYXN0ZXJAc29sdXRpb24teC5jb22CAQAwFwYDVR0RBBAw
DoEMZmdwQHBobG8ub3JnMCQGA1UdEgQdMBuBGWhvc3RtYXN0ZXJAc29sdXRpb24teC5jb20w
SQYDVR0fBEIwQDA+oDygOoY4aHR0cHM6Ly9hZG1pbi5ocS5zb2x1dGlvbi14LmNvbS9waHBr
aS9jYWNybC9zeF9jYWNybC5jcmwwQgYJYIZIAYb4QgENBDUWM1BIUGtpL09wZW5TU0wgR2Vu
ZXJhdGVkIEUtbWFpbCAoUy9NSU1FKSBDZXJ0aWZpY2F0ZTA0BglghkgBhvhCAQIEJxYlaHR0
cHM6Ly9hZG1pbi5ocS5zb2x1dGlvbi14LmNvbS9waHBraTAjBglghkgBhvhCAQMEFhYUbnNf
cmV2b2tlX3F1ZXJ5LnBocD8wQAYJYIZIAYb4QgEIBDMWMWh0dHBzOi8vYWRtaW4uaHEuc29s
dXRpb24teC5jb20vcGhwa2kvcG9saWN5Lmh0bWwwDQYJKoZIhvcNAQEEBQADgYEAikq+tCsb
ayEFEwZ/IbDjTrxAF6s2z0kxi+QS5ompndex3Pwm76wDjT3Sv2/oG11lYnb2inKPfj6X1D8E
zyYqggdQHxRXLSI2kvSuwH3CmFIK9ri+fI75j2iRwL6SAJxjxACqr7P3b2kxREu1Zwl5BLUv
gB7SrMv7dHJyn5Z0fcYxggP2MIID8gIBATCB0DCByDELMAkGA1UEBhMCQVQxDzANBgNVBAgT
BlZpZW5uYTEPMA0GA1UEBxMGVmllbm5hMSEwHwYDVQQKExhzb2x1dGlvbi14IFNvZnR3YXJl
IEdtYkgxHjAcBgNVBAsTFUNlcnRpZmljYXRlIEF1dGhvcml0eTEqMCgGA1UEAxMhc29sdXRp
b24teDo6Y2VydGlmaWNhdGUtYXV0aG9yaXR5MSgwJgYJKoZIhvcNAQkBFhlob3N0bWFzdGVy
QHNvbHV0aW9uLXguY29tAgMwAAQwCQYFKw4DAhoFAKCCAnswGAYJKoZIhvcNAQkDMQsGCSqG
SIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMDUwNDI2MTUyMDIyWjAjBgkqhkiG9w0BCQQxFgQU
Q1Ab176WBuxzIDWZLS++qlIZBekwUgYJKoZIhvcNAQkPMUUwQzAKBggqhkiG9w0DBzAOBggq
hkiG9w0DAgICAIAwDQYIKoZIhvcNAwICAUAwBwYFKw4DAgcwDQYIKoZIhvcNAwICASgwgeEG
CSsGAQQBgjcQBDGB0zCB0DCByDELMAkGA1UEBhMCQVQxDzANBgNVBAgTBlZpZW5uYTEPMA0G
A1UEBxMGVmllbm5hMSEwHwYDVQQKExhzb2x1dGlvbi14IFNvZnR3YXJlIEdtYkgxHjAcBgNV
BAsTFUNlcnRpZmljYXRlIEF1dGhvcml0eTEqMCgGA1UEAxMhc29sdXRpb24teDo6Y2VydGlm
aWNhdGUtYXV0aG9yaXR5MSgwJgYJKoZIhvcNAQkBFhlob3N0bWFzdGVyQHNvbHV0aW9uLXgu
Y29tAgMwAAQwgeMGCyqGSIb3DQEJEAILMYHToIHQMIHIMQswCQYDVQQGEwJBVDEPMA0GA1UE
CBMGVmllbm5hMQ8wDQYDVQQHEwZWaWVubmExITAfBgNVBAoTGHNvbHV0aW9uLXggU29mdHdh
cmUgR21iSDEeMBwGA1UECxMVQ2VydGlmaWNhdGUgQXV0aG9yaXR5MSowKAYDVQQDEyFzb2x1
dGlvbi14OjpjZXJ0aWZpY2F0ZS1hdXRob3JpdHkxKDAmBgkqhkiG9w0BCQEWGWhvc3RtYXN0
ZXJAc29sdXRpb24teC5jb20CAzAABDANBgkqhkiG9w0BAQEFAASBgCm9ROUyr3rCTNwR/grp
DfOR67V+E1nCEWOwXSbNA+LJdEZOubn6vhrOhek3H+HRe8s5D9wdR55f2QqMHXXjyb1pT8z5
dpB/2maQooAFxUFACmFaq7BEuJmF/EyPd2FY8Q28DB0L3eHhNfNH9s97GKtZrYZLwreE3OFl
gm0pi/E+AAAAAAAA
--------------ms000601000202000408080504--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top