Parsing, BNF, TreeTop, GhostWheel, ...

Philipp Kempgen · Aug 12, 2010

Hi,

I'm looking for a parser which can be fed with some (A)BNF-style rules.

Gave TreeTop a try
---cut---------------------------------------------------------
grammar MathGrammar
rule additive
multitive '+' additive / multitive
end
rule multitive
primary '*' multitive / primary
end
rule primary
'(' additive ')' / number
end
rule number
[1-9] [0-9]*
end
end
---cut---------------------------------------------------------

as well as GhostWheel
---cut---------------------------------------------------------
MathParser = GhostWheel.build_parser {

rule( :additive , alt( :multitive,
seq( :multitive, '+', :additive )
))
rule( :multitive , alt( seq(

rimary, '*', :multitive ),

rimary
))
rule(

rimary , alt( seq( '(', :additive, ')' ),
:number
))
rule( :number , seq( /[1-9]/ , zplus( /0-9/ ) ))

parser( :main, seq

additive, eof()) )
}
---cut---------------------------------------------------------

'1+1' parses fine.
However when I change the definition of "additive" from
multitive '+' additive / multitive
to
multitive / multitive '+' additive
it fails to parse.

Is this a problem with packrat/PEG parsers in general?
If so, which parser is more appropriate?
It should hand back a parse tree.

Memory consumption is not an issue.

Regards,
Philipp

Philipp Kempgen · Aug 12, 2010

Philipp said:
I'm looking for a parser which can be fed with some (A)BNF-style rules. ...
If so, which parser is more appropriate?
It should hand back a parse tree.

Alternatively I'd appreciate if Regexps could return *all*
occurrences of named capture groups inside repetitions etc.
instead of just the last match for each name. Feasible?

Regards,
Philipp

Iain Barnett · Aug 12, 2010

rules.
=20
Alternatively I'd appreciate if Regexps could return *all*
occurrences of named capture groups inside repetitions etc.
instead of just the last match for each name. Feasible?
=20
Regards,
Philipp
=20

I was trying to do the same thing and asked about repeated named =
captures.
http://osdir.com/ml/ruby-talk/2010-07/msg00361.html

It seems that using String#scan is the closest anything Ruby has, as the =
Oniguruma regex engine doesn't support it. I think it's a real shame as =
named capture groups are really useful.

Regards,
Iain

Robert Klemme · Aug 13, 2010

As I understand you want /f(o)+/ =~ "foo" to return ["o", "o"] as match
for the group (used normal capturing groups for simplicity).

I was trying to do the same thing and asked about repeated named captures.
http://osdir.com/ml/ruby-talk/2010-07/msg00361.html

It seems that using String#scan is the closest anything Ruby has, as the Oniguruma regex engine doesn't support it. I think it's a real shame as named capture groups are really useful.

Regular expression engines generally do only return one match per group
- regardless of naming or not naming groups. I'm not up to date with
current Perl's regular expressions which are the only ones I can imagine
to be crazy enough to provide such a feature.

Otherwise String#scan
is indeed the proper tool to get multiple matches. The example above
could be done like this

if /f(o+)/ =~ s
$1.scan /o/ do |match|
p match
end
end

or this

if /f(o+)/ =~ s
matches = $1.scan /o/
p matches
end

Kind regards

robert

Iain Barnett · Aug 13, 2010

=20
Regular expression engines generally do only return one match per =

group - regardless of naming or not naming groups. I'm not up to date =
with current Perl's regular expressions which are the only ones I can =
imagine to be crazy enough to provide such a feature.

=20

Net definitely offers it, which is ironic as none of the .Net devs I =
used to work with ever used regex.=20

Scan is a really poor alternative (for this kind of task), I think. =
Instead of just writing one regex which expresses what I want to do much =
better, I end up having to write a lot of extra code to get the same =
effect. Scan's cool for more simple things.

Regards
Iain

Philipp Kempgen · Aug 13, 2010

Robert said:
As I understand you want /f(o)+/ =~ "foo" to return ["o", "o"] as match
for the group (used normal capturing groups for simplicity).

FRUIT = '(?<FRUIT> Apple|Banana|Pear|Orange)'
FRUIT_COLLECTION = '(?<FRUIT_COLLECTION> ' << FRUIT << '*)'

re = Regexp.new( '^' << FRUIT_COLLECTION << '$', Regexp::EXTENDED )
re.match( 'PearBananaApple' )[:FRUIT]
=> "Apple"

I would want e.g. matchdata[:FRUIT] to be an Array
[ 'Pear', 'Banana', 'Apple' ]
and not just 'Apple' (the last FRUIT).

Actually something similar to a concrete syntax tree / parse tree
would be even better:

ROOT "PearBananaApple"
FRUIT_COLLECTION "PearBananaApple"
FRUIT "Pear"
FRUIT "Banana"
FRUIT "Apple"

Exactly.
Anyway thanks for your input.

Regards,
Philipp

Clifford Heath · Aug 17, 2010

(Sent previously, but it didn't appear. Sorry if you get this twice).

Philipp said:
Is this a problem with packrat/PEG parsers in general?

Yes. PEG parsers are greedy (iterate as far as possible and never
backtrack on that), and also take the first alternative that succeeds,
without trying subsequent alternatives. It's actually quite useful
and natural, once you get the hang of it.

Pure PEGs have no lookahead, as Treetop does. It's often necessary
though, in the light of the above.

Also, your approach will yield the wrong results when you add '-' into
the mix, because of associativity. "1-2-3" will calculate as 1-(2-3).
A better way is to use iteration, see

<http://github.com/nathansobo/treetop/blob/master/examples/lambda_calculus/arithmetic.treetop>

Clifford Heath.

Clifford Heath · Aug 18, 2010

Christian said:
You may want to look at: http://kanocc.rubyforge.org/

Actually, I think Citrus does a better job than Treetop (of which I'm primary maintainer and a serious user). It provides a BNF built on ruby functions and operators, and also a Treetop/PEG-like parser for grammar files as an alternative. A brief look at Kanocc indicates Citrus may be preferable over it also.

Citrus doesn't memoize by default however, you have to enable that. It also doesn't have support for semantic predicates, an important feature (for me) which I added to Treetop. Otherwise it is preferable in most ways, and I think Michael Jackson has done a sterling job. Kudos!

<http://github.com/mjijackson/citrus/>

Clifford Heath.

Philipp Kempgen · Aug 19, 2010

Clifford said:
Yes. PEG parsers are greedy (iterate as far as possible and never
backtrack on that), and also take the first alternative that succeeds,
without trying subsequent alternatives. It's actually quite useful
and natural, once you get the hang of it.

Pure PEGs have no lookahead, as Treetop does. It's often necessary
though, in the light of the above.

What I wanted to do is to construct/generate a parser from a bunch
of (A)BNF rules (dozends) from an RFC so backtracking is a
requirement while lookahead is not. The grammar specification for
the parser (/ parser generator) does not need to be exactly like BNF
but should not require me to think about where lookahead is needed
as this manual conversion can be error-prone.
That seems to rule out PEG parsers.

Clifford said:
Actually, I think Citrus does a better job than Treetop (of which I'm primary maintainer and a serious user). It provides a BNF built on ruby functions and operators, and also a Treetop/PEG-like parser for grammar files as an alternative. A brief look at Kanocc indicates Citrus may be preferable over it also.

Citrus doesn't memoize by default however, you have to enable that. It also doesn't have support for semantic predicates, an important feature (for me) which I added to Treetop.

<http://github.com/mjijackson/citrus/>

Thanks a lot for the pointers Clifford and Christian!
Will have a look as soon as time permits.

Philipp

Clifford Heath · Aug 20, 2010

Philipp said:
Clifford Heath wrote:
What I wanted to do is to construct/generate a parser from a bunch
of (A)BNF rules (dozends) from an RFC so backtracking is a
requirement while lookahead is not. The grammar specification for
the parser (/ parser generator) does not need to be exactly like BNF
but should not require me to think about where lookahead is needed
as this manual conversion can be error-prone.

Most of the BNFs in RFCs cannot be directly automated by any parser
generator. In addition, quite a few of them are slightly incorrect
and/or incomplete, but haven't been fixed because they're always
implemented by humans, who paper over the issues, and introduce new
errors instead ;-)

The upshot is that you're not going to find a direct implementation
that works without thinking.

That seems to rule out PEG parsers.

On the contrary, I believe that a PEG parser might represent the most
direct mapping. PEGs do backtrack, just not on a successful rule or
iteration. You simply have to introduce enough lookahead to prevent
false success or excessive iteration... which is actually pretty easy
once you get the hang of it... the point here is that the *rest* of
the grammar is much more likely to look like the EBNF you started
with. Oh, yeah, you have to refactor any left-recursion too, which can
be difficult.

Clifford Heath.

JÃ¶rg W Mittag · Sep 3, 2010

Clifford said:
Most of the BNFs in RFCs cannot be directly automated by any parser
generator. In addition, quite a few of them are slightly incorrect
and/or incomplete, but haven't been fixed because they're always
implemented by humans, who paper over the issues, and introduce new
errors instead ;-)

The upshot is that you're not going to find a direct implementation
that works without thinking.

On the contrary, I believe that a PEG parser might represent the most
direct mapping. PEGs do backtrack, just not on a successful rule or
iteration. You simply have to introduce enough lookahead to prevent
false success or excessive iteration... which is actually pretty easy
once you get the hang of it... the point here is that the *rest* of
the grammar is much more likely to look like the EBNF you started
with. Oh, yeah, you have to refactor any left-recursion too, which can
be difficult.

Actually, while it is true that PEGs as specificied in Bryan Ford's
paper do not support left recursion, PEGs are usually implemented
using packrat parsers which *can* support direct left recursion just
fine as Alessandro Warth showed in his aptly titled paper "Packrat
Parsers Can Support Left Recursion", although with the caveat that you
lose the linear time complexity guarantee that makes packrat parsing
so compelling in the first place.

jwm

Clifford Heath · Sep 7, 2010

Jörg W Mittag said:
Actually, while it is true that PEGs as specificied in Bryan Ford's
paper do not support left recursion, PEGs are usually implemented
using packrat parsers which *can* support direct left recursion just
fine as Alessandro Warth showed in his aptly titled paper "Packrat
Parsers Can Support Left Recursion",

Nice, thanks for the pointer. I'm not brave enough to try to implement
it in Treetop (the builder used for code gen is a bit of a pain), but
it'd certainly be cool to have. Probably easier to add to Citrus, where
there's no code generator in the way.

Clifford Heath.

Help needed about Treetop::Runtime::SyntaxNode.	1	Jul 23, 2009
Treetop parser (or PEG in general?) questions	18	Jan 27, 2008
Treetop Email Parser	13	Jan 28, 2008
[ANN] parslet 1.1 - a PEG parser library	1	Feb 2, 2011
help with "grammar"	0	May 13, 2009
[SUMMARY] Parsing JSON (#155)	12	Feb 7, 2008
Lalr(n) parsing with reg	1	Apr 25, 2005
[ANN] parslet 1.0 - a PEG parser library	2	Dec 29, 2010

Parsing, BNF, TreeTop, GhostWheel, ...

Philipp Kempgen

Philipp Kempgen

Iain Barnett

Robert Klemme

Iain Barnett

Philipp Kempgen

Clifford Heath

Clifford Heath

Philipp Kempgen

Clifford Heath

JÃ¶rg W Mittag

Clifford Heath

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads