Treetop parser (or PEG in general?) questions

P

Phrogz

I've been looking for something like treetop for a while now. Very
excited to have found it, and to play with it.

I'm rather new to PEGs, having only experienced them briefly in Lua.
Following are some questions as I try to adapt my thinking to their
ways.

Let's assume that I'm trying to parse the following (wiki) markup:
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
= Welcome! =
Hello world! This is a **single paragraph**
that **wraps over //two//** physical lines.

This is a second paragraph. All on one line.

== A Table Example ==
|| **Head1** || **Head2** ||
|| row1col1 || row1col2 ||
|| r2c1 || //r2c2// ||

This is the last paragraph. There is no newline
after this final period.
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_


I'm trying to convert it to this (HTML):
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
<h1>Welcome!</h1>
<p>Hello world! This is a <strong>single paragraph</strong>
that <strong>wraps over <em>two</em></strong> physical lines.</p>
<p>This is a second paragraph. All on one line.</p>
<h2>A Table Example</h2>
<table>
<tr>
<td><strong>Head1</strong></td>
<td><strong>Head2</strong></td>
</tr>
<tr>
<td>row1col1</td>
<td>row1col2</td>
</tr>
<tr>
<td>r2c1</td>
<td><em>r2c2</em></td>
</tr>
</table>
<p>This is the last paragraph. There is no newline
after this final period.</p>
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_

Three questions (in particular, of many) jump out at me:


LINE ANCHORING
In the above, there is 'block' content, and 'inline' content. Headers,
paragraphs, and tables are block level items. Among other things, this
means that their markup is only valid at the start of a line. For
example, a line that started " = Hi =" would not constitute a valid
header. In Regexp land, this would be handled simply with a ^ anchor.
How do you handle this in Treetop? Do you simply ensure that the root
rule of the document contains only the block-level rules that consume
up to and including one or more newlines?

grammar SimpleWiki
rule document
heading / paragraph / table / newlines
end
rule heading
stuff "\n"
end
rule paragraph
stuff "\n\n"
end
rule table
stuff "\n"
end
rule newlines
"\n"+
end
end

HANDLING EOF
As seen in the example above, a paragraph (or any block level element,
really) is allowed to not have a newline if it's the last thing in the
file. Do you handle this case normally by just preprocessing the input
and shoving a newline on the end if it doesn't exist, or is there a
way in Treetop to recognize the /\Z/ anchor from a Regexp?


BACKREFERENCES
A valid heading must match this regexp: /^(?:=+) (.+) \1$/
It must start at the front of the line with one or more =
characters.
It can have anything (including some = characters).
It ends with the same number of = characters, which must be followed
by a newline.

I know one of the sweet things about PEGs over Regexps are their
ability to match grammars with nested rules. I can't figure out how to
use this to my advantage in the above. If I adapt the nested parens
example from the Treetop documentation...

# Assume we're already at the start of a line.
rule heading
'=' heading '='
/
' ' string_of_words ' '
end

....then I fail to account for the necessary newline that follows the
last equals sign. But I can't figure out how to change this to use a
newline without messing up the recursive parsing.


Other questions I won't dive into: what's a reasonable way to eat
inline content (words) while allowing inline markup? Am I necessarily
going to end up with a tree for a paragraph that has one child for
each word (or letter)? Can I consume and throw away the newline in the
middle of a paragraph (as part of a string_of_words) without messing
up the end delimiters for the paragraph? (I think so.) Will lookaheads
suffice to wrap <table>...</table> around all the rows, and <tr>...</
tr> around all the cells in a row? (I think so.)

Any help or ideas appreciated.
 
P

Phrogz

An additional Treetop question that has me stumped:

I have a simple grammar as listed at the end of this post. In the
following code, why can I not get the 'to_xml' method to flow through
the 'inline_atom' rule, to use the to_xml of the underlying wrrd and
numz classes? Why does inline_atom not provide the #wrrd and #numz
SyntaxNode methods inside the handlers for inline_atom? How can I
rewrite inline_atom to allow the flow through?


require 'rubygems'
require 'treetop'

Treetop.load "t1.treetop"
@parser = T1Parser.new
@root = @parser.parse( "Hi 123\n\n" )
p @root
#=>SyntaxNode+Paragraph1+Paragraph0 offset=0, "Hi 123\n
\n" (inline_content,to_xml):
#=> SyntaxNode+InlineContent2+InlineContent1 offset=0, "Hi
123" (all_items,to_xml,items,last_item):
#=> SyntaxNode offset=0, "Hi ":
#=> SyntaxNode+InlineContent0 offset=0, "Hi " (wspace,item):
#=> SyntaxNode+InlineAtom1+Wrrd0 offset=0, "Hi" (to_xml):
#=> SyntaxNode offset=0, "H"
#=> SyntaxNode offset=1, "i"
#=> SyntaxNode offset=2, " ":
#=> SyntaxNode offset=2, " "
#=> SyntaxNode+InlineAtom0+Numz0 offset=3, "123" (to_xml):
#=> SyntaxNode offset=3, "1"
#=> SyntaxNode offset=4, "2"
#=> SyntaxNode offset=5, "3"
#=> SyntaxNode offset=6, "\n\n"

p @root.to_xml
#=> NameError: undefined local variable or method 'wrrd' for
#<Treetop::Runtime::SyntaxNode:0x5c7808>


-_- t1.treetop -_-

grammar T1
rule paragraph
inline_content "\n\n" {
def to_xml
"<p>#{inline_content.to_xml}</p>\n"
end
}
end

rule inline_content
items:( item:inline_atom wspace )* last_item:inline_atom {
def to_xml
all_items.map{ |atom| atom.to_xml }.join( ' ' )
end
def all_items
all = []
all << first_item if methods.include?( 'first_item' )
all.concat items.elements.map{ |el| el.item }
all << last_item if methods.include?( 'last_item' )
all
end
}
end

rule inline_atom
numz {
def to_xml
numz.to_xml
end
}
/
wrrd {
def to_xml
wrrd.to_xml
end
}
end

rule wrrd
[A-Za-z]+ {
def to_xml
"<word>#{text_value}</word>"
end
}
end

rule numz
([1-9]+ / '0') {
def to_xml
"<i>#{text_value}</i>"
end
}
end

rule wspace
[ \t]+
end

end
 
P

Phil Tomson

I've been playing with TreeTop for a while now as well...
I would suggest that you use tt to generate the parser and then take a
look at the resulting code:

tt t1.treetop

Resulting code will be in t1.rb

I think you'll find that the call to wrrd.toxml doesn't work there
because wrrd is not in scope. Not sure how to fix it, though, maybe
you could use elements[0].wrrd?

Phil
 
C

Clifford Heath

Phrogz said:
I've been looking for something like treetop for a while now. Very
excited to have found it, and to play with it.

So was I - thanks Nathan! - and many of the recent improvements are mine.
Let's assume that I'm trying to parse the following (wiki) markup: .... snip...
Three questions (in particular, of many) jump out at me:
LINE ANCHORING
In the above, there is 'block' content, and 'inline' content. Headers,
paragraphs, and tables are block level items. Among other things, this
means that their markup is only valid at the start of a line. For
example, a line that started " = Hi =" would not constitute a valid
header. In Regexp land, this would be handled simply with a ^ anchor.
How do you handle this in Treetop?

Treetop (and PEGs in general) requires no separate lexer - all lexing
can be done with the same algorithmic efficiency as you get from a
typical DFA-based lexer anyhow.

As a result, Treetop has no notion of whitespace or newlines - they're
just characters, and you have to have syntax rules that match them.
If you want a rule to match after a newline, call it after a newline,
or at the start of the text). If you want a rule to match only if
followed by a newline, follow it with: & '\n'
You can use any rule after & - the rule must match but the input won't
be consumed by the calling rule.
HANDLING EOF
As seen in the example above, a paragraph (or any block level element,
really) is allowed to not have a newline if it's the last thing in the
file. Do you handle this case normally by just preprocessing the input
and shoving a newline on the end if it doesn't exist,

That's what I'd do.
or is there a
way in Treetop to recognize the /\Z/ anchor from a Regexp?

No. There's no EOF symbol. Perhaps there should be..
BACKREFERENCES
A valid heading must match this regexp: /^(?:=+) (.+) \1$/
It must start at the front of the line with one or more =
characters.
It can have anything (including some = characters).
It ends with the same number of = characters, which must be followed
by a newline.

I don't think there's a way of saying that the trailing =s must be
equal in number to those matched by the leading '='+.

In general, there's no way to inject custom code that affects the
parsing process (what ANTLR calls a semantic predicate), which is
a pity. It's something I want sometimes too, so if you can suggest
a clean enough way to specify it, I'll think about implementing it.
Other questions I won't dive into: what's a reasonable way to eat
inline content (words) while allowing inline markup?

Your problem here if I read it correctly is that you want to read anything
that's not markup. Here's where the ! operator comes in. You might have a
rule called "markup", which matches any markup, and a rule "word" that you
call like this:

rule words
(!markup word)*
end

That will match any sequence of zero or more "word", where no word matches
"markup". Here, the rule following ! may be of any complexity, as with &.

This is essentially the "C comment matching problem". Here's what I use
for C-style comments, C++ style comments, and whitespace:

rule s # Optional space
S?
end

rule S # Mandatory space
(white / comment_to_eol / comment_c_style)+
end

rule white
[ \t\n\r]+
end

rule comment_to_eol
'//' (!"\n" .)+
end

rule comment_c_style
'/*' (!'*/' . )* '*/'
end
Am I necessarily
going to end up with a tree for a paragraph that has one child for
each word (or letter)?

You'll have one leaf per leaf rule (lexical rule) - but you don't need
to look at it, you can use "text_value" of any node, which is just the
substring of the input spanned by that rule.

I hope that's some help. My CQL parser might give you some more ideas, at
<http://activefacts.rubyforge.org/svn/lib/activefacts/cql/CQLParser.treetop>.
It's a different style of language than what you're parsing, but also needs
large amounts of backtracking at times.

Clifford Heath.
 
C

Clifford Heath

Phrogz said:
require 'rubygems'
require 'treetop'

Treetop.load "t1.treetop"
@parser = T1Parser.new
@root = @parser.parse( "Hi 123\n\n" )

Just a tip: Treetop now uses my Polyglot gem, which hooks require,
so that instead of calling "Treetop.load 't1.treetop'", you can
just say:

require 'treetop'
require 't1'

If the .rb file is found first, that'll be loaded, if not, treetop
will compile 't1.treetop' (or 't1.tt', whichever you use).

In my CQL parser, I have a file cql.rb, which does the require 'treetop'
and require 'CQLParser', and also uses Polyglot to define a CQL load
function, so if you have a file "model.cql", you can just:

require 'cql'
require 'model'

and the model.cql file is compiled by the CQL parser (which itself
is created dynamically by Treetop if needed). Cute stuff.

Clifford Heath.
 
C

Clifford Heath

Phrogz said:
An additional Treetop question that has me stumped:

I have a simple grammar as listed at the end of this post. In the
following code, why can I not get the 'to_xml' method to flow through
the 'inline_atom' rule, to use the to_xml of the underlying wrrd and
numz classes?

I'll make a commentary first, leading up to your answer :).

In inline_content, you have a sequence containing a sequence:
items:( item:inline_atom wspace )* last_item:inline_atom {...

which creates *two* SyntaxNodes. Your code block is emitted into the
module InlineContent2, which is extended into the outer SyntaxNode,
as you see in the dump of the syntax tree. The notation:
SyntaxNode+InlineContent2+InlineContent1 in the dump says that the
object is of class SyntaxNode, but is extended with the two modules
named. The dump also shows the interesting methods that have been
added to your nodes... cute, eh?

This code block does: methods.include?('first_item') which will
always fail, since first_item is nowhere defined. I think you
meant to say:

def all_items
items.elements.map{ |e| e.item } + [last_item]
end

You should use respond_to?. not methods.include? anyhow if you want
to test for an alternative being taken - this saves building an array
just to see whether it contains your element.
Why does inline_atom not provide the #wrrd and #numz
SyntaxNode methods inside the handlers for inline_atom? How can I
rewrite inline_atom to allow the flow through?

The alternative that contains wrrd has only that element, i.e. it's
not a sequence or a repetition, so it doesn't get a SyntaxNode of its
own. Any code block you add is a module that gets extended into the
node returned from the wrrd rule (and must refer to that node as self,
not by the name wrrd), as you'll see in the dump: the SyntaxNode which
is extended with the Wrrd0 module (which is created by the wrrd rule)
is a direct child of the inline_content node.

So in this case, you can remove the two code blocks in the inline_atom
rule and your program just works.

Note that you can have a code block on each alternative as well as the
rule as a whole:

rule inline_atom
( numz / wrrd { def bar; ... end } ) { def foo; ... end}
end

This adds the method foo to whichever alternative was taken, but the
method bar only if wrrd was taken. In both cases, the methods are
defined in a module that is extended into the node; no additional
nodes are created by this rule.

Phil's advice is good - run tt and read the emitted code - it's not
difficult and will help you understand whas going on.

Clifford Heath.
 
P

Phil Tomson

Note that you can have a code block on each alternative as well as the
rule as a whole:

rule inline_atom
( numz / wrrd { def bar; ... end } ) { def foo; ... end}
end

This adds the method foo to whichever alternative was taken, but the
method bar only if wrrd was taken. In both cases, the methods are
defined in a module that is extended into the node; no additional
nodes are created by this rule.

This is good to know. So you could also do this, correct?:

rule inline_atom
( numz { def bar; ... end} / wrrd { def bar; ... end} ) { def foo; ... end}
end

so that each alternative gets it's own bar method.

Here's something I'm trying to figure out. I want to parse certain
types of declarations in a language - port declarations in VHDL. I
don't want to have to create a parser for the whole VHDL language. So
let's say I have this VHDL code:

library IEEE;
use IEEE.std_logic_1164.all;

entity foo is
port( a,b : in bit;
c : out bit;
)
end foo;
other stuff blah blah...;

The only thing I care about is that port declaration in the middle. I
want to extract the signals names from it (a,b,c) and the directions
(in,out).

I thought this might work:

grammar VHDL
rule top_level
( port_decl / . )* {

def get_ports
if( elements[0].respond_to? :ports )
elements[0].ports if elements[0].respond_to? :ports
else
[]
end
end

rule port_decl
spc port_keyword spc '(' spc io_ports:interface_list ')' spc ';'
spc <PortDeclNode> {
def ports
io_ports.ports
end
}

rule interface_list
psd:interface_signal_decl more_port_signal_decls:( spc ';' spc
sig_decl:interface_signal_decl spc )* {

def ports
([psd.port_decls] + more_port_signal_decls).flatten
end

def more_port_signal_decls
super.elements.map { |elt| elt.sig_decl.port_decls }
end
}

end


rule interface_signal_decl
p_name:name more_names:(spc ',' spc other_name:name )* spc ':' spc
dir:mode spc sig_typ:sig_type spc <PortSigDeclNode> {

class InterfaceSigDecl
attr_accessor :name, :mode, :type
def initialize name, mode, type
@name = name
@mode = mode
@type = type
puts "name: #{name} mode: #{mode} type: #{type}"
end

def to_s
"#{@name} : #{@mode} #{@type} \n"
end
end

def port_name
p_name.text_value.downcase
end

def port_decls
port_names.map {|pn| InterfaceSigDecl.new pn, direction, type }
end

def port_names
[p_name.text_value.downcase] + more_names
end

def more_names
super.elements.map {|elt| elt.other_name.text_value.downcase}
end

def direction
dir.text_value.downcase
end

def type
sig_typ.text_value.downcase
end


}
end

... end grammar (lots of other stuff, but not important for the example)

I can pass a port declaration on it's own followed by garbage, like:

ports = parse 'port( x : in bit ) ; dfe;'
ps = ports.get_ports

And that will work, I get the ports list out.

However, if I try:

ports = parse 'xyz; port( x : in bit ) ; dfe;'
ps = ports.get_ports

The ports list is empty because the '.' matches the whole string due
to the "xyz;" at the beginning of the line.

So how would one go about extracting one valid syntactic element (the
port_decl in this case) from surrounding elements that one doesn't
care about?

Phil
 
C

Clifford Heath

Phil said:
This is good to know. So you could also do this, correct?:

rule inline_atom
( numz { def bar; ... end} / wrrd { def bar; ... end} ) { def foo; ... end}
end

so that each alternative gets it's own bar method.
Yes.

So how would one go about extracting one valid syntactic element (the
port_decl in this case) from surrounding elements that one doesn't
care about?

You're trying to skip any amount of stuff up to the port declaration,
then parse that, then skip to the end. Now first I'll ignore that your
"stuff" can presumably contain comments, which may contain the word
"port", but you do it like this:

rule vhdl_file_wth_port
( !'port' . )* port_decl .*
end

This says to parse any number of single characters as long as you aren't
looking at the word "port", then parse the port_decl and skip the rest.

Just beware of the fact that here, port might be embedded inside another
word "supportable". I tend to create keyword rules to handle that:

rule port
'port' !alphanumeric
end

rule alphanumeric
[a-zA-Z0-9]
end

The keyword rule should only be called when you know you've just seen
something that can't be part of a word - never after an arbitrary character.

If you want to handle comments etc that might contain the word port,
you'll need to detect those separately in a sub-rule, and skip a
sequence of comments and "stuff not containing 'port'" before parsing
your decl:

rule vhdl_file_wth_port
non_port* port_decl non_port*
end

rule non_port
comment / string / white / !port alphanumeric+
end

rule string
"'" ( '\\' [befntr\\'] / .)* "'"
end

rule white
[ \t\n\r]+
end

rule comment
'/*' (!'*/' . )* '*/'
end

Clifford Heath.
 
C

Clifford Heath

Clifford said:
The keyword rule should only be called when you know you've just seen
something that can't be part of a word - never after an arbitrary
character.
rule vhdl_file_wth_port
non_port* port_decl non_port*
end

rule non_port
comment / string / white / !port alphanumeric+
end

Hmmm, it looks like I broke my own rule here, but I didn't.

There's no need to worry about non_port choosing the alphanumeric
alternative, but stopping just in time to see that the AN string
contains the word "port", because the + is greedy. If "alphanumeric+"
sees the "port" in "supportable", it'll walk on by. The only time the
negative assertion is needed is at the start of an alphanumeric string.

In other words, "port_decl" will only ever get called here either at
the start of the input, or after a comment, string, white or a
*non-alphanumeric* (because an alphanumeric would have been eaten).

Hope that clarifies. In this case greediness makes things easier :).

Clifford Heath.
 
P

Phrogz

That's what I'd do.


No. There's no EOF symbol. Perhaps there should be..

For the archives, the original PEG/packrat paper defines:
EndOfFile <- !.

That looks like it'll do just fine. It's interesting to note that it
defines this because technically the grammar can match an input string
without parsing all the text. In the paper, EndOfFile is used to
anchor the end of the root rule to ensure all the text is parsed.

In Treetop, this is obviated by the
Treetop::Runtime::CompiledParser#consume_all_input flag, which
defaults to true. Only by setting it to false do you get the
(dubiously useful) behavior of ignoring the rest of the input if
you've matched the root rule.
 
C

Clifford Heath

Phrogz said:
For the archives, the original PEG/packrat paper defines:
EndOfFile <- !

That would be a good thing to add, and I think the meta-grammar
would still parse.
In Treetop, this is obviated by the
Treetop::Runtime::CompiledParser#consume_all_input flag, which
defaults to true. Only by setting it to false do you get the
(dubiously useful) behavior of ignoring the rest of the input if
you've matched the root rule.

I use this in a loop to consume all the individual declarations in
a CQL file, because I need to act on each one in turn - I don't want
to process an entire file in one pass, with the possibility of a
syntax error backtracking through all the input and every parse
rule being memoized. But then, perhaps my "parse_all" method should
be added to Treetop, and then the flag wouldn't be as necessary.

I think Nathan would oppose it, but I'd also like to add regex's
as terminals for performance, so that a SyntaxNode isn't needed for
every character.

I've also suggested to him that certain rules could be designated as
"skip" rules, for which no SyntaxNode is built. Perhaps also that
a normal rule could identify another rule as a skip rule, which is
implicitly inserted between (but not around) each node of this rule.
This would allow such rules to implement the whitespace behaviour of
other parser generators, and perhaps also to build lists. Something
like:

rule statement skip whitespace
'if' expression statement ( 'else' statement )?
/ etc...
end

where the whitespace rule is implicitly inserted, or

rule parameter_list skip comma_white
item+
end

which would function as if I'd said

item (comma_white item)*

Clifford Heath.
 
P

Phrogz

That would be a good thing to add, and I think the meta-grammar
would still parse.

I suppose I'm suggesting that it isn't badly needed, since it's
trivial to add:
rule EOF
!.
end
to any grammar where such a marker is useful. It probably would be
nice to have in the official language, in one form or another, though
I could see that starting to open the door to adding other
'convenience' rules built in and made available to every grammar.
I think Nathan would oppose it, but I'd also like to add regex's
as terminals for performance, so that a SyntaxNode isn't needed for
every character.

I would be very, very in favor of this. Not just for performance, for
simplicity in consuming a few nodes that I don't need granularity on
and where it would be easier to match using a regexp.

I read a snippet that made it sound like Perl6 combines regexps and
PEG in some way; haven't looked into it any further to find out,
though.
I've also suggested to him that certain rules could be designated as
"skip" rules, for which no SyntaxNode is built. Perhaps also that
a normal rule could identify another rule as a skip rule, which is
implicitly inserted between (but not around) each node of this rule.
This would allow such rules to implement the whitespace behaviour of
other parser generators, and perhaps also to build lists. Something
like:

rule statement skip whitespace
   'if' expression statement ( 'else' statement )?
    / etc...
end

where the whitespace rule is implicitly inserted, or

rule parameter_list skip comma_white
    item+
end

which would function as if I'd said

    item (comma_white item)*

Hrm, not wild about implicit insertion into rules. (But then I'm just
a bumpkin.) I would think instead you'd want some way to label each
part of a rule, or a rule as a whole:

rule some_name:no_node
...
end

rule some_name
(item whitespace:no_node)+
end

Additionally, something that I've wanted a few times (like the email
parser) would be a parser command that skips ALL node creation,
determining only if the parse is possible. Something like:
@parser.parse( str, :nodes=>:eek:mit )
or
@parser.validate( str )

No idea how much memory or speed would be saved with this; my
assumption is "a good amount", simply given the proliferation of nodes
and modules and extending.
 
P

Phil Tomson

You're trying to skip any amount of stuff up to the port declaration,
then parse that, then skip to the end. Now first I'll ignore that your
"stuff" can presumably contain comments, which may contain the word
"port", but you do it like this:

rule vhdl_file_wth_port
( !'port' . )* port_decl .*
end

This says to parse any number of single characters as long as you aren't
looking at the word "port", then parse the port_decl and skip the rest.

OK, this approach seems to work (I'll need to try the additions you
oulined later as well to prevent matching words which have "port" in
them).

Just to up the ante a bit:

In VHDL entity declarations can have port declarations in them as can
component declarations:

entity Foo is
port( a,b : in bit; c : out bit);
end Foo;

--later

component CPU is
port( clock : in bit;
data_bus : inout bit_vector( 15 downto 0);
address_bus : out bit_vector( 31 downto 0)
);
end CPU;

There should only be 1 entity declaration in a file, but there could
be multiple component declarations (or none) in a file. I'd like to
extract the entity port declaration and component port declarations
(keeping a list of component ports).

Any suggestions? Again, there's lot of syntax that can occur around
these things and a I don't care about any of that. I just want to get
the connections between things.



rule string
"'" ( '\\' [befntr\\'] / .)* "'"
end



rule white
[ \t\n\r]+
end

BTW: I've noticed that \w doesnt' seem to work as the univerasl
whitespace designator. I tried this comment rule:

rule comment
'--' [0-9a-zA-Z\w]* [\n]
end

But it didnt' work when I had a comment like:
-- this is a comment

But this one did:
--thisisacomment

Then I changed the rule to add a ' ':
rule comment
'--' [0-9a-zA-Z ]* [\n]
end

And that worked.

Phil
 
C

Clifford Heath

Phrogz said:
I suppose I'm suggesting that it isn't badly needed, since it's
trivial to add:
rule EOF
!.
end

Oh, ok, I missed the . in your original. I thought that ! by itself
was a special token. I agree, !. should work and there's no need to
add it to the language.
I would be very, very in favor of this. Not just for performance, for
simplicity in consuming a few nodes that I don't need granularity on
and where it would be easier to match using a regexp.

Certainly... and you'd need to save the MatchData so you can grab
$1, $2, etc. I don't think it's even very hard...
I read a snippet that made it sound like Perl6 combines regexps and
PEG in some way; haven't looked into it any further to find out,
though.

Not sure about PEG, but you can do recursive parsing and embed arbitrary
code in Perl6 REs.
Hrm, not wild about implicit insertion into rules. (But then I'm just
a bumpkin.)

No, I agree it's ugly, but it is essentially what other parser
generators do by using a separate lexer. I just thought it would
be cool to have a skip rule rather than some implicit one. It'd
be a wart though.

However, my original proposal still works and went down well with
Nathan, which is to be able to declare a non-building rule using
the keyword skip:

skip white
[ \t\r\n]
end

You still need to call that rule anywhere you want whitespace
skipped, but you don't get nodes for it.

As a further optimization, if you had

(thing white)*

then it could build an array of thing, instead of an array of
sequences, each sequence containing one thing. Not sure how
Treetop would know to do that though, as it only ever compiles
a single rule at a time.
Additionally, something that I've wanted a few times (like the email
parser) would be a parser command that skips ALL node creation,

Ok, but there needs to be something created to be able to memoize
the parse. I'm not sure how much you'd save.

BTW, for those who were hoping I might do all of this (or multi-
language support) any time soon, well, I have another project
underway that's taking all my time ATM :).

Clifford Heath.
 
C

Clifford Heath

Phil said:
Just to up the ante a bit:
In VHDL entity declarations can have port declarations in them as can
component declarations:

entity Foo is
port( a,b : in bit; c : out bit);
end Foo;

--later

component CPU is
port( clock : in bit;
data_bus : inout bit_vector( 15 downto 0);
address_bus : out bit_vector( 31 downto 0)
);
end CPU;

There should only be 1 entity declaration in a file, but there could
be multiple component declarations (or none) in a file. I'd like to
extract the entity port declaration and component port declarations
(keeping a list of component ports).

Any suggestions?

Same sort of thing: scan any rubbish not looking like "entity", then,
continuing until you see the matching 'end', scan rubbish until you
see 'port'. Scan that, repeat until you see the end of the entity,
then do the same sort of thing for component.

Might be quicker to build most of the grammar if you have multiply-
nested things terminating in 'end', but s.t. like this should work:

rule file
(!'entity' .)* entity components
end

rule entity
'entity' ports 'end'
end

rule ports
((!'port' .)* port)*
end

rule port
'port' port_decl
end

rule components
((!'component' .)* component)*
end

.... etc.


Clifford Heath.
 
P

Phil Tomson

This is essentially the "C comment matching problem". Here's what I use
for C-style comments, C++ style comments, and whitespace:

rule s # Optional space
S? end

rule S # Mandatory space
(white / comment_to_eol / comment_c_style)+
end

rule white
[ \t\n\r]+
end
rule comment_to_eol
'//' (!"\n" .)+
end
rule comment_c_style
'/*' (!'*/' . )* '*/'
end

I gave this a try with the following grammar:

#foo_grammar.treetop
grammar Foo_grammar
rule top_level
( comment_to_eol / comment_c_style )
{ def is_comment?
true
end
}
end
rule comment_to_eol
'--' (!"\n" .)+
end
rule comment_c_style
'/*' (!'*/' . )* '*/'
end
end

And then tested it like so:

class FooParserTest < Test::Unit::TestCase
include ParserTestHelper
def setup
puts "setup..."
@parser = Foo_grammarParser.new
end

def test_eol_comment
assert( @parser.parse "-- this is a comment. \n")
end

def test_c_style_comment
comment = @parser.parse("/* this is a comment. */")
assert comment.is_comment?
end
end

The first testcase (test_eol_comment) fails, but the second one
passes. Any idea what's wrong with that comment_to_eol rule?

Phil
 
I

Iñaki Baz Castillo

El Martes, 22 de Abril de 2008, Phil Tomson escribi=F3:
The first testcase (test_eol_comment) fails, but the second one
passes. =A0Any idea what's wrong with that comment_to_eol rule?

Are you sure? both cases work for me.

=2D-=20
I=F1aki Baz Castillo
 
P

Phil Tomson

El Martes, 22 de Abril de 2008, Phil Tomson escribi=F3:



Are you sure? both cases work for me.

I found out on #treetop IRC that I need to change the comment_eol rule to:

rule comment_to_eol
'--' (!"\n" .)+ "\n"
end

Then it'll match: "--this is a comment\n"
 
I

Iñaki Baz Castillo

El Martes, 22 de Abril de 2008, Phil Tomson escribi=F3:
I found out on #treetop IRC that I need to change the comment_eol rule to:

rule comment_to_eol
'--' (!"\n" .)+ "\n"
end

Then it'll match: "--this is a comment\n"

Yes sorry, I tested it via a bash script with command parameters, so=20
writting "\n" doesn't create a LF but a literal \n.

=2D-=20
I=F1aki Baz Castillo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top