P
Phrogz
I've been looking for something like treetop for a while now. Very
excited to have found it, and to play with it.
I'm rather new to PEGs, having only experienced them briefly in Lua.
Following are some questions as I try to adapt my thinking to their
ways.
Let's assume that I'm trying to parse the following (wiki) markup:
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
= Welcome! =
Hello world! This is a **single paragraph**
that **wraps over //two//** physical lines.
This is a second paragraph. All on one line.
== A Table Example ==
|| **Head1** || **Head2** ||
|| row1col1 || row1col2 ||
|| r2c1 || //r2c2// ||
This is the last paragraph. There is no newline
after this final period.
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
I'm trying to convert it to this (HTML):
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
<h1>Welcome!</h1>
<p>Hello world! This is a <strong>single paragraph</strong>
that <strong>wraps over <em>two</em></strong> physical lines.</p>
<p>This is a second paragraph. All on one line.</p>
<h2>A Table Example</h2>
<table>
<tr>
<td><strong>Head1</strong></td>
<td><strong>Head2</strong></td>
</tr>
<tr>
<td>row1col1</td>
<td>row1col2</td>
</tr>
<tr>
<td>r2c1</td>
<td><em>r2c2</em></td>
</tr>
</table>
<p>This is the last paragraph. There is no newline
after this final period.</p>
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
Three questions (in particular, of many) jump out at me:
LINE ANCHORING
In the above, there is 'block' content, and 'inline' content. Headers,
paragraphs, and tables are block level items. Among other things, this
means that their markup is only valid at the start of a line. For
example, a line that started " = Hi =" would not constitute a valid
header. In Regexp land, this would be handled simply with a ^ anchor.
How do you handle this in Treetop? Do you simply ensure that the root
rule of the document contains only the block-level rules that consume
up to and including one or more newlines?
grammar SimpleWiki
rule document
heading / paragraph / table / newlines
end
rule heading
stuff "\n"
end
rule paragraph
stuff "\n\n"
end
rule table
stuff "\n"
end
rule newlines
"\n"+
end
end
HANDLING EOF
As seen in the example above, a paragraph (or any block level element,
really) is allowed to not have a newline if it's the last thing in the
file. Do you handle this case normally by just preprocessing the input
and shoving a newline on the end if it doesn't exist, or is there a
way in Treetop to recognize the /\Z/ anchor from a Regexp?
BACKREFERENCES
A valid heading must match this regexp: /^(?:=+) (.+) \1$/
It must start at the front of the line with one or more =
characters.
It can have anything (including some = characters).
It ends with the same number of = characters, which must be followed
by a newline.
I know one of the sweet things about PEGs over Regexps are their
ability to match grammars with nested rules. I can't figure out how to
use this to my advantage in the above. If I adapt the nested parens
example from the Treetop documentation...
# Assume we're already at the start of a line.
rule heading
'=' heading '='
/
' ' string_of_words ' '
end
....then I fail to account for the necessary newline that follows the
last equals sign. But I can't figure out how to change this to use a
newline without messing up the recursive parsing.
Other questions I won't dive into: what's a reasonable way to eat
inline content (words) while allowing inline markup? Am I necessarily
going to end up with a tree for a paragraph that has one child for
each word (or letter)? Can I consume and throw away the newline in the
middle of a paragraph (as part of a string_of_words) without messing
up the end delimiters for the paragraph? (I think so.) Will lookaheads
suffice to wrap <table>...</table> around all the rows, and <tr>...</
tr> around all the cells in a row? (I think so.)
Any help or ideas appreciated.
excited to have found it, and to play with it.
I'm rather new to PEGs, having only experienced them briefly in Lua.
Following are some questions as I try to adapt my thinking to their
ways.
Let's assume that I'm trying to parse the following (wiki) markup:
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
= Welcome! =
Hello world! This is a **single paragraph**
that **wraps over //two//** physical lines.
This is a second paragraph. All on one line.
== A Table Example ==
|| **Head1** || **Head2** ||
|| row1col1 || row1col2 ||
|| r2c1 || //r2c2// ||
This is the last paragraph. There is no newline
after this final period.
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
I'm trying to convert it to this (HTML):
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
<h1>Welcome!</h1>
<p>Hello world! This is a <strong>single paragraph</strong>
that <strong>wraps over <em>two</em></strong> physical lines.</p>
<p>This is a second paragraph. All on one line.</p>
<h2>A Table Example</h2>
<table>
<tr>
<td><strong>Head1</strong></td>
<td><strong>Head2</strong></td>
</tr>
<tr>
<td>row1col1</td>
<td>row1col2</td>
</tr>
<tr>
<td>r2c1</td>
<td><em>r2c2</em></td>
</tr>
</table>
<p>This is the last paragraph. There is no newline
after this final period.</p>
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
Three questions (in particular, of many) jump out at me:
LINE ANCHORING
In the above, there is 'block' content, and 'inline' content. Headers,
paragraphs, and tables are block level items. Among other things, this
means that their markup is only valid at the start of a line. For
example, a line that started " = Hi =" would not constitute a valid
header. In Regexp land, this would be handled simply with a ^ anchor.
How do you handle this in Treetop? Do you simply ensure that the root
rule of the document contains only the block-level rules that consume
up to and including one or more newlines?
grammar SimpleWiki
rule document
heading / paragraph / table / newlines
end
rule heading
stuff "\n"
end
rule paragraph
stuff "\n\n"
end
rule table
stuff "\n"
end
rule newlines
"\n"+
end
end
HANDLING EOF
As seen in the example above, a paragraph (or any block level element,
really) is allowed to not have a newline if it's the last thing in the
file. Do you handle this case normally by just preprocessing the input
and shoving a newline on the end if it doesn't exist, or is there a
way in Treetop to recognize the /\Z/ anchor from a Regexp?
BACKREFERENCES
A valid heading must match this regexp: /^(?:=+) (.+) \1$/
It must start at the front of the line with one or more =
characters.
It can have anything (including some = characters).
It ends with the same number of = characters, which must be followed
by a newline.
I know one of the sweet things about PEGs over Regexps are their
ability to match grammars with nested rules. I can't figure out how to
use this to my advantage in the above. If I adapt the nested parens
example from the Treetop documentation...
# Assume we're already at the start of a line.
rule heading
'=' heading '='
/
' ' string_of_words ' '
end
....then I fail to account for the necessary newline that follows the
last equals sign. But I can't figure out how to change this to use a
newline without messing up the recursive parsing.
Other questions I won't dive into: what's a reasonable way to eat
inline content (words) while allowing inline markup? Am I necessarily
going to end up with a tree for a paragraph that has one child for
each word (or letter)? Can I consume and throw away the newline in the
middle of a paragraph (as part of a string_of_words) without messing
up the end delimiters for the paragraph? (I think so.) Will lookaheads
suffice to wrap <table>...</table> around all the rows, and <tr>...</
tr> around all the cells in a row? (I think so.)
Any help or ideas appreciated.