Parsers vs. Homemade "Parsing" via REs

R

Randy Kramer

I have the need to translate several megabytes of TWiki marked up text to
HTML. In fact, it may not even be a one time thing--I'm planning to build a
wiki like thing, and will probably keep TWiki markup as the native language
for storage. Over time, I may extend the TWiki markup syntax.

(Aside: I'm aware that Ruby has two other markup languages sometimes used for
wikis (Red Cloth (?) and Text <something>?)--I may someday support those as
well, but I'm not immediately interested in converting the megabytes of data
I have and learning a different markup language.)

My impression is that some or many wikis (this is my impression of TWiki)
don't use a "real parser" (like YACC or whatever), but instead simply search
and replace in the text using (many) Regular Expressions. Conceptually, that
seems an easy approach (less learning on my part, but probably tedious
creation of many REs (or borrowing from TWiki's Perl).

I've never used a parser, and am not really that familiar with them, but I'm
wondering what the tradeoffs might be. I've heard that a parser may be
easier to modify to extend the syntax. (But, I have a feeling that the TWiki
markup language might not be "regular" enough to be parsed by something like
YACC.):

* If I did create the proper grammar rules, would parsing using something
like YACC be faster than a bunch of RE replacements?

* Any recommendations for a parser in Ruby? I think there are a couple,
I've been doing some Googling / reading and have come across references to
parse.rb and (iirc) something called Coco (??).

* Anybody know the approach followed by existing Ruby wikis like Instiki,
Ruwiki, etc.?

Other comments, hints, suggestions?

Randy Kramer
 
A

Austin Ziegler

* If I did create the proper grammar rules, would parsing using
something like YACC be faster than a bunch of RE replacements?
* Any recommendations for a parser in Ruby? I think there are a
couple, I've been doing some Googling / reading and have come
across references to parse.rb and (iirc) something called Coco
(??).

* Anybody know the approach followed by existing Ruby wikis like
Instiki, Ruwiki, etc.?

Ruwiki uses regular expressions to convert. There are portions of
this what I will probably be moving toward a more traditional parser
-- particularly those things which aren't converting well to XHTML
(I am attempting to ensure XHTML compliance in Ruwiki's generated
output).

As I understand it, RedCloth (which supports both Markdown and
Textile in version 3), does everything through regexen as well.

The biggest problem with regexen is not speed -- they're FAST. The
biggest problem is interaction of regexen. Ruwiki solves this, in
part, by adding priorities to tokens.

-austin
 
N

Nikolai Weibull

* Randy Kramer (Feb 27, 2005 16:00):
My impression is that some or many wikis (this is my impression of
TWiki) don't use a "real parser" (like YACC or whatever), but instead
simply search and replace in the text using (many) Regular
Expressions. Conceptually, that seems an easy approach (less learning
on my part, but probably tedious creation of many REs (or borrowing
from TWiki's Perl).

Yes, this is a rather horrendous misuse of regular expressions. It
works, but it's very brittle. As Robert Klemme said, the biggest
problem isn't speed (although it is certainly an issue to factor in),
rather that the interaction between the regular expressions may be
non-obvious. Using a proper grammar you can be clear of how your data
is being parsed.

The place for regular expressions are not in describing grammars, but
tokens.
... I have a feeling that the TWiki markup language might not be
"regular" enough to be parsed by something like YACC ...

Eh, "regular" is a weird word to use. Regular expressions map onto
something called regular languages or regular sets. Regular languages
are more constrained in what constructions may be used in the language
than for a context-free language. Context-free languages are what are
usually described by a parser-generator like YACC or RACC (the Ruby
equivalent of the heavily C-bound YACC).

Most markup languages (and perhaps most non-natural languages) are
context-free in nature. Thus, using a parser-generator like RACC fits
well with what you are trying to do.

I couldn't say what kind of language the TWiki is, though, as I am not
familiar with its syntax/grammar.
* If I did create the proper grammar rules, would parsing using
something like YACC be faster than a bunch of RE replacements?

Yes. For any non-trivial problem, this would be the case. Note,
however, that you still need to feed tokens to [YR]ACC, and this can be
slow if done wrong. Also, it may be appropriate to write your own
parser from scratch, if the grammar is simple enough. A
recursive-decent parser is easier to implement and understand than
a state-machine one (like those created by [YR]ACC). Of course, the
fact that they create the state-machine for you is good, but it may
still be troublesome to understand how the grammar interacts with the
grammar-rules. This becomes obvious in a recursive-decent parser.
* Any recommendations for a parser in Ruby? I think there are a
couple, I've been doing some Googling / reading and have come
across references to parse.rb and (iirc) something called Coco
(??).

Check out RACC,
nikolai
 
W

why the lucky stiff

Randy said:
* If I did create the proper grammar rules, would parsing using something
like YACC be faster than a bunch of RE replacements?

* Any recommendations for a parser in Ruby? I think there are a couple,
I've been doing some Googling / reading and have come across references to
parse.rb and (iirc) something called Coco (??).

I've written probably twenty different parsing libraries -- including
six complete revisions of the YAML parser -- and I personally favor
writing the parser in C (with Bison and Re2c). I find the two as easy
to use as any Ruby equivalents and re2c's lexer lets me control the
buffer better -- in case I need to turn a token into multiple events or
backtrack.

I hope no one is using RedCloth's parser as an example. It's awful.
It's like when Ron Popeil injects garlic into a turkey using that freaky
flavor syringe and you're like, "Grrrosss, he killed the turkey,
polished it up and forced it to be a junkie." RedCloth is all the
regexps from textile.php and markdown.pl forced into a hot, enclosed area --

It's a breeding ground for moth and ringworm. I have two evil Bad Luke
hands afterwards, which first try to strangle me and, failing that, then
go after people on the street with Ron Popeil's patented rib cage shears.

_why
 
T

Thomas

(Aside: I'm aware that Ruby has two other markup languages sometimes used for
wikis (Red Cloth (?) and Text <something>?)--I may someday support those as
well, but I'm not immediately interested in converting the megabytes of data
I have and learning a different markup language.)

My impression is that some or many wikis (this is my impression of TWiki)
don't use a "real parser" (like YACC or whatever), but instead simply search
and replace in the text using (many) Regular Expressions. Conceptually, that
seems an easy approach (less learning on my part, but probably tedious
creation of many REs (or borrowing from TWiki's Perl).

<plug>deplate (http://deplate.sf.net)</plug> uses a syntax that has some
similiarities with TWiki. It uses REs for tokenizing the text and
turning the input string into arrays of objects. Basically, deplate is
meant to be adaptable to different wiki-like markups but I haven't had
the time yet to get there.

Cheers,
Thomas.
 
R

Rob .

My implementation of an "auto-indent and insert end" feature in a
jEdit Ruby Plugin[1] uses regular expressions to determine if an end
is required. I also use regex's to identify classes and methods for a
"file structure popup" feature.

In my experience it was easy to implement with regex's early on, but
the maintanance and enhancement has proved troublesome as I've come
across more obscure syntax situations. I'm now investigating using a
version of JRuby[2] that has been modified by the Eclipse Ruby
Development Tool[2] team.

Rob

[1]http://community.jedit.org/cgi-bin/TWiki/view/Main/RubyPlugin
[2]http://jruby.sourceforge.net/
[3]http://rubyeclipse.sourceforge.net/
 
M

Mark Probert

Hi ..

I have the need to translate several megabytes of TWiki marked up text to
HTML. ...

* If I did create the proper grammar rules, would parsing using
something like YACC be faster than a bunch of RE replacements?
I think that you mileage may vary, depending on the nature of the text markup
and how complicated you want the HTML. If it is something like tag
replacement, then regexp is probably the way to go. It will be, on the
whole, faster to implement for such a "one-off" job.
* Any recommendations for a parser in Ruby? I think there are a couple,
I've been doing some Googling / reading and have come across references to
parse.rb and (iirc) something called Coco (??).
I am responsible for one of the Coco/R implementations (coco/rb). The
"something" is a what's called an attributed LL(1) parser-generator, as
opposed to YACC, which is LALR. These are just different ways that the
parser attacks decomposition of the target file. You can find out more at

http://www.scifac.ru.ac.za/compilers/

There are some basic tradeoffs. Coco is very fast and the grammar and scanner
are all in one place (YACC uses LEX as a tokeniser, which is a separate
program and input file). YACC is somewhat more flexible in grammars it can
produce. For example, it is not possible to write a Ruby parser in COCO/r,
though you can do it YACC. However, for 99% of little-languages, that is not
such a big problem, just avoid constructs like 'goto'.

Regards,
 
A

Aredridel

My impression is that some or many wikis (this is my impression of TWiki)
don't use a "real parser" (like YACC or whatever), but instead simply search
and replace in the text using (many) Regular Expressions. Conceptually, that
seems an easy approach (less learning on my part, but probably tedious
creation of many REs (or borrowing from TWiki's Perl).

I've written a wiki, though in PHP, not Ruby (yet; I'm porting)

The parser is basically recursive-descent, hand-coded, with an
insanely complex regex tokenizer.

Parsing human text with a strict parser (like racc/yacc) is Really
Hard -- the LALR(1) algorithm has too many limits to do it justice.
Even fitting it into a GLR shaped box is hard, though I'm trying to in
my rewrite.

Your best bet is probably to use a hand-coded Recursive Descent
parser, where you drag in big blobs of text, divvy it up into
successively smaller pieces, calling the parser for each recursion,
with fewer and fewer options remaining. The parsing might go like so:

With the input paragraph:

_/hi/, there_

and the markup objects "Bold" and "italic" and "plain"

your parser might match a single token: "bold text"

Then you strip the _ _, and you have

/hi/, there

Your parser would recognize two tokens: "italic text" (hi), and plain
(", there")

It'd strip each: plain(hi), and then nothing further would match.

Your structure in memory would look like this:

Bold(Italic("hi")."there")

Which is then pretty trivial to mark up as HTML.

The funky cases come in things not easily tokenized: lists,
particularly, are a pain. My current parser does ugly things like
guess whether something is list-like, hands the whole blob to a
listifier routine, which then throws out any plaintext bits for
rendering. Ugly, but works.

A GLR parser, on the other hand, instead of breaking things down,
builds them up -- with rules like "_", anything, "_" is bold, you'd
break your stream down into tokens like so:

Literal(_) Literal(/) Text(hi) Literal(/) Text(, there) Literal(_)

And then with the rules "_", anything, "_" -> bold
"/", anything, "/" -> italic
italic*|bold*|plain* -> anything
else -> plain

The parser could then reduce that to the same above structure. Problem
is, that takes many tokens of look-ahead, so the parser could get
slow.

an LALR(1) parser is just like above, except that the rules can only
use one token of look-ahead to figure out what to do. Obviously, for
the enormous variety in human text, that's a huge limit.

Hope that makes some sense.

Ari

P.S. Why not search and replace? Assume "/" means italic and "-" means
bold. Then parse this:

http://www.amazon.com/exec/obidos/t...002-2563138-3700849/-/foo/002-2563138-3700849

And try not to get <a href='....<em> <strong>....' ...
 
R

Randy Kramer

Thanks to everyone who responded on this thread, and especially to
Aredridel--the examples here helped me understand at least two varieties of
parsers considerably better than I did.

There is a folloup for Aredridel below.

I think I'm heading in the direction of writing something that might be called
a parser (or might not be). My basic aim is to minimize the number of passes
through the document. (Simple RE substitution requires a pass through the
document for each RE (IIUC).) I think I can do it with one or two passes,
with a rather elaborate case type statement (or many ifs) on examination of
each "token" (word).

I've written some other questions to the list. I'm going through TWiki and
trying to develop a list of all the TWiki markup--there is a lot I don't use
(so far) and I may initially deal only with the more common stuff.

(In the back of my mind I am still reserving some last resort options--I can
always install an instance of TWiki locally, and "rig" a way to feed markup
to it and recover the HTML.)


The funky cases come in things not easily tokenized: lists,
particularly, are a pain. My current parser does ugly things like
guess whether something is list-like, hands the whole blob to a
listifier routine, which then throws out any plaintext bits for
rendering. Ugly, but works.

How are your lists marked up? (Or are you trying to recognize them without a
specific markup, more of a natural language type thing?)

thanks again to all,
Randy Kramer
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top