Simple parsing of sloppy HTML - LittleLexer

John Carter · Feb 11, 2004

There have been a couple of threads on parsing HTML.

If you looking for something excruciatingly simple (and simplistic) that
will parse the sloppiest of HTML pages, you could do worse than look at
LittleLexer.

http://littlelexer.rubyforge.org/

At a mere 44 lines of non-blank non-comment lines of Ruby, it has the
virtue of a certain elegant simplicity.

It includes a rudimentary HTML parser as an example.

John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : (e-mail address removed)
New Zealand

Note to all marketers. If you want to sell things to me, buy Google words.

I refuse on principle to buy anything sold by spam or popup and I
never follow any links found in a spam. I do however use Google and
will often follow the neat non-irritating Google Word ads that are of
interest to me.

gabriele renzi · Feb 11, 2004

il Wed, 11 Feb 2004 10:59:41 +0900, John Carter

http://littlelexer.rubyforge.org/

At a mere 44 lines of non-blank non-comment lines of Ruby, it has the
virtue of a certain elegant simplicity.

It includes a rudimentary HTML parser as an example.

I wonder if you're going to add some more documentation. It seem that
littlelexer is someway interesting, but I could not understand it last
time I looked at it (note that I'm stupid)

John Carter · Feb 12, 2004

Cute, but it's not a parser, it's a composite lexer.
A regular expression engine cannot *parse* context-free grammars (like
most programming languages), even if you can layer them like this.

Yes and no.

It's a while since I did the theory courses so forgive me if my
details as to whether something is LR or LL is wrong....

Note 1: If I remember correctly even a full "parser generator" like yacc/Bison
cannot recognize an arbitrary CFG. They only recognize a limited subset
like LL(1)

Thus LittleLexer is a parser in the same sense that yacc/Bison are,
but the subset of CFG's it can recognize is more limited.

Note 2: The fundamental limitation on regex's as a parser is they
don't cope with nested structures.

So as a mental exercise I was trying to visualize how I would get
LittleLexer to chew on that simplest and best poster child example of
a "nested" language, s-expressions. (Lisp like expressions)

The answer is quite simple in that case, have a lexer pass, no surprise.
And then have a pass that recognizes unbracketed regions.
%r{\([^\(\)]*\)} maps to 'f'

eg.

(+ (- 30 20) 10)

Feed it to the lexer and it outputs...
'(+(-dd)d)', ['(','+','(','-','30','20',')','10',')']

Feed it to the parser and it produces....

'(+fd)', ['(','+','(-dd)','d',')']

Feed it back into exactly the same little lexer instance and you have...
'f', '['(+fd)']

Now there is an option on the LittleLexer scan method that allows you
to pass in the resulting token list from the previous pass so what you
get is a AST (Abstract Syntax Tree) of the s-expression. Voila!
LittleLexer parsed a language that is not a regular expression.

The LispParser as a proof of concept is easy to do. What I want to do
for version 2 of LitteLexer is generalize the notion of handling
nested structures like that. But since the whole thing is a "late
night, can't sleep" project, I'm stilling mulling over what I mean by
"nested structure".

For example consider Ruby if/else expressions...

if a
dostuff
if b
do_more_stuff
else
dothing
end
end

Would map to something like....
if => 'i'
statement => 's'
else => 'l'
end => 'e'
expression => 'x'

So that would be...
'ixsixslsee'

So we could have a regex that recognized "atomic" if/else statements...
[/ixs+(ls)?e/, 's']

For example, a LL(1) parser must be able to decide what path it is
going down by one token of look ahead. In a sense this iterative
approach I'm using is a bit better off. I can use the entire string to
decide whether I have an atomic item or a nested item.

So why doesn't Bison and the like do this? Because it is slow, the
whole art and science of compiler compiler's evolved in days of very
slow computers trying to do far more than they had power for.

On the other hand it is not too slow as Ruby's regex engine is really
very good and I'm only manipulating strings and pointers which are
typically an order of magnitude smaller than the original file.

John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : (e-mail address removed)
New Zealand

Note to all marketers. If you want to sell things to me, buy Google words.

I refuse on principle to buy anything sold by spam or popup and I
never follow any links found in a spam. I do however use Google and
will often follow the neat non-irritating Google Word ads that are of
interest to me.

John Carter · Feb 13, 2004

Sure you can wrap them up in loops and other things, and that's
what parsers do when they wrap themselves around a lexer, but
you must have a layer that is not a regular expression, or it
doesn't work. This has been mathematically proven from the
definition of RE, so if you find a way around it, you don't
have an RE.

Correct, and the fact that I can and do parse lisp via this trick does
demonstrate that I can recognize a larger set than a RE.

John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : (e-mail address removed)
New Zealand

Note to all marketers. If you want to sell things to me, buy Google words.

I refuse on principle to buy anything sold by spam or popup and I
never follow any links found in a spam. I do however use Google and
will often follow the neat non-irritating Google Word ads that are of
interest to me.

__PRETTY__FUNCTION__ for Ruby?	0	Dec 14, 2008
Checking ruby out from svn?	1	May 18, 2009
Which ruby is running me?	3	Apr 20, 2009
What sound does no duck make?	12	May 6, 2005
Povray binding?	5	Aug 22, 2004
Status of getresuid? Dropping privilege temporarily...	0	Jun 26, 2007
String interpolation method?	7	Feb 27, 2008
First cut Perl to Ruby conversion?	2	Mar 13, 2007

Simple parsing of sloppy HTML - LittleLexer

John Carter

gabriele renzi

John Carter

John Carter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads