HTML parser to DOM via SAX?

R

Rogan Dawes

Hi folks,

I am trying to build an "advanced" spider, with support for
javascript/DHTML links.

In order to do this, I'm trying to find a way of parsing an HTML page to
a DOM, but at the same time allowing javascript to execute, if needed.

It seems that the best way to do this is to parse the HTML using a
SAX-like html parser (e.g. tagsoup), check the tag that it is processing
at each point to see if it is script related, and if so, process the tag
(e.g. source the script from the provided URL, or evaluate the inline
script), before passing the SAX event to a class that builds the DOM.

I imagine the following structure:

Create a JavaScript interpreter (Rhino)
Create a new empty W3C Document, pass that to Rhino, so that script
calls that reference "document" will have something to work with.
Create a reader from the HTML source, and use that as input to the
TagSoup parser.

Register a SAX handler with the tagsoup parser that checks to see if the
tag is script related, and if there is anything script related that
needs doing, e.g. sourcing the script in the Rhino interpreter, and then
passes the event to a SAX handler that actually takes the SAX event and
creates the corresponding DOM element.

When the parser has finished, identify any "onload" events (or other
events that should be executed post-load), and use the Rhino interpreter
to execute them.

So, the missing bits are basically the logic that checks if a tag is
script related, and the actual document builder that converts SAX events
into dom elements.

Firstly, does this sound like a reasonable approach?

Secondly, does anyone know of any GPL-compatible implementations of a
SAX to DOM convertor? Or, is there some (hidden?) interface in JAXP that
supports this functionality?

Thanks

Rogan

P.S. This will eventually make its way into WebScarab, a GPL web
application security analysis tool. More info at
http://www.owasp.org/software/webscarab.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,527
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top