Parsing imperfect HTML documents

Frank LaRosa · Jul 23, 2003

Hi,

What's the recommended way to parse an arbitrary HTML document which
may or may not conform to strict XML syntax requirements?

I tried using a DocumentBuilder, but immediately got the exception
message "Value must be quoted". The exception is thrown out of the
parse method so I have no way to ignore it.

Would a SAXParser be a better idea?

I don't have the option to fix the errors in the source documents.

Marco Schmidt · Jul 23, 2003

Frank LaRosa:

What's the recommended way to parse an arbitrary HTML document which
may or may not conform to strict XML syntax requirements?

HTMLParser is supposedly very stable when confronted with ugly
real-world HTML: <http://htmlparser.sourceforge.net/>.

[...]

Regards,
Marco

Adam Maass · Jul 23, 2003

Frank LaRosa said:
What's the recommended way to parse an arbitrary HTML document which
may or may not conform to strict XML syntax requirements?

I tried using a DocumentBuilder, but immediately got the exception
message "Value must be quoted". The exception is thrown out of the
parse method so I have no way to ignore it.

Would a SAXParser be a better idea?

I don't have the option to fix the errors in the source documents.

JTidy deals nicely with parsing real-world HTML.

-- Adam Maass

Drew Volpe · Jul 23, 2003

Last time we met said:
JTidy deals nicely with parsing real-world HTML.

I second the JTidy recommendation. I've used it a fair amount and
it's very good, easy to use, and will output XHTML. The only problem
I had with it were that it's very strict in what it outputs. If the input
html has no equivalent strict html, the page can get mangled. This
comes up primarily with forms which are layed out using tables.

dv

--
--------------------------------------------------------------------------
The geographical center of Boston is in Roxbury. Due north of the
center we find the South End. This is not to be confused with South
Boston which lies directly east from the South End. North of the South
End is East Boston and southwest of East Boston is the North End.

Drew Volpe, mylastname at hcs o harvard o edu

Parsing HTML/XML documents	2	Apr 26, 2007
html parsing	0	Dec 2, 2006
HTML File Parsing	3	Oct 28, 2008
Parsing HTML with HTML::TableExtract	2	Nov 27, 2009
Digester : Parsing problem ...	1	Oct 14, 2006
Java and HTML parsing.	0	May 7, 2007
Parsing HTML with HTML::Tree	1	Mar 1, 2010
Problem parsing HTML	7	Nov 24, 2009

Parsing imperfect HTML documents

Frank LaRosa

Marco Schmidt

Adam Maass

Drew Volpe

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads