Parsing imperfect HTML documents

F

Frank LaRosa

Hi,

What's the recommended way to parse an arbitrary HTML document which
may or may not conform to strict XML syntax requirements?

I tried using a DocumentBuilder, but immediately got the exception
message "Value must be quoted". The exception is thrown out of the
parse method so I have no way to ignore it.

Would a SAXParser be a better idea?

I don't have the option to fix the errors in the source documents.
 
A

Adam Maass

Frank LaRosa said:
What's the recommended way to parse an arbitrary HTML document which
may or may not conform to strict XML syntax requirements?

I tried using a DocumentBuilder, but immediately got the exception
message "Value must be quoted". The exception is thrown out of the
parse method so I have no way to ignore it.

Would a SAXParser be a better idea?

I don't have the option to fix the errors in the source documents.

JTidy deals nicely with parsing real-world HTML.

-- Adam Maass
 
D

Drew Volpe

Last time we met said:
JTidy deals nicely with parsing real-world HTML.

I second the JTidy recommendation. I've used it a fair amount and
it's very good, easy to use, and will output XHTML. The only problem
I had with it were that it's very strict in what it outputs. If the input
html has no equivalent strict html, the page can get mangled. This
comes up primarily with forms which are layed out using tables.


dv

--
--------------------------------------------------------------------------
The geographical center of Boston is in Roxbury. Due north of the
center we find the South End. This is not to be confused with South
Boston which lies directly east from the South End. North of the South
End is East Boston and southwest of East Boston is the North End.

Drew Volpe, mylastname at hcs o harvard o edu
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top