Good HTML Parser

C

Chris

Can anyone recommend a good HTML/XHTML parser, similar to
HTMLParser.HTMLParser or htmllib.HTMLParser, but able to intelligently
know that certain tags, like <br>, are implicitly closed? I need to
iterate through the entire DOM, building up a DOM path, but the stdlib
parsers aren't calling handle_endtag() for any implicitly closed tags.
I looked at BeautifulSoup, but it only seems to work by first parsing
the entire document, then allowing you to query the document
afterwards. I need something like a SAX parser.
 
D

Diez B. Roggisch

Chris said:
Can anyone recommend a good HTML/XHTML parser, similar to
HTMLParser.HTMLParser or htmllib.HTMLParser, but able to intelligently
know that certain tags, like <br>, are implicitly closed? I need to
iterate through the entire DOM, building up a DOM path, but the stdlib
parsers aren't calling handle_endtag() for any implicitly closed tags.
I looked at BeautifulSoup, but it only seems to work by first parsing
the entire document, then allowing you to query the document
afterwards. I need something like a SAX parser.

This isn't possible. Your own example of arbitrarily closeable Tags needs
context that just a SAX-like parser can't provide.

I suggest you use BeautifulSoup, and if you must create your own
event-generation around that which you can attach consumers to.

Diez
 
S

Stefan Behnel

Chris said:
Can anyone recommend a good HTML/XHTML parser, similar to
HTMLParser.HTMLParser or htmllib.HTMLParser, but able to intelligently
know that certain tags, like <br>, are implicitly closed? I need to
iterate through the entire DOM, building up a DOM path, but the stdlib
parsers aren't calling handle_endtag() for any implicitly closed tags.
I looked at BeautifulSoup, but it only seems to work by first parsing
the entire document, then allowing you to query the document
afterwards. I need something like a SAX parser.

Try lxml.html. It's very memory friendly and extremely fast, so you may end up
without any reason to use SAX anymore.

http://codespeak.net/lxml/

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top