HTMLParser fragility

  • Thread starter Lawrence D'Oliveiro
  • Start date
L

Lawrence D'Oliveiro

I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object then gets into a
confused state after that so you cannot continue using it.

The way I'm currently working around this is to do a dummy pre-parsing
run with a dummy (non-subclassed) HTMLParser object. Every time I hit
HTMLParseError, I note the line number in a set of lines to skip, then
create a new HTMLParser object and restart the scan from the beginning,
skipping all the lines I've noted so far. Only when I get to the end
without further errors do I do the proper parse with all my appropriate
actions.
 
R

Rene Pijlman

D

Daniel Dittmar

Lawrence said:
I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object then gets into a
confused state after that so you cannot continue using it.

The way I'm currently working around this is to do a dummy pre-parsing
run with a dummy (non-subclassed) HTMLParser object. Every time I hit
HTMLParseError, I note the line number in a set of lines to skip, then
create a new HTMLParser object and restart the scan from the beginning,
skipping all the lines I've noted so far. Only when I get to the end
without further errors do I do the proper parse with all my appropriate
actions.

You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
as a first step to get well formed HTML.

Daniel
 
R

Richie Hindle

[Daniel]
You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
as a first step to get well formed HTML.

But Tidy fails on huge numbers of real-world HTML pages. Simple things like
misspelled tags make it fail:
from mx.Tidy import tidy
results = tidy("<html><body><pree>Hello world!</pre></body></html>")
print results[3]
line 1 column 7 - Warning: inserting missing 'title' element
line 1 column 13 - Error: <pree> is not recognized!
line 1 column 13 - Warning: discarding unexpected <pree>
line 1 column 31 - Warning: discarding unexpected </pre>
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

Is there a Python HTML tidier which will do as good a job as a browser?
 
P

Paul Boddie

Richie said:
But Tidy fails on huge numbers of real-world HTML pages. Simple things like
misspelled tags make it fail:

[Various error messages]
Is there a Python HTML tidier which will do as good a job as a browser?

As pointed out elsewhere, libxml2 will attempt to parse HTML if asked
to:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><pree>Hello world!</pree></body></html>

See how it fixes up the mismatching tags. The libxml2dom package is
available in the usual place:

http://www.python.org/pypi/libxml2dom

Paul
 
R

Richie Hindle

[Richie]
But Tidy fails on huge numbers of real-world HTML pages. [...]
Is there a Python HTML tidier which will do as good a job as a browser?
[Walter]
You can also use the HTML parser from libxml2
[Paul]
libxml2 will attempt to parse HTML if asked to [...] See how it fixes
up the mismatching tags.

Great! Many thanks.
 
J

John J. Lee

Lawrence D'Oliveiro said:
I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object then gets into a
confused state after that so you cannot continue using it.
[...]

sgmllib.SGMLParser (or htmllib.HTMLParser) is more tolerant than
HTMLParser.HTMLParser.

BeautifulSoup derives from sgmllib.SGMLParser, and introduces extra
robustness, of a sort.


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,177
Latest member
OrderGlucea
Top