HTMLParser fragility

Lawrence D'Oliveiro · Apr 5, 2006

I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object then gets into a
confused state after that so you cannot continue using it.

The way I'm currently working around this is to do a dummy pre-parsing
run with a dummy (non-subclassed) HTMLParser object. Every time I hit
HTMLParseError, I note the line number in a set of lines to skip, then
create a new HTMLParser object and restart the scan from the beginning,
skipping all the lines I've noted so far. Only when I get to the end
without further errors do I do the proper parse with all my appropriate
actions.

Rene Pijlman · Apr 5, 2006

Lawrence D'Oliveiro:

I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not.

There are two solutions to this:

1. Tidy the source before parsing it.
http://www.egenix.com/files/python/mxTidy.html

2. Use something more foregiving, like BeautifulSoup.
http://www.crummy.com/software/BeautifulSoup/

Daniel Dittmar · Apr 5, 2006

Lawrence said:
I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object then gets into a
confused state after that so you cannot continue using it.

The way I'm currently working around this is to do a dummy pre-parsing
run with a dummy (non-subclassed) HTMLParser object. Every time I hit
HTMLParseError, I note the line number in a set of lines to skip, then
create a new HTMLParser object and restart the scan from the beginning,
skipping all the lines I've noted so far. Only when I get to the end
without further errors do I do the proper parse with all my appropriate
actions.

You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
as a first step to get well formed HTML.

Daniel

Richie Hindle · Apr 5, 2006

[Daniel]

You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
as a first step to get well formed HTML.

But Tidy fails on huge numbers of real-world HTML pages. Simple things like
misspelled tags make it fail:

from mx.Tidy import tidy
results = tidy("<html><body><pree>Hello world!</pre></body></html>")
print results[3]

Click to expand...

Click to expand...

line 1 column 7 - Warning: inserting missing 'title' element
line 1 column 13 - Error: <pree> is not recognized!
line 1 column 13 - Warning: discarding unexpected <pree>
line 1 column 31 - Warning: discarding unexpected </pre>
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

Is there a Python HTML tidier which will do as good a job as a browser?

=?ISO-8859-1?Q?Walter_D=F6rwald?= · Apr 6, 2006

Rene said:
Lawrence D'Oliveiro:

There are two solutions to this:

1. Tidy the source before parsing it.
http://www.egenix.com/files/python/mxTidy.html

2. Use something more foregiving, like BeautifulSoup.
http://www.crummy.com/software/BeautifulSoup/

You can also use the HTML parser from libxml2 or any of the available
wrappers for it.

Bye,
Walter Dörwald

Paul Boddie · Apr 6, 2006

Richie said:
But Tidy fails on huge numbers of real-world HTML pages. Simple things like
misspelled tags make it fail:

[Various error messages]

Is there a Python HTML tidier which will do as good a job as a browser?

As pointed out elsewhere, libxml2 will attempt to parse HTML if asked
to:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><pree>Hello world!</pree></body></html>

See how it fixes up the mismatching tags. The libxml2dom package is
available in the usual place:

http://www.python.org/pypi/libxml2dom

Paul

Lawrence D'Oliveiro · Apr 7, 2006

Rene Pijlman said:
2. Use something more foregiving, like BeautifulSoup.
http://www.crummy.com/software/BeautifulSoup/

That sounds like what I'm after!

Richie Hindle · Apr 7, 2006

[Richie]

But Tidy fails on huge numbers of real-world HTML pages. [...]
Is there a Python HTML tidier which will do as good a job as a browser?
[Walter]
You can also use the HTML parser from libxml2
[Paul]
libxml2 will attempt to parse HTML if asked to [...] See how it fixes
up the mismatching tags.

Great! Many thanks.

John J. Lee · Apr 10, 2006

Lawrence D'Oliveiro said:
I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object then gets into a
confused state after that so you cannot continue using it.

[...]

sgmllib.SGMLParser (or htmllib.HTMLParser) is more tolerant than
HTMLParser.HTMLParser.

BeautifulSoup derives from sgmllib.SGMLParser, and introduces extra
robustness, of a sort.

John

HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
HTMLParser not parsing whole html file	4	Oct 24, 2010
Turning HTMLParser into an iterator	0	Jun 1, 2009
Special chars with HTMLParser	4	Aug 5, 2009
Buffering HTML as HTMLParser reads it?	3	Aug 1, 2007
UTF8 & HTMLParser	2	Dec 1, 2006
Parsing HTML--looking for info/comparison of HTMLParser vs. htmllibmodules.	1	Jul 7, 2006
HTMLParser and write	1	Mar 5, 2004

HTMLParser fragility

Lawrence D'Oliveiro

Rene Pijlman

Daniel Dittmar

Richie Hindle

=?ISO-8859-1?Q?Walter_D=F6rwald?=

Paul Boddie

Lawrence D'Oliveiro

Richie Hindle

John J. Lee

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads