L
Lawrence D'Oliveiro
I've been using HTMLParser to scrape Web sites. The trouble with this
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object then gets into a
confused state after that so you cannot continue using it.
The way I'm currently working around this is to do a dummy pre-parsing
run with a dummy (non-subclassed) HTMLParser object. Every time I hit
HTMLParseError, I note the line number in a set of lines to skip, then
create a new HTMLParser object and restart the scan from the beginning,
skipping all the lines I've noted so far. Only when I get to the end
without further errors do I do the proper parse with all my appropriate
actions.
is, there's a lot of malformed HTML out there. Real browsers have to be
written to cope gracefully with this, but HTMLParser does not. Not only
does it raise an exception, but the parser object then gets into a
confused state after that so you cannot continue using it.
The way I'm currently working around this is to do a dummy pre-parsing
run with a dummy (non-subclassed) HTMLParser object. Every time I hit
HTMLParseError, I note the line number in a set of lines to skip, then
create a new HTMLParser object and restart the scan from the beginning,
skipping all the lines I've noted so far. Only when I get to the end
without further errors do I do the proper parse with all my appropriate
actions.