trying to parse non valid html documents with HTMLParser

florent · Aug 2, 2005

I'm trying to parse html documents from the web, using the HTMLParser
class of the HTMLParser module (python 2.3), but some web documents are
not fully valids. When the parser finds an invalid tag, he raises an
exception. Then it seems impossible to resume the parsing just after
where the exception was raised. I'd like to continue parsing an html
document even if an invalid tag was found. Is it possible to do this ?

Here is a little non valid html document.
----------
<html>
<body>
<a href="""">bogus link</a>
</body>
</html>
----------

Benjamin Niemann · Aug 2, 2005

florent said:
I'm trying to parse html documents from the web, using the HTMLParser
class of the HTMLParser module (python 2.3), but some web documents are
not fully valids.

Some?? Most of them

When the parser finds an invalid tag, he raises an
exception. Then it seems impossible to resume the parsing just after
where the exception was raised. I'd like to continue parsing an html
document even if an invalid tag was found. Is it possible to do this ?

AFAIK not with HTMLParser or htmllib. You might try (if you haven't done
yet) htmllib and see, which parser is more forgiving.

You might pipe the document through an external tool like HTML Tidy
<http://www.w3.org/People/Raggett/tidy/> before you feed it into
HTMLParser.

Benji York · Aug 2, 2005

florent said:
I'm trying to parse html documents from the web, using the HTMLParser
class of the HTMLParser module (python 2.3), but some web documents are
not fully valids.

From http://www.crummy.com/software/BeautifulSoup/:

You didn't write that awful page. You're just trying to get
some data out of it. Right now, you don't really care what
HTML is supposed to look like.

Neither does this parser.

florent · Aug 3, 2005

AFAIK not with HTMLParser or htmllib. You might try (if you haven't done

yet) htmllib and see, which parser is more forgiving.

Thanks, I'll try htmllib.
In other case, I found a solution. Feeding data to the HTMLParser by
chunks extracted from the string using string.split("<"), will allow me
to loose only one tag at a time when an exception is raised !

florent · Aug 3, 2005

From http://www.crummy.com/software/BeautifulSoup/:

You didn't write that awful page. You're just trying to get
some data out of it. Right now, you don't really care what
HTML is supposed to look like.

Neither does this parser.

True, I just want to extract some data from html documents. But the
problem is the same. The parser looses the position he was in the string
when he encounters a bad tag.

Benji York · Aug 3, 2005

florent said:
True, I just want to extract some data from html documents. But the
problem is the same. The parser looses the position he was in the string
when he encounters a bad tag.

Are you saying that Beautiful Soup can't parse the HTML? If so, I'm
sure the author would like an example so he can "fix" it.

florent · Aug 3, 2005

AFAIK not with HTMLParser or htmllib. You might try (if you haven't done

yet) htmllib and see, which parser is more forgiving.

You were right, the HTMLParser of htmllib is more permissive. He just
ignores the bad tags !

florent · Aug 3, 2005

Are you saying that Beautiful Soup can't parse the HTML? If so, I'm

sure the author would like an example so he can "fix" it.

I finally use the htmllib module wich is more permissive than the
HTMLParser module when parsing bad html documents.
Anyway, where can I find the author's contact informations ?

Steve M · Aug 3, 2005

You were right, the HTMLParser of htmllib is more permissive. He just
ignores the bad tags !

The HTMLParser on my distribution is a she. But then again, I am using
ActivePython on Windows...

Benjamin Niemann · Aug 3, 2005

Steve said:
ignores the bad tags !

The HTMLParser on my distribution is a she. But then again, I am using
ActivePython on Windows...

Although building parsers is for some strange reason one of my favourite
programming adventures, I do not have such a personal relationship with my
classes

HTMLParser not parsing whole html file	4	Oct 24, 2010
HTMLParser fragility	8	Apr 5, 2006
Noob trying to parse bad HTML using xml.etree.ElementTree	0	Dec 30, 2012
I'm about to get in trouble with the HTML <body></body> tags	10	Aug 12, 2023
Trying to parse a HUGE(1gb) xml file	41	Dec 20, 2010
Parsing HTML--looking for info/comparison of HTMLParser vs. htmllibmodules.	1	Jul 7, 2006
Question regarding HTMLParser module.	1	Jul 28, 2003
Manipulate HTML documents via data structure	0	Oct 1, 2004

trying to parse non valid html documents with HTMLParser

florent

Benjamin Niemann

Benji York

florent

florent

Benji York

florent

florent

Steve M

Benjamin Niemann

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads