Looking for a decent HTML parser for Python...

Just Another Victim of the Ambient Morality · Dec 6, 2006

I'm trying to parse HTML in a very generic way.
So far, I'm using SGMLParser in the sgmllib module. The problem is that
it forces you to parse very specific tags through object methods like
start_a(), start_p() and the like, forcing you to know exactly which tags
you want to handle. I want to be able to handle the start tags of any and
all tags, like how one would do in the Xerces C++ XML parser. In other
words, I would like a simple start() method that is called whenever any tag
is encountered. How may I do this?
Thank you...

Just Another Victim of the Ambient Morality · Dec 6, 2006

Just Another Victim of the Ambient Morality said:
I'm trying to parse HTML in a very generic way.
So far, I'm using SGMLParser in the sgmllib module. The problem is
that it forces you to parse very specific tags through object methods like
start_a(), start_p() and the like, forcing you to know exactly which tags
you want to handle. I want to be able to handle the start tags of any and
all tags, like how one would do in the Xerces C++ XML parser. In other
words, I would like a simple start() method that is called whenever any
tag is encountered. How may I do this?
Thank you...

Okay, I think I found what I'm looking for in HTMLParser in the
HTMLParser module.
Thanks...

Just Another Victim of the Ambient Morality · Dec 6, 2006

Just Another Victim of the Ambient Morality said:
Okay, I think I found what I'm looking for in HTMLParser in the
HTMLParser module.

Except it appears to be buggy or, at least, not very robust. There are
websites for which it falsely terminates early in the parsing. I have a
sneaking feeling the sgml parser will be more robust, if only it had that
one feature I am looking for.
Can someone help me out here?
Thank you...

Fredrik Lundh · Dec 6, 2006

Except it appears to be buggy or, at least, not very robust. There are
websites for which it falsely terminates early in the parsing.

which probably means that the sites are broken. the amount of broken
HTML on the net is staggering, as is the amount of code in a typical web
browser for dealing with all that crap. for a more tolerant parser, see:

http://www.crummy.com/software/BeautifulSoup/

</F>

Stephen Eilert · Dec 6, 2006

Fredrik Lundh escreveu:

which probably means that the sites are broken. the amount of broken
HTML on the net is staggering, as is the amount of code in a typical web
browser for dealing with all that crap. for a more tolerant parser, see:

http://www.crummy.com/software/BeautifulSoup/

</F>

+1 for BeautifulSoup.

The documentation is quite brief and sometimes confusing, but I've
found it the easiest parser I've ever worked with.

Stephen

hubritic · Dec 6, 2006

Agreed that the web sites are probably broken. Try running the HTML
though HTMLTidy (http://tidy.sourceforge.net/). Doing that has allowed
me to parse where I had problem such as yours.

I have also had luck with BeautifulSoup, which also includes a tidy
function in it.

Looking for a front End developer. ( partner )	3	May 26, 2023
Looking For Advice	1	Dec 10, 2022
New coder looking for critique on fun project.	6	Jul 20, 2023
Looking for feedback on this markup language I developed and my website idea?	0	Jun 17, 2023
New to this forum ... looking for direction	8	Sep 12, 2020
HTML Parser	3	Jul 2, 2013
PEP/GSoC idea: built-in parser generator module for Python?	0	Mar 14, 2014
Looking For Advice	0	Oct 29, 2013

Looking for a decent HTML parser for Python...

Just Another Victim of the Ambient Morality

Just Another Victim of the Ambient Morality

Just Another Victim of the Ambient Morality

Fredrik Lundh

Stephen Eilert

hubritic

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads