Looking for a decent HTML parser for Python...

  • Thread starter Just Another Victim of the Ambient Morality
  • Start date
J

Just Another Victim of the Ambient Morality

I'm trying to parse HTML in a very generic way.
So far, I'm using SGMLParser in the sgmllib module. The problem is that
it forces you to parse very specific tags through object methods like
start_a(), start_p() and the like, forcing you to know exactly which tags
you want to handle. I want to be able to handle the start tags of any and
all tags, like how one would do in the Xerces C++ XML parser. In other
words, I would like a simple start() method that is called whenever any tag
is encountered. How may I do this?
Thank you...
 
J

Just Another Victim of the Ambient Morality

Just Another Victim of the Ambient Morality said:
I'm trying to parse HTML in a very generic way.
So far, I'm using SGMLParser in the sgmllib module. The problem is
that it forces you to parse very specific tags through object methods like
start_a(), start_p() and the like, forcing you to know exactly which tags
you want to handle. I want to be able to handle the start tags of any and
all tags, like how one would do in the Xerces C++ XML parser. In other
words, I would like a simple start() method that is called whenever any
tag is encountered. How may I do this?
Thank you...

Okay, I think I found what I'm looking for in HTMLParser in the
HTMLParser module.
Thanks...
 
J

Just Another Victim of the Ambient Morality

Just Another Victim of the Ambient Morality said:
Okay, I think I found what I'm looking for in HTMLParser in the
HTMLParser module.

Except it appears to be buggy or, at least, not very robust. There are
websites for which it falsely terminates early in the parsing. I have a
sneaking feeling the sgml parser will be more robust, if only it had that
one feature I am looking for.
Can someone help me out here?
Thank you...
 
F

Fredrik Lundh

Except it appears to be buggy or, at least, not very robust. There are
websites for which it falsely terminates early in the parsing.

which probably means that the sites are broken. the amount of broken
HTML on the net is staggering, as is the amount of code in a typical web
browser for dealing with all that crap. for a more tolerant parser, see:

http://www.crummy.com/software/BeautifulSoup/

</F>
 
S

Stephen Eilert

Fredrik Lundh escreveu:
which probably means that the sites are broken. the amount of broken
HTML on the net is staggering, as is the amount of code in a typical web
browser for dealing with all that crap. for a more tolerant parser, see:

http://www.crummy.com/software/BeautifulSoup/

</F>

+1 for BeautifulSoup.

The documentation is quite brief and sometimes confusing, but I've
found it the easiest parser I've ever worked with.


Stephen
 
H

hubritic

Agreed that the web sites are probably broken. Try running the HTML
though HTMLTidy (http://tidy.sourceforge.net/). Doing that has allowed
me to parse where I had problem such as yours.

I have also had luck with BeautifulSoup, which also includes a tidy
function in it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top