Parsing HTML with JavaScript

M

mtfulmer

I am trying to extract some information from a few web pages, and I was
using the HTMLParser module. It worked fine until it got to the
javascript, at which it gave a parse error. Is there a good way to work
around this or should I just preparse the file to remove the javascript
manually? This is my first python program.
 
R

Richard Brodie

I am trying to extract some information from a few web pages, and I was
using the HTMLParser module. It worked fine until it got to the
javascript, at which it gave a parse error.

It's fairly common for pages with Javascript to also be invalid HTML.
HTMLParser isn't an 'ignore all errors silently and guess what it's
meant to be' parser. Unless you have known good inputs it's often
best to use an alternative. Some options are discussed in Uche's article
here: http://www.xml.com/pub/a/2004/09/08/pyxml.html
 
J

John J. Lee

I am trying to extract some information from a few web pages, and I was
using the HTMLParser module. It worked fine until it got to the
javascript, at which it gave a parse error. Is there a good way to work
around this or should I just preparse the file to remove the javascript
manually? This is my first python program.

sgmllib is very similar to HTMLParser, but doesn't break so easily
(but sgmllib has some problems with XHTML -- swings and roundabouts).

Or, try BeautifulSoup.


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top