Parsing HTML--looking for info/comparison of HTMLParser vs. htmllibmodules.

  • Thread starter Kenneth McDonald
  • Start date
K

Kenneth McDonald

I'm writing a program that will parse HTML and (mostly) convert it to
MediaWiki format. The two Python modules I'm aware of to do this are
HTMLParser and htmllib. However, I'm currently experiencing either real
or conceptual difficulty with both, and was wondering if I could get
some advice.

The problem I'm having with HTMLParser is simple; I don't seem to be
getting the actual text in the HTML document. I've implemented the
do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but
it never seems to receive any data. Is there another way to access the
text chunks as they come along?

HTMLParser would probably be the way to go if I can figure this out. It
seems much simpler than htmllib, and satisfies my requirements.

htmllib will write out the text data (using the AbstractFormatter and
AbstractWriter), but my problem here is conceptual. I simply don't
understand why all of these different "levels" of abstractness are
necessary, nor how to use them. As an example, the html <i>text</i>
should be converted to ''text'' (double single-quotes at each end) in my
mediawiki markup output. This would obviously be easy to achieve if I
simply had an html parse that called a method for each start tag, text
chunk, and end tag. But htmllib calls the tag functions in HTMLParser,
and then does more things with both a formatter and a writer. To me,
both seem unnecessarily complex (though I suppose I can see the benefits
of a writer before generators gave the opportunity to simply yield
chunks of output to be processed by external code.) In any case, I don't
really have a good idea of what I should do with htmllib to get my
converted tags, and then content, and then closing converted tags,
written out.

Please feel free to point to examples, code, etc. Probably the simplest
solution would be a way to process text content in HTMLParser.HTMLParser.

Thanks,
Ken
 
W

wes weston

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.TokenList = []
def handle_data( self,data):
data = data.strip()
if data and len(data) > 0:
self.TokenList.append(data)
#print data
def GetTokenList(self):
return self.TokenList


try:
url = "http://....your url here.............."
f = urllib.urlopen(url)
res = f.read()
f.close()
except:
print "bad read"
return

h = MyHTMLParser()
h.feed(res)
tokensList = h.GetTokenList()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,060
Latest member
BuyKetozenseACV

Latest Threads

Top