HTMLParser question

R

Rajarshi Guha

Hi,
I have some HTML that looks essentially consists of a series of <div>'s
and each <div> having one of two classes (tnt-question or tnt-answer).
I'm using HTMLParser to handle the tags as:

class MyHTMLParser(HTMLParser.HTMLParser):

def handle_starttag(self, tag, attrs):
if len(attrs) == 1:
cls,whichcls = attrs[0]
if whichcls == 'tnt-question':
print self.get_starttag_text(), self.getpos()
def handle_endtag(self, tag):
pass
def handle_data(self, data):
print data

if __name__ == '__main__':

htmldata = string.join(open('tt.html','r').readlines())
parser = MyHTMLParser()
parser.feed( htmldata )

However what I would like is that when the parser reaches some HTML like
this:

<div class="tnt-question">
How do I add a user to a MySQL system?
</div>

I should get back the data between the open and close tags. However the
above code prints the text contained between all tags, not just the <div>
tags with the class='tnt-question'.

Is there a way to call handle_data() when a specific tag is being handled?
Placing a call to handle_data() in handle_starttag seems to be the way -
but I';m not sure how to actually do it - what data should I pass to the
call?

Any pointers would be appreciated
Thanks,
Rajarshi
 
B

Benjamin Niemann

Rajarshi said:
Hi,
I have some HTML that looks essentially consists of a series of <div>'s
and each <div> having one of two classes (tnt-question or tnt-answer).
I'm using HTMLParser to handle the tags as:

class MyHTMLParser(HTMLParser.HTMLParser):

def handle_starttag(self, tag, attrs):
if len(attrs) == 1:
cls,whichcls = attrs[0]
if whichcls == 'tnt-question':
print self.get_starttag_text(), self.getpos()
def handle_endtag(self, tag):
pass
def handle_data(self, data):
print data

if __name__ == '__main__':

htmldata = string.join(open('tt.html','r').readlines())
parser = MyHTMLParser()
parser.feed( htmldata )

However what I would like is that when the parser reaches some HTML like
this:

<div class="tnt-question">
How do I add a user to a MySQL system?
</div>

I should get back the data between the open and close tags. However the
above code prints the text contained between all tags, not just the <div>
tags with the class='tnt-question'.

Is there a way to call handle_data() when a specific tag is being handled?
Placing a call to handle_data() in handle_starttag seems to be the way -
but I';m not sure how to actually do it - what data should I pass to the
call?
Set a flag, when you the parser calls handle_starttag() and the tag
matches your criteria, unset it, when the corresponding endtag is found
(you'll probably have to count the nesting depth, so for
<div class="printme">Yo <div>man</div>!</div>
the flag is unset on the second </div>). Then in handle_data() only
print it, when the flag is set.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,233
Latest member
AlyssaCrai

Latest Threads

Top