HTMLParser handler_starttag misses lots of tags!

M

Matthew Wilson

I want to parse an html file and extract my router's IP address. I
wrote this code and I have python 2.3 installed:

#! /usr/bin/env python

import HTMLParser

class HP(HTMLParser.HTMLParser):

def handle_starttag(self, tag, data):
print "tag is %s." % (tag)

def handle_comment(self, data):
print "caught a comment: %s." % (data)

def handle_data(self, data):
if "IP" in data:
print "Caught %s." % data

hp = HP()
out = open('routerstatus.html')
for line in out:
hp.feed(line)


I figured that when I ran this on the html code at the bottom of this
file, it would print every tag, but instead, this is what I got:

tag is html.
tag is head.
tag is meta.
tag is meta.
tag is meta.
tag is meta.
tag is meta.
tag is title.
tag is link.
tag is script.
tag is body.
tag is form.

The program seems to take a vacation after the opening form tag. What
am I doing wrong?

Finally, this is the html code I am trying to parse:



<html>

<head>
<meta http-equiv="content-type" content="text/html;charset=ISO-8859-1">
<meta name="generator" content="Adobe GoLive 5">
<META http-equiv='Pragma' CONTENT='no-cache'>
<META HTTP-EQUIV="Cache-Control" CONTENT="no-cache">
<META http-equiv='Refresh' CONTENT='20'>
<title>router form</title>
<link rel="stylesheet" href="form.css">
<script language="javascript" type="text/javascript">
<!-- hide script from old browsers
function loadhelp(num) {

parent.helpframe.document.location.href="help/help"+num+".html"

}
function newwindow(F)
{
if((F.status.value =="checked")||(F.EncapPTelstra.value=="checked")||(F.EncapAolDhcp.value=="checked"))
window.open('enatherstatus.htm', 'enstatherstatus', 'width=380,height=450,status=yes');
else if((F.EncapPPTP.value =="checked"))
window.open('pptpstatus.htm', 'pptpstatus', 'width=380,height=320,status=yes');
else
window.open('pppoestatus.htm', 'pppoestatus', 'width=380,height=320,status=yes');



}
//-->
</script>
</head>
<body bgcolor="#ffffff" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onload="loadhelp('_SysStatus')">
<form method="POST">
<input type=hidden name=status value=>
<input type="hidden" name=EncapPTelstra value=>
<input type="hidden" name=EncapPPTP value=>
<input type="hidden" name=EncapAolDhcp value=>
<table border="0" cellpadding="0" cellspacing="3" width="100%">
<tr>
<td colspan="2">
<h1>Router Status</h1>
</td>
</tr>
<!-- RULE //-->
<tr>
<td colspan="2">
<img src="img/liteblue.gif" width="100%" height="2" border="0">
</td>
</tr>
<!-- END RULE //-->
<tr>
<td width="60%">
<b>Account Name</b>
</td>
<td width="40%">

</td>
</tr>

<tr>
<td width="60%">
<b>Firmware Version </b>
</td>
<td width="40%">
4.13 Aug 20 2003
</td>
</tr>

<!-- RULE //-->
<tr>
<td colspan="2">
<img src="img/liteblue.gif" width="100%" height="2" border="0">
</td>
</tr>
<!-- END RULE //-->
<tr>
<td colspan="2">
<span class="subhead">Internet Port </span>
</td>
</tr>
<tr>
<td width="60%">
<b>MAC Address </b>
</td>
<td width="40%">
00:09:5b:29:3d:b4
</td>
</tr>
<tr>
<td width="60%">
<b>IP Address </b>
</td>
<td width="40%">
66.72.206.129
</td>
</tr>
<tr>
<td width="60%">
<b>DHCP </b>
</td>
<td width="40%">
None
</td>
</tr>
<tr>
<td width="60%">
<b>IP Subnet Mask </b>
</td>
<td width="40%">
None
</td>
</tr>
<tr>
<td width="60%">
<b>Domain Name Server</b>
</td>
<td width="40%">
66.73.20.40
</td>
</tr>
<tr>
<td width="60%">
<b></b>
</td>
<td width="40%">
206.141.193.55
</td>
</tr>
<!-- RULE //-->
<tr>
<td colspan="2">
<img src="img/liteblue.gif" width="100%" height="2" border="0">
</td>
</tr>
<!-- END RULE //-->
<tr>
<td colspan="2">
<span class="subhead">LAN Port </span>
</td>
</tr>
<tr>
<td width="60%">
<b>MAC Address </b>
</td>
<td width="40%">
00:09:5b:29:3d:b3
</td>
</tr>
<tr>
<td width="60%">
<b>IP Address </b>
</td>
<td width="40%">
192.168.0.1
</td>
</tr>
<tr>
<td width="60%">
<b>DHCP </b>
</td>
<td width="40%">
Server
</td>
</tr>
<tr>
<td width="60%">
<b>IP Subnet Mask </b>
</td>
<td width="40%">
255.255.255.0
</td>
</tr>

</table>
<TABLE border=0 width="100%">
<tr width="100%">
<td>
<img src="img/liteblue.gif" width="100%" height="2" border="0">
</td>
</tr>
<TR width="100%">
<TD>

<span class="subhead">Wireless Port </span>
</TD>
</TR>

</TABLE>

<TABLE width="100%" border=0>

<TR>
<TD width="60%"><b>MAC Address
(BSSID) </b></TD>
<TD width="40%">00:09:5b:29:3d:b3</TD></TR>
</table>
<TABLE width="100%" cellSpacing=2 border=0>
<TD width="60%"><b>Name (SSID)</b></TD>
<TD width="40%">natchieland</TD></tr>
<TD width="60%"><b>Region</b></TD>
<TD width="40%">USA</TD></tr>
<TD width="60%"><b>Channel</b></TD>
<TD width="40%">1</TD></tr>

</table>
<TABLE width="100%" cellSpacing=2 border=0>

<tr>
<td colspan="2">
<img src="img/liteblue.gif" width="100%" height="2" border="0">
</td>
</tr>

<tr>
<td align='center'>
<input type="BUTTON" value="Show Statistics" onclick="window.open('mtenSysStatistics.htm','static','width=500,height=200,status=yes, resizable=yes');">
<INPUT onclick="newwindow(this.form);" type=button value="Connection Status">
</TD>
</tr>
</TABLE>
</form>
</body>

</html>
 
D

Diez B. Roggisch

The program seems to take a vacation after the opening form tag. What
am I doing wrong?
<input type=hidden name=status value=>

I can't believe that this value=-thingy is valid html....

Regards,

Diez
 
P

Peter Otten

Matthew said:
I want to parse an html file and extract my router's IP address. I
wrote this code and I have python 2.3 installed:

#! /usr/bin/env python

import HTMLParser

class HP(HTMLParser.HTMLParser):

def handle_starttag(self, tag, data):
print "tag is %s." % (tag)

def handle_comment(self, data):
print "caught a comment: %s." % (data)

def handle_data(self, data):
if "IP" in data:
print "Caught %s." % data

hp = HP()
out = open('routerstatus.html')
for line in out:
hp.feed(line)


I figured that when I ran this on the html code at the bottom of this
file, it would print every tag, but instead, this is what I got:

tag is html.
tag is head.
tag is meta.
tag is meta.
tag is meta.
tag is meta.
tag is meta.
tag is title.
tag is link.
tag is script.
tag is body.
tag is form.

The program seems to take a vacation after the opening form tag. What
am I doing wrong?

Nothing, but your input file is not valid HTML and seems to puzzle the
parser. I recommend running it through tidy before you feed it to the
parser.

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top