B
BobAalsma
I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
No errors, but some of the tags seem to go missing for no apparent reason - any advice?
I have searched extensively for this, but seem to be the only one with missing data from HTMLParser
Code:
import urllib2
from HTMLParser import HTMLParser
from GetHttpFileContents import getHttpFileContents
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Start tag:\n\t", tag
for attr in attrs:
print "\t\tattr:", attr
# end for attr in attrs:
#
def handle_endtag(self, tag):
print "End tag :\n\t", tag
#
def handle_data(self, data):
if data != '\n\n':
if data != '\n':
print "Data :\t\t", data
# end if 1
# end if 2
#
#
# ---------------------------------------------------------------------
#
def removeHtmlFromFileContents():
TextOut = ''
parser = MyHTMLParser()
parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())
return TextOut
#
# ---------------------------------------------------------------------
#
if __name__ == '__main__':
TextOut = removeHtmlFromFileContents()
Part of the output:
End tag :
script
Start tag:
title
Data : Bob Aalsma - Nederland | LinkedIn
End tag :
title
Start tag:
script
attr: ('type', 'text/javascript')
attr: ('src', 'http://www.linkedin.com/uas/authping?url=http://nl.linkedin.com/in/bobaalsma')
End tag :
script
Start tag:
link
attr: ('rel', 'stylesheet')
attr: ('type', 'text/css')
attr: ('href', 'http://s3.licdn.com/scds/concat/com...dljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
Start tag:
script
attr: ('type', 'text/javascript')
attr: ('src', 'http://s4.licdn.com/scds/concat/com...9o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
End tag :
script
End tag :
head
But the source text for this is [and all of the "<meta ...> seem to go missing:
</script>
<title>Bob Aalsma | LinkedIn</title>
<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/...j6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
<meta name="LinkedInBookmarkType" content="profile">
<meta name="ShortTitle" content="Bob Aalsma">
<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
<meta name="UniqueID" content="24198692">
<meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG">
</head>
No errors, but some of the tags seem to go missing for no apparent reason - any advice?
I have searched extensively for this, but seem to be the only one with missing data from HTMLParser
Code:
import urllib2
from HTMLParser import HTMLParser
from GetHttpFileContents import getHttpFileContents
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Start tag:\n\t", tag
for attr in attrs:
print "\t\tattr:", attr
# end for attr in attrs:
#
def handle_endtag(self, tag):
print "End tag :\n\t", tag
#
def handle_data(self, data):
if data != '\n\n':
if data != '\n':
print "Data :\t\t", data
# end if 1
# end if 2
#
#
# ---------------------------------------------------------------------
#
def removeHtmlFromFileContents():
TextOut = ''
parser = MyHTMLParser()
parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())
return TextOut
#
# ---------------------------------------------------------------------
#
if __name__ == '__main__':
TextOut = removeHtmlFromFileContents()
Part of the output:
End tag :
script
Start tag:
title
Data : Bob Aalsma - Nederland | LinkedIn
End tag :
title
Start tag:
script
attr: ('type', 'text/javascript')
attr: ('src', 'http://www.linkedin.com/uas/authping?url=http://nl.linkedin.com/in/bobaalsma')
End tag :
script
Start tag:
link
attr: ('rel', 'stylesheet')
attr: ('type', 'text/css')
attr: ('href', 'http://s3.licdn.com/scds/concat/com...dljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
Start tag:
script
attr: ('type', 'text/javascript')
attr: ('src', 'http://s4.licdn.com/scds/concat/com...9o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
End tag :
script
End tag :
head
But the source text for this is [and all of the "<meta ...> seem to go missing:
</script>
<title>Bob Aalsma | LinkedIn</title>
<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/...j6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
<meta name="LinkedInBookmarkType" content="profile">
<meta name="ShortTitle" content="Bob Aalsma">
<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
<meta name="UniqueID" content="24198692">
<meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG">
</head>