HTMLParser skipping HTML? [newbie]

B

BobAalsma

I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
No errors, but some of the tags seem to go missing for no apparent reason - any advice?
I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(

Code:
import urllib2
from HTMLParser import HTMLParser

from GetHttpFileContents import getHttpFileContents

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Start tag:\n\t", tag
for attr in attrs:
print "\t\tattr:", attr
# end for attr in attrs:
#
def handle_endtag(self, tag):
print "End tag :\n\t", tag
#
def handle_data(self, data):
if data != '\n\n':
if data != '\n':
print "Data :\t\t", data
# end if 1
# end if 2
#
#
# ---------------------------------------------------------------------
#
def removeHtmlFromFileContents():
TextOut = ''

parser = MyHTMLParser()
parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())

return TextOut
#
# ---------------------------------------------------------------------
#
if __name__ == '__main__':
TextOut = removeHtmlFromFileContents()





Part of the output:
End tag :
script
Start tag:
title
Data : Bob Aalsma - Nederland | LinkedIn
End tag :
title
Start tag:
script
attr: ('type', 'text/javascript')
attr: ('src', 'http://www.linkedin.com/uas/authping?url=http://nl.linkedin.com/in/bobaalsma')
End tag :
script
Start tag:
link
attr: ('rel', 'stylesheet')
attr: ('type', 'text/css')
attr: ('href', 'http://s3.licdn.com/scds/concat/com...dljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
Start tag:
script
attr: ('type', 'text/javascript')
attr: ('src', 'http://s4.licdn.com/scds/concat/com...9o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
End tag :
script
End tag :
head



But the source text for this is [and all of the "<meta ...> seem to go missing:
</script>
<title>Bob Aalsma | LinkedIn</title>
<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/...j6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
<meta name="LinkedInBookmarkType" content="profile">
<meta name="ShortTitle" content="Bob Aalsma">
<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
<meta name="UniqueID" content="24198692">
<meta name="SaveURL" content="/profile/view?id=24198692&amp;authType=name&amp;authToken=KhOG">
</head>
 
P

Peter Otten

BobAalsma said:
I'm trying to understand the HTMLParser so I've copied some code from
http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and
tried that on my LinkedIn page.
No errors, but some of the tags seem to go missing for no apparent reason - any advice?
I have searched extensively for this, but seem to be the only one with
missing data from HTMLParser :(
Code:
import urllib2
from HTMLParser import HTMLParser

from GetHttpFileContents import getHttpFileContents

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Start tag:\n\t", tag
for attr in attrs:
print "\t\tattr:", attr
# end for attr in attrs:
#
def handle_endtag(self, tag):
print "End tag :\n\t", tag
#
def handle_data(self, data):
if data != '\n\n':
if data != '\n':
print "Data :\t\t", data
# end if 1
# end if 2

Please no! A kitten dies every time you write one of those comments ;)
def removeHtmlFromFileContents():
TextOut = ''

parser = MyHTMLParser()
parser.feed(urllib2.urlopen(
'http://nl.linkedin.com/in/bobaalsma').read())

return TextOut
#
# ---------------------------------------------------------------------
#
if __name__ == '__main__':
TextOut = removeHtmlFromFileContents()


After removing
from GetHttpFileContents import getHttpFileContents

from your script I get the following output (using python 2.7):

$ python parse_orig.py | grep meta -C2
script
Start tag:
meta
attr: ('http-equiv', 'content-type')
attr: ('content', 'text/html; charset=UTF-8')
Start tag:
meta
attr: ('http-equiv', 'X-UA-Compatible')
attr: ('content', 'IE=8')
Start tag:
meta
attr: ('name', 'description')
attr: ('content', 'Bekijk het (Nederland) professionele
profiel van Bob Aalsma op LinkedIn. LinkedIn is het grootste zakelijke
netwerk ter wereld. Professionals als Bob Aalsma kunnen hiermee interne
connecties met aanbevolen kandidaten, branchedeskundigen en businesspartners
vinden.')
Start tag:
meta
attr: ('name', 'pageImpressionID')
attr: ('content', '711eedaa-8273-45ca-a0dd-77eb96749134')
Start tag:
meta
attr: ('name', 'pageKey')
attr: ('content', 'nprofile-public-success')
Start tag:
meta
attr: ('name', 'analyticsURL')
attr: ('content', '/analytics/noauthtracker')
$

So there definitely are some meta tags.

Note that if you're logged in into a site the html the browser is "seeing"
may differ from the html you are retrieving via urllib.urlopen(...).read().
Perhaps that is the reason why you don't get what you expect.
 
B

BobAalsma

Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:
I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.

No errors, but some of the tags seem to go missing for no apparent reason - any advice?

I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(



Code:

import urllib2

from HTMLParser import HTMLParser



from GetHttpFileContents import getHttpFileContents



# create a subclass and override the handler methods

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):

print "Start tag:\n\t", tag

for attr in attrs:

print "\t\tattr:", attr

# end for attr in attrs:

#

def handle_endtag(self, tag):

print "End tag :\n\t", tag

#

def handle_data(self, data):

if data != '\n\n':

if data != '\n':

print "Data :\t\t", data

# end if 1

# end if 2

#

#

# ---------------------------------------------------------------------

#

def removeHtmlFromFileContents():

TextOut = ''



parser = MyHTMLParser()

parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())



return TextOut

#

# ---------------------------------------------------------------------

#

if __name__ == '__main__':

TextOut = removeHtmlFromFileContents()











Part of the output:

End tag :

script

Start tag:

title

Data : Bob Aalsma - Nederland | LinkedIn

End tag :

title

Start tag:

script

attr: ('type', 'text/javascript')

attr: ('src', 'http://www.linkedin.com/uas/authping?url=http://nl.linkedin.com/in/bobaalsma')

End tag :

script

Start tag:

link

attr: ('rel', 'stylesheet')

attr: ('type', 'text/css')

attr: ('href', 'http://s3.licdn.com/scds/concat/com...dljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')

Start tag:

script

attr: ('type', 'text/javascript')

attr: ('src', 'http://s4.licdn.com/scds/concat/com...9o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')

End tag :

script

End tag :

head







But the source text for this is [and all of the "<meta ...> seem to go missing:

</script>

<title>Bob Aalsma | LinkedIn</title>

<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">

<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/...j6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">

<meta name="LinkedInBookmarkType" content="profile">

<meta name="ShortTitle" content="Bob Aalsma">

<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">

<meta name="UniqueID" content="24198692">

<meta name="SaveURL" content="/profile/view?id=24198692&amp;authType=name&amp;authToken=KhOG">

</head>

Hmm, OK, Peter, thanks. I didn't consider the effect of logging in, that could certainly be a reason. So how could I have the script log in?

[Didn't understand the bit about the kittens, though. How about that?]
 
B

BobAalsma

Op woensdag 5 september 2012 19:23:45 UTC+2 schreef BobAalsma het volgende:
Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:
I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
No errors, but some of the tags seem to go missing for no apparent reason - any advice?
I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(

import urllib2
from HTMLParser import HTMLParser
from GetHttpFileContents import getHttpFileContents
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Start tag:\n\t", tag
for attr in attrs:
print "\t\tattr:", attr
# end for attr in attrs:

def handle_endtag(self, tag):
print "End tag :\n\t", tag

def handle_data(self, data):
if data != '\n\n':
if data != '\n':
print "Data :\t\t", data
# end if 1
# end if 2

# ---------------------------------------------------------------------

def removeHtmlFromFileContents():
TextOut = ''
parser = MyHTMLParser()

return TextOut

# ---------------------------------------------------------------------

if __name__ == '__main__':
TextOut = removeHtmlFromFileContents()
Part of the output:
End tag :

Start tag:

Data : Bob Aalsma - Nederland | LinkedIn
End tag :

Start tag:

attr: ('type', 'text/javascript')
attr: ('src', 'http://www.linkedin.com/uas/authping?url=http://nl.linkedin.com/in/bobaalsma')
End tag :

Start tag:

attr: ('rel', 'stylesheet')
attr: ('type', 'text/css')
attr: ('href', 'http://s3.licdn.com/scds/concat/com...dljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
Start tag:

attr: ('type', 'text/javascript')
attr: ('src', 'http://s4.licdn.com/scds/concat/com...9o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
End tag :

End tag :

But the source text for this is [and all of the "<meta ...> seem to go missing:
</script>

<title>Bob Aalsma | LinkedIn</title>
<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/...j6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
<meta name="LinkedInBookmarkType" content="profile">
<meta name="ShortTitle" content="Bob Aalsma">
<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
<meta name="UniqueID" content="24198692">
<meta name="SaveURL" content="/profile/view?id=24198692&amp;authType=name&amp;authToken=KhOG">
</head>



Hmm, OK, Peter, thanks. I didn't consider the effect of logging in, that could certainly be a reason. So how could I have the script log in?



[Didn't understand the bit about the kittens, though. How about that?]

Oops, sorry, found that bit about logging in - asked too soon; still wonder about the kittens ;)
 
P

Peter Otten

BobAalsma said:
[Didn't understand the bit about the kittens, though. How about that?]

Oops, sorry, found that bit about logging in - asked too soon; still
wonder about the kittens ;)

I just wanted to tell you not to mark the end of an if-suite with an "# end
if" comment. As soon as you become familiar with the language that will look
like noise that detracts from the actual code.

In an attempt to make this advice appear less patronizing I wrapped it into
a lame joke by alluding to

http://en.wikipedia.org/wiki/Every_time_you_masturbate..._God_kills_a_kitten

Sorry for the confusion -- I hope you aren't offended.
 
B

BobAalsma

Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:
I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.

No errors, but some of the tags seem to go missing for no apparent reason - any advice?

I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(



Code:

import urllib2

from HTMLParser import HTMLParser



from GetHttpFileContents import getHttpFileContents



# create a subclass and override the handler methods

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):

print "Start tag:\n\t", tag

for attr in attrs:

print "\t\tattr:", attr

# end for attr in attrs:

#

def handle_endtag(self, tag):

print "End tag :\n\t", tag

#

def handle_data(self, data):

if data != '\n\n':

if data != '\n':

print "Data :\t\t", data

# end if 1

# end if 2

#

#

# ---------------------------------------------------------------------

#

def removeHtmlFromFileContents():

TextOut = ''



parser = MyHTMLParser()

parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())



return TextOut

#

# ---------------------------------------------------------------------

#

if __name__ == '__main__':

TextOut = removeHtmlFromFileContents()











Part of the output:

End tag :

script

Start tag:

title

Data : Bob Aalsma - Nederland | LinkedIn

End tag :

title

Start tag:

script

attr: ('type', 'text/javascript')

attr: ('src', 'http://www.linkedin.com/uas/authping?url=http://nl.linkedin.com/in/bobaalsma')

End tag :

script

Start tag:

link

attr: ('rel', 'stylesheet')

attr: ('type', 'text/css')

attr: ('href', 'http://s3.licdn.com/scds/concat/com...dljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')

Start tag:

script

attr: ('type', 'text/javascript')

attr: ('src', 'http://s4.licdn.com/scds/concat/com...9o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')

End tag :

script

End tag :

head







But the source text for this is [and all of the "<meta ...> seem to go missing:

</script>

<title>Bob Aalsma | LinkedIn</title>

<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">

<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/...j6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">

<meta name="LinkedInBookmarkType" content="profile">

<meta name="ShortTitle" content="Bob Aalsma">

<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">

<meta name="UniqueID" content="24198692">

<meta name="SaveURL" content="/profile/view?id=24198692&amp;authType=name&amp;authToken=KhOG">

</head>

No offense and thanks for the reminder.
My background is software packages in 3GL, where different platforms mean different editors which mean it is sometimes difficult to recognize the end of blocks, especially when nested.
No need for that here, no.
I think it also means I'm still not really satisfied with my commenting in Python...
 
B

BobAalsma

Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:
I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.

No errors, but some of the tags seem to go missing for no apparent reason - any advice?

I have searched extensively for this, but seem to be the only one with missing data from HTMLParser :(



Code:

import urllib2

from HTMLParser import HTMLParser



from GetHttpFileContents import getHttpFileContents



# create a subclass and override the handler methods

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):

print "Start tag:\n\t", tag

for attr in attrs:

print "\t\tattr:", attr

# end for attr in attrs:

#

def handle_endtag(self, tag):

print "End tag :\n\t", tag

#

def handle_data(self, data):

if data != '\n\n':

if data != '\n':

print "Data :\t\t", data

# end if 1

# end if 2

#

#

# ---------------------------------------------------------------------

#

def removeHtmlFromFileContents():

TextOut = ''



parser = MyHTMLParser()

parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())



return TextOut

#

# ---------------------------------------------------------------------

#

if __name__ == '__main__':

TextOut = removeHtmlFromFileContents()











Part of the output:

End tag :

script

Start tag:

title

Data : Bob Aalsma - Nederland | LinkedIn

End tag :

title

Start tag:

script

attr: ('type', 'text/javascript')

attr: ('src', 'http://www.linkedin.com/uas/authping?url=http://nl.linkedin.com/in/bobaalsma')

End tag :

script

Start tag:

link

attr: ('rel', 'stylesheet')

attr: ('type', 'text/css')

attr: ('href', 'http://s3.licdn.com/scds/concat/com...dljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')

Start tag:

script

attr: ('type', 'text/javascript')

attr: ('src', 'http://s4.licdn.com/scds/concat/com...9o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')

End tag :

script

End tag :

head







But the source text for this is [and all of the "<meta ...> seem to go missing:

</script>

<title>Bob Aalsma | LinkedIn</title>

<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">

<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/...j6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">

<meta name="LinkedInBookmarkType" content="profile">

<meta name="ShortTitle" content="Bob Aalsma">

<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">

<meta name="UniqueID" content="24198692">

<meta name="SaveURL" content="/profile/view?id=24198692&amp;authType=name&amp;authToken=KhOG">

</head>

I can see that my Tester is not logging in: the reply from the site reads "<title>Sign In | LinkedIn</title>" rather than "<title>Bob Aalsma | LinkedIn</title>".
How can I tell which part is not correct?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top