HTMLParser skipping HTML? [newbie]

BobAalsma · Sep 5, 2012

I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
No errors, but some of the tags seem to go missing for no apparent reason - any advice?
I have searched extensively for this, but seem to be the only one with missing data from HTMLParser

Code:
import urllib2
from HTMLParser import HTMLParser

from GetHttpFileContents import getHttpFileContents

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Start tag:\n\t", tag
for attr in attrs:
print "\t\tattr:", attr
# end for attr in attrs:
#
def handle_endtag(self, tag):
print "End tag :\n\t", tag
#
def handle_data(self, data):
if data != '\n\n':
if data != '\n':
print "Data :\t\t", data
# end if 1
# end if 2
#
#
# ---------------------------------------------------------------------
#
def removeHtmlFromFileContents():
TextOut = ''

parser = MyHTMLParser()
parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())

return TextOut
#
# ---------------------------------------------------------------------
#
if __name__ == '__main__':
TextOut = removeHtmlFromFileContents()

Part of the output:
End tag :
script
Start tag:
title
Data : Bob Aalsma - Nederland | LinkedIn
End tag :
title
Start tag:
script
attr: ('type', 'text/javascript')
attr: ('src', 'http://www.linkedin.com/uas/authping?url=http://nl.linkedin.com/in/bobaalsma')
End tag :
script
Start tag:
link
attr: ('rel', 'stylesheet')
attr: ('type', 'text/css')
attr: ('href', 'http://s3.licdn.com/scds/concat/com...dljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
Start tag:
script
attr: ('type', 'text/javascript')
attr: ('src', 'http://s4.licdn.com/scds/concat/com...9o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
End tag :
script
End tag :
head

But the source text for this is [and all of the "<meta ...> seem to go missing:
</script>
<title>Bob Aalsma | LinkedIn</title>
<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/...j6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
<meta name="LinkedInBookmarkType" content="profile">
<meta name="ShortTitle" content="Bob Aalsma">
<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
<meta name="UniqueID" content="24198692">
<meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG">
</head>

Peter Otten · Sep 5, 2012

BobAalsma said:
I'm trying to understand the HTMLParser so I've copied some code from

http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and
tried that on my LinkedIn page.

No errors, but some of the tags seem to go missing for no apparent reason - any advice?
I have searched extensively for this, but seem to be the only one with

missing data from HTMLParser

Code:
import urllib2
from HTMLParser import HTMLParser

from GetHttpFileContents import getHttpFileContents

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Start tag:\n\t", tag
for attr in attrs:
print "\t\tattr:", attr
# end for attr in attrs:
#
def handle_endtag(self, tag):
print "End tag :\n\t", tag
#
def handle_data(self, data):
if data != '\n\n':
if data != '\n':
print "Data :\t\t", data
# end if 1
# end if 2

Please no! A kitten dies every time you write one of those comments

def removeHtmlFromFileContents():
TextOut = ''

parser = MyHTMLParser()
parser.feed(urllib2.urlopen(
'http://nl.linkedin.com/in/bobaalsma').read())

return TextOut
#
# ---------------------------------------------------------------------
#
if __name__ == '__main__':
TextOut = removeHtmlFromFileContents()

After removing

from GetHttpFileContents import getHttpFileContents

from your script I get the following output (using python 2.7):

$ python parse_orig.py | grep meta -C2
script
Start tag:
meta
attr: ('http-equiv', 'content-type')
attr: ('content', 'text/html; charset=UTF-8')
Start tag:
meta
attr: ('http-equiv', 'X-UA-Compatible')
attr: ('content', 'IE=8')
Start tag:
meta
attr: ('name', 'description')
attr: ('content', 'Bekijk het (Nederland) professionele
profiel van Bob Aalsma op LinkedIn. LinkedIn is het grootste zakelijke
netwerk ter wereld. Professionals als Bob Aalsma kunnen hiermee interne
connecties met aanbevolen kandidaten, branchedeskundigen en businesspartners
vinden.')
Start tag:
meta
attr: ('name', 'pageImpressionID')
attr: ('content', '711eedaa-8273-45ca-a0dd-77eb96749134')
Start tag:
meta
attr: ('name', 'pageKey')
attr: ('content', 'nprofile-public-success')
Start tag:
meta
attr: ('name', 'analyticsURL')
attr: ('content', '/analytics/noauthtracker')
$

So there definitely are some meta tags.

Note that if you're logged in into a site the html the browser is "seeing"
may differ from the html you are retrieving via urllib.urlopen(...).read().
Perhaps that is the reason why you don't get what you expect.

BobAalsma · Sep 5, 2012

Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:

I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.

No errors, but some of the tags seem to go missing for no apparent reason - any advice?

I have searched extensively for this, but seem to be the only one with missing data from HTMLParser

Code:

import urllib2

from HTMLParser import HTMLParser

from GetHttpFileContents import getHttpFileContents

# create a subclass and override the handler methods

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):

print "Start tag:\n\t", tag

for attr in attrs:

print "\t\tattr:", attr

# end for attr in attrs:

#

def handle_endtag(self, tag):

print "End tag :\n\t", tag

#

def handle_data(self, data):

if data != '\n\n':

if data != '\n':

print "Data :\t\t", data

# end if 1

# end if 2

#

#

# ---------------------------------------------------------------------

#

def removeHtmlFromFileContents():

TextOut = ''

parser = MyHTMLParser()

parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())

return TextOut

#

# ---------------------------------------------------------------------

#

if __name__ == '__main__':

TextOut = removeHtmlFromFileContents()

Part of the output:

End tag :

script

Start tag:

title

Data : Bob Aalsma - Nederland | LinkedIn

End tag :

title

Start tag:

script

attr: ('type', 'text/javascript')

attr: ('src', 'http://www.linkedin.com/uas/authping?url=http://nl.linkedin.com/in/bobaalsma')

End tag :

script

Start tag:

link

attr: ('rel', 'stylesheet')

attr: ('type', 'text/css')

attr: ('href', 'http://s3.licdn.com/scds/concat/com...dljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')

Start tag:

script

attr: ('type', 'text/javascript')

attr: ('src', 'http://s4.licdn.com/scds/concat/com...9o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')

End tag :

script

End tag :

head

But the source text for this is [and all of the "<meta ...> seem to go missing:

</script>

<title>Bob Aalsma | LinkedIn</title>

<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">

<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/...j6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">

<meta name="LinkedInBookmarkType" content="profile">

<meta name="ShortTitle" content="Bob Aalsma">

<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">

<meta name="UniqueID" content="24198692">

<meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG">

</head>

Hmm, OK, Peter, thanks. I didn't consider the effect of logging in, that could certainly be a reason. So how could I have the script log in?

[Didn't understand the bit about the kittens, though. How about that?]

BobAalsma · Sep 5, 2012

Op woensdag 5 september 2012 19:23:45 UTC+2 schreef BobAalsma het volgende:

Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:

I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.
No errors, but some of the tags seem to go missing for no apparent reason - any advice?
I have searched extensively for this, but seem to be the only one with missing data from HTMLParser

Code:

Click to expand...

import urllib2
from HTMLParser import HTMLParser
from GetHttpFileContents import getHttpFileContents
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "Start tag:\n\t", tag
for attr in attrs:
print "\t\tattr:", attr
# end for attr in attrs:

#

Click to expand...

def handle_endtag(self, tag):
print "End tag :\n\t", tag

#

Click to expand...

def handle_data(self, data):
if data != '\n\n':
if data != '\n':
print "Data :\t\t", data
# end if 1
# end if 2

#

Click to expand...

#

Click to expand...

# ---------------------------------------------------------------------

#

Click to expand...

def removeHtmlFromFileContents():
TextOut = ''
parser = MyHTMLParser()

parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())

Click to expand...

return TextOut

#

Click to expand...

# ---------------------------------------------------------------------

#

Click to expand...

if __name__ == '__main__':
TextOut = removeHtmlFromFileContents()
Part of the output:
End tag :

script

Click to expand...

Start tag:

title

Click to expand...

Data : Bob Aalsma - Nederland | LinkedIn
End tag :

title

Click to expand...

Start tag:

script

Click to expand...

attr: ('type', 'text/javascript')
attr: ('src', 'http://www.linkedin.com/uas/authping?url=http://nl.linkedin.com/in/bobaalsma')
End tag :

script

Click to expand...

Start tag:

link

Click to expand...

attr: ('rel', 'stylesheet')
attr: ('type', 'text/css')
attr: ('href', 'http://s3.licdn.com/scds/concat/com...dljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')
Start tag:

script

Click to expand...

attr: ('type', 'text/javascript')
attr: ('src', 'http://s4.licdn.com/scds/concat/com...9o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')
End tag :

script

Click to expand...

End tag :

head

Click to expand...

But the source text for this is [and all of the "<meta ...> seem to go missing:

</script>

Click to expand...

<title>Bob Aalsma | LinkedIn</title>
<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">
<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/...j6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">
<meta name="LinkedInBookmarkType" content="profile">
<meta name="ShortTitle" content="Bob Aalsma">
<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">
<meta name="UniqueID" content="24198692">
<meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG">
</head>

Click to expand...

Hmm, OK, Peter, thanks. I didn't consider the effect of logging in, that could certainly be a reason. So how could I have the script log in?

[Didn't understand the bit about the kittens, though. How about that?]

Oops, sorry, found that bit about logging in - asked too soon; still wonder about the kittens

Peter Otten · Sep 5, 2012

BobAalsma said:
[Didn't understand the bit about the kittens, though. How about that?]

Oops, sorry, found that bit about logging in - asked too soon; still
wonder about the kittens

I just wanted to tell you not to mark the end of an if-suite with an "# end
if" comment. As soon as you become familiar with the language that will look
like noise that detracts from the actual code.

In an attempt to make this advice appear less patronizing I wrapped it into
a lame joke by alluding to

http://en.wikipedia.org/wiki/Every_time_you_masturbate..._God_kills_a_kitten

Sorry for the confusion -- I hope you aren't offended.

BobAalsma · Sep 6, 2012

Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:

I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.

No errors, but some of the tags seem to go missing for no apparent reason - any advice?

I have searched extensively for this, but seem to be the only one with missing data from HTMLParser

Code:

import urllib2

from HTMLParser import HTMLParser

from GetHttpFileContents import getHttpFileContents

# create a subclass and override the handler methods

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):

print "Start tag:\n\t", tag

for attr in attrs:

print "\t\tattr:", attr

# end for attr in attrs:

#

def handle_endtag(self, tag):

print "End tag :\n\t", tag

#

def handle_data(self, data):

if data != '\n\n':

if data != '\n':

print "Data :\t\t", data

# end if 1

# end if 2

#

#

# ---------------------------------------------------------------------

#

def removeHtmlFromFileContents():

TextOut = ''

parser = MyHTMLParser()

parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())

return TextOut

#

# ---------------------------------------------------------------------

#

if __name__ == '__main__':

TextOut = removeHtmlFromFileContents()

Part of the output:

End tag :

script

Start tag:

title

Data : Bob Aalsma - Nederland | LinkedIn

End tag :

title

Start tag:

script

attr: ('type', 'text/javascript')

attr: ('src', 'http://www.linkedin.com/uas/authping?url=http://nl.linkedin.com/in/bobaalsma')

End tag :

script

Start tag:

link

attr: ('rel', 'stylesheet')

attr: ('type', 'text/css')

attr: ('href', 'http://s3.licdn.com/scds/concat/com...dljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')

Start tag:

script

attr: ('type', 'text/javascript')

attr: ('src', 'http://s4.licdn.com/scds/concat/com...9o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')

End tag :

script

End tag :

head

But the source text for this is [and all of the "<meta ...> seem to go missing:

</script>

<title>Bob Aalsma | LinkedIn</title>

<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">

<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/...j6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">

<meta name="LinkedInBookmarkType" content="profile">

<meta name="ShortTitle" content="Bob Aalsma">

<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">

<meta name="UniqueID" content="24198692">

<meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG">

</head>

No offense and thanks for the reminder.
My background is software packages in 3GL, where different platforms mean different editors which mean it is sometimes difficult to recognize the end of blocks, especially when nested.
No need for that here, no.
I think it also means I'm still not really satisfied with my commenting in Python...

BobAalsma · Sep 6, 2012

Op woensdag 5 september 2012 14:57:05 UTC+2 schreef BobAalsma het volgende:

I'm trying to understand the HTMLParser so I've copied some code from http://docs.python.org/library/htmlparser.html?highlight=html#HTMLParser and tried that on my LinkedIn page.

No errors, but some of the tags seem to go missing for no apparent reason - any advice?

I have searched extensively for this, but seem to be the only one with missing data from HTMLParser

Code:

import urllib2

from HTMLParser import HTMLParser

from GetHttpFileContents import getHttpFileContents

# create a subclass and override the handler methods

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):

print "Start tag:\n\t", tag

for attr in attrs:

print "\t\tattr:", attr

# end for attr in attrs:

#

def handle_endtag(self, tag):

print "End tag :\n\t", tag

#

def handle_data(self, data):

if data != '\n\n':

if data != '\n':

print "Data :\t\t", data

# end if 1

# end if 2

#

#

# ---------------------------------------------------------------------

#

def removeHtmlFromFileContents():

TextOut = ''

parser = MyHTMLParser()

parser.feed(urllib2.urlopen('http://nl.linkedin.com/in/bobaalsma').read())

return TextOut

#

# ---------------------------------------------------------------------

#

if __name__ == '__main__':

TextOut = removeHtmlFromFileContents()

Part of the output:

End tag :

script

Start tag:

title

Data : Bob Aalsma - Nederland | LinkedIn

End tag :

title

Start tag:

script

attr: ('type', 'text/javascript')

attr: ('src', 'http://www.linkedin.com/uas/authping?url=http://nl.linkedin.com/in/bobaalsma')

End tag :

script

Start tag:

link

attr: ('rel', 'stylesheet')

attr: ('type', 'text/css')

attr: ('href', 'http://s3.licdn.com/scds/concat/com...dljf1bvpack85gyxhv4-5xxmkfcm1ny97biv0pwj7ch69')

Start tag:

script

attr: ('type', 'text/javascript')

attr: ('src', 'http://s4.licdn.com/scds/concat/com...9o6xkxgppoxivctlunb-8v6o0480wy5u6j7f3sh92hzxo')

End tag :

script

End tag :

head

But the source text for this is [and all of the "<meta ...> seem to go missing:

</script>

<title>Bob Aalsma | LinkedIn</title>

<link rel="stylesheet" type="text/css" href="https://s3-s.licdn.com/scds/concat/common/css?h=7d22iuuoi1bmp3a2jb6jyv5z5">

<link rel="stylesheet" type="text/css" href="https://s4-s.licdn.com/scds/concat/...j6nlhvdvzx7rmluambv-69sgyia02rmcjmco0t9d3xpvo">

<meta name="LinkedInBookmarkType" content="profile">

<meta name="ShortTitle" content="Bob Aalsma">

<meta name="Description" content="Bob Aalsma: Project Manager at DripFeed in the Information Services industry (Amsterdam Area, Netherlands)">

<meta name="UniqueID" content="24198692">

<meta name="SaveURL" content="/profile/view?id=24198692&authType=name&authToken=KhOG">

</head>

I can see that my Tester is not logging in: the reply from the site reads "<title>Sign In | LinkedIn</title>" rather than "<title>Bob Aalsma | LinkedIn</title>".
How can I tell which part is not correct?

How to position the tooltip comment on these buttons?	9	Nov 4, 2023
How to have two html audio players on one page?	0	May 3, 2022
Sort by number of characters	1	Nov 2, 2023
Canvas drawing HTML Javascript on elementor	1	Feb 22, 2023
Only one table shows up with the information	2	Mar 29, 2023
Help with code	0	Jun 12, 2022
Align img inside nav tabs section	5	Dec 29, 2023
Check forms With JavaScript	1	Mar 28, 2023

HTMLParser skipping HTML? [newbie]

BobAalsma

Peter Otten

BobAalsma

BobAalsma

Peter Otten

BobAalsma

BobAalsma

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads