help with link parsing?

Littlefield, Tyler · Dec 20, 2010

Hello all,
I have a question. I guess this worked pre 2.6; I don't remember the
last time I used it, but it was a while ago, and now it's failing.
Anyone mind looking at it and telling me what's going wrong? Also, is
there a quick way to match on a certain site? like links from google.com
and only output those?
#!/usr/bin/env python

#This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published
#by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

#This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
#MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
#
#You should have received a copy of the GNU General Public License along
with this program. If not, see
#http://www.gnu.org/licenses/.

"""
This script will parse out all the links in an html document and write
them to a textfile.
"""
import sys,optparse
import htmllib,formatter

#program class declarations:
class Links(htmllib.HTMLParser):
def __init__(self,formatter):
htmllib.HTMLParser.__init__(self, formatter)
self.links=[]
def start_a(self, attrs):
if (len(attrs)>0):
for a in attrs:
if a[0]=="href":
self.links.append(a[1])
print a[1]
break

def main(argv):
if (len(argv)!=3):
print("Error:\n"+argv[0]+" <input> <output>.\nParses <input>
for all links and saves them to <output>.")
return 1
lcount=0
format=formatter.NullFormatter()
html=Links(format)
print "Retrieving data:"
page=open(argv[1],"r")
print "Feeding data to parser:"
html.feed(page.read())
page.close()
print "Writing links:"
output=open(argv[2],"w")
for i in (html.links):
output.write(i+"\n")
lcount+=1
output.close()
print("Wrote "+str(lcount)+" links to "+argv[2]+".");
print("done.")

if (__name__ == "__main__"):
#we call the main function passing a list of args, and exit with
the return code passed back.
sys.exit(main(sys.argv))

Jon Clements · Dec 21, 2010

Hello all,
I have a question. I guess this worked pre 2.6; I don't remember the
last time I used it, but it was a while ago, and now it's failing.
Anyone mind looking at it and telling me what's going wrong? Also, is
there a quick way to match on a certain site? like links from google.com
and only output those?
#!/usr/bin/env python

#This program is free software: you can redistribute it and/or modify it
under the terms of the GNU General Public License as published
#by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

#This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
#MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.
#
#You should have received a copy of the GNU General Public License along
with this program. If not, see
#http://www.gnu.org/licenses/.

"""
This script will parse out all the links in an html document and write
them to a textfile.
"""
import sys,optparse
import htmllib,formatter

#program class declarations:
class Links(htmllib.HTMLParser):
def __init__(self,formatter):
htmllib.HTMLParser.__init__(self, formatter)
self.links=[]
def start_a(self, attrs):
if (len(attrs)>0):
for a in attrs:
if a[0]=="href":
self.links.append(a[1])
print a[1]
break

def main(argv):
if (len(argv)!=3):
print("Error:\n"+argv[0]+" <input> <output>.\nParses <input>
for all links and saves them to <output>.")
return 1
lcount=0
format=formatter.NullFormatter()
html=Links(format)
print "Retrieving data:"
page=open(argv[1],"r")
print "Feeding data to parser:"
html.feed(page.read())
page.close()
print "Writing links:"
output=open(argv[2],"w")
for i in (html.links):
output.write(i+"\n")
lcount+=1
output.close()
print("Wrote "+str(lcount)+" links to "+argv[2]+".");
print("done.")

if (__name__ == "__main__"):
#we call the main function passing a list of args, and exit with
the return code passed back.
sys.exit(main(sys.argv))

This doesn't answer your original question, but excluding the command
line handling, how's this do you?:

import lxml
from urlparse import urlsplit

doc = lxml.html.parse('http://www.google.com')
print map(urlsplit, doc.xpath('//a/@href'))

[SplitResult(scheme='http', netloc='www.google.co.uk', path='/imghp',
query='hl=en&tab=wi', fragment=''), SplitResult(scheme='http',
netloc='video.google.co.uk', path='/', query='hl=en&tab=wv',
fragment=''), SplitResult(scheme='http', netloc='maps.google.co.uk',
path='/maps', query='hl=en&tab=wl', fragment=''),
SplitResult(scheme='http', netloc='news.google.co.uk', path='/nwshp',
query='hl=en&tab=wn', fragment=''), ...]

Much nicer IMHO, plus the lxml.html has iterlinks() and other
convenience functions for handling HTML.

hth

Jon.

Colin J. Williams · Dec 22, 2010

import lxml
from urlparse import urlsplit

doc = lxml.html.parse('http://www.google.com')
print map(urlsplit, doc.xpath('//a/@href'))

[SplitResult(scheme='http', netloc='www.google.co.uk', path='/imghp',
query='hl=en&tab=wi', fragment=''), SplitResult(scheme='http',
netloc='video.google.co.uk', path='/', query='hl=en&tab=wv',
fragment=''), SplitResult(scheme='http', netloc='maps.google.co.uk',
path='/maps', query='hl=en&tab=wl', fragment=''),
SplitResult(scheme='http', netloc='news.google.co.uk', path='/nwshp',
query='hl=en&tab=wn', fragment=''), ...]

Jon,

What version of Python was used to run this?

Colin W.

Jon Clements · Dec 22, 2010

import lxml
from urlparse import urlsplit

Click to expand...

doc = lxml.html.parse('http://www.google.com')
print map(urlsplit, doc.xpath('//a/@href'))

Click to expand...

[SplitResult(scheme='http', netloc='www.google.co.uk', path='/imghp',
query='hl=en&tab=wi', fragment=''), SplitResult(scheme='http',
netloc='video.google.co.uk', path='/', query='hl=en&tab=wv',
fragment=''), SplitResult(scheme='http', netloc='maps.google.co.uk',
path='/maps', query='hl=en&tab=wl', fragment=''),
SplitResult(scheme='http', netloc='news.google.co.uk', path='/nwshp',
query='hl=en&tab=wn', fragment=''), ...]

Click to expand...

Jon,

What version of Python was used to run this?

Colin W.

2.6.5 - the lxml library is not a standard module though and needs to
be installed.

Help with code plsss	0	Aug 30, 2023
Python battle game help	2	Feb 23, 2023
Help with code	2	Oct 11, 2022
Web Page Parsing/Downloading	1	Nov 22, 2013
Help with Loop	0	Mar 30, 2023
Need help with this Python code.	2	Jun 13, 2023
Pygame animation help!!!!	0	Dec 4, 2023
Need to modify a Python Gui	0	Jun 2, 2013

help with link parsing?

Littlefield, Tyler

Jon Clements

Colin J. Williams

Jon Clements

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads