Pulling stuff out of the internet archive (wayback machine)

David Fraser · Jul 1, 2004

I used this when trying to retrieve the McMillan site. Others might find
it useful...

David

#!/usr/bin/env python

import urlparse
import urllib2
import os
import HTMLParser
import sre

class HTMLLinkScanner(HTMLParser.HTMLParser):
tags = {'a':'href','img':'src','frame':'src','base':'href'}

def reset(self):
self.links = {}
self.replacements = []
HTMLParser.HTMLParser.reset(self)

def handle_starttag(self, tag, attrs):
if tag in self.tags:
checkattrs = self.tags[tag]
if isinstance(checkattrs, (str, unicode)):
checkattrs = [checkattrs]
for attr, value in attrs:
if attr in checkattrs:
if tag != 'base':
link = urlparse.urldefrag(value)[0]
self.links[link] = True
self.replacements.append((self.get_starttag_text(), attr, value))

class MirrorRetriever:
def __init__(self, archivedir):
self.archivedir = archivedir
self.urlmap = {}

def url2filename(self, url):
scheme, location, path, query, fragment = urlparse.urlsplit(url)
if not path or path.endswith('/'):
path += 'index.html'
path = os.path.join(*path.split('/'))
if scheme.lower() != 'http':
location = os.path.join(scheme, location)
# ignore query for the meantime
return os.path.join(self.archivedir, location, path)

def testinclude(self, url):
scheme, location, path, query, fragment = urlparse.urlsplit(url)
if scheme in ('mailto', 'javascript'): return False
# TODO: add ability to specify site
# return location.lower() == 'www.mcmillan-inc.com'
return True

def ensuredir(self, pathname):
if not os.path.isdir(pathname):
self.ensuredir(os.path.dirname(pathname))
os.mkdir(pathname)

def retrieveurl(self, url):
return urllib2.urlopen(url).read()

def mirror(self, url):
if url in self.urlmap:
return
else:
filename = self.url2filename(url)
if not self.testinclude(url):
return
print url,'->',filename
self.urlmap = filename # TODO: add an op...th('.'), sys.argv[2]) m.mirror(sys.argv[1])

Paul Rubin · Jul 1, 2004

David Fraser said:
I used this when trying to retrieve the McMillan site. Others might
find it useful...

Cool, thanks. I've done stuff like that a bunch of times, but I
usually just examine the HTML manually and identify a few fixed
strings to search for, to locate the links I want.

Personal archive tool, looking for suggestions on improving the code	5	Jul 27, 2010
Cannot figure out line of code, also not understanding error	9	Feb 20, 2014
Crawling	1	Mar 10, 2021
urllib2.urlopen(url) pulling something other than HTML	7	Aug 20, 2007
Python point location of intersect between two lines	0	Feb 28, 2018
Mapping of node numbers	0	May 6, 2014
HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
"Laws of Form" are a notation for the SK calculus, demo in Python.	0	Mar 31, 2013

Pulling stuff out of the internet archive (wayback machine)

David Fraser

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads