Link Checking Issues - Sub domains

rpupkin77 · Aug 5, 2008

Hi,

I have written this script to run as a cron that will loop through a
text file with a list of urls. It works fine for most of the links,
however there are a number of urls which are subdomains (they are
government sites) such as http://basename.airforce.mil, these links
are always throwing 400 errors even though the site exists.

Is there a way to get around this?

Here is the script:

import httplib
from urlparse import urlparse

class LinkChecker:

def oldStuff():
p = urlparse(url)
h = HTTP(p[1])
h.putrequest('HEAD', p[2])
h.endheaders()
if h.getreply()[0] == 200: return 1
else: return 0

def check(self):
print "\nLooping through the file, line by line."

# define default values for the paremeters
text_file = open("/home/jjaffe/pythonModules/JAMRSscripts/urls.txt",
"r")
output = ""
errors = "=================== ERRORS (website exists but 404, 503
etc ): ===================\n"
failures= "\n=================== FAILURES (cannot connect to website
at all): ===================\n"
eCount = 0
fCount = 0

#loop through each line and see what the response code is
for line in text_file:
p = urlparse(line)
try:
conn = httplib.HTTPConnection(p[1])
conn.request("GET", p[2])
r1 = conn.getresponse()
if r1.status != 200: #if the response code was not success (200)
then report the error
errors += "\n "+str(r1.status)+" error for: "+p[1]+p[2]
eCount = (eCount + 1)
data1 = r1.read()
conn.close()
except: #the connection attempt timed out - hence the website
doesn't even exist
failures +="\n Could not create connection object: "+p[1]+p[2]
fCount = (fCount + 1)
text_file.close()

#see if there were errors and create output string
if (eCount == 0) and (fCount == 0):
output = "No errors or failures to report"
else:
output = errors+"\n\n"+failures

print output

if __name__ == '__main__':
lc = LinkChecker()
lc.check()
del lc

Thanks in advance.

Terry Reedy · Aug 5, 2008

rpupkin77 said:
Hi,

I have written this script to run as a cron that will loop through a
text file with a list of urls. It works fine for most of the links,
however there are a number of urls which are subdomains (they are
government sites) such as http://basename.airforce.mil, these links
are always throwing 400 errors even though the site exists.

Have you looked at urllib/urllib2 (urllib.request in 3.0)
for checking links?
If 'http://basename.airforce.mil' works typed into your browser,
this from the doc for urllib.request.Request might be relevant:

"headers should be a dictionary, and will be treated as if add_header()
was called with each key and value as arguments. This is often used to
“spoof” the User-Agent header, which is used by a browser to identify
itself – some HTTP servers only allow requests coming from common
browsers as opposed to scripts. For example, Mozilla Firefox may
identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127
Firefox/2.0.0.11", while urllib‘s default user agent string is
"Python-urllib/2.6" (on Python 2.6)."

Help with my responsive home page	2	Dec 14, 2022
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
menu bar and banner responsive issues....any guidance is appreciated!	0	Apr 5, 2016
urllib timeout issues	5	Mar 27, 2007
How to make this faster?	1	Nov 13, 2012
BotSavesPrincess Problem	5	Aug 9, 2017
Why is this WordPress comments form not submitting?	1	Jan 12, 2020
issues simply parsing a whitespace-delimited textfile in pythonscript	3	May 21, 2008

Link Checking Issues - Sub domains

rpupkin77

Terry Reedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads