E
EP
Hello patient and tolerant Pythonistas,
Iterating through a long list of arbitrary (and possibly syntactically flawed) urls with a urllib2 pinging function I get a hang up. No exception is raised, however (according to Windows Task Manager) python.exe stops using any CPU time, neither increasing nor decreasing the memory it uses, and the script does not progress (permanently stalled, it seems). As an example, the below function has been stuck on url number 364 for ~40 minutes.
Does this simply indicate the need for a time-out function, or could there be something else going on (error in my usage) I've overlooked?
If it requires a time-out control, is there a way to implement that without using separate threads? Any best practice recommendations?
Here's my function:
--------------------------------------------------
def testLinks2(urlList=[]):
import urllib2
goodLinks=[]
badLinks=[]
user_agent = 'mySpider Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
print len(urlList), " links to test"
count=0
for url in urlList:
count+=1
print count,
try:
request = urllib2.Request(url)
request.add_header('User-Agent', user_agent)
handle = urllib2.urlopen(request)
goodLinks.append(url)
except urllib2.HTTPError, e:
badLinks.append({url:e.code})
print e.code,": ",url
except:
print "unknown error: ",url
badLinks.append({url:"unknown error"})
print len(goodLinks)," working links found"
return goodLinks, badLinks
good, bad=testLinks2(linkList)
--------------------------------------------------
Thannks in advance for your thoughts.
Eric Pederson
Iterating through a long list of arbitrary (and possibly syntactically flawed) urls with a urllib2 pinging function I get a hang up. No exception is raised, however (according to Windows Task Manager) python.exe stops using any CPU time, neither increasing nor decreasing the memory it uses, and the script does not progress (permanently stalled, it seems). As an example, the below function has been stuck on url number 364 for ~40 minutes.
Does this simply indicate the need for a time-out function, or could there be something else going on (error in my usage) I've overlooked?
If it requires a time-out control, is there a way to implement that without using separate threads? Any best practice recommendations?
Here's my function:
--------------------------------------------------
def testLinks2(urlList=[]):
import urllib2
goodLinks=[]
badLinks=[]
user_agent = 'mySpider Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
print len(urlList), " links to test"
count=0
for url in urlList:
count+=1
print count,
try:
request = urllib2.Request(url)
request.add_header('User-Agent', user_agent)
handle = urllib2.urlopen(request)
goodLinks.append(url)
except urllib2.HTTPError, e:
badLinks.append({url:e.code})
print e.code,": ",url
except:
print "unknown error: ",url
badLinks.append({url:"unknown error"})
print len(goodLinks)," working links found"
return goodLinks, badLinks
good, bad=testLinks2(linkList)
--------------------------------------------------
Thannks in advance for your thoughts.
Eric Pederson