urllib timeout issues

S

supercooper

I am downloading images using the script below. Sometimes it will go
for 10 mins, sometimes 2 hours before timing out with the following
error:

Traceback (most recent call last):
File "ftp_20070326_Downloads_cooperc_FetchLibreMapProjectDRGs.py",
line 108, i
n ?
urllib.urlretrieve(fullurl, localfile)
File "C:\Python24\lib\urllib.py", line 89, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "C:\Python24\lib\urllib.py", line 222, in retrieve
fp = self.open(url, data)
File "C:\Python24\lib\urllib.py", line 190, in open
return getattr(self, name)(url)
File "C:\Python24\lib\urllib.py", line 322, in open_http
return self.http_error(url, fp, errcode, errmsg, headers)
File "C:\Python24\lib\urllib.py", line 335, in http_error
result = method(url, fp, errcode, errmsg, headers)
File "C:\Python24\lib\urllib.py", line 593, in http_error_302
data)
File "C:\Python24\lib\urllib.py", line 608, in redirect_internal
return self.open(newurl)
File "C:\Python24\lib\urllib.py", line 190, in open
return getattr(self, name)(url)
File "C:\Python24\lib\urllib.py", line 313, in open_http
h.endheaders()
File "C:\Python24\lib\httplib.py", line 798, in endheaders
self._send_output()
File "C:\Python24\lib\httplib.py", line 679, in _send_output
self.send(msg)
File "C:\Python24\lib\httplib.py", line 646, in send
self.connect()
File "C:\Python24\lib\httplib.py", line 630, in connect
raise socket.error, msg
IOError: [Errno socket error] (10060, 'Operation timed out')


I have searched this forum extensively and tried to avoid timing out,
but to no avail. Anyone have any ideas as to why I keep getting a
timeout? I thought setting the socket timeout did it, but it didnt.

Thanks.

<--- CODE --->

images = [['34095e3','Clayton'],
['35096d2','Clearview'],
['34095d1','Clebit'],
['34095c3','Cloudy'],
['34096e2','Coalgate'],
['34096e1','Coalgate SE'],
['35095g7','Concharty Mountain'],
['34096d6','Connerville'],
['34096d5','Connerville NE'],
['34096c5','Connerville SE'],
['35094f8','Cookson'],
['35095e6','Council Hill'],
['34095f5','Counts'],
['35095h6','Coweta'],
['35097h2','Coyle'],
['35096c4','Cromwell'],
['35095a6','Crowder'],
['35096h7','Cushing']]

exts = ['tif', 'tfw']
envir = 'DEV'
# URL of our image(s) to grab
url = 'http://www.archive.org/download/'
logRoot = '//fayfiler/seecoapps/Geology/GEOREFRENCED IMAGES/TOPO/
Oklahoma UTMz14meters NAD27/'
logFile = os.path.join(logRoot, 'FetchLibreDRGs_' + strftime('%m_%d_%Y_
%H_%M_%S', localtime()) + '_' + envir + '.log')

# Local dir to store files in
fetchdir = logRoot
# Entire process start time
start = time.clock()

msg = envir + ' - ' + "Script: " + os.path.join(sys.path[0],
sys.argv[0]) + ' - Start time: ' + strftime('%m/%d/%Y %I:%M:%S %p',
localtime()) + \

'\n--------------------------------------------------------------------------------------------------------------
\n\n'
AddPrintMessage(msg)
StartFinishMessage('Start')

# Loop thru image list, grab each tif and tfw
for image in images:
# Try and set socket timeout default to none
# Create a new socket connection for every time through list loop
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('archive.org', 80))
s.settimeout(None)

s2 = time.clock()
msg = '\nProcessing ' + image[0] + ' --> ' + image[1]
AddPrintMessage(msg)
print msg
for ext in exts:
fullurl = url + 'usgs_drg_ok_' + image[0][:5] + '_' + image[0]
[5:] + '/o' + image[0] + '.' + ext
localfile = fetchdir + image[0] + '_' +
string.replace(image[1], ' ', '_') + '.' + ext
urllib.urlretrieve(fullurl, localfile)
e2 = time.clock()
msg = '\nDone processing ' + image[0] + ' --> ' + image[1] +
'\nProcess took ' + Timer(s2, e2)
AddPrintMessage(msg)
print msg
# Close socket connection, only to reopen with next run thru loop
s.close()

end = time.clock()
StartFinishMessage('Finish')
msg = '\n\nDone! Process completed in ' + Timer(start, end)
AddPrintMessage(msg)
 
G

Gabriel Genellina

I am downloading images using the script below. Sometimes it will go
for 10 mins, sometimes 2 hours before timing out with the following
error:

urllib.urlretrieve(fullurl, localfile)
IOError: [Errno socket error] (10060, 'Operation timed out')

I have searched this forum extensively and tried to avoid timing out,
but to no avail. Anyone have any ideas as to why I keep getting a
timeout? I thought setting the socket timeout did it, but it didnt.

You should do the opposite: timing out *early* -not waiting 2 hours- and
handling the error (maybe using a queue to hold pending requests)
 
S

supercooper

I am downloading images using the script below. Sometimes it will go
for 10 mins, sometimes 2 hours before timing out with the following
error:
urllib.urlretrieve(fullurl, localfile)
IOError: [Errno socket error] (10060, 'Operation timed out')
I have searched this forum extensively and tried to avoid timing out,
but to no avail. Anyone have any ideas as to why I keep getting a
timeout? I thought setting the socket timeout did it, but it didnt.

You should do the opposite: timing out *early* -not waiting 2 hours- and
handling the error (maybe using a queue to hold pending requests)

Gabriel, thanks for the input. So are you saying there is no way to
realistically *prevent* the timeout from occurring in the first
place? And by timing out early, do you mean to set the timeout for x
seconds and if and when the timeout occurs, handle the error and start
the process again somehow on the pending requests? Thanks.

chad
 
G

Gabriel Genellina

I am downloading images using the script below. Sometimes it will go
for 10 mins, sometimes 2 hours before timing out with the following
error:
urllib.urlretrieve(fullurl, localfile)
IOError: [Errno socket error] (10060, 'Operation timed out')
I have searched this forum extensively and tried to avoid timing out,
but to no avail. Anyone have any ideas as to why I keep getting a
timeout? I thought setting the socket timeout did it, but it didnt.

You should do the opposite: timing out *early* -not waiting 2 hours- and
handling the error (maybe using a queue to hold pending requests)

Gabriel, thanks for the input. So are you saying there is no way to
realistically *prevent* the timeout from occurring in the first

Exactly. The error is out of your control: maybe the server is down,
irresponsive, overloaded, a proxy has any problems, any network problem,
etc.
place? And by timing out early, do you mean to set the timeout for x
seconds and if and when the timeout occurs, handle the error and start
the process again somehow on the pending requests? Thanks.

Exactly!
Another option: Python is cool, but there is no need to reinvent the
wheel. Use wget instead :)
 
N

Nick Vatamaniuc

En Tue, 27 Mar 2007 16:21:55 -0300, supercooper <[email protected]>
escribió:
I am downloading images using the script below. Sometimes it will go
for 10 mins, sometimes 2 hours before timing out with the following
error:
urllib.urlretrieve(fullurl, localfile)
IOError: [Errno socket error] (10060, 'Operation timed out')
I have searched this forum extensively and tried to avoid timing out,
but to no avail. Anyone have any ideas as to why I keep getting a
timeout? I thought setting the socket timeout did it, but it didnt.
You should do the opposite: timing out *early* -not waiting 2 hours- and
handling the error (maybe using a queue to hold pending requests)

Gabriel, thanks for the input. So are you saying there is no way to
realistically *prevent* the timeout from occurring in the first
place? And by timing out early, do you mean to set the timeout for x
seconds and if and when the timeout occurs, handle the error and start
the process again somehow on the pending requests? Thanks.

chad

Chad,

Just run the retrieval in a Thread. If the thread is not done after x
seconds, then handle it as a timeout and then retry, ignore, quit or
anything else you want.

Even better, what I did for my program is first gather all the URLs (I
assume you can do that), then group by servers, i.e. n # of images
from foo.com, m # from bar.org .... Then start a thread for each
server (with some possible maximum number of threads), each one of
those threads will be responsible for retrieving images from only one
server (this is to prevent a DoS pattern). Let each of the server
threads start a 'small' retriever thread for each image (this is to
handle the timeout you mention).

So you have two different threads -- one per server to parallelize
downloading, which in turn will spawn and one per download to handle
timeout. This way you will (ideally) saturate your bandwidth but you
only get one image per server at a time so you still 'play nice' with
each of the servers. If you want to have a max # of server threads
running (in case you have way to many servers to deal with) then run
batches of server threads.

Hope this helps,
Nick Vatamaniuc
 
S

supercooper

En Tue, 27 Mar 2007 17:41:44 -0300, supercooper <[email protected]>
escribió:


En Tue, 27 Mar 2007 16:21:55 -0300, supercooper <[email protected]>
escribió:
I am downloading images using the script below. Sometimes it will go
for 10 mins, sometimes 2 hours before timing out with the following
error:
urllib.urlretrieve(fullurl, localfile)
IOError: [Errno socket error] (10060, 'Operation timed out')
I have searched this forum extensively and tried to avoid timing out,
but to no avail. Anyone have any ideas as to why I keep getting a
timeout? I thought setting the socket timeout did it, but it didnt.
You should do the opposite: timing out *early* -not waiting 2 hours- and
handling the error (maybe using a queue to hold pending requests)
Gabriel, thanks for the input. So are you saying there is no way to
realistically *prevent* the timeout from occurring in the first

Exactly. The error is out of your control: maybe the server is down,
irresponsive, overloaded, a proxy has any problems, any network problem,
etc.
place? And by timing out early, do you mean to set the timeout for x
seconds and if and when the timeout occurs, handle the error and start
the process again somehow on the pending requests? Thanks.

Exactly!
Another option: Python is cool, but there is no need to reinvent the
wheel. Use wget instead :)

Gabriel...thanks for the tip on wget...its awesome! I even built it on
my mac. It is working like a champ for hours on end...

Thanks!

chad




import os, shutil, string

images = [['34095d2','Nashoba'],
['34096c8','Nebo'],
['36095a4','Neodesha'],
['33095h7','New Oberlin'],
['35096f3','Newby'],
['35094e5','Nicut'],
['34096g2','Non'],
['35096h6','North Village'],
['35095g3','Northeast Muskogee'],
['35095g4','Northwest Muskogee'],
['35096f2','Nuyaka'],
['34094e6','Octavia'],
['36096a5','Oilton'],
['35096d3','Okemah'],
['35096c3','Okemah SE'],
['35096e2','Okfuskee'],
['35096e1','Okmulgee Lake'],
['35095f7','Okmulgee NE'],
['35095f8','Okmulgee North'],
['35095e8','Okmulgee South'],
['35095e4','Oktaha'],
['34094b7','Old Glory Mountain'],
['36096a4','Olive'],
['34096d3','Olney'],
['36095a6','Oneta'],
['34097a2','Overbrook']]

wgetDir = 'C:/Program Files/wget/o'
exts = ['tif', 'tfw']
url = 'http://www.archive.org/download/'
home = '//fayfiler/seecoapps/Geology/GEOREFRENCED IMAGES/TOPO/Oklahoma
UTMz14meters NAD27/'

for image in images:
for ext in exts:
fullurl = url + 'usgs_drg_ok_' + image[0][:5] + '_' + image[0]
[5:] + '/o' + image[0] + '.' + ext
os.system('wget %s -t 10 -a log.log' % fullurl)
shutil.move(wgetDir + image[0] + '.' + ext, home + 'o' +
image[0] + '_' + string.replace(image[1], ' ', '_') + '.' + ext)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,066
Latest member
VytoKetoReviews

Latest Threads

Top