urllib (54, 'Connection reset by peer') error

chrispoliquin · Jun 13, 2008

Hi,

I have a small Python script to fetch some pages from the internet.
There are a lot of pages and I am looping through them and then
downloading the page using urlretrieve() in the urllib module.

The problem is that after 110 pages or so the script sort of hangs and
then I get the following traceback:
Traceback (most recent call last):
File "volume_archiver.py", line 21, in <module>
urllib.urlretrieve(remotefile,localfile)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib.py", line 89, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib.py", line 222, in retrieve
fp = self.open(url, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib.py", line 190, in open
return getattr(self, name)(url)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib.py", line 328, in open_http
errcode, errmsg, headers = h.getreply()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 1195, in getreply
response = self._conn.getresponse()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 924, in getresponse
response.begin()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 385, in begin
version, status, reason = self._read_status()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 343, in _read_status
line = self.fp.readline()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/socket.py", line 331, in readline
data = recv(1)
IOError: [Errno socket error] (54, 'Connection reset by peer')
My script code is as follows:
-----------------------------------------
import os
import urllib

volume_number = 149 # The volumes number 150 to 544

while volume_number < 544:
volume_number = volume_number + 1
localfile = '/Users/Chris/Desktop/Decisions/' + str(volume_number) +
'.html'
remotefile = 'http://caselaw.lp.findlaw.com/scripts/getcase.pl?
court=us&navby=vol&vol=' + str(volume_number)
print 'Getting volume number:', volume_number
urllib.urlretrieve(remotefile,localfile)

print 'Download complete.'
-----------------------------------------

Once I get the error once running the script again doesn't do much
good. It usually gets two or three pages and then hangs again.

What is causing this?

Chris · Jun 13, 2008

Hi,

I have a small Python script to fetch some pages from the internet.
There are a lot of pages and I am looping through them and then
downloading the page using urlretrieve() in the urllib module.

The problem is that after 110 pages or so the script sort of hangs and
then I get the following traceback:

Traceback (most recent call last):
File "volume_archiver.py", line 21, in <module>
urllib.urlretrieve(remotefile,localfile)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib.py", line 89, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib.py", line 222, in retrieve
fp = self.open(url, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib.py", line 190, in open
return getattr(self, name)(url)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib.py", line 328, in open_http
errcode, errmsg, headers = h.getreply()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 1195, in getreply
response = self._conn.getresponse()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 924, in getresponse
response.begin()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 385, in begin
version, status, reason = self._read_status()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 343, in _read_status
line = self.fp.readline()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/socket.py", line 331, in readline
data = recv(1)
IOError: [Errno socket error] (54, 'Connection reset by peer')

My script code is as follows:
-----------------------------------------
import os
import urllib

volume_number = 149 # The volumes number 150 to 544

while volume_number < 544:
volume_number = volume_number + 1
localfile = '/Users/Chris/Desktop/Decisions/' + str(volume_number) +
'.html'
remotefile = 'http://caselaw.lp.findlaw.com/scripts/getcase.pl?
court=us&navby=vol&vol=' + str(volume_number)
print 'Getting volume number:', volume_number
urllib.urlretrieve(remotefile,localfile)

print 'Download complete.'
-----------------------------------------

Once I get the error once running the script again doesn't do much
good. It usually gets two or three pages and then hangs again.

What is causing this?

The server is causing it, you could just alter your code

import os
import urllib
import time

volume_number = 149 # The volumes number 150 to 544
localfile = '/Users/Chris/Desktop/Decisions/%s.html'
remotefile = 'http://caselaw.lp.findlaw.com/scripts/getcase.pl?
court=us&navby=vol&vol=%s'
while volume_number < 544:
volume_number += 1
print 'Getting volume number:', volume_number
try:
urllib.urlretrieve(remotefile%volume_number,localfile
%volume_number)
except IOError:
volume_number -= 1
time.sleep(5)

print 'Download complete.'

That way if the attempt fails it rolls back the volume number, pauses
for a few seconds and tries again.

Jeff McNeil · Jun 13, 2008

It means your client received a TCP segment with the reset bit sent.
The 'peer' will toss one your way if it determines that a connection
is no longer valid or if it receives a bad sequence number. If I had
to hazard a guess, I'd say it's probably a network device on the
server side trying to stop you from running a mass download
(especially if it's easily repeatable and happens at about the same
byte range).

-Jeff

Hi,

I have a small Python script to fetch some pages from the internet.
There are a lot of pages and I am looping through them and then
downloading the page using urlretrieve() in the urllib module.

The problem is that after 110 pages or so the script sort of hangs and
then I get the following traceback:
Traceback (most recent call last):
File "volume_archiver.py", line 21, in <module>
urllib.urlretrieve(remotefile,localfile)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib.py", line 89, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib.py", line 222, in retrieve
fp = self.open(url, data)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib.py", line 190, in open
return getattr(self, name)(url)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/urllib.py", line 328, in open_http
errcode, errmsg, headers = h.getreply()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 1195, in getreply
response = self._conn.getresponse()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 924, in getresponse
response.begin()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 385, in begin
version, status, reason = self._read_status()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/httplib.py", line 343, in _read_status
line = self.fp.readline()
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/socket.py", line 331, in readline
data = recv(1)
IOError: [Errno socket error] (54, 'Connection reset by peer')
My script code is as follows:
-----------------------------------------
import os
import urllib

volume_number = 149 # The volumes number 150 to 544

while volume_number < 544:
volume_number = volume_number + 1
localfile = '/Users/Chris/Desktop/Decisions/' + str(volume_number) +
'.html'
remotefile = 'http://caselaw.lp.findlaw.com/scripts/getcase.pl?
court=us&navby=vol&vol=' + str(volume_number)
print 'Getting volume number:', volume_number
urllib.urlretrieve(remotefile,localfile)

print 'Download complete.'
-----------------------------------------

Once I get the error once running the script again doesn't do much
good. It usually gets two or three pages and then hangs again.

What is causing this?

chrispoliquin · Jun 18, 2008

Thanks for the help. The error handling worked to a certain extent
but after a while the server does seem to stop responding to my
requests.

I have a list of about 7,000 links to pages I want to parse the HTML
of (it's basically a web crawler) but after a certain number of
urlretrieve() or urlopen() calls the server just stops responding.
Anyone know of a way to get around this? I don't own the server so I
can't make any modifications on that side.

Tim Golden · Jun 18, 2008

Thanks for the help. The error handling worked to a certain extent
but after a while the server does seem to stop responding to my
requests.

I have a list of about 7,000 links to pages I want to parse the HTML
of (it's basically a web crawler) but after a certain number of
urlretrieve() or urlopen() calls the server just stops responding.
Anyone know of a way to get around this? I don't own the server so I
can't make any modifications on that side.

I think someone's already mentioned this, but it's almost
certainly an explicit or implicit throttling on the remote server.
If you're pulling 7,000 pages from a single server you need to
be sure that you're within the Terms of Use of that service, or
at the least you need to contact the maintainers in courtesy to
confirm that this is acceptable.

If you don't you may well cause your IP block to be banned on
their network, which could affect others as well as yourself.

TJG

John Nagle · Jun 21, 2008

Tim said:
I think someone's already mentioned this, but it's almost
certainly an explicit or implicit throttling on the remote server.
If you're pulling 7,000 pages from a single server you need to
be sure that you're within the Terms of Use of that service, or
at the least you need to contact the maintainers in courtesy to
confirm that this is acceptable.

If you don't you may well cause your IP block to be banned on
their network, which could affect others as well as yourself.

Interestingly, "lp.findlaw.com" doesn't have any visible terms of service.
The information being downloaded is case law, which is public domain, so
there's no copyright issue. Some throttling and retry is needed to slow
down the process, but it should be fixable.

Try this: put in the retry code someone else suggested. Use a variable
retry delay, and wait one retry delay between downloading files. Whenever
a download fails, double the retry delay and try
again; don't let it get bigger than, say, 256 seconds. When a download
succeeds, halve the retry delay, but don't let it get smaller than 1 second.
That will make your downloader self-tune to the throttling imposed by
the server.

John Nagle

How to set up web service by python?	2	Oct 24, 2009
Error in running python -v on Mac ox 10.5.	2	May 5, 2009
IDLE stopped working	0	Oct 26, 2008
IDLE stopped working	1	Oct 26, 2008
Can't load smtplib	0	Feb 12, 2009
logging exceptions	4	Aug 26, 2008
daemon.DaemonContext and logging	2	Dec 10, 2009
Cannot Read MySQLdb docs within Python interpreter	1	May 26, 2008

urllib (54, 'Connection reset by peer') error

chrispoliquin

Chris

Jeff McNeil

chrispoliquin

Tim Golden

John Nagle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads