urllib2 pinger : insight as to use, cause of hang-up?

Discussion in 'Python' started by EP, Jun 6, 2005.

  1. EP

    EP Guest

    Hello patient and tolerant Pythonistas,

    Iterating through a long list of arbitrary (and possibly syntactically flawed) urls with a urllib2 pinging function I get a hang up. No exception is raised, however (according to Windows Task Manager) python.exe stops using any CPU time, neither increasing nor decreasing the memory it uses, and the script does not progress (permanently stalled, it seems). As an example, the below function has been stuck on url number 364 for ~40 minutes.

    Does this simply indicate the need for a time-out function, or could there be something else going on (error in my usage) I've overlooked?

    If it requires a time-out control, is there a way to implement that without using separate threads? Any best practice recommendations?

    Here's my function:

    --------------------------------------------------
    def testLinks2(urlList=[]):
    import urllib2
    goodLinks=[]
    badLinks=[]
    user_agent = 'mySpider Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    print len(urlList), " links to test"
    count=0
    for url in urlList:
    count+=1
    print count,
    try:
    request = urllib2.Request(url)
    request.add_header('User-Agent', user_agent)
    handle = urllib2.urlopen(request)
    goodLinks.append(url)
    except urllib2.HTTPError, e:
    badLinks.append({url:e.code})
    print e.code,": ",url
    except:
    print "unknown error: ",url
    badLinks.append({url:"unknown error"})
    print len(goodLinks)," working links found"
    return goodLinks, badLinks

    good, bad=testLinks2(linkList)
    --------------------------------------------------

    Thannks in advance for your thoughts.



    Eric Pederson
    EP, Jun 6, 2005
    #1
    1. Advertising

  2. EP

    Mahesh Guest

    Timing it out will probably solve it.
    Mahesh, Jun 6, 2005
    #2
    1. Advertising

  3. EP

    EP Guest

    "Mahesh" advised:
    >
    > Timing it out will probably solve it.
    >



    Thanks.

    Follow-on question regarding implementing a timeout for use by urllib2. I am guessing the simplest way to do this is via socket.setdefaulttimeout(), but I am not sure if this sets a global parameter, and if so, whether it might be reset via instantiations of urllib, urllib2, httplib, etc. I assume socket and the timeout parameter is in the global namespace and that I can just reset it at will for application to all the socket module 'users'. Is that right?

    (TIA)


    [experimenting]

    >>> import urllib2plus
    >>> urllib2plus.setSocketTimeOut(1)
    >>> urllib2plus.urlopen('http://zomething.com')


    Traceback (most recent call last):
    File "<pyshell#52>", line 1, in -toplevel-
    urllib2plus.urlopen('http://zomething.com')
    File "C:\Python24\lib\urllib2plus.py", line 130, in urlopen
    return _opener.open(url, data)
    File "C:\Python24\lib\urllib2plus.py", line 361, in open
    response = self._open(req, data)
    File "C:\Python24\lib\urllib2plus.py", line 379, in _open
    '_open', req)
    File "C:\Python24\lib\urllib2plus.py", line 340, in _call_chain
    result = func(*args)
    File "C:\Python24\lib\urllib2plus.py", line 1024, in http_open
    return self.do_open(httplib.HTTPConnection, req)
    File "C:\Python24\lib\urllib2plus.py", line 999, in do_open
    raise URLError(err)
    URLError: <urlopen error timed out>

    >>> urllib2plus.setSocketTimeOut(10)
    >>> urllib2plus.urlopen('http://zomething.com')

    <addinfourl at 12449152 whose fp = <socket._fileobject object at 0x00BE1340>>

    >>> import socket
    >>> socket.setdefaulttimeout(0)
    >>> urllib2plus.urlopen('http://zomething.com')

    Traceback (most recent call last):
    File "<pyshell#60>", line 1, in -toplevel-
    urllib2plus.urlopen('http://zomething.com')
    File "C:\Python24\lib\urllib2plus.py", line 130, in urlopen
    return _opener.open(url, data)
    File "C:\Python24\lib\urllib2plus.py", line 361, in open
    response = self._open(req, data)
    File "C:\Python24\lib\urllib2plus.py", line 379, in _open
    '_open', req)
    File "C:\Python24\lib\urllib2plus.py", line 340, in _call_chain
    result = func(*args)
    File "C:\Python24\lib\urllib2plus.py", line 1024, in http_open
    return self.do_open(httplib.HTTPConnection, req)
    File "C:\Python24\lib\urllib2plus.py", line 999, in do_open
    raise URLError(err)
    URLError: <urlopen error (10035, 'The socket operation could not complete without blocking')>
    >>> socket.setdefaulttimeout(1)
    >>> urllib2plus.urlopen('http://zomething.com')

    <addinfourl at 12449992 whose fp = <socket._fileobject object at 0x00BE1420>>
    EP, Jun 6, 2005
    #3
  4. EP

    Mahesh Guest

    socket.setdefaulttimeout() is what I have used in the past and it has
    worked well. I think it is set in the global namespace though I could
    be wrong. I think it retains its value within the module it is called
    in. If you use it in a different module if will probably get reset
    though it is easy enough to test that out.
    Mahesh, Jun 6, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Falco98
    Replies:
    0
    Views:
    524
    Falco98
    Sep 15, 2003
  2. Rosny
    Replies:
    0
    Views:
    422
    Rosny
    Jun 24, 2005
  3. Rosny
    Replies:
    1
    Views:
    385
    Rosny
    Jun 26, 2005
  4. Josef Cihal
    Replies:
    0
    Views:
    725
    Josef Cihal
    Sep 5, 2005
  5. Darrel
    Replies:
    1
    Views:
    477
    Darrel
    Dec 16, 2007
Loading...

Share This Page