404 errors

Discussion in 'Python' started by Derek Fountain, Apr 27, 2004.

  1. I'm probably a bit off topic with this, but I'm not sure where else to ask.
    Hopefully someone here will know the answer.

    I'm writing a script (in Python) which reads a webpage from a user supplied
    URL using urllib.urlopen. I want to detect an error from the server. If the
    server doesn't exist, that's easy - catch the IOError. However, if the
    server exists, but the path in the URL is wrong, how do I detect the error?
    Some servers respond with a nicely formatted bit of HTML explaining the
    problem, which is fine for a human, but not for a script. Is there some
    flag or something definitive on the response which says "this is a 404
    error"?
    Derek Fountain, Apr 27, 2004
    #1
    1. Advertising

  2. Derek Fountain

    Tut Guest

    Tue, 27 Apr 2004 11:00:57 +0800, Derek Fountain wrote:

    > Some servers respond with a nicely formatted bit of HTML explaining the
    > problem, which is fine for a human, but not for a script. Is there some
    > flag or something definitive on the response which says "this is a 404
    > error"?


    Maybe catch the urllib2.HTTPError?
    Tut, Apr 27, 2004
    #2
    1. Advertising

  3. Derek Fountain

    Ivan Karajas Guest

    On Tue, 27 Apr 2004 10:46:47 +0200, Tut wrote:

    > Tue, 27 Apr 2004 11:00:57 +0800, Derek Fountain wrote:
    >
    >> Some servers respond with a nicely formatted bit of HTML explaining the
    >> problem, which is fine for a human, but not for a script. Is there some
    >> flag or something definitive on the response which says "this is a 404
    >> error"?

    >
    > Maybe catch the urllib2.HTTPError?


    This kind of answers the question. urllib will let you read whatever it
    receives, regardless of the HTTP status; you need to use urllib2 if you
    want to find out the status code when a request results in an error (any
    HTTP status beginning with a 4 or 5). This can be done like so:

    import urllib2
    try:
    asock = urllib2.urlopen("http://www.foo.com/qwerty.html")
    except urllib2.HTTPError, e:
    print e.code

    The value in urllib2.HTTPError.code comes from the first line of the web
    server's HTTP response, just before the headers begin, e.g. "HTTP/1.1 200
    OK", or "HTTP/1.1 404 Not Found".

    One thing you need to be aware of is that some web sites don't behave as
    you would expect them to; e.g. responding with a redirection rather than a
    404 error when you when you request a page that doesn't exist. In these
    cases you might still have to rely on some clever scripting.

    Cheers,

    Ivan
    Ivan Karajas, Apr 29, 2004
    #3
  4. Derek Fountain

    John J. Lee Guest

    Ivan Karajas <> writes:

    > On Tue, 27 Apr 2004 10:46:47 +0200, Tut wrote:
    >
    > > Tue, 27 Apr 2004 11:00:57 +0800, Derek Fountain wrote:
    > >
    > >> Some servers respond with a nicely formatted bit of HTML explaining the
    > >> problem, which is fine for a human, but not for a script. Is there some
    > >> flag or something definitive on the response which says "this is a 404
    > >> error"?

    > >
    > > Maybe catch the urllib2.HTTPError?

    >
    > This kind of answers the question. urllib will let you read whatever it
    > receives, regardless of the HTTP status; you need to use urllib2 if you
    > want to find out the status code when a request results in an error (any
    > HTTP status beginning with a 4 or 5). This can be done like so:


    FWIW, note that urllib2's own idea of an error (ie. something for
    which it throws a response object as an HTTPError exception rather
    than returning it) is: 'anything other than 200 is an error'. The
    only exceptions are where some responses happen to be handled by
    urllib2 handlers (eg. 302), or at a lower level by httplib (eg. 100).


    > import urllib2
    > try:
    > asock = urllib2.urlopen("http://www.foo.com/qwerty.html")
    > except urllib2.HTTPError, e:
    > print e.code
    >
    > The value in urllib2.HTTPError.code comes from the first line of the web
    > server's HTTP response, just before the headers begin, e.g. "HTTP/1.1 200
    > OK", or "HTTP/1.1 404 Not Found".
    >
    > One thing you need to be aware of is that some web sites don't behave as
    > you would expect them to; e.g. responding with a redirection rather than a
    > 404 error when you when you request a page that doesn't exist. In these
    > cases you might still have to rely on some clever scripting.


    The following kind of functionality is in urllib2 in Python 2.4 (there
    are some loose ends, which I will tie up soon). It's slightly simpler
    in 2.4 than in my ClientCookie clone of that module, but (UNTESTED):

    import ClientCookie
    from ClientCookie._Util import response_seek_wrapper

    class BadResponseProcessor(ClientCookie.BaseProcessor):
    # Convert apparently-successful 200 OK or 30x redirection responses to 404s
    # iff they contain tell-tale text that indicates failure.

    def __init__(self, diagnostic_text):
    self.diagnostic_text = diagnostic_text

    def http_response(self, request, response):
    if not hasattr(response, "seek"):
    response = response_seek_wrapper(response)

    if response.code in [200, 301, 302, 303, 307]:
    ct = response.info().getheaders("content-type")
    if ct and ct[0].startswith("text/html"):
    try:
    data = response.read(4096)
    if self.diagnostic_text in data:
    response.code = 404
    finally:
    response.seek(0)
    return response

    https_response = http_response

    brp = BadResponseProcessor("Whoops, an error occurred.")
    opener = ClientCookie.build_opener(brp)

    r = opener.open("http://nonstandard.com/bad/url")
    assert r.code == 404


    Hmm, looking at that, I suppose it would be better done *after*
    redirection (which is quite possible, with the modifications I've
    made, without needing any heavy subclassing or other hacks -- use the
    processor_order attribute). You'd then just check for 200 rather than
    200 or 30x in the code above.

    A similar problem: as I mention above, by default, urllib2 only
    returns 200 responses, and always raises an exception for other HTTP
    response codes. Occasionally, it's much more convenient to have an
    OpenerDirector that behaves differently:

    class HTTPErrorProcessor(ClientCookie.HTTPErrorProcessor):
    # return most error responses rather than raising an exception

    def http_response(self, request, response):
    code, msg, hdrs = response.code, response.msg, response.info()

    category = divmod(code, 100)[0] # eg. 200 --> 2
    if category not in [2, 4, 5] or code in [401, 407]:
    response = self.parent.error(
    'http', request, response, code, msg, hdrs)

    return response

    https_response = http_response


    John
    John J. Lee, Apr 29, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mark Goldin

    Errors, errors, errors

    Mark Goldin, Jan 17, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    942
    Mark Goldin
    Jan 17, 2004
  2. Shaun Wilde

    404 errors and Server.Transfer

    Shaun Wilde, Jun 20, 2004, in forum: ASP .Net
    Replies:
    6
    Views:
    473
    Shaun Wilde
    Jun 26, 2004
  3. Phil Winstanley [Microsoft MVP ASP.NET]

    Re: 404 errors and Server.Transfer

    Phil Winstanley [Microsoft MVP ASP.NET], Jun 20, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    412
    Shaun Wilde
    Jun 22, 2004
  4. Cedric
    Replies:
    4
    Views:
    467
    Cedric
    Feb 16, 2005
  5. Jonathan Folland
    Replies:
    2
    Views:
    1,624
    Jonathan Folland
    Mar 17, 2005
Loading...

Share This Page