urllib to cache 301 redirections?

O.R.Senthil Kumaran · Jul 6, 2007

Hi,
There is an Open Tracker item against urllib2 library python.org/sf/735515
which states that.
urllib / urllib2 should cache the results of 301 (permanent) redirections.
This shouldn't break anything, since it's just an internal optimisation
from one point of view -- but it's also what the RFC (2616, section 10.3.2, first para) says
SHOULD happen.

I am trying to understand, what does it mean.
Should the original url be avaiable to the user upon request as urllib
automatically calls the redirect_request and provides the redirected url only?

I am not completely getting what "cache - redirection" implies and what should
be done with the urllib2 module. Any pointers?

Thanks,

John J. Lee · Jul 6, 2007

O.R.Senthil Kumaran said:
Hi,
There is an Open Tracker item against urllib2 library python.org/sf/735515
which states that.
urllib / urllib2 should cache the results of 301 (permanent) redirections.
This shouldn't break anything, since it's just an internal optimisation
from one point of view -- but it's also what the RFC (2616, section 10.3.2, first para) says
SHOULD happen.

I am trying to understand, what does it mean.
Should the original url be avaiable to the user upon request as urllib
automatically calls the redirect_request and provides the redirected url only?

urllib2, you mean.

Regardless of this bug, Request.get_full_url() should be (and is)
whatever URL the request instance was originally constructed with.

I am not completely getting what "cache - redirection" implies and what should
be done with the urllib2 module. Any pointers?

When a 301 redirect occurs after a request for URL U, via
urllib2.urlopen(U), urllib2 should remember the result of that
redirection, viz a second URL, V. Then, when another
urllib2.urlopen(U) takes place, urllib2 should send an HTTP request
for V, not U. urllib2 does not currently do this. (Obviously the
cache -- that is, the dictionary or whatever that stores the mapping
from URLs U to V -- should not be maintained by function urlopen
itself. Perhaps it should live on the redirect handler.)

302 redirections are temporary and are handled correctly in this
respect already by urllib2.

John

O.R.Senthil Kumaran · Jul 16, 2007

Thank you for the reply, Mr. John and I apologize for a very late response
from my end.

* John J. Lee said:
When a 301 redirect occurs after a request for URL U, via
urllib2.urlopen(U), urllib2 should remember the result of that
redirection, viz a second URL, V. Then, when another
urllib2.urlopen(U) takes place, urllib2 should send an HTTP request
for V, not U. urllib2 does not currently do this. (Obviously the
cache -- that is, the dictionary or whatever that stores the mapping
from URLs U to V -- should not be maintained by function urlopen
itself. Perhaps it should live on the redirect handler.)

I spent a little time thinking about a solution and figured out that the
following changes to HTTPRedirectHandler, might be helpful in implementing
this.

Class HTTPRedirectHandler(BaseHandler):
# ... omitted ...
# Initialize a dictionary to hold cache.

def __init__(self):
self.cache = {}

# Handles 301 errors separately in a different function which maintains a
# maintains cache.

def http_error_301(self, req, fp, code, msg, headers):

if req in self.cache:
# Look for loop, if a particular url appears in both key and value
# then there is loop and return HTTPError
if len(set(self.cache.keys()) & set(self.cache.values())) > 0:
raise HTTPError(req.get_full_url(), code, self.inf_msg + msg +
headers, fp)
return self.cache[req]

self.cache[req] = self.http_error_302(req,fp,code,msg, headers)
return self.cache[req]

John, let me know your comments on this approach.
I have not tested this code in real scenario yet with a 301 redirect.
If its okay, I shall test it and submit a patch for the tracker item.

Thanks,
Senthil

John J Lee · Jul 16, 2007

]

I spent a little time thinking about a solution and figured out that the
following changes to HTTPRedirectHandler, might be helpful in implementing
this.

[...]

Did you post it on the Python SF patch tracker?

If not, please do, and point us at it. I'll comment there.

John

John Nagle · Jul 16, 2007

O.R.Senthil Kumaran said:
Thank you for the reply, Mr. John and I apologize for a very late response
from my end.

When a 301 redirect occurs after a request for URL U, via
urllib2.urlopen(U), urllib2 should remember the result of that
redirection, viz a second URL, V. Then, when another
urllib2.urlopen(U) takes place, urllib2 should send an HTTP request
for V, not U. urllib2 does not currently do this. (Obviously the
cache -- that is, the dictionary or whatever that stores the mapping
from URLs U to V -- should not be maintained by function urlopen
itself. Perhaps it should live on the redirect handler.)

Click to expand...

I spent a little time thinking about a solution and figured out that the
following changes to HTTPRedirectHandler, might be helpful in implementing
this.

Class HTTPRedirectHandler(BaseHandler):
# ... omitted ...
# Initialize a dictionary to hold cache.

def __init__(self):
self.cache = {}

# Handles 301 errors separately in a different function which maintains a
# maintains cache.

def http_error_301(self, req, fp, code, msg, headers):

if req in self.cache:
# Look for loop, if a particular url appears in both key and value
# then there is loop and return HTTPError
if len(set(self.cache.keys()) & set(self.cache.values())) > 0:
raise HTTPError(req.get_full_url(), code, self.inf_msg + msg +
headers, fp)
return self.cache[req]

self.cache[req] = self.http_error_302(req,fp,code,msg, headers)
return self.cache[req]

John, let me know your comments on this approach.
I have not tested this code in real scenario yet with a 301 redirect.
If its okay, I shall test it and submit a patch for the tracker item.

That assumes you're reusing the same object to reopen another URL.

Is this thread-safe?

That's also an inefficient way to test for an empty dictionary.

John Nagle

O.R.Senthil Kumaran · Jul 18, 2007

* John J Lee said:
Did you post it on the Python SF patch tracker?

If not, please do, and point us at it. I'll comment there.

Posted: http://www.python.org/sf/1755841

Thanks,

O.R.Senthil Kumaran · Jul 18, 2007

* John Nagle said:
That assumes you're reusing the same object to reopen another URL.

Is this thread-safe?

I don't know. I looked into few other cache requests (cache ftp) and saw how it was
implemented. I am not getting as how this wont be thread-safe.

That's also an inefficient way to test for an empty dictionary.

How should it be done, otherwise? I am looking for alternative methods as
well.

Redirect 301 Static Page with Parameters to Static Page	1	Dec 20, 2022
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	May 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Feb 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 1, 2007

urllib to cache 301 redirections?

O.R.Senthil Kumaran

John J. Lee

O.R.Senthil Kumaran

John J Lee

John Nagle

O.R.Senthil Kumaran

O.R.Senthil Kumaran

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads