urllib to cache 301 redirections?

  • Thread starter O.R.Senthil Kumaran
  • Start date
O

O.R.Senthil Kumaran

Hi,
There is an Open Tracker item against urllib2 library python.org/sf/735515
which states that.
urllib / urllib2 should cache the results of 301 (permanent) redirections.
This shouldn't break anything, since it's just an internal optimisation
from one point of view -- but it's also what the RFC (2616, section 10.3.2, first para) says
SHOULD happen.

I am trying to understand, what does it mean.
Should the original url be avaiable to the user upon request as urllib
automatically calls the redirect_request and provides the redirected url only?

I am not completely getting what "cache - redirection" implies and what should
be done with the urllib2 module. Any pointers?

Thanks,
 
J

John J. Lee

O.R.Senthil Kumaran said:
Hi,
There is an Open Tracker item against urllib2 library python.org/sf/735515
which states that.
urllib / urllib2 should cache the results of 301 (permanent) redirections.
This shouldn't break anything, since it's just an internal optimisation
from one point of view -- but it's also what the RFC (2616, section 10.3.2, first para) says
SHOULD happen.

I am trying to understand, what does it mean.
Should the original url be avaiable to the user upon request as urllib
automatically calls the redirect_request and provides the redirected url only?

urllib2, you mean.

Regardless of this bug, Request.get_full_url() should be (and is)
whatever URL the request instance was originally constructed with.

I am not completely getting what "cache - redirection" implies and what should
be done with the urllib2 module. Any pointers?

When a 301 redirect occurs after a request for URL U, via
urllib2.urlopen(U), urllib2 should remember the result of that
redirection, viz a second URL, V. Then, when another
urllib2.urlopen(U) takes place, urllib2 should send an HTTP request
for V, not U. urllib2 does not currently do this. (Obviously the
cache -- that is, the dictionary or whatever that stores the mapping
from URLs U to V -- should not be maintained by function urlopen
itself. Perhaps it should live on the redirect handler.)

302 redirections are temporary and are handled correctly in this
respect already by urllib2.


John
 
O

O.R.Senthil Kumaran

Thank you for the reply, Mr. John and I apologize for a very late response
from my end.

* John J. Lee said:
When a 301 redirect occurs after a request for URL U, via
urllib2.urlopen(U), urllib2 should remember the result of that
redirection, viz a second URL, V. Then, when another
urllib2.urlopen(U) takes place, urllib2 should send an HTTP request
for V, not U. urllib2 does not currently do this. (Obviously the
cache -- that is, the dictionary or whatever that stores the mapping
from URLs U to V -- should not be maintained by function urlopen
itself. Perhaps it should live on the redirect handler.)

I spent a little time thinking about a solution and figured out that the
following changes to HTTPRedirectHandler, might be helpful in implementing
this.

Class HTTPRedirectHandler(BaseHandler):
# ... omitted ...
# Initialize a dictionary to hold cache.

def __init__(self):
self.cache = {}


# Handles 301 errors separately in a different function which maintains a
# maintains cache.

def http_error_301(self, req, fp, code, msg, headers):

if req in self.cache:
# Look for loop, if a particular url appears in both key and value
# then there is loop and return HTTPError
if len(set(self.cache.keys()) & set(self.cache.values())) > 0:
raise HTTPError(req.get_full_url(), code, self.inf_msg + msg +
headers, fp)
return self.cache[req]

self.cache[req] = self.http_error_302(req,fp,code,msg, headers)
return self.cache[req]


John, let me know your comments on this approach.
I have not tested this code in real scenario yet with a 301 redirect.
If its okay, I shall test it and submit a patch for the tracker item.

Thanks,
Senthil
 
J

John J Lee

]
I spent a little time thinking about a solution and figured out that the
following changes to HTTPRedirectHandler, might be helpful in implementing
this.
[...]

Did you post it on the Python SF patch tracker?

If not, please do, and point us at it. I'll comment there.


John
 
J

John Nagle

O.R.Senthil Kumaran said:
Thank you for the reply, Mr. John and I apologize for a very late response
from my end.

When a 301 redirect occurs after a request for URL U, via
urllib2.urlopen(U), urllib2 should remember the result of that
redirection, viz a second URL, V. Then, when another
urllib2.urlopen(U) takes place, urllib2 should send an HTTP request
for V, not U. urllib2 does not currently do this. (Obviously the
cache -- that is, the dictionary or whatever that stores the mapping
from URLs U to V -- should not be maintained by function urlopen
itself. Perhaps it should live on the redirect handler.)


I spent a little time thinking about a solution and figured out that the
following changes to HTTPRedirectHandler, might be helpful in implementing
this.

Class HTTPRedirectHandler(BaseHandler):
# ... omitted ...
# Initialize a dictionary to hold cache.

def __init__(self):
self.cache = {}


# Handles 301 errors separately in a different function which maintains a
# maintains cache.

def http_error_301(self, req, fp, code, msg, headers):

if req in self.cache:
# Look for loop, if a particular url appears in both key and value
# then there is loop and return HTTPError
if len(set(self.cache.keys()) & set(self.cache.values())) > 0:
raise HTTPError(req.get_full_url(), code, self.inf_msg + msg +
headers, fp)
return self.cache[req]

self.cache[req] = self.http_error_302(req,fp,code,msg, headers)
return self.cache[req]


John, let me know your comments on this approach.
I have not tested this code in real scenario yet with a 301 redirect.
If its okay, I shall test it and submit a patch for the tracker item.

That assumes you're reusing the same object to reopen another URL.

Is this thread-safe?

That's also an inefficient way to test for an empty dictionary.

John Nagle
 
O

O.R.Senthil Kumaran

* John Nagle said:
That assumes you're reusing the same object to reopen another URL.

Is this thread-safe?

I don't know. I looked into few other cache requests (cache ftp) and saw how it was
implemented. I am not getting as how this wont be thread-safe.
That's also an inefficient way to test for an empty dictionary.

How should it be done, otherwise? I am looking for alternative methods as
well.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top