OO Design question for Net::HTTP caching extension

A

Aredridel

I'm in the process of writing an HTTP-1.1 extension to Net::HTTP. At
the moment, I'm replacing Net::HTTP#get with a version that will check
for a valid cache entry and return that instead of the fresh result, and
the original, uncached functionality is in #get_fresh

At the moment, I've made the cache per-class: A single class variable
for the cache, something like Net::HTTP.cache = new Net::HTTPCache, then
the new functionality starts.

Is this sane? Should it be an instance variable instead of a class
variable? Is there a more orthagonal way to do this?

Ari
 
B

Brian Candler

I'm in the process of writing an HTTP-1.1 extension to Net::HTTP. At
the moment, I'm replacing Net::HTTP#get with a version that will check
for a valid cache entry and return that instead of the fresh result, and
the original, uncached functionality is in #get_fresh

At the moment, I've made the cache per-class: A single class variable
for the cache, something like Net::HTTP.cache = new Net::HTTPCache, then
the new functionality starts.

Is this sane? Should it be an instance variable instead of a class
variable? Is there a more orthagonal way to do this?

One architectural suggestion: rather than extending/overriding Net::HTTP,
make your cache a separate object.

Then, Net::Cache#get can return an object from its cache (which is just an
instance variable of Net::Cache), and also contains instances of object(s)
which do the fetching from remote hosts when there is a cache miss.

I think there are a number of advantages in this approach:
- no messing with an existing foundation class
- the potential to have multiple methods for retrieving objects
(e.g. Net::FTP as well as Net::HTTP)
- a clear division of responsibility between the cache and the
object retrieval protocol

With some care, you should be able to make your cache thread-safe: if a
request for object X comes in while the cache is already fetching that
object, it could wait until the retrieval is complete. The cache can then
sit behind DRb, for example, so it can be accessed by multiple processes
simultaneously. Equally, if you get a request for object Y while a fetch for
object X is taking place, you can perform a parallel fetch in another
thread.

I think it would be much easier to deal with these sort of threading issues
when the cache is a separate object from the protocol.

In fact, I think I would break it up a bit more: have the raw cache as one
object (very simple: just 'put' and 'fetch' methods into a hash, but it has
its own semaphore for protecting concurrent accesses), and a cache manager
which takes the incoming 'get' requests, checks the cache, and if necessary
performs the actual fetch before returning the object.

Regards,

Brian.
 
A

Aredridel

One architectural suggestion: rather than extending/overriding Net::HTTP,
make your cache a separate object.

I'm intending to make the cache itself a separate object.

However, I am planning to extend Net::HTTP#get to use the cache,
because, in theory, with HTTP/1.1 semantics on the cache, it will make
no difference -- you could be getting a cached response as it is...
Honestly, I think that integrating it with get makes the most sense
there.
Then, Net::Cache#get can return an object from its cache (which is just an
instance variable of Net::Cache), and also contains instances of object(s)
which do the fetching from remote hosts when there is a cache miss.



I think there are a number of advantages in this approach:
- no messing with an existing foundation class

Consider me ambitious, but I'd like to see code like this make it as far
as inclusion with ruby some day. ;-)
- the potential to have multiple methods for retrieving objects
(e.g. Net::FTP as well as Net::HTTP)

Those are different objects ;-)
- a clear division of responsibility between the cache and the
object retrieval protocol

That's hard to do and stay HTTP/1.1 compliant. The caching requires
metadata that doesn't exist for FTP nor filesystem objects. . . Some
basic support would be possible, but at least some metadata object would
be required. I'd say that the HTTP headers are that data at the moment.

Even so, I'll keep it abstract enough to not invite duplication.

With some care, you should be able to make your cache thread-safe: if a
request for object X comes in while the cache is already fetching that
object, it could wait until the retrieval is complete. The cache can then
sit behind DRb, for example, so it can be accessed by multiple processes
simultaneously. Equally, if you get a request for object Y while a fetch for
object X is taking place, you can perform a parallel fetch in another
thread.

I intend to do that in my second version. That's one thing that most
caching systems don't do that they should.
I think it would be much easier to deal with these sort of threading issues
when the cache is a separate object from the protocol.

I don't think it'll be too difficult either way. The cache will be
separate, and I'm thinking of a more-or-less callback system for things
to be added to the cache -- an object can register that it will be
adding something to the cache when retrieval is complete, but I think
having the cache /do/ as little work as possible (such as the actual
retrieval) is more sane. Partly, this is because, theoretically, one
could be caching outbound requests as well, ala Apache's mod_cache, and
if they're dynamically generated, having a callback system would let
that be simpler than having the cache do the retrieval.
In fact, I think I would break it up a bit more: have the raw cache as one
object (very simple: just 'put' and 'fetch' methods into a hash, but it has
its own semaphore for protecting concurrent accesses), and a cache manager
which takes the incoming 'get' requests, checks the cache, and if necessary
performs the actual fetch before returning the object.

Basically, I think I'll let Net::HTTP#get (and other methods) be
extended to use the cache, effectively making them the cache manager,
and the base cache will be more or less four methods:

get
put
valid?
delete

Thank you /very/ much for the insight, by the way. My idea's
implementation is getting cleaner in my head as I talk about this.

Ari
 
A

Aredridel

One architectural suggestion: rather than extending/overriding Net::HTTP,
make your cache a separate object.

On second thought, integrating as tightly as I'd like with Net::HTTP,
while it works, isn't practical because the cache can't be checked
easily until after the socket's opened.

I'll probably go with a delegator much like Net::HTTP::proxy. We'll
see. This is useful code, and it's already nearly good enough for my
uses but deserves some full development because I can see a lot of need.

My use for this is an RSS aggregator that's not going to waste
bandwidth.

Ari
 
B

Brian Candler

I'm intending to make the cache itself a separate object.

However, I am planning to extend Net::HTTP#get to use the cache,
because, in theory, with HTTP/1.1 semantics on the cache, it will make
no difference -- you could be getting a cached response as it is...
Honestly, I think that integrating it with get makes the most sense
there.

It doesn't really make sense to me. When I call Net::HTTP#get, I am asking a
protocol to perform an operation; in many cases I do not want it
transparently intercepted and modified, or I do not want my application's
RAM footprint to be bloated by extra copies of objects when my application's
usage pattern may not warrant it.

However, if you make a different object which implements the same API, I can
use it as a plug-in replacement for Net::HTTP in those cases where using a
cache is appropriate.

(If you want a really serious cache then you wouldn't implement it in Ruby
anyway; you'd run Squid on your box and proxy through it. Looking at the
Squid documentation may give you some ideas on cache management, how to
handle object expiry in the absence of HTTP/1.1 information and so on)

Regards,

Brian.
 
A

Aredridel

As far as modifying Net::HTTP#get and others to use the cache
transparently, I think it makes sense since they could be cached anyway
-- there's nothing that says the protocol can't have a local cache, as
long as it's HTTP/1.1.

Anyway, though, I've decided to do otherwise for the moment simply
because Net::HTTP connects a socket upon new or start being called, and
there's no good place to hook the cache there and still have keepalive
be implemented cleanly.
(If you want a really serious cache then you wouldn't implement it in Ruby
anyway; you'd run Squid on your box and proxy through it. Looking at the
Squid documentation may give you some ideas on cache management, how to
handle object expiry in the absence of HTTP/1.1 information and so on)

I've a really serious cache, but no need for as fast as squid, since
there's just one process accessing it. . . No need for a daemon and
massive cache filesystem when a dbm file is plenty ;-)

Anyway, I'm now re-working to be an object with Net::HTTP's interface,
and running into design issues around Net::HTTP and Net::HTTP::proxy,
thus:

Net::HTTP::proxy creates proxy objects that call Net::HTTP equivalents
to do the dirty work, but override methods. The problem I have is that I
have a third class with the same API, but that may need to delegate to
either the Proxy or the raw HTTP class, and that sould be selectable at
runtime, but Proxy makes the assumption that it's the only wrapper
around Net::HTTP, so instantiating the cache will likely be harder than
just using Net::HTTP or Net::HTTP::proxy, since now instead of two
options:

* Net::HTTP
* Net::HTTP::proxy delegating to Net::HTTP

there are four variants:

* Net::HTTP
* Net::HTTP::proxy delegating to Net::HTTP
* Net::HTTP::Cache delegating to Net::HTTP
* Net::HTTP::Cache delegating to Net::HTTP::proxy
delegating to Net::HTTP

so the Cache has to provide a way to select that easily if it is to be
Net::HTTP API compatible.

Ari
 
A

Aredridel

actually I found it quite irritating that proxy isn´t intergated
more transparently into Net::HTTP (not to mention complete lack of
support for ftp proxy)
Every Perl or Python tool will use http_proxy and no_proxy environment
variables by default and I see no reason Ruby tools should do
differently.

Well, for one, there are environments where environment variables are
not settable -- for example, within mod_ruby -- you may want several
sets of settings, but they all run with a shared process pool.

The lack of transparency is annoying to me too -- I'd expect a
"use_proxy" method to Net::HTTP itself, more than a different
constructor. That said, code-wise, Net::HTTP is a very clean animal.

Ari
 
R

Richard Zidlicky

Hi,

In mail "Re: OO Design question for Net::HTTP caching extension"


OK, I'll implement it in 1.8.

same old request.. could Net::HTTP use proxy by default?

Far too many Ruby programs and libraries do not provide support for http
proxies and of those that do only a minimal fraction does get it right,
ie not only use http_proxy but also no_proxy.

Regards
Richard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top