Servlet caching strategies?

markspace · Sep 20, 2010

I was poking around the internet looking for caching idea for servlets
and JEE containers. Here's an excerpt from an older book (2002)
published by O'Reilly:

"Pregeneration and caching of content can be key to providing your site
visitors with a quality experience. With the right pregeneration and
caching, web pages pop up rather than drag, and loads are
reduced--sometimes dramatically--on the client, server, and network....

There's no need to dynamically regenerate content that doesn't change
between requests. "

<http://events.oreilly.com/pub/a/onjava/excerpt/jebp_3/index2.html>

Since this source is almost 8 years old, I thought I'd ask here. Is
"pregeneration" of content still considered a best practice? What tools
do you use to manage the process? The article recommends Tomcat & Ant,
apparently implying that you should build everything possible statically
with Ant before deploying it.

Any different thoughts?

Arved Sandstrom · Sep 21, 2010

markspace said:
I was poking around the internet looking for caching idea for servlets
and JEE containers. Here's an excerpt from an older book (2002)
published by O'Reilly:

"Pregeneration and caching of content can be key to providing your
site visitors with a quality experience. With the right pregeneration
and caching, web pages pop up rather than drag, and loads are
reduced--sometimes dramatically--on the client, server, and
network....
There's no need to dynamically regenerate content that doesn't change
between requests. "

<http://events.oreilly.com/pub/a/onjava/excerpt/jebp_3/index2.html>

Since this source is almost 8 years old, I thought I'd ask here. Is
"pregeneration" of content still considered a best practice? What
tools do you use to manage the process? The article recommends
Tomcat & Ant, apparently implying that you should build everything
possible statically with Ant before deploying it.

Any different thoughts?

As far as I'm concerned it's still a best practice. I have to confess, while
some of the web apps I've worked on in the past decade could have profitably
used a certain amount of deliberate thought regarding caching and/or
pregeneration, neither I nor anyone else on the dev teams ever really
thought about it; hardware advances seemed to save the day. Most times we
focused on optimizing the connection to the database.

But then again I've never worked on a LARGE web application, and there I
suspect people think about this frequently, and actually do it. Some casual
Googling tells me that this is the case.

I thank you for bringing this to my attention, because now I'm going to
think about it.

AHS

Arne Vajhøj · Sep 21, 2010

I was poking around the internet looking for caching idea for servlets
and JEE containers. Here's an excerpt from an older book (2002)
published by O'Reilly:

"Pregeneration and caching of content can be key to providing your site
visitors with a quality experience. With the right pregeneration and
caching, web pages pop up rather than drag, and loads are
reduced--sometimes dramatically--on the client, server, and network....

There's no need to dynamically regenerate content that doesn't change
between requests. "

<http://events.oreilly.com/pub/a/onjava/excerpt/jebp_3/index2.html>

Since this source is almost 8 years old, I thought I'd ask here. Is
"pregeneration" of content still considered a best practice? What tools
do you use to manage the process? The article recommends Tomcat & Ant,
apparently implying that you should build everything possible statically
with Ant before deploying it.

Any different thoughts?

There are no need to pregenerate static content.

Some pregeneration happens automatically in the Java world:
.class -> native
.jsp -> .java -> .class -> native

Data can and should be cached. Several cache products exists for
this purpose like EHCache.

If the context is CMS, then it is common to have the CMS
output something less dynamic from the source that is more
dynamic.

To speed up traditional dynamic stuff, then instead of
all types of ugly hacks, then simply put a cache server
like Varnish in front of your app server.

Arne

Tom Anderson · Sep 22, 2010

I have no direct experience with this; my company's practice seems to be
to not worry about it at all. Static content (images, mostly) gets pushed
forward to webservers or a content distribution network, but all page
content is generated on the fly. I doubt much of it gets cached; we lean
towards doing lots of per-user customisation on the page, which militates
against that somewhat. I suspect we are Bad People for doing this, though.

There are no need to pregenerate static content.

Agreed. To repeat what Arne said with slightly different emphasis, i think
pages should be cached - i think it's vastly preferable to cacheing data
further back, because it shortcuts page generation - and i think the place
to do it is in a reverse proxy sitting in front of your app servers. As
Arne said, something like Denmark's finest software product, Varnish:

http://www.varnish-cache.org/

Varnish was built from the ground up (by a Dane, Arne!) to be fast at
serving cached pages. It can pump pages out vastly faster than any app
server, using runtime resources very efficiently in doing so.

But i don't think you need to pregenerate per se to fill the cache.
Rather, you spider the site, via the cache, thus causing it to be filled.
If you can produce a list of URLs for every page on your site, wget
running on a box out on the net will do the job nicely.

As an aside, on the subject of high-performance web architecture in
reality, this is old, but it's a classic:

http://perl.apache.org/docs/tutorials/apps/scale_etoys/etoys.html

tom

markspace · Sep 22, 2010

Agreed. To repeat what Arne said with slightly different emphasis, i
think pages should be cached - i think it's vastly preferable to
cacheing data further back, because it shortcuts page generation - and i
think the place to do it is in a reverse proxy sitting in front of your
app servers.

So I agree that the author I linked to, Jason Hunter, seems a little
goofy in this regard. I've never heard of pregenerating websites being
a best practice, which is why I asked here about it. The idea of using
separate (and pre-written, and pre-debugged) cache software to do the
caching seems much, much better.

However, Mr Hunter raises some good points. One I think is that it's
impossible for a cache to function if it can't determine the age of a
page (or other resource). And he admonishes the coder to implement
getLastModified() on the servlet, which will automatically add date
information to the page (or other generated content). This seems
fundamental in enabling a cache to work, because the spec for caching
seems to imply that without some sort of cache control in the HTTP
header, data cannot be cached.

So I'm I correct in assuming that the cache software needs some sort of
control? Either Last Modified, ETag, or some sort of explicit cache
control. That's how I read the spec.

And what cache control do you use when planning out a site?

http://perl.apache.org/docs/tutorials/apps/scale_etoys/etoys.html

Thanks, I'll look into this when I get a chance.

Tom Anderson · Sep 22, 2010

So I agree that the author I linked to, Jason Hunter, seems a little goofy in
this regard. I've never heard of pregenerating websites being a best
practice, which is why I asked here about it. The idea of using separate
(and pre-written, and pre-debugged) cache software to do the caching seems
much, much better.

However, Mr Hunter raises some good points. One I think is that it's
impossible for a cache to function if it can't determine the age of a
page (or other resource). And he admonishes the coder to implement
getLastModified() on the servlet, which will automatically add date
information to the page (or other generated content). This seems
fundamental in enabling a cache to work, because the spec for caching
seems to imply that without some sort of cache control in the HTTP
header, data cannot be cached.

So I'm I correct in assuming that the cache software needs some sort of
control? Either Last Modified, ETag, or some sort of explicit cache
control.

Yes, absolutely. IIRC, the minimum you need is a last-modified header and
a lack of a no-store cache-control header. With that, clients will still
have to make GET requests for the page, but they can be conditional; if
your server supports conditional GETs (i have no idea if theres's any
support for this in web frameworks; it's straightforward with a plain
servlet), you can avoid rendering pages. I believe a reverse proxy can use
that to serve cached pages to users making non-conditional GETs (eg who
have never seen the page before): it passes on a conditional GET to the
app server, using the last-modified on its cached copy of the page, and if
it gets a 304 Not Modified response, it serves the cached page.

You then have a plethora of options for doing things better or
differently. Off the top of my head:

* If you can't or don't want to issue last-modified headers, you can use
etags instead.

* You can send an expires or cache-control maxage header, which lets
browser and proxy caches reuse cached pages without revalidating them with
the app server.

* You can send all sorts of other things in cache-control headers to tune
cache behaviour; i don't think any of these are colossally important,
though.

* You can use the 'far-future expires' method, in which you serve all
content with expiry dates in the far future, so it can be cached forever,
and make sure that if you change it, you change the URL it's reached
through, so that the cached old version won't be used. For example, when
you change your logo, logo.v1.png becomes logo.v2.png.

* If you have a highly configurable reverse proxy like Varnish, you can do
some trickery to offload work to it while retaining control. The thing is
to send a freshness-enabling header, like cache-control s-maxage, with a
value which guarantees freshness far into the future, which will let the
proxy cache the response without needing to revalidate it, but strip this
off before sending it out into the web. You thus ensure that the response
is cached, but only locally. That doesn't save you any bandwidth, but it
does save you server load. You keep control over the cacheing, because you
can always purge the proxy's cache if things change.

And what cache control do you use when planning out a site?

How do you mean?

Thanks, I'll look into this when I get a chance.

I'll add some a couple more. Basic info that it sounds like you already
know:

http://www.mnot.net/cache_docs/

Yet more interesting Danish cache science (although
not directly useful in this discussion):

http://iwaw.europarchive.org/04/Clausen.pdf

tom

Arne Vajhøj · Sep 23, 2010

Agreed. To repeat what Arne said with slightly different emphasis, i
think pages should be cached - i think it's vastly preferable to
cacheing data further back, because it shortcuts page generation - and i
think the place to do it is in a reverse proxy sitting in front of your
app servers. As Arne said, something like Denmark's finest software
product, Varnish:

http://www.varnish-cache.org/

Varnish was built from the ground up (by a Dane, Arne!) to be fast at
serving cached pages. It can pump pages out vastly faster than any app
server, using runtime resources very efficiently in doing so.

I know Varnish is maintained by PHK.

There are a couple of other Danes that have contributed to software
over the years (even though they all seem to migrate to other
countries):
Anders Hejlsberg (Turbo Pascal, Delphi, C#)
Bjarne Stroustrup (C++)
Rasmus Lerdorf (PHP)
David Heinemeier Hansson (Ruby on Rails)

But i don't think you need to pregenerate per se to fill the cache.
Rather, you spider the site, via the cache, thus causing it to be
filled. If you can produce a list of URLs for every page on your site,
wget running on a box out on the net will do the job nicely.

I would not bother preloading. Just cache them at first real
request.

Arne

the

Arne Vajhøj · Sep 23, 2010

So I agree that the author I linked to, Jason Hunter, seems a little
goofy in this regard. I've never heard of pregenerating websites being a
best practice, which is why I asked here about it. The idea of using
separate (and pre-written, and pre-debugged) cache software to do the
caching seems much, much better.

However, Mr Hunter raises some good points. One I think is that it's
impossible for a cache to function if it can't determine the age of a
page (or other resource). And he admonishes the coder to implement
getLastModified() on the servlet, which will automatically add date
information to the page (or other generated content). This seems
fundamental in enabling a cache to work, because the spec for caching
seems to imply that without some sort of cache control in the HTTP
header, data cannot be cached.

So I'm I correct in assuming that the cache software needs some sort of
control? Either Last Modified, ETag, or some sort of explicit cache
control. That's how I read the spec.

getLastModified allows the HTTP conditional get to work. Which
is a lot better than getting everything, but not as good as
caching in front.

I would consider headers to assist caching in front to be more
important, if you need to go high volume.

Arne

Servlet caching strategies?

markspace

Arved Sandstrom

Arne Vajhøj

Tom Anderson

markspace

Tom Anderson

Arne Vajhøj

Arne Vajhøj

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads