Saturday, February 28, 2015

Caching and Etag headers - A day's work in Apache and TileStache

Some months ago, I set up a Amazon EC2 instance and set up Varnish on it, then had it act as a caching proxy for our map tiles. Gorgeous map tiles, just rather a lot of data every month and it would be best to offload that onto Amazon's fast network.

It's working well: after a little bit of fussing with setting Cache-Control and Expires headers via Apache's Expires and Headers modules, browsers definitely have expiration dates and are caching the map tiles. And that's where I left it lie for some months, on account of more pressing duties.

But I got to revisit this last week and noticed: Only Chrome was reporting that the map tiles were being loaded from cache. Firebug and IE's F12 tools, were showing code 200 for those map tiles. That is to say, they had the map tiles in their cache... but were requesting them anew anyway!

Cache Expiration


These map tiles change very rarely, once per year. So we were very generous about the maximum age of them in the cache. We advise the browser that 90 days is good.
ExpiresActive On
Cache-control set Header "max-age=7776000"
But so what? The browser has in its cache an image fetched on Feb 22 (6 days ago) and that the file is definitely too old come May 22... But what about today? How does the browser know that there isn't a newer version?

If the image file had a Last-Modified time, the browser could send along a If-Modified-Since header, and the server could send back a 304 (Not Modified) if the file hasn't been modified. But this is not a static file. It's generated by TileStache (perhaps from on-disk cache, perhaps not), and TileStache does not include a Last-Modified header.

From a standpoint of TileStache's capability to support multiple storage backends (Caches class types), this kinda makes sense. A S3 backend, a on-disk backend, memcache, or even hybrids of  multiple cache backends... I can see why they skipped that.

As a result, the browser has it in the cache but the only way to know that it's the most recent version... is to download it again.

Etag


The other mechanism for identifying that a file has not changed since you last downloaded it, is the humble Etag header. An Etag is a string that identifies whether the file has changed: it could be a timestamp, a version number, an MD5 or CRC32 checksum, just about anything. Apache's own module does some hashing of the file's mtime, inode number, and filesize to generate a nice scramble.

The server sends the Etag along with the file content:
Content-type: image/jpeg
Etag: abcdefgh12345678
Content-length: 213452
begin file payload here
The browser on a subsequent request to  that same URL, would send that same Etag back to the server as an advisory "If your etag matches this, then I already have the current copy and I don't need it again":
GET /12/34/56.jpeg
If-None-Match: "abcdefgh12345678"
An added benefit of using Etag instead of a Last-Modified time, is that the Etag won't run afoul of your clock being borked. For our extreme max-age of 90 days that's not a big issue, but if you're doing more typical caching of 1 or 2 hours and someone's PC gets set to the wrong timezone, they missed out -- but Etag abc123 will still be abc123 no matter what time it is.

But TileStache doesn't generate Etag headers either, so now what?

TileStache Issue 22


Looks like 2 years ago someone else had run into this, and asked for someone to add either Last-Modified or else Etag support into TileStache. Looks like it didn't get done. Fortunately, shr3k's advice was pretty close to the mark and it proved quite easy to add an MD5 digest to all three entry points: WSGI, CGI, and ModPython.

So, if they ever accept that pull, or you patch your own installation of TileStache, your TileStache will automagically generate Etag headers.

At first I was concerned that it would add significant processing time, generating that Etag. But in fact it's not even measurable: if there's a millisecond or two of difference, it's nothing detectable by the time a browser and a network are involved.

Hooray! (well, for caching proxies)


So we're all set: Our Varnish CDN is sending the Etag on to the client and is also caching it for itself. Browsers are using the Etag and sending it back to the Varnish, and they're getting back 304 responses instead of kilobytes of image content. For tiles being fetched, response times are 100-150 ms; for a tile already in cache, more like 20-40ms start-to-finish for the browser to find out that it's already set.

But there is a limitation: The Etag is only useful to caching proxies such as Varnish, and not if your browser is accessing TileStache directly. TileStache will not heed a If-None-Match header from the browser; TileStache would simply send you back the image content. This trick is definitely limited to caching proxies which are smart enough to handle Etag and know to bail with a 304.

Then again, how hard would it be to detect the header and fetch the tile content, then calculate the Etag? It'd come after fetching the content, so perhaps not a win for the server load; but could save some network load... An interesting idea...

No comments:

Post a Comment