Static Content Hosting

Most people host static content (HTML, JavaScript, images, etc) inefficiently. This page documents best practices for hosting static content.

The Ideal Method

The file hierarchy resembles:

/
/version-1
/version-1/foo
/version-1/bar
/version-2
/version-2/foo
/version-2/baz
/latest -> version-2

These translate to URLs like http://example.com/version-1/foo

Each time you change content, you create a new URL space for that version of the content set. Content in a set is idempotent for all of time. i.e. the contents of a file inside a versioned directory never change, ever. The only operations you can do are create and delete.

Content out of versioned directories is served with a caching policy that effectively allows caching for all of time. e.g.

GET /version-1/foo HTTP/1.1
Host: example.com

200 OK
Last-Modified: Mon, 9 Jan 2012 13:37:07 GMT
Cache-Control: max-age=31546000
ETag: 16d7a4fca7442dda3ad93c9a726597e4

When clients connect to your service, you need to send them to the preferred/current version of the hosted content. You can do this at the application layer by coding the latest version in the response and have the client parse that. Or, you can address it via HTTP:

GET /latest HTTP/1.1
Host: example.com

307 See Other
Location: http://example.com/version-1/
Cache-Control: max-age=60

There are a number of advantages to this hosting stategy:

  • If the client has the latest version cached, it doesn't need to hit your server at all because of Cache-Control: max-age. Instead, it pulls it from the local cache and ignores validation (If-Modified-Since, If-None-Match).
  • HTTP caches in front of your server will absorb almost all load.
  • Clients never see mixed versions of content
  • HTTP server load is significantly reduced (fewer overall HTTP requests due to cache hit rate)
  • Faster client responsiveness. Clients can typically load content from local cache without going to network for validation.

Some caveats include:

  • Isn't as suitable for highly-dynamic content. If your generated/static content changes with great frequency (seconds to minutes), you won't see as pronounced a boost from this strategy.
  • Version management. You'll have multiple versions floating around on your server. You'll need to manage which ones need to be active and which can be deleted.
  • URLs aren't as friendly. People won't notice for CSS, JS, or images. But, it might be annoying noise in your browser's URL bar.

When you have a cluster of machines serving content, there are some things to watch out for:

  • ETag must be consistent across all machines. Some HTTP servers like to use the filesystem identifier for the ETag. This isn't consistent across different filesystem instances. If you can't provide consistent ETag efficiently, turn it off: you don't need it for this strategy to work.
  • You must deploy new versions to all servers before switching over the latest pointer. If you don't, a client could read the pointer from a ready host and then attempt to fetch data from a host that doesn't have the new version yet.

Additional Improvements:

  • Use compression on HTTP responses. Hopefully your HTTP stack does this automatically. It will likely cut down on wire transfer time at the small expense of increased CPU utilization. Some HTTP servers might even cache the compressed entity to save redundant work.
  • Date strings are good version identifiers. e.g. 0/20120109T2130/foo