Static Content Hosting
Most people host static content (HTML, JavaScript, images, etc)
inefficiently. This page documents best practices for hosting static
content.
The Ideal Method
The file hierarchy resembles:
/
/version-1
/version-1/foo
/version-1/bar
/version-2
/version-2/foo
/version-2/baz
/latest -> version-2
These translate to URLs like http://example.com/version-1/foo
Each time you change content, you create a new URL space for that
version of the content set. Content in a set is idempotent for all of
time. i.e. the contents of a file inside a versioned directory never
change, ever. The only operations you can do are create and delete.
Content out of versioned directories is served with a caching policy
that effectively allows caching for all of time. e.g.
GET /version-1/foo HTTP/1.1
Host: example.com
200 OK
Last-Modified: Mon, 9 Jan 2012 13:37:07 GMT
Cache-Control: max-age=31546000
ETag: 16d7a4fca7442dda3ad93c9a726597e4
When clients connect to your service, you need to send them to the
preferred/current version of the hosted content. You can do this at the
application layer by coding the latest version in the response and have
the client parse that. Or, you can address it via HTTP:
GET /latest HTTP/1.1
Host: example.com
307 See Other
Location: http://example.com/version-1/
Cache-Control: max-age=60
There are a number of advantages to this hosting stategy:
- If the client has the latest version cached, it doesn't need to hit
your server at all because of Cache-Control: max-age. Instead, it
pulls it from the local cache and ignores validation
(If-Modified-Since, If-None-Match).
- HTTP caches in front of your server will absorb almost all load.
- Clients never see mixed versions of content
- HTTP server load is significantly reduced (fewer overall HTTP requests
due to cache hit rate)
- Faster client responsiveness. Clients can typically load content from
local cache without going to network for validation.
Some caveats include:
- Isn't as suitable for highly-dynamic content. If your generated/static
content changes with great frequency (seconds to minutes), you won't
see as pronounced a boost from this strategy.
- Version management. You'll have multiple versions floating around on
your server. You'll need to manage which ones need to be active and
which can be deleted.
- URLs aren't as friendly. People won't notice for CSS, JS, or images.
But, it might be annoying noise in your browser's URL bar.
When you have a cluster of machines serving content, there are some
things to watch out for:
- ETag must be consistent across all machines. Some HTTP servers like to
use the filesystem identifier for the ETag. This isn't consistent
across different filesystem instances. If you can't provide consistent
ETag efficiently, turn it off: you don't need it for this strategy to
work.
- You must deploy new versions to all servers before switching over the
latest pointer. If you don't, a client could read the pointer from a
ready host and then attempt to fetch data from a host that doesn't
have the new version yet.
Additional Improvements:
- Use compression on HTTP responses. Hopefully your HTTP stack does this
automatically. It will likely cut down on wire transfer time at the
small expense of increased CPU utilization. Some HTTP servers might
even cache the compressed entity to save redundant work.
- Date strings are good version identifiers. e.g. 0/20120109T2130/foo