Gregory Szorc's Digital Home

The Importance of Time on Automated Machine Configuration

June 24, 2013 at 09:00 PM | categories: sysadmin, Mozilla, Puppet

Usage of machine configuration management software like Puppet and Chef has taken off in recent years. And rightly so - these pieces of software make the lives of countless system administrators much better (in theory).

In their default (and common) configuration, these pieces of software do a terrific job of ensuring a machine is provisioned with today's configuration. However, for many server provisioning scenarios, we actually care about yesterday's configuration.

In this post, I will talk about the importance of time when configuring machines.

Describing the problem

If you've worked on any kind of server application, chances are you've had to deal with a rollback. Some new version of a package or web application is rolled out to production. However, due to unforeseen problems, it needed to be rolled back.

Or, perhaps you operate a farm of machines that continuously build or compile software from version control. It's desirable to be able to reproduce the output from a previous build (ideally bit identical).

In these scenarios, the wall time plays a crucial rule when dealing with a central, master configuration server (such as a Puppet master).

Since a client will always pull the latest revision of its configuration from the server, it's very easy to define your configurations such that the result of machine provisioning today is different from yesterday (or last week or last month).

For example, let's say you are running Puppet to manage a machine that sits in a continuous integration farm and recompiles a source tree over and over. In your Puppet manifest you have:

package {
    "gcc":
        ensure => latest
}

If you run Puppet today, you may pull down GCC 4.7 from the remote package repository because 4.7 is the latest version available. But if you run Puppet tomorrow, you may pull down GCC 4.8 because the package repository has been updated! If for some reason you need to rebuild one of today's builds tomorrow (perhaps you want to rebuild that revision plus a minor patch), they'll use different compiler versions (or any package for that matter) and the output may not be consistent - it may not even work at all! So much for repeatability.

File templates are another example. In Puppet, file templates are evaluated on the server and the results are sent to the client. So, the output of file template execution today might be different from the output tomorrow. If you needed to roll back your server to an old version, you may not be able to do that because the template on the server isn't backwards compatible! This can be worked around, sure (commonly by copying the template and branching differences), but over time these hacks accumulate in a giant pile of complexity.

The common issue here is that time has an impact on the outcome of machine configuration. I refer to this issue as time-dependent idempotency. In other words, does time play a role in the supposedly idempotent configuration process? If the output is consistent no matter when you run the configuration, it is time-independent and truly idempotent. If it varies depending on when configuration is performed, it is time-dependent and thus not truly idempotent.

Solving the problem

My attitude towards machine configuration and automation is that it should be as time independent as possible. If I need to revert to yesterday's state or want to reproduce something that happened months ago, I want strong guarantees that it will be similar, if not identical. Now, this is just my opinion. I've worked in environments where we had these strong guarantees. And having had this luxury, I abhore the alternative where so many pieces of configuration vary over time as the central configuration moves forward without the ability to turn back the clock. As always, your needs may be different and this post may not apply to you!

I said as possible a few times in the previous paragraph. While you could likely make all parts of your configuration time independent, it's not a good idea. In the real world, things change over time and making all configuration data static regardless of time will produce a broken or bad configuration.

User access is one such piece of configuration. Employees come and go. Passwords and SSH keys change. You don't want to revert user access to the way it was two months ago, restoring access to a disgruntled former employee or allowing access via a compromised password. Network configuration is another. Say the network topology changed and the firewall rules need updating. If you reverted the networking configuration, the machine likely wouldn't work on the network!

This highlights an important fact: if making your machine configuration time independent is a goal, you will need to bifurcate configuration by time dependency and solve for both. You'll need to identify every piece of configuration and ask do I put this in the bucket that is constant over time or the bucket that changes over time?

Machine configuration software can do a terrific job of ensuring an applied configuration is idempotent. The problem is it typically can't manage both time-dependent and time-independent attributes at the same time. Solving this requires a little brain power, but is achievable if there is will. In the next section, I'll describe how.

Technical implementation

Time-dependent machine configuration is a solved problem. Deploy Puppet master (or similar) and you are good to go.

Time-independent configuration is a bit more complicated.

As I mentioned above, the first step is to isolate all of the configuration you want to be time independent. Next, you need to ensure time dependency doesn't creep into that configuration. You need to identify things that can change over time and take measures to ensure those changes won't affect the configuration. I encourage you to employ the external system test: does this aspect of configuration depend on an external system or entity? If so how will I prevent changes in it over time from affecting us?

Package repositories are one such external system. New package versions are released all the time. Old packages are deleted. If your configuration says to install the latest package, there's no guarantee the package version won't change unless the package repository doesn't change. If you simply pin a package to a specific version, that version may disappear from the server. The solution: pin packages to specific versions and run your own package mirror that doesn't delete or modify existing packages.

Does your configuration fetch a file from a remote server or use a file as a template? Cache that file locally (in case it disappears) and put it under version control. Have the configuration reference the version control revision of that file. As long as the repository is accessible, the exact version of the file can be retrieved at any time without variation.

In my professional career, I've used two separate systems for managing time-independent configuration data. Both relied heavily on version control. Essentially, all the time-independent configuration data is collected into a single repository - an independent repository from all the time-dependent data (although that's technically an implementation detail). For Puppet, this would include all the manifests, modules, and files used directly by Puppet. When you want to activate a machine with a configuration, you simply say check out revision X of this repository and apply its configuration. Since revision X of the repository is constant over time, the set of configuration data being used to configure the machine is constant. And, if you've done things correctly, the output is idempotent over time.

In one of these systems, we actually had two versions of Puppet running on a machine. First, we had the daemon communicating with a central Puppet master. It was continually applying time-dependent configuration (user accounts, passwords, networking, etc). We supplemented this was a manually executed standalone Puppet instance. When you ran a script, it asked the Puppet master for its configuration. Part of that configuration was the revision of the time-independent Git repository containing the Puppet configuration files the client should use. It then pulled the Git repo, checked out the specified revision, merged Puppet master's settings for the node with that config (not the manifests, just some variables), then ran Puppet locally to apply the configuration. While a machine's configuration typically referenced a SHA-1 of a specific Git commit to use, we could use anything git checkout understood. We had some machines running master or other branches if we didn't care about time-independent idempotency for that machine at that time. What this all meant was that if you wanted to roll back a machine's configuration, you simply specified an earlier Git commit SHA-1 and then re-ran local Puppet.

We were largely satisfied with this model. We felt like we got the best of both worlds. And, since we were using the same technology (Puppet) for time-dependent and time-independent configuration, it was a pretty simple-to-understand system. A downside was there were two Puppet instances instead of one. With a little effort, someone could probably devise a way for the Puppet master to merge the two configuration trees. I'll leave that as an exercise for the reader. Perhaps someone has done this already! If you know of someone, please leave a comment!

Challenges

The solution I describe does not come without its challenges.

First, deciding whether a piece of configuration is time dependent or time independent can be quiet complicated. For example, should a package update for a critical security fix be time dependent or time independent? It depends! What's the risk of the machine not receiving that update? How often is that machine rolled back? Is that package important to the operation/role of that machine (if so, I'd lean more towards time independent).

Second, minimizing exposure to external entities is hard. While I recommend putting as much as possible under version control in a single repository and pinning versions everywhere when you interface with an external system, this isn't always feasible. It's probably a silly idea to have your 200 GB Apt repository under version control and distributed locally to every machine in your network. So, you end up introducing specialized one-off systems as necessary. For our package repository, we just ran an internal HTTP server that only allowed inserts (no deletes or mutates). If we were creative, we could have likely devised a way for the client to pass a revision with the request and have the server dynamically serve from that revision of an underlying repository. Although, that may not work for every server type due to limited control over client behavior.

Third, ensuring compatibility between the time-dependent configuration and time-independent configuration is hard. This is a consequence of separating those configurations. Will a time-independent configuration from a revision two years ago work with the time-dependent configuration of today? This issue can be mitigated by first having as much configuration as possible be time independent and second not relying on wide support windows. If it's good enough to only support compatibility for time-independent configurations less than a month old, then it's good enough! With this issue, I feel you are trading long-term future incompatibility for well-defined and understood behavior in the short to medium term. That's a trade-off I'm willing to make.

Conclusion

Many machine configuration management systems only care about idempotency today. However, with a little effort, it's possible to achieve consistent state over time. This requires a little extra effort and brain power, but it's certainly doable.

The next time you are programming your system configuration tool, I hope you take the time to consider the effects time will have and that you will take the necessary steps to ensure consistency over time (assuming you want that, of course).