Reproducing Mozilla's Mercurial Server

September 05, 2014 at 02:50 PM | categories: Mercurial, Mozilla

Of of my first tasks in my new role as a Developer Productivity Engineer is to help make Mozilla's Mercurial server better. Many of the awesome things we have planned rely on features in newer versions of Mercurial. It's therefore important for us to upgrade our Mercurial server to a modern version (we are currently running 2.5.4) and to keep our Mercurial server upgraded as time passes.

There are a few reasons why we haven't historically upgraded our Mercurial server. First, as anyone who has maintained high-availability systems will tell you, there is the attitude of if it isn't broken, don't fix it. In other words, Mercurial 2.5.4 is working fine, so why mess with a good thing. This was all fine and dandy - until Mercurial started falling over in the last few weeks.

But the blocker towards upgrading that I want to talk about today is systems verification. There has been extreme caution around upgrading Mercurial at Mozilla because it is a critical piece of Mozilla's infrastructure and if the upgrade were to not go well, the outage would be disastrous for developer productivity and could even jeopardize an emergency Firefox release.

As much as I'd like to say that a modern version of Mercurial on the server would be a drop-in replacement (Mercurial has a great committment to backwards compatibility and has loose coupling between clients and servers such that upgrading servers should not impact clients), there is always a risk that something will change. And that risk is compounded by the amount of custom code we have running on our server.

The way you protect against unexpected changes is testing. In the ideal world, you have a robust test suite that you run against a staging instance of a service to validate that any changes have no impact. In the absence of testing, you are left with fear, uncertainty, and doubt. FUD is an especially horrible philosophy when it comes to managing servers.

Unfortunately, we don't really have a great testing infrastructure for Mozilla's Mercurial server. And I want to change that.

Reproducing the Server Environment

When writing tests, it is important for the thing being tested to be as similar as possible to the real thing. This is why so many people have an aversion to mocking: every time you alter the test environment, you run the risk that those differences from reality will mask changes seen in the real environment.

So, it makes sense that a good first goal for creating a test suite against our Mercurial server should be to reproduce the production server and environment as closely as possible.

I'm currently working on a Vagrant environment that attempts to reproduce the official environment as closely as possible. It starts one virtual machine for the SSH/master server. It starts a separate virtual machine for the hgweb/slave servers. The virtual machines are booting CentOS. This is different than production, where we run RHEL. But they are similar enough (and can share the same packages) that the differences shouldn't matter too much, at least for now.

Using Puppet

In production, Mozilla is using Puppet to manage the Mercurial servers. Unfortunately, the actual Puppet configs that Mozilla is running are behind a firewall, mainly for security reasons. This is potentially a huge setback for my reproducibility effort, as I'd like to have my virtual machines use the same exact Puppet configs as whats used in production so the environments match as closely as possible. This would also save me a lot of work from having to reinvent the wheel.

Fortunately, Ben Kero has extracted the Mercurial-relevant Puppet config files into a standalone repository. Apparently that repository gets rolled into the production Puppet configs periodically. So, my virtual machines and production can share the same Mercurial Puppet files. Nice!

It wasn't long after starting to use the standalone Puppet configs that I realized this would be a rabbit hole. This first manifests in the standalone Puppet code referencing things that exist in the hidden Mozilla Puppet files. So the liberation was only partially successful. Sad panda.

So, I'm now in the process of creating a fake Mozilla Puppet environment that mimics the base Mozilla environment (from the closed repo) and am modifying the shared Puppet Mercurial code to work with both versions. This is a royal pain, but it needs to be done if we want to reproduce production and maintain peace of mind that test results reflect reality.

Because reproducing runtime environments is important for reproducing and solving bugs and for testing, I call on the maintainers of Mozilla's closed Puppet repository to liberate it from behind its firewall. I'd like to see a public Puppet configuration tree available for all to use so that anyone anywhere can reproduce the state of a server or service operated by Mozilla to within reasonable approximation. Had this already been done, it would have saved me hours of work. As it stands, I'm reverse engineering systems and trying to cobble together understanding of how the Mozilla Puppet configs work and what parts of them can safely be ignored to reproduce an approximate testing environment.

Along that vein, I finally got access to Mozilla's internal Puppet repository. This took a few meetings and apparently a lot of backroom chatter was generated - "developer's don't normally get access, oh my!" All I wanted was to see how systems are configured so I can help improve them. Instead, getting access felt like pulling teeth. This feels like a major roadblock towards productivity, reproducibility, and testing.

Facebook gives its developers access to most production machines and trusts them to not be stupid. I know we (Mozilla) like to hold ourselves to a high standard of security and privacy. But not giving developers access to the configurations for the systems their code runs on feels like a very silly policy. I hope Mozilla invests in opening up this important code and data, if not to the world, at least to its trusted employees.

Anyway, hopefully I'll soon have a Vagrant environment that allows people to build a standalone instance of Mozilla's Mercurial server. And once that's in place, I can start writing tests that basic services and workflows (including repository syncing) work as expected. Stay tuned.