Gregory Szorc's Digital Home

Mercurial Extension for Gecko Development

July 22, 2013 at 10:27 AM | categories: Mercurial, Mozilla

My weekend was spent hacking on Mercurial extensions. First, I worked on porting the pushlog extension off SQLite. This will eventually enable Mozilla to move Mercurial hosting off NFS and should make hg.mozilla.org much faster as a result!

But the main purpose of this blog post is to introduce a new Mercurial extension I wrote this weekend!

Gecko developers perform a number of common tasks with Mercurial, so I thought it would be handy to package them up in an extension.

To install the extension:

hg clone https://hg.mozilla.org/hgcustom/version-control-tools

Then add this extension to your hgrc file (either the global or per-repository will suffice):

[extensions]
mozext = /path/to/version-control-tools/hgext/mozext

Since I believe tools should be self-documenting, run the following for usage info:

$ hg help mozext

Here are some examples:

# Clone mozilla-central into the mc directory.
hg clone central mc
hg clone mc mc

# Create a unified Mercurial repository containing changesets
# from all the release repositories.
hg cloneunified gecko

# Pull changes from the central and inbound repositories.
hg pull central
hg pull inbound

# Update the working tree to the tip of inbound.
hg up inbound/default

# View the tree open/closed status.
hg treestatus

# Show a list of all known trees and their aliases.
hg moztrees

# Open TBPL for the push containing a changeset.
hg tbpl inbound 821e984ef423
hg tbpl inbound inbound/default

# Push the tip of inbound to mozilla-central
hg pushtree -r inbound/default central

I've only tested this extension with Mercurial 2.6 (which every Mozilla developer should be running). I'm not willing to support older versions. Upgrade already!

There are a number of features I'd like to implement:

hg importtry - Automatically import changesets for a Try push into the repository.
hg land - Automatically land patches on an integration tree (like inbound). Will handle rebasing automatically.
hg critic - Perform stlye checking and other analysis on a changeset or group of changesets.
Ability to integrate build status into changeset info. This will allow things such as pull only the last green changeset. I'd also like a build status field to appear in the log output. Unfortunately, I believe the latency of the build lookup API is prohibitively high to perform the kind of tight integration I'd like.
Move mozautomation Python package into a standalone package or integrate already existing code (did I reinvent the wheel?).
Log fetching. Specify a changeset and fetch build/test logs.
Possibly move code into mozilla-central.
Possibly add mach commands for some of this functionality.

There's no bug component for this extension (yet). If you find any issues or wish to add a feature, just email a patch to me at gps@mozilla.com.

Please let me know if you find this useful or if you have any questions.

Analysis of Firefox's Build Automation

July 16, 2013 at 06:15 PM | categories: Mozilla

Mozilla operates thousands of machines whose sole role is to build Firefox (and related applications), run tests, perform static analysis, etc. This is collectively referred to as the Firefox/Mozilla build automation or just automation. The output of all this automation can be seen at tbpl.mozilla.org.

In this post, I'll give an overview of how all this automation works followed by a critical analysis identifying what I like and what I feel should be improved.

How Firefox automation works

Let's take a journey through what happens when you push a new revision of mozilla-central (the main Firefox repository) to Mozilla's canonical Mercurial server. (For Mozilla people, this journey is roughly the same regardless of which automation-enabled project branch you push to.) While Firefox's automation infrastructure kicks off builds on several platforms and operating systems, for simplicity reasons, I'm going to limit low-level technical details to our 64-bit Linux builds.

Before I begin, a disclaimer: I'm not a subject expert in much of what I'm about to say. There are people who spend a magnitude more time than myself touching the systems I'm about to describe. If I get something wrong, please contact me and I'll update this post.

Let's begin.

Buildbot

The heart of Firefox's build automation is a piece of software called Buildbot. Buildbot is essentially a glorified job scheduling system. I find the buildbot basics covers, well, the basics pretty well. What you need to know is that Mozilla maintains a buildbot repository that appears to contain the buildbot core plus basic customization for Mozilla.

There are buildbot masters and slaves. masters do all the coordination and scheduling; slaves do all the real work (such as compiling Firefox). Mozilla operates a handful of masters and a few thousand slaves.

When you push code to a project branch (like mozilla-central), a buildbot master sees the push then figures out what needs to happen. For mozilla-central, the push gets translated to a request to build on several different platforms. These requests then go to a scheduler (possibly getting collapsed into a single request). These requests then get turned into jobs that run on slaves.

This logic mostly lives in the buildbot-configs repository. Of particular interest is the config.py file, which pretty much defines how buildbot is configured at Mozilla - at a high-level anyway.

When a scheduled job executes, the high-level job request is converted into low-level actions (or steps in buildbot parlance) that get executed on slaves. For example, a request to build might clone the source repository, run client.mk, package the results, etc. This logic lives in the buildbotcustom repository. It's worth highlighting the factory.py file. This file contains the beef of the logic for converting high-level jobs into actions on slaves. Start at the MozillaBuildFactory class class to see exactly what goes into performing a build. Then move on to addDoBuildSteps(), which contains the command for invoking the actual build system. As you can see, there's a lot that goes into building besides just invoking the build system (like most developers do)!

For many automation jobs, there is an additional component that comes into play: mozharness. mozharness is relatively new to the Firefox build automation landscape so you may not be familiar with it. A goal of mozharness is to largely migrate the low-level logic from buildbotcustom - the logic that converts a high-level job request into low-level buildbot steps (typically command invocations) - into a separate, standalone entity that doesn't depend on buildbot. A goal is to enable developers to run mozharness locally and run automation jobs just like the official automation infrastructure does. If you have time, I encourage you to read the mozharness FAQ to learn more. My understanding is mozharness will eventually power all of the jobs currently defined in buildbotcustom, so I recommend getting acquainted with mozharness.

In the mozharness world, automation jobs are defined as scripts. Here's the marionette script. You just execute a script (with ideally as few arguments as possible) and mozharness does the rest. In buildbot, instead of having a job with say 12 steps and this logic for configuring the steps live in buildbot, buildbot just says run the marionette mozharness script. Since very little business logic now lives in buildbot, this essentially reduces buildbot's role to just job scheduling.

And that is essentially how the automation determines what to run. Now let's talk about the machines automation runs on.

Machine provisioning

Earlier, I said Mozilla operates thousands of buildbot slaves. Let's talk about how those slaves come into existence.

A slave is just a fancy name for a machine, either physical or virtual. These machines are owned or operated by Mozilla. Mozilla either buys a physical machine or rents one from a cloud provider, like Amazon EC2.

For hopefully obvious reasons, it is important for the configuration of these machines to be consistent. Let's talk about how that is done. Keep in mind I'm talking about Linux machines. OS X and Windows machines go through a different procedure.

When a new machine is acquired, it needs an operating system. There is a kickstart config file that installs CentOS 6.2. At the end of the base OS flash/install, it configures Puppet to talk to a central Puppet master. This Puppet infrastructure is called PuppetAgain and its files are stored in the puppet repository.

Puppet is let loose on the fresh OS install and eventually the machine is configured so it is homogenous with other similar machines in the automation infrastructure. Presumably, Puppet continually polls the central Puppet master and applies the latest configuration.

Part of the puppetization of this machine involves installing the buildbot client. The client eventually registers with the buildbot master and waits for jobs to process.

So, we've described how machines are provisioned as buildbot slaves and how buildbot jobs are converted to actions/steps/commands to be performed on slaves. Let's examine a job in more detail.

Running a build job

Before I talk about the details of a build job, it's worth mentioning that nearly everything I described up until this point is largely hidden from view from most Firefox developers. As far as I know, things like Puppet logs are hidden from public view. And there shouldn't be anything terribly wrong with that: the Puppet configs are public, after all. Unless you are affiliated with Release Engineering or the Automation and Tools Team (A*Team) or hack on a component that warrants its own piece of automation, you probably aren't too concerned with how all of this works.

Anyway, it's finally time to start talking about something almost every Firefox developer has done: build Firefox from source.

As I mentioned above, code in buildbotcustom (to be replaced by mozharness someday) is responsible for turning a Firefox build job into a series of actions/steps/commands to run on a slave. And, lucky for us, the activity of a slave is captured and saved to text log files! If you've ever used TBPL, you've almost certainly clicked a link to view one of these logs.

In this section, I'll describe the steps performed in this log from a recent mozilla-central build on a 64-bit Linux machine. I will be paying particular attention to steps that affect the build environment (for reasons that will be revealed in my critique below).

If you load our log of interest next to factory.py from buildbotcustom, you can start to see how they are related. You may notice self.addStep() calls in factory.py correspond to ========= Started ... ========= Finished sections in the log. That's no accident: every addStep() call produces a section like that in the log.

Now let's look at some of those steps in detail.

The job/log contains a number of set property steps. Search for set props and you'll find them in the log. These steps define named properties in a hash map buildbot uses to represent the current configuration/state. Think of these properties as a way for buildbot to communicate metadata between masters and slaves.

One of the first interesting steps we see is the cloning of the tools repository. Search for Started clone build tools and you'll find it. This repository contains a lot of support tools and scripts used by all parts of automation. There's lots of useful tools in there!

Skipping over the steps that check whether to clobber the builder or purge old content from disk, the next build steps relevant to our interests involve the population of a mock environment. Search for Started mock-tgt mozilla-centos6-x86_64 to find it in the logs.

Mock is a piece of software that manages chroots. It was written by the Fedora project for creating isolated build environments for software packages. For reasons unknown to me, Mozilla runs a forked version of mock called mock_mozilla.

The build job creates a fresh mock environment on every build. (This is clearly indicated by the INFO: chroot (/builds/mock_mozilla/mozilla-centos6-x86_64) unlocked and deleted line in the log.) Later on, builds are performed inside this mock environment. This means that every build job is mostly isolated from both the underlying operating system and all build jobs that came before.

You can see the creation of the new mock environment by looking for the mock_mozilla -r mozilla-centos6-x86_64 --init command in the log. This is using the mozilla-centos6-x86_64 configuration file to create the new environment. This file is managed by Puppet, so you can see it in the puppet repository. The setup command on line 11 is the most important line in this file: it defines what commands to run to initialize the new mock environment.

Populating the mock environment takes a number of buildbot steps. After copying a bunch of files into the mock/chroot, the mock environment is further initialized by installing a number of packages. Search for Started mock-install in the log. Yum is being used to install a number of packages required to build Firefox. This package list appears to come from config.py in the buildbot-configs repository. These packages are downloaded from a Yum repository hosted by Mozilla. Altogether, 249 new packages consuming 821 MB are installed during this step!

After the mock environment is created, the mozharness repository is removed, re-cloned, and updated to the production branch. After that, we pull changes from mozilla-central and update the local checkout to the revision we've been told to build.

If you search for Started got mozconfig, you'll find where the mozconfig file (the build system configuration file) is acquired.

After that, the tooltool configuration is consulted. Tooltool is essentially a content-addressable file store: files are stored and retrieved by the hash value of their content. A manifest file inside mozilla-central defines the set of files to fetch from tooltool at build time. At this step of the build, that file is consulted and listed files are downloaded. While the Linux manifest is currently empty, the OS X tooltool manifest defines the digest of the Clang archive to use to build the tree.

Since tooltool files are addressed by content (not merely by name), this means that the same file will be fetched no matter when the build runs. In other words, behavior is constant as the contents of the tooltool repository itself change (or at least it should be).

After tooltool contents sync, we finally arrive at the actual build step. Search for Started compile in the log. The important detail here is that mock_mozilla is used to build firefox with client.mk inside the fresh mock/chroot environment.

After building, we move on to other tasks, such as creating a distributable package, running tests, etc. I'm just going to glance over them because there's a lot going on and it would take a long time to explain it all! I encourage you to look at the steps in the log and learn.

While these steps are going on, the buildbot master is notified that the build and packaging aspect of the job has completed (sometime before make check is executed). Upon receiving this success notification, the buildbot master scheduled derived jobs, notably all the test suites (reftests, mochitests, xpcshell tests, etc). This scheduling occurs before make check has completed so overall turnaround time is reduced as much as possible.

Finally, at the end of the log, the slave/machine is rebooted. At this point, the job has finished. TBPL colors the B next to the job green to signal successul completion. The slave waits for its next job from the buildbot master.

And that is how Firefox is built! I could go into details for all of the derived test jobs, but I won't, as that would take a lot of effort! I leave that as an exercise for the reader.

Analysis

The Firefox build automation is complex and composed of many pieces. It's a testament to a lot of people's hard work that it works as well as it does!

From my experience at a previous job managing large numbers of servers in a production datacenter, I commend Release Engineering for deploying Puppet to help ensure the machines performing Firefox's build automation are in a consistent state. I also like how mock is used to mostly isolate build jobs from one another. These are both very important to ensure Firefox is built consistently over time.

The on-going migration of automation logic from buildbotcustom to mozharness is a fantastic project and I hope it is completed soon. I hold hope that one day we can integrate mozharness into the local development workflow (likely transparently through tools like mach) and local developers can invoke actions using the same code path as the official automation infrastructure.

Areas for improvement

The Firefox build automation largely works well. And I don't mean to take away from that. However, there are a number of areas where things could be improved. In this section, I'll talk about some of them.

Before I get into details, let me share an experience I had about a year ago.

A tale of modifying the xpcshell test harness

In April 2012, I was writing a lot of JavaScript testing code and was frustrated at how difficult it was to share test helper code between tests. I filed bug 748490 to request a new feature in the JavaScript test harnesses: the ability to create testing-only JavaScript modules that weren't shipped as part of Firefox. This would encourage code reuse among tests and make writing tests easier. I thought it would be a relatively simple feature to implement!

And, it was. At first. The initial implementation landed two weeks after I filed the bug (it didn't require too much effort, but I was busy with others tasks). It landed without incident. Although, nothing was using it and there was no test coverage. However, my local development started relying on it and things were working just fine.

When I attempted to actually land a test that made use of this new feature, it failed because the new directory for these shared modules wasn't being archived. Simple enough to fix, right? Wrong. Start reading bug 755339 from comment #4. It is a trail of agony. I had to uplift my change to all the major branches. Then, once that was done, buildbotcustom could be updated. Uplifts were performed on May 24. buildbotcustom was pushed out to production a week later on May 31. It immediately broke the world. At the time, philor says it was the worst tree bustage he'd ever seen. Literally every tree was red. Achievement unlocked. A workaround was quickly devised on May 31 (after the buildbotcustom change was backed out to restore working automation) and landed. Again, it had to be uplifted to all the trees. This patch conflicted when applied on older trees and I accidentally committed a typo and broke beta, release, and esr in the process. I got an earful for breaking these trees and lost a lot of street cred with the sheriffs (who are charged with keeping law and order in the land of the source trees). The buildbotcustom change finally landed on June 7 and stuck.

Finally, my initial idea of adding testing-only JavaScript modules had been implemented and was available for all to use. It only required uplifting a patch to all project branches, having Release Engineering reconfigure buildbotcustom, and breaking all the trees in the process. This experience was truly WTFOMGBBQ. And, the agony above was from adding a seemingly innocuous feature to a single test harness. Can you imagine what it's like to add a new test harness to automation?!

In-tree automation configs different from Release Engineering's

The above example demonstrates many shortcomings in the current automation infrastructure. The first one I will talk about is that the in-tree automation configs (how to perform a specific automation job, such as run an individual test harness) are almost completely different from what Release Engineering runs on the official infrastructure!

A year ago, mozilla-central had testsuite-targets.mk - a make file defining targets like xpcshell-tests and mochitest-plain that allowed developers to run these test suites locally. Release Engineering had buildbotcustom, which didn't use testsuite-targets.mk at all. While invoking most test harnesses is simply a matter of formulating arguments for a Python script, that logic was duplicated between testsuite-targets and buildbotcustom.

Today, Release Engineering has been porting job invocation code to mozharness. And, I have been encouraging people to locally run tests with mach. Initially, mach commands executed make. However, commands are now bypassing make and calling into the Python test harnesses natively. This allows more advanced behavior since mach has more control and insight into the underlying test harness. The downside is mach is now reinventing logic.

So, we now have 3 or 4 separate implementations for performing many of our automation jobs. Medium term, you figure Release Engineering consolidates all the buildbotcustom logic into mozharness, eliminating 1. I also think it is inevitable that testsuite-targets is refactored to invoke mach commands or is just removed altogether. But that still leaves us with 2 independent implementations!

I believe we should work towards consolidating the logic for job invocation to inside mozilla-central. The commands developers run locally should be as similar as possible to what runs on official automation infrastructure. Any differences introduce potential for different results. Differences also increase the burden to roll out changes. I would argue that an in-tree change should be all that is necessary to change the behavior of official automation infrastructure.

How this is accomplished, I'm not entirely certain. In my ideal world, I think mozharness would live in tree. Although, I understand that may not be practical because much of what mozharness does involves downloading just-built packages, uploading results to a server, etc. I think a middle ground that accomplishes most of what we seek is for the Release Engineering configs (mozharness) to contain as little logic as possible for actual job invocation and let something in-tree do the rest. For example, the mozharness job for the xpcshell test harness would simply be execute the run-xpcshell-tests script from the tests archive. mozilla-central could then have full domain over what exactly is done. If mozilla-central changes, everyone picks up those changes immediately.

Local run-time environments different from official ones

Similar in vein to the previous section is that there are discrepancies between local run-time environments and official ones. By run-time environment, I mean the state of the operating system (installed packages, configuration settings, etc). This plays an important role in determining how an automation job executes. An obvious example is the compiler toolchain. GCC 4.5 is obviously going to have different behavior from GCC 4.7.

While Release Engineering has taken steps to ensure consistency in the configuration of the machines powering the official automation infrastructure, local machine configurations are effectively living in the wild west. Developers are effectively unable to recreate the official automation environment. This lowers the liklihood that a local build will have the same outcome when performed on official automation infrastructure.

While supporting diverse run-time environments will result in greater compatibility and is thus a good thing, I think it is important that local developers have the ability to recreate the official run-time environment as closely as possible. This will raise confidence that local results can be trusted and should cut down on development cost by reducing the number of Try pushes and reducing development cycles.

Recreating the official run-time environment varies in difficulty depending on the operating system. For Linux, it should actually be pretty easy! For the build environment, Mozilla simply needs to package the mock environment used to build or at least could publish a script used to create said environment. In bug 886226, I made an initial stab at this by creating a Vagrantfile that will kinda/sorta recreate the official 64-bit Linux build environment. I even published an archive of the mock environment (which can be imported into Docker). As pointed out on the bug, my work isn't perfect. But it's a start. And it's better than what developers have today to reproduce the official environments.

If we decide publishing archives of chroot environments is the way to go, I believe we could extend the Linux solution to OS X as well. OS X has chroot. It also has a sandboxing facility (man sandbox). There are also tools like jailkit. Also, using chroot environments will also likely make our automation faster since using archives will almost surely be faster than recreating the environment on every job run. Bug 851294 has been filed to track this.

What about Windows? Well, I don't know. We sort of have an environment with MozillaBuild. But, it's not isolated from the rest of the operating system like chroot environments are. Modulo Windows, two out of three tier one platforms isn't too bad!

Now, even if the software environment is similar, hardware differences can also affect results. Unfortunately, there's little that can be done on this front aside from having everybody use the same hardware models. That's not going to happen. So let's pretend we don't have that problem.

Configurations change over time

I recently wrote a post on the importance of time on machine configurations. The gist is that configurations always pulling from version control tip or relying on external resource often vary with time and aren't truly idempotent. This makes it difficult to reproduce specific configurations at future points in time.

Unfortunately, Mozilla's build automation is highly susceptible to configuration variance over time. There are many examples.

The buildbot configuration is periodically pushed to the infrastructure. A job will inherit the config that is currently deployed. If you make a backwards incompatible change to the buildbot configs and push an old revision of mozilla-central, things will blow up.

There are two time-dependent aspects of Puppet bootstrapping. First, the Puppet configs (like buildbot configs) are periodically pushed to a central master server. Whichever version is deployed at a given time is the version picked up by the automation job. Second, we don't do a good job of pinning versions inside the Puppet config. Here are the Mercurial, mock, and Python manifests using the latest package version. This means if a new version of a package is added to Mozilla's Yum repositories that new package version will be deployed on the next Puppet sync. This is definitely not time independent. Fortunately, the base operating system configuration for Linux isn't terribly important because of the mock environment. But, it still varies.

Like the base operating system, the mock environment is full of dependence on time. First, the config file itself is managed by Puppet and thus susceptible to its time-dependent behavior. Second, the packages installed in the mock environment aren't version pinned. This means that whatever packages deployed in Mozilla's Yum repositories will be used. So, every time the Yum package repository is updated, we potentially switch the configuration of our build environment. On the surface, this terrifies me. Maybe we have a strict update policy in place to prevent excessive package updates. The modified times of packages on the server seems to indicate that. Even if we didn't update the package repository, it still seems a bit fragile: I'd much prefer to pin package versions everywhere than rely on the content of a Yum server.

The default branch of the tools repository is always checked out at build time. As new revisions are added to this repository, builds will pick them up immediately. If you make a backwards incompatible change to the tools repository, old revisions may no longer build.

The production branch of the mozharness repository is always used during jobs. Again, if mozharness changes, every future job sees these changes.

The takeaway here is that many of the tightly-related systems/repositories aren't linked in any way. The automation configuration simply pulls the tip of each. We run into problems when we wish to make backwards incompatible changes. If you make one, you first have to update all repositories to be compatible. This is a huge pain (see my experience above). Furthermore, it makes it impossible to have successful automation runs from old revisions of those repositories!

The solution to this mess is to pin versions and configs everywhere. The repositories being put through automation (namely mozilla-central) should contain revisions of the configs to use. It should say I want to use revision X of mozharness, revision Y of the build machine configuration, etc. This will enable much more consistent automation output over time. It cuts down on surprises (did behavior change on July 14 because of a change to mozilla-central or to a new automation config being rolled out to the server?). It allows people to build very old packages. This means Mozilla wouldn't need to store terabytes of old builds around (we could just trigger the build again and get the same output).

I recognize that deterministic automation configuration is likely not completely achievable. But, that doesn't mean we shouldn't work towards it. Having something better than today enables so many more useful scenarios and flexibility in our automation. Let me explain.

Say you want to add or remove a test harness. Or, maybe you want to add a required argument or remove an obsolute argument from a test harness. The procedure from doing this today is far from trivial. It's not enough to simply land your change in mozilla-central and be done with it. Instead, you need to land support in buildbotcustom or mozharness, prepare to land in mozilla-central, then coordinate with Release Engineering to have your changes landed around the time of a server deployment. And since changes likely affect multiple trees, you're likely also landing things in inbound, fx-team, services-central, possibly aurora, beta, and release, etc. If you aren't landing in aurora, beta, release, esr, b2g, etc, you need to remember to have your change merged into these server configurations when those trees inherit your code (although Release Engineering is typically pretty good about tracking this and doing it for you).

Contrast this with simply making a backwards incompatible change by checking in support in mozharness then making the change in mozilla-central along with a revision bump of the mozharness revision to use. When that changeset is pushed to the infrastructure, a compatible version of mozharness is used. When that change gets merged into other trees, the appropriate mozharness revision is used. No extra work needed. Utopia. You just made A*Team and Release Engineering much more productive by eliminating a lot of overhead.

Conclusion

The Mozilla automation infrastructure is a complex beast. There are many moving parts and separate systems. Any newcomer to Mozilla should simply stand in awe that so many systems seemlessly work so well together.

There are improvements that can be made, sure (especially in the area of deterministic behavior over time). But, I think with the mozharness work and the Puppetization of the server infrastructure that Mozilla is on the right course. I look forward to a future where the source code in mozilla-central and the automation infrastructure are more tightly integrated. We're trending there. Give it time.

Quantifying Mozilla's Automation Efficiency

July 14, 2013 at 11:15 PM | categories: Mozilla

Mozilla's build and test automation now records system resource usage (CPU, memory, and I/O) during mozharness jobs. It does this by adding a generic resource collection feature to mozharness. If a mozharness script inherits from a specific class, it magically performs system resource collection and reporting!

I want to emphasize that the current state of the feature is far from complete. There are numerous shortcomings and areas for improvement:

At the time I'm writing this, the mozharness patch is only deployed on the Cedar tree. Hopefully it will be deployed to the production infrastructure shortly.
This feature only works for mozharness jobs. Notably absent are desktop builds.
psutil - the underlying Python package used to collect data - isn't yet installable everywhere. As Release Engineering rolls it out to other machine classes in bug 893254, those jobs will magically start collecting resource usage.
While detailed resource usage is captured during job execution, we currently only report a very high-level summary at job completion time. This will be addressed with bug 893388.
Jobs running on virtual machines appear to misreport CPU usage (presumably due to CPU steal being counted as utilized CPU). Bug 893391 tracks.
You need to manually open logs to view resource usage. (e.g. open this log and search for Total resource usage.) I hope to one day have key metrics reported in TBPL output and/or easily graphable.
Resource collection operates at the system level. Because there is only 1 job running on a machine and slaves typically do little else, we assume system resource usage is a sufficient proxy for automation job usage. This obviously isn't always correct. But, it was the easiest to initially implement.

Those are a lot of shortcomings! And, it essentially means only OS X test jobs are providing meaningful data now. But, you have to start somewhere. And, we have more data now than we did before. That's progress.

Purpose and Analysis

Collecting resource usage of automation jobs (something I'm quite frankly surprised we weren't doing before) should help raise awareness of inefficient machine utilization and other hardware problems. It will allow us to answer questions such as are the machines working as hard as they can or is a particular hardware component contributing to slower automation execution.

Indeed a casual look at the first days of the data has shown some alarming readings, notably the abysmal CPU efficiency of our test jobs. For an OS X 10.8 opt build, the xpcshell job only utilized an average of 10% CPU during execution. A browser chrome job only utilized 12% CPU on average. Finally, a reftest job only utilized 13%.

Any CPU cycle not utilized by our automation infrastructure is forever lost and cannot be put to work again. So, to only utilize 10-13% of available CPU cycles during test jobs is wasting a lot of machine potential. This is the equivalent of buying 7.7 to 10 machines and only turning 1 of them on! Or, in terms of time, it would reduce the wall time execution of a 1 hour job to 6:00 to 7:48. Or in terms of overall automation load, it would significantly decrease the backlog and turnaround time. You get my drift. This is why parallelizing test execution within test suites - a means to increase CPU utilization - is such an exciting project to me. This work is all tracked in bug 845748 and in my opinion it cannot complete soon enough. (I'd also like to see more investigation into bottlenecks in test execution. Even small improvements of 1 or 2% can have a measurable impact when multiplied by thousands of tests per suite and hundreds of test job runs per day.)

Another interesting observation is that there is over 1 GB of write I/O during some test jobs. Browser chrome tests write close to 2GB! That is surprisingly high to me. Are the tests really incurring that much I/O? If so, which ones? Do they need to? If not tests, what background service is performing that much work? Could I/O wait be slowing tests down? Should we invest in more SSDs? More science is definitely needed.

I hope people find this data useful and that we put it to use to make more data-driven decisions around Mozilla's automation infrastructure.

The Importance of Time on Automated Machine Configuration

June 24, 2013 at 09:00 PM | categories: sysadmin, Mozilla, Puppet

Usage of machine configuration management software like Puppet and Chef has taken off in recent years. And rightly so - these pieces of software make the lives of countless system administrators much better (in theory).

In their default (and common) configuration, these pieces of software do a terrific job of ensuring a machine is provisioned with today's configuration. However, for many server provisioning scenarios, we actually care about yesterday's configuration.

In this post, I will talk about the importance of time when configuring machines.

Describing the problem

If you've worked on any kind of server application, chances are you've had to deal with a rollback. Some new version of a package or web application is rolled out to production. However, due to unforeseen problems, it needed to be rolled back.

Or, perhaps you operate a farm of machines that continuously build or compile software from version control. It's desirable to be able to reproduce the output from a previous build (ideally bit identical).

In these scenarios, the wall time plays a crucial rule when dealing with a central, master configuration server (such as a Puppet master).

Since a client will always pull the latest revision of its configuration from the server, it's very easy to define your configurations such that the result of machine provisioning today is different from yesterday (or last week or last month).

For example, let's say you are running Puppet to manage a machine that sits in a continuous integration farm and recompiles a source tree over and over. In your Puppet manifest you have:

package {
    "gcc":
        ensure => latest
}

If you run Puppet today, you may pull down GCC 4.7 from the remote package repository because 4.7 is the latest version available. But if you run Puppet tomorrow, you may pull down GCC 4.8 because the package repository has been updated! If for some reason you need to rebuild one of today's builds tomorrow (perhaps you want to rebuild that revision plus a minor patch), they'll use different compiler versions (or any package for that matter) and the output may not be consistent - it may not even work at all! So much for repeatability.

File templates are another example. In Puppet, file templates are evaluated on the server and the results are sent to the client. So, the output of file template execution today might be different from the output tomorrow. If you needed to roll back your server to an old version, you may not be able to do that because the template on the server isn't backwards compatible! This can be worked around, sure (commonly by copying the template and branching differences), but over time these hacks accumulate in a giant pile of complexity.

The common issue here is that time has an impact on the outcome of machine configuration. I refer to this issue as time-dependent idempotency. In other words, does time play a role in the supposedly idempotent configuration process? If the output is consistent no matter when you run the configuration, it is time-independent and truly idempotent. If it varies depending on when configuration is performed, it is time-dependent and thus not truly idempotent.

Solving the problem

My attitude towards machine configuration and automation is that it should be as time independent as possible. If I need to revert to yesterday's state or want to reproduce something that happened months ago, I want strong guarantees that it will be similar, if not identical. Now, this is just my opinion. I've worked in environments where we had these strong guarantees. And having had this luxury, I abhore the alternative where so many pieces of configuration vary over time as the central configuration moves forward without the ability to turn back the clock. As always, your needs may be different and this post may not apply to you!

I said as possible a few times in the previous paragraph. While you could likely make all parts of your configuration time independent, it's not a good idea. In the real world, things change over time and making all configuration data static regardless of time will produce a broken or bad configuration.

User access is one such piece of configuration. Employees come and go. Passwords and SSH keys change. You don't want to revert user access to the way it was two months ago, restoring access to a disgruntled former employee or allowing access via a compromised password. Network configuration is another. Say the network topology changed and the firewall rules need updating. If you reverted the networking configuration, the machine likely wouldn't work on the network!

This highlights an important fact: if making your machine configuration time independent is a goal, you will need to bifurcate configuration by time dependency and solve for both. You'll need to identify every piece of configuration and ask do I put this in the bucket that is constant over time or the bucket that changes over time?

Machine configuration software can do a terrific job of ensuring an applied configuration is idempotent. The problem is it typically can't manage both time-dependent and time-independent attributes at the same time. Solving this requires a little brain power, but is achievable if there is will. In the next section, I'll describe how.

Technical implementation

Time-dependent machine configuration is a solved problem. Deploy Puppet master (or similar) and you are good to go.

Time-independent configuration is a bit more complicated.

As I mentioned above, the first step is to isolate all of the configuration you want to be time independent. Next, you need to ensure time dependency doesn't creep into that configuration. You need to identify things that can change over time and take measures to ensure those changes won't affect the configuration. I encourage you to employ the external system test: does this aspect of configuration depend on an external system or entity? If so how will I prevent changes in it over time from affecting us?

Package repositories are one such external system. New package versions are released all the time. Old packages are deleted. If your configuration says to install the latest package, there's no guarantee the package version won't change unless the package repository doesn't change. If you simply pin a package to a specific version, that version may disappear from the server. The solution: pin packages to specific versions and run your own package mirror that doesn't delete or modify existing packages.

Does your configuration fetch a file from a remote server or use a file as a template? Cache that file locally (in case it disappears) and put it under version control. Have the configuration reference the version control revision of that file. As long as the repository is accessible, the exact version of the file can be retrieved at any time without variation.

In my professional career, I've used two separate systems for managing time-independent configuration data. Both relied heavily on version control. Essentially, all the time-independent configuration data is collected into a single repository - an independent repository from all the time-dependent data (although that's technically an implementation detail). For Puppet, this would include all the manifests, modules, and files used directly by Puppet. When you want to activate a machine with a configuration, you simply say check out revision X of this repository and apply its configuration. Since revision X of the repository is constant over time, the set of configuration data being used to configure the machine is constant. And, if you've done things correctly, the output is idempotent over time.

In one of these systems, we actually had two versions of Puppet running on a machine. First, we had the daemon communicating with a central Puppet master. It was continually applying time-dependent configuration (user accounts, passwords, networking, etc). We supplemented this was a manually executed standalone Puppet instance. When you ran a script, it asked the Puppet master for its configuration. Part of that configuration was the revision of the time-independent Git repository containing the Puppet configuration files the client should use. It then pulled the Git repo, checked out the specified revision, merged Puppet master's settings for the node with that config (not the manifests, just some variables), then ran Puppet locally to apply the configuration. While a machine's configuration typically referenced a SHA-1 of a specific Git commit to use, we could use anything git checkout understood. We had some machines running master or other branches if we didn't care about time-independent idempotency for that machine at that time. What this all meant was that if you wanted to roll back a machine's configuration, you simply specified an earlier Git commit SHA-1 and then re-ran local Puppet.

We were largely satisfied with this model. We felt like we got the best of both worlds. And, since we were using the same technology (Puppet) for time-dependent and time-independent configuration, it was a pretty simple-to-understand system. A downside was there were two Puppet instances instead of one. With a little effort, someone could probably devise a way for the Puppet master to merge the two configuration trees. I'll leave that as an exercise for the reader. Perhaps someone has done this already! If you know of someone, please leave a comment!

Challenges

The solution I describe does not come without its challenges.

First, deciding whether a piece of configuration is time dependent or time independent can be quiet complicated. For example, should a package update for a critical security fix be time dependent or time independent? It depends! What's the risk of the machine not receiving that update? How often is that machine rolled back? Is that package important to the operation/role of that machine (if so, I'd lean more towards time independent).

Second, minimizing exposure to external entities is hard. While I recommend putting as much as possible under version control in a single repository and pinning versions everywhere when you interface with an external system, this isn't always feasible. It's probably a silly idea to have your 200 GB Apt repository under version control and distributed locally to every machine in your network. So, you end up introducing specialized one-off systems as necessary. For our package repository, we just ran an internal HTTP server that only allowed inserts (no deletes or mutates). If we were creative, we could have likely devised a way for the client to pass a revision with the request and have the server dynamically serve from that revision of an underlying repository. Although, that may not work for every server type due to limited control over client behavior.

Third, ensuring compatibility between the time-dependent configuration and time-independent configuration is hard. This is a consequence of separating those configurations. Will a time-independent configuration from a revision two years ago work with the time-dependent configuration of today? This issue can be mitigated by first having as much configuration as possible be time independent and second not relying on wide support windows. If it's good enough to only support compatibility for time-independent configurations less than a month old, then it's good enough! With this issue, I feel you are trading long-term future incompatibility for well-defined and understood behavior in the short to medium term. That's a trade-off I'm willing to make.

Conclusion

Many machine configuration management systems only care about idempotency today. However, with a little effort, it's possible to achieve consistent state over time. This requires a little extra effort and brain power, but it's certainly doable.

The next time you are programming your system configuration tool, I hope you take the time to consider the effects time will have and that you will take the necessary steps to ensure consistency over time (assuming you want that, of course).

Using Docker to Build Firefox

May 19, 2013 at 01:45 PM | categories: Mozilla, Firefox

I have the privilege of having my desk located around a bunch of really intelligent people from the Mozilla Services team. They've been talking a lot about all the new technologies around server provisioning. One that interested me is Docker.

Docker is a pretty nifty piece of software. It's essentially a glorified wrapper around Linux Containers. But, calling it that is doing it an injustice.

Docker interests me because it allows simple environment isolation and repeatability. I can create a run-time environment once, package it up, then run it again on any other machine. Furthermore, everything that runs in that environment is isolated from the underlying host (much like a virtual machine). And best of all, everything is fast and simple.

For my initial experimentation with Docker, I decided to create an environment for building Firefox.

Building Firefox with Docker

To build Firefox with Docker, you'll first need to install Docker. That's pretty simple.

Then, it's just a matter of creating a new container with our build environment:

curl https://gist.github.com/indygreg/5608534/raw/30704c59364ce7a8c69a02ee7f1cfb23d1ffcb2c/Dockerfile | docker build

The output will look something like:

FROM ubuntu:12.10
MAINTAINER Gregory Szorc "gps@mozilla.com"
RUN apt-get update
===> d2f4faba3834
RUN dpkg-divert --local --rename --add /sbin/initctl && ln -s /bin/true /sbin/initctl
===> aff37cc837d8
RUN apt-get install -y autoconf2.13 build-essential unzip yasm zip
===> d0fc534feeee
RUN apt-get install -y libasound2-dev libcurl4-openssl-dev libdbus-1-dev libdbus-glib-1-dev libgtk2.0-dev libiw-dev libnotify-dev libxt-dev mesa-common-dev uuid-dev
===> 7c14cf7af304
RUN apt-get install -y binutils-gold
===> 772002841449
RUN apt-get install -y bash-completion curl emacs git man-db python-dev python-pip vim
===> 213b117b0ff2
RUN pip install mercurial
===> d3987051be44
RUN useradd -m firefox
===> ce05a44dc17e
Build finished. image id: ce05a44dc17e
ce05a44dc17e

As you can see, it is essentially bootstrapping an environment to build Firefox.

When this has completed, you can activate a shell in the container by taking the image id printed at the end and running it:

docker run -i -t ce05a44dc17e /bin/bash
# You should now be inside the container as root.
su - firefox
hg clone https://hg.mozilla.org/mozilla-central
cd mozilla-central
./mach build

If you want to package up this container for distribution, you just find its ID then export it to a tar archive:

docker ps -a
# Find ID of container you wish to export.
docker export 2f6e0edf64e8 > image.tar
# Distribute that file somewhere.
docker import - < image.tar

Simple, isn't it?

Future use at Mozilla

I think it would be rad if Release Engineering used Docker for managing their Linux builder configurations. Want to develop against the exact system configuration that Mozilla uses in its automation - you could do that. No need to worry about custom apt repositories, downloading custom toolchains, keeping everything isolated from the rest of your system, etc: Docker does that all automatically. Mozilla simply needs to publish Docker images on the Internet and anybody can come along and reproduce the official environment with minimal effort. Once we do that, there are few excuses for someone breaking Linux builds because of an environment discrepancy.

Release Engineering could also use Docker to manage isolation of environments between builds. For example, it could spin up a new container for each build or test job. It could even save images from the results of these jobs. Have a weird build failure like a segmentation fault in the compiler? Publish the Docker image and have someone take a look! No need to take the builder offline while someone SSH's into it. No need to worry about the probing changing state because you can always revert to the state at the time of the failure! And, builds would likely start faster. As it stands, our automation spends minutes managing packages before builds begin. This lag would largely be eliminated with Docker. If nothing else, executing automation jobs inside a container would allow us to extract accurate resource usage info (CPU, memory, I/O) since the Linux kernel effectively gives containers their own namespace independent of the global system's.

I might also explore publishing Docker images that construct an ideal development environment (since getting recommended tools in the hands of everybody is a hard problem).

Maybe I'll even consider hooking up build system glue to automatically run builds inside containers.

Lots of potential here.

Conclusion

I encourage Linux users to play around with Docker. It enables some new and exciting workflows and is a really powerful tool despite its simplicity. So far, the only major faults I have with it are that the docs say it should not be used in production (yet) and it only works on Linux.

« Previous Page -- Next Page »

Mercurial Extension for Gecko Development

Analysis of Firefox's Build Automation

How Firefox automation works

Buildbot

Machine provisioning

Running a build job

Analysis

Areas for improvement

A tale of modifying the xpcshell test harness

In-tree automation configs different from Release Engineering's

Local run-time environments different from official ones

Configurations change over time

Conclusion

Quantifying Mozilla's Automation Efficiency

Purpose and Analysis

The Importance of Time on Automated Machine Configuration

Describing the problem

Solving the problem

Technical implementation

Challenges

Conclusion

Using Docker to Build Firefox

Building Firefox with Docker

Future use at Mozilla

Conclusion

Categories