Mozilla operates thousands of machines whose sole role is to build Firefox (and related applications), run tests, perform static analysis, etc. This is collectively referred to as the Firefox/Mozilla build automation or just automation. The output of all this automation can be seen at tbpl.mozilla.org.
In this post, I'll give an overview of how all this automation works followed by a critical analysis identifying what I like and what I feel should be improved.
How Firefox automation works
Let's take a journey through what happens when you push a new revision of mozilla-central (the main Firefox repository) to Mozilla's canonical Mercurial server. (For Mozilla people, this journey is roughly the same regardless of which automation-enabled project branch you push to.) While Firefox's automation infrastructure kicks off builds on several platforms and operating systems, for simplicity reasons, I'm going to limit low-level technical details to our 64-bit Linux builds.
Before I begin, a disclaimer: I'm not a subject expert in much of what I'm about to say. There are people who spend a magnitude more time than myself touching the systems I'm about to describe. If I get something wrong, please contact me and I'll update this post.
The heart of Firefox's build automation is a piece of software called Buildbot. Buildbot is essentially a glorified job scheduling system. I find the buildbot basics covers, well, the basics pretty well. What you need to know is that Mozilla maintains a buildbot repository that appears to contain the buildbot core plus basic customization for Mozilla.
There are buildbot masters and slaves. masters do all the coordination and scheduling; slaves do all the real work (such as compiling Firefox). Mozilla operates a handful of masters and a few thousand slaves.
When you push code to a project branch (like mozilla-central), a buildbot master sees the push then figures out what needs to happen. For mozilla-central, the push gets translated to a request to build on several different platforms. These requests then go to a scheduler (possibly getting collapsed into a single request). These requests then get turned into jobs that run on slaves.
When a scheduled job executes, the high-level job request is converted into low-level actions (or steps in buildbot parlance) that get executed on slaves. For example, a request to build might clone the source repository, run client.mk, package the results, etc. This logic lives in the buildbotcustom repository. It's worth highlighting the factory.py file. This file contains the beef of the logic for converting high-level jobs into actions on slaves. Start at the MozillaBuildFactory class class to see exactly what goes into performing a build. Then move on to addDoBuildSteps(), which contains the command for invoking the actual build system. As you can see, there's a lot that goes into building besides just invoking the build system (like most developers do)!
For many automation jobs, there is an additional component that comes into play: mozharness. mozharness is relatively new to the Firefox build automation landscape so you may not be familiar with it. A goal of mozharness is to largely migrate the low-level logic from buildbotcustom - the logic that converts a high-level job request into low-level buildbot steps (typically command invocations) - into a separate, standalone entity that doesn't depend on buildbot. A goal is to enable developers to run mozharness locally and run automation jobs just like the official automation infrastructure does. If you have time, I encourage you to read the mozharness FAQ to learn more. My understanding is mozharness will eventually power all of the jobs currently defined in buildbotcustom, so I recommend getting acquainted with mozharness.
In the mozharness world, automation jobs are defined as scripts. Here's the marionette script. You just execute a script (with ideally as few arguments as possible) and mozharness does the rest. In buildbot, instead of having a job with say 12 steps and this logic for configuring the steps live in buildbot, buildbot just says run the marionette mozharness script. Since very little business logic now lives in buildbot, this essentially reduces buildbot's role to just job scheduling.
And that is essentially how the automation determines what to run. Now let's talk about the machines automation runs on.
Earlier, I said Mozilla operates thousands of buildbot slaves. Let's talk about how those slaves come into existence.
A slave is just a fancy name for a machine, either physical or virtual. These machines are owned or operated by Mozilla. Mozilla either buys a physical machine or rents one from a cloud provider, like Amazon EC2.
For hopefully obvious reasons, it is important for the configuration of these machines to be consistent. Let's talk about how that is done. Keep in mind I'm talking about Linux machines. OS X and Windows machines go through a different procedure.
When a new machine is acquired, it needs an operating system. There is a kickstart config file that installs CentOS 6.2. At the end of the base OS flash/install, it configures Puppet to talk to a central Puppet master. This Puppet infrastructure is called PuppetAgain and its files are stored in the puppet repository.
Puppet is let loose on the fresh OS install and eventually the machine is configured so it is homogenous with other similar machines in the automation infrastructure. Presumably, Puppet continually polls the central Puppet master and applies the latest configuration.
Part of the puppetization of this machine involves installing the buildbot client. The client eventually registers with the buildbot master and waits for jobs to process.
So, we've described how machines are provisioned as buildbot slaves and how buildbot jobs are converted to actions/steps/commands to be performed on slaves. Let's examine a job in more detail.
Running a build job
Before I talk about the details of a build job, it's worth mentioning that nearly everything I described up until this point is largely hidden from view from most Firefox developers. As far as I know, things like Puppet logs are hidden from public view. And there shouldn't be anything terribly wrong with that: the Puppet configs are public, after all. Unless you are affiliated with Release Engineering or the Automation and Tools Team (A*Team) or hack on a component that warrants its own piece of automation, you probably aren't too concerned with how all of this works.
Anyway, it's finally time to start talking about something almost every Firefox developer has done: build Firefox from source.
As I mentioned above, code in buildbotcustom (to be replaced by mozharness someday) is responsible for turning a Firefox build job into a series of actions/steps/commands to run on a slave. And, lucky for us, the activity of a slave is captured and saved to text log files! If you've ever used TBPL, you've almost certainly clicked a link to view one of these logs.
In this section, I'll describe the steps performed in this log from a recent mozilla-central build on a 64-bit Linux machine. I will be paying particular attention to steps that affect the build environment (for reasons that will be revealed in my critique below).
If you load our log of interest next to factory.py from buildbotcustom, you can start to see how they are related. You may notice self.addStep() calls in factory.py correspond to ========= Started ... ========= Finished sections in the log. That's no accident: every addStep() call produces a section like that in the log.
Now let's look at some of those steps in detail.
The job/log contains a number of set property steps. Search for set props and you'll find them in the log. These steps define named properties in a hash map buildbot uses to represent the current configuration/state. Think of these properties as a way for buildbot to communicate metadata between masters and slaves.
One of the first interesting steps we see is the cloning of the tools repository. Search for Started clone build tools and you'll find it. This repository contains a lot of support tools and scripts used by all parts of automation. There's lots of useful tools in there!
Skipping over the steps that check whether to clobber the builder or purge old content from disk, the next build steps relevant to our interests involve the population of a mock environment. Search for Started mock-tgt mozilla-centos6-x86_64 to find it in the logs.
Mock is a piece of software that manages chroots. It was written by the Fedora project for creating isolated build environments for software packages. For reasons unknown to me, Mozilla runs a forked version of mock called mock_mozilla.
The build job creates a fresh mock environment on every build. (This is clearly indicated by the INFO: chroot (/builds/mock_mozilla/mozilla-centos6-x86_64) unlocked and deleted line in the log.) Later on, builds are performed inside this mock environment. This means that every build job is mostly isolated from both the underlying operating system and all build jobs that came before.
You can see the creation of the new mock environment by looking for the mock_mozilla -r mozilla-centos6-x86_64 --init command in the log. This is using the mozilla-centos6-x86_64 configuration file to create the new environment. This file is managed by Puppet, so you can see it in the puppet repository. The setup command on line 11 is the most important line in this file: it defines what commands to run to initialize the new mock environment.
Populating the mock environment takes a number of buildbot steps. After copying a bunch of files into the mock/chroot, the mock environment is further initialized by installing a number of packages. Search for Started mock-install in the log. Yum is being used to install a number of packages required to build Firefox. This package list appears to come from config.py in the buildbot-configs repository. These packages are downloaded from a Yum repository hosted by Mozilla. Altogether, 249 new packages consuming 821 MB are installed during this step!
After the mock environment is created, the mozharness repository is removed, re-cloned, and updated to the production branch. After that, we pull changes from mozilla-central and update the local checkout to the revision we've been told to build.
If you search for Started got mozconfig, you'll find where the mozconfig file (the build system configuration file) is acquired.
After that, the tooltool configuration is consulted. Tooltool is essentially a content-addressable file store: files are stored and retrieved by the hash value of their content. A manifest file inside mozilla-central defines the set of files to fetch from tooltool at build time. At this step of the build, that file is consulted and listed files are downloaded. While the Linux manifest is currently empty, the OS X tooltool manifest defines the digest of the Clang archive to use to build the tree.
Since tooltool files are addressed by content (not merely by name), this means that the same file will be fetched no matter when the build runs. In other words, behavior is constant as the contents of the tooltool repository itself change (or at least it should be).
After tooltool contents sync, we finally arrive at the actual build step. Search for Started compile in the log. The important detail here is that mock_mozilla is used to build firefox with client.mk inside the fresh mock/chroot environment.
After building, we move on to other tasks, such as creating a distributable package, running tests, etc. I'm just going to glance over them because there's a lot going on and it would take a long time to explain it all! I encourage you to look at the steps in the log and learn.
While these steps are going on, the buildbot master is notified that the build and packaging aspect of the job has completed (sometime before make check is executed). Upon receiving this success notification, the buildbot master scheduled derived jobs, notably all the test suites (reftests, mochitests, xpcshell tests, etc). This scheduling occurs before make check has completed so overall turnaround time is reduced as much as possible.
Finally, at the end of the log, the slave/machine is rebooted. At this point, the job has finished. TBPL colors the B next to the job green to signal successul completion. The slave waits for its next job from the buildbot master.
And that is how Firefox is built! I could go into details for all of the derived test jobs, but I won't, as that would take a lot of effort! I leave that as an exercise for the reader.
The Firefox build automation is complex and composed of many pieces. It's a testament to a lot of people's hard work that it works as well as it does!
From my experience at a previous job managing large numbers of servers in a production datacenter, I commend Release Engineering for deploying Puppet to help ensure the machines performing Firefox's build automation are in a consistent state. I also like how mock is used to mostly isolate build jobs from one another. These are both very important to ensure Firefox is built consistently over time.
The on-going migration of automation logic from buildbotcustom to mozharness is a fantastic project and I hope it is completed soon. I hold hope that one day we can integrate mozharness into the local development workflow (likely transparently through tools like mach) and local developers can invoke actions using the same code path as the official automation infrastructure.
Areas for improvement
The Firefox build automation largely works well. And I don't mean to take away from that. However, there are a number of areas where things could be improved. In this section, I'll talk about some of them.
Before I get into details, let me share an experience I had about a year ago.
A tale of modifying the xpcshell test harness
And, it was. At first. The initial implementation landed two weeks after I filed the bug (it didn't require too much effort, but I was busy with others tasks). It landed without incident. Although, nothing was using it and there was no test coverage. However, my local development started relying on it and things were working just fine.
When I attempted to actually land a test that made use of this new feature, it failed because the new directory for these shared modules wasn't being archived. Simple enough to fix, right? Wrong. Start reading bug 755339 from comment #4. It is a trail of agony. I had to uplift my change to all the major branches. Then, once that was done, buildbotcustom could be updated. Uplifts were performed on May 24. buildbotcustom was pushed out to production a week later on May 31. It immediately broke the world. At the time, philor says it was the worst tree bustage he'd ever seen. Literally every tree was red. Achievement unlocked. A workaround was quickly devised on May 31 (after the buildbotcustom change was backed out to restore working automation) and landed. Again, it had to be uplifted to all the trees. This patch conflicted when applied on older trees and I accidentally committed a typo and broke beta, release, and esr in the process. I got an earful for breaking these trees and lost a lot of street cred with the sheriffs (who are charged with keeping law and order in the land of the source trees). The buildbotcustom change finally landed on June 7 and stuck.
In-tree automation configs different from Release Engineering's
The above example demonstrates many shortcomings in the current automation infrastructure. The first one I will talk about is that the in-tree automation configs (how to perform a specific automation job, such as run an individual test harness) are almost completely different from what Release Engineering runs on the official infrastructure!
A year ago, mozilla-central had testsuite-targets.mk - a make file defining targets like xpcshell-tests and mochitest-plain that allowed developers to run these test suites locally. Release Engineering had buildbotcustom, which didn't use testsuite-targets.mk at all. While invoking most test harnesses is simply a matter of formulating arguments for a Python script, that logic was duplicated between testsuite-targets and buildbotcustom.
Today, Release Engineering has been porting job invocation code to mozharness. And, I have been encouraging people to locally run tests with mach. Initially, mach commands executed make. However, commands are now bypassing make and calling into the Python test harnesses natively. This allows more advanced behavior since mach has more control and insight into the underlying test harness. The downside is mach is now reinventing logic.
So, we now have 3 or 4 separate implementations for performing many of our automation jobs. Medium term, you figure Release Engineering consolidates all the buildbotcustom logic into mozharness, eliminating 1. I also think it is inevitable that testsuite-targets is refactored to invoke mach commands or is just removed altogether. But that still leaves us with 2 independent implementations!
I believe we should work towards consolidating the logic for job invocation to inside mozilla-central. The commands developers run locally should be as similar as possible to what runs on official automation infrastructure. Any differences introduce potential for different results. Differences also increase the burden to roll out changes. I would argue that an in-tree change should be all that is necessary to change the behavior of official automation infrastructure.
How this is accomplished, I'm not entirely certain. In my ideal world, I think mozharness would live in tree. Although, I understand that may not be practical because much of what mozharness does involves downloading just-built packages, uploading results to a server, etc. I think a middle ground that accomplishes most of what we seek is for the Release Engineering configs (mozharness) to contain as little logic as possible for actual job invocation and let something in-tree do the rest. For example, the mozharness job for the xpcshell test harness would simply be execute the run-xpcshell-tests script from the tests archive. mozilla-central could then have full domain over what exactly is done. If mozilla-central changes, everyone picks up those changes immediately.
Local run-time environments different from official ones
Similar in vein to the previous section is that there are discrepancies between local run-time environments and official ones. By run-time environment, I mean the state of the operating system (installed packages, configuration settings, etc). This plays an important role in determining how an automation job executes. An obvious example is the compiler toolchain. GCC 4.5 is obviously going to have different behavior from GCC 4.7.
While Release Engineering has taken steps to ensure consistency in the configuration of the machines powering the official automation infrastructure, local machine configurations are effectively living in the wild west. Developers are effectively unable to recreate the official automation environment. This lowers the liklihood that a local build will have the same outcome when performed on official automation infrastructure.
While supporting diverse run-time environments will result in greater compatibility and is thus a good thing, I think it is important that local developers have the ability to recreate the official run-time environment as closely as possible. This will raise confidence that local results can be trusted and should cut down on development cost by reducing the number of Try pushes and reducing development cycles.
Recreating the official run-time environment varies in difficulty depending on the operating system. For Linux, it should actually be pretty easy! For the build environment, Mozilla simply needs to package the mock environment used to build or at least could publish a script used to create said environment. In bug 886226, I made an initial stab at this by creating a Vagrantfile that will kinda/sorta recreate the official 64-bit Linux build environment. I even published an archive of the mock environment (which can be imported into Docker). As pointed out on the bug, my work isn't perfect. But it's a start. And it's better than what developers have today to reproduce the official environments.
If we decide publishing archives of chroot environments is the way to go, I believe we could extend the Linux solution to OS X as well. OS X has chroot. It also has a sandboxing facility (man sandbox). There are also tools like jailkit. Also, using chroot environments will also likely make our automation faster since using archives will almost surely be faster than recreating the environment on every job run. Bug 851294 has been filed to track this.
What about Windows? Well, I don't know. We sort of have an environment with MozillaBuild. But, it's not isolated from the rest of the operating system like chroot environments are. Modulo Windows, two out of three tier one platforms isn't too bad!
Now, even if the software environment is similar, hardware differences can also affect results. Unfortunately, there's little that can be done on this front aside from having everybody use the same hardware models. That's not going to happen. So let's pretend we don't have that problem.
Configurations change over time
I recently wrote a post on the importance of time on machine configurations. The gist is that configurations always pulling from version control tip or relying on external resource often vary with time and aren't truly idempotent. This makes it difficult to reproduce specific configurations at future points in time.
Unfortunately, Mozilla's build automation is highly susceptible to configuration variance over time. There are many examples.
The buildbot configuration is periodically pushed to the infrastructure. A job will inherit the config that is currently deployed. If you make a backwards incompatible change to the buildbot configs and push an old revision of mozilla-central, things will blow up.
There are two time-dependent aspects of Puppet bootstrapping. First, the Puppet configs (like buildbot configs) are periodically pushed to a central master server. Whichever version is deployed at a given time is the version picked up by the automation job. Second, we don't do a good job of pinning versions inside the Puppet config. Here are the Mercurial, mock, and Python manifests using the latest package version. This means if a new version of a package is added to Mozilla's Yum repositories that new package version will be deployed on the next Puppet sync. This is definitely not time independent. Fortunately, the base operating system configuration for Linux isn't terribly important because of the mock environment. But, it still varies.
Like the base operating system, the mock environment is full of dependence on time. First, the config file itself is managed by Puppet and thus susceptible to its time-dependent behavior. Second, the packages installed in the mock environment aren't version pinned. This means that whatever packages deployed in Mozilla's Yum repositories will be used. So, every time the Yum package repository is updated, we potentially switch the configuration of our build environment. On the surface, this terrifies me. Maybe we have a strict update policy in place to prevent excessive package updates. The modified times of packages on the server seems to indicate that. Even if we didn't update the package repository, it still seems a bit fragile: I'd much prefer to pin package versions everywhere than rely on the content of a Yum server.
The default branch of the tools repository is always checked out at build time. As new revisions are added to this repository, builds will pick them up immediately. If you make a backwards incompatible change to the tools repository, old revisions may no longer build.
The production branch of the mozharness repository is always used during jobs. Again, if mozharness changes, every future job sees these changes.
The takeaway here is that many of the tightly-related systems/repositories aren't linked in any way. The automation configuration simply pulls the tip of each. We run into problems when we wish to make backwards incompatible changes. If you make one, you first have to update all repositories to be compatible. This is a huge pain (see my experience above). Furthermore, it makes it impossible to have successful automation runs from old revisions of those repositories!
The solution to this mess is to pin versions and configs everywhere. The repositories being put through automation (namely mozilla-central) should contain revisions of the configs to use. It should say I want to use revision X of mozharness, revision Y of the build machine configuration, etc. This will enable much more consistent automation output over time. It cuts down on surprises (did behavior change on July 14 because of a change to mozilla-central or to a new automation config being rolled out to the server?). It allows people to build very old packages. This means Mozilla wouldn't need to store terabytes of old builds around (we could just trigger the build again and get the same output).
I recognize that deterministic automation configuration is likely not completely achievable. But, that doesn't mean we shouldn't work towards it. Having something better than today enables so many more useful scenarios and flexibility in our automation. Let me explain.
Say you want to add or remove a test harness. Or, maybe you want to add a required argument or remove an obsolute argument from a test harness. The procedure from doing this today is far from trivial. It's not enough to simply land your change in mozilla-central and be done with it. Instead, you need to land support in buildbotcustom or mozharness, prepare to land in mozilla-central, then coordinate with Release Engineering to have your changes landed around the time of a server deployment. And since changes likely affect multiple trees, you're likely also landing things in inbound, fx-team, services-central, possibly aurora, beta, and release, etc. If you aren't landing in aurora, beta, release, esr, b2g, etc, you need to remember to have your change merged into these server configurations when those trees inherit your code (although Release Engineering is typically pretty good about tracking this and doing it for you).
Contrast this with simply making a backwards incompatible change by checking in support in mozharness then making the change in mozilla-central along with a revision bump of the mozharness revision to use. When that changeset is pushed to the infrastructure, a compatible version of mozharness is used. When that change gets merged into other trees, the appropriate mozharness revision is used. No extra work needed. Utopia. You just made A*Team and Release Engineering much more productive by eliminating a lot of overhead.
The Mozilla automation infrastructure is a complex beast. There are many moving parts and separate systems. Any newcomer to Mozilla should simply stand in awe that so many systems seemlessly work so well together.
There are improvements that can be made, sure (especially in the area of deterministic behavior over time). But, I think with the mozharness work and the Puppetization of the server infrastructure that Mozilla is on the right course. I look forward to a future where the source code in mozilla-central and the automation infrastructure are more tightly integrated. We're trending there. Give it time.