Gregory Szorc's Digital Home

Modern CI is Too Complex and Misdirected

April 07, 2021 at 09:00 AM | categories: CI, build system

The state of CI platforms is much stronger than it was just a few years ago. Overall, this is a good thing: access to powerful CI platforms enables software developers and companies to ship more reliable software more frequently, which benefits its users/customers. Centralized CI platforms like GitHub Actions, GitLab Pipelines, and Bitbucket provide benefits of scale, as the Internet serves as a collective information repository for how to use them. Do a search for how to do X on CI platform Y and you'll typically find some code you can copy and paste. Nobody wants to toil with wrangling their CI configuration after all: they just want to ship.

Modern CI Systems are Too Complex

The advancements in CI platforms have come at a cost: increased complexity. And the more I think about it, I'm coming around to the belief that modern CI systems are too complex. Let me explain.

At its core, a CI platform is a specialized remote code execution as a service (it's a feature, not a CVE!) where the code being executed is in pursuit of building, testing, and shipping software (unless you abuse it to mine cryptocurrency). So, CI platforms typically throw in a bunch of value-add features to enable you to ship software more easily. There are vastly different approaches and business models here. (I must tip my hat to GitHub Actions leveraging network effects via community maintained actions: this lowers TCO for GitHub as they don't need to maintain many actions, creates vendor lock-in as users develop a dependence on platform-proprietary actions, all while increasing the value of the platform for end-users - a rare product trifecta.) A common value-add property of CI platforms is some kind of configuration file (often YAML) which itself offers common functionality, such as configuring the version control checkout and specifying what commands to run. This is where we start to get into problems.

(I'm going to focus on GitHub Actions here, not because they are the worst (far from it), but because they seem to be the most popular and readers can relate more easily. But my commentary applies to other platforms like GitLab as well.)

The YAML configuration of modern CI platforms is... powerful. Here are features present in GitHub Actions workflow YAML:

An embedded templating system that results in the source YAML being expanded into a final YAML document that is actually evaluated. This includes a custom expression mini language.
Triggers for when to run jobs.
Named variables.
Conditional job execution.
Dependencies between jobs.
Defining Docker-based run-time environment.
Encrypted secrets.
Steps constituting each job and what actions those steps should take.

If we expand scope slightly to include actions maintained by GitHub, we also have steps/actions features for:

Performing Git checkouts.
Storing artifacts used by workflows/jobs.
Caching artifacts used by workflows/jobs.
Installing common programming languages and environments (like Java, Node.js, Python, and Ruby).
And a whole lot more.

And then of course there are 3rd party Actions. And there's a lot of them!

There's a lot of functionality here and a lot of it is arguably necessary: I'm hard pressed to name a feature to cut. (Although I'm no fan of using YAML as a programming language but I concede it use is a fair compromise compared to forcing people to write code to produce YAML or make equivalent API calls to do what the YAML would do.) All these features seem necessary for a sufficiently powerful CI offering. Nobody would use your offering if it didn't offer turnkey functionality after all.

So what's my complaint?

I posit that a sufficiently complex CI system becomes indistinguishable from a build system. I challenge you: try to convince me or yourself that GitHub Actions, GitLab CI, and other CI systems aren't build systems. The basic primitives are all there. GitHub Actions Workflows comprised of jobs comprised of steps are little different from say Makefiles comprised of rules comprised of commands to execute for that rule, with dependencies gluing everything together. The main difference is the form factor and the execution model (build systems are traditionally local and single machine but CI systems are remote/distributed).

Then we have a similar conjecture: a sufficiently complex build system becomes indistinguishable from a CI system. Earlier I said that CI systems are remote code execution as a service. While build systems are historically things that run locally (and therefore not a service), modern build systems like Bazel (or Buck or Gradle) are completely different animals. For example, Bazel has remote execution and remote caching as built-in features. Hey - those are built-in features of modern CI systems too! So here's a thought experiment: if I define a build system in Bazel and then define a server-side Git push hook so the remote server triggers Bazel to build, run tests, and post the results somewhere, is that a CI system? I think it is! A crude one. But I think that qualifies as a CI system.

If you squint hard enough, sufficiently complex CI systems and sufficiently complex build systems start to look like the same thing to me. At a very high level, both are providing a pool of servers offering general compute/execute functionality with specialized features in the domain of building/shipping software, like inter-task artifact exchange, caching, dependencies, and a frontend language to define how everything works.

(If you squint really hard you can start to see a value proposition of Kubernetes for even more general compute scheduling, but I'm not going to go that far in this post because it is a much harder point to make and I don't necessarily believe in it myself. But I thought I'd mention it as an interesting thought experiment. But an easier leap to make is to throw batch job execution (as is often found in data warehouses) in with build and CI systems as belonging in the same bucket: batch job execution also tends to have dependencies, exchange of artifacts between jobs, and I think can strongly resemble a CI system and therefore a build system.)

The thing that bugs me about modern CI systems is that I inevitably feel like I'm reinventing a build system and fragmenting build system logic. Your CI configuration inevitably devolves into a bunch of complex YAML with all kinds of caching and dependency optimizations to keep execution time low and reliability in check - just like your build system. You find yourself contorting your project's build system to work in the context of CI and vice versa. You end up managing two complex DAGs and platforms/systems instead of one.

Because build systems are more generic than CI systems (I think a sufficiently advanced build system can do a superset of the things that a sufficiently complex CI system can do), that means that CI systems are redundant with sufficiently advanced build systems. So going beyond the section title: CI systems aren't too complex: they shouldn't need to exist. Your CI functionality should be an extension of the build system.

In addition to the redundancy argument, I think unified systems are more user friendly. By integrating your CI system into your build system (which by definition can be driven locally as part of regular development workflows), you can expose the full power of the CI system to developers more easily. Think running ad-hoc CI jobs without having to push your changes to a remote server first, just like you can with local builds or tests. This is huge for ergonomics and can drastically compress the cycle time for changes to these systems (which are often brittle to change/test).

Don't get me wrong, aspects of CI systems not traditionally found in build systems (such as centralized results reporting and a UI/API for (re)triggering jobs) absolutely need to exist. Instead, it is the remote compute and work definition aspects that are completely redundant with build systems.

Let's explore the implications of build and CI systems being more of the same.

Modern CI Offerings are Targeting the Wrong Abstraction

If you assume that build and CI systems can be / are more of the same, then it follows that many modern CI offerings like GitHub Actions, GitLab CI, and others are targeting the wrong abstraction: they are defined as domain specific platforms for running CI systems when instead they should take a step back and target the broader general compute platform that is also needed for build systems (and maybe batch job execution, such as what's commonly found in data warehouses/pipelines).

Every CI offering is somewhere different on the spectrum here. I would go so far as to argue that GitHub Actions is more a CI product than a platform. Let me explain.

In my ideal CI platform, I have the ability to schedule an ad-hoc graph of tasks against that platform. I have the ability to hit some APIs with definitions of the tasks I want that platform to run and it accepts them, executes them, uploads artifacts somewhere, reports task results so dependent tasks can execute, etc.

There is a GitHub Actions API that allows you to interact with the service. But the critical feature it doesn't let me do is define ad-hoc units of work: the actual remote execute as a service. Rather, the only way to define units of work is via workflow YAML files checked into your repository. That's so constraining!

GitLab Pipelines is a lot better. GitLab Pipelines supports features like parent-child pipelines (dependencies between different pipelines), multi-project pipelines (dependencies between different projects/repos), and dynamic child pipelines (generate YAML files in pipeline job that defines a new pipeline). (I don't believe GitHub Actions supports any of these features.) Dynamic child pipelines are an important feature, as they mostly divorce the checked-in YAML configuration from the remote execute as a service feature. The main missing feature here is a generic API that allows you achieve this functionality without having to go through a parent pipeline / YAML first. If that API existed, you could build your own build/CI/batch execute system on top of GitLab Pipelines with fewer constraints imposed on you by GitLab Pipeline's opinionated YAML configuration files and the intended use of its creators. (Generally, I think a good litmus test for a well-designed platform or tool is when its authors are surprised by someone's unintended use for it. Of course this knife cuts both ways, as sometimes people do undesirable things, like mine cryptocurrency.)

CI offerings like GitHub Actions and GitLab Pipelines are more products than platforms because they tightly couple an opinionated configuration mechanism (YAML files) and web UI (and corresponding APIs) on top of a theoretically generic remote execute as a service offering. For me to consider these offerings as platforms, they need to grow the ability to schedule arbitrary compute via an API, without being constrained by the YAML officially supported out of the box. GitLab is almost there (the critical missing link is a schedule an inline-defined pipeline API). It is unknown if GitHub is - or is even interested in - pursuing this direction. (More on this later.)

Taskcluster: The Most Powerful CI Platform You've Never Heard Of

I wanted to just mention Taskcluster in passing as a counterexample to the CI offerings that GitHub, GitLab, and others are pursuing. But I found myself heaping praises towards it, so you get a full section on Taskcluster. This content isn't critical to the overall post, so feel free to skip. But if you want to know what a CI platform built for engineers looks like or you are a developer of CI platforms and would like to read about some worthwhile ideas to steal, keep reading.

Mozilla's Taskcluster is a generic CI platform originally built for Firefox. At the time it was conceived and initially built out in 2014-2015, there was nothing else quite like it. And I'm still not aware of anything that can match its raw capabilities. There might be something proprietary behind corporate walls. But nothing close to it in the open source domain. And even the proprietary CI platforms I'm aware of often fall short of Taskcluster's feature list.

To my knowledge, Taskcluster is the only publicly available, mega project scale, true CI platform in existence.

Germane to this post, one thing I love about Taskcluster is its core primitives around defining execution units. The core execute primitive in Taskcluster is a task. Tasks are connected together to form a DAG. (This is not unlike how a build system works.)

A task is created by issuing an API request to a queue service. That API request essentially says schedule this unit of work.

Tasks are defined somewhat generically, essentially as units of arbitrary compute along with metadata, such as task dependencies, permissions/scopes that task has, etc. That unit of work has many of the primitives that are familiar to you if you use GitHub Actions, GitLab Pipelines, etc: a list of commands to execute, which Docker image to execute in, paths to files constituting artifacts, retry settings, etc.

Taskcluster has features far beyond what are offered by GitHub, GitLab, and others today.

For example, Taskcluster offers an IAM-like scopes feature that moderates access control. Scopes control what actions you can perform, what services you have access to, which runner features you can use (e.g. whether you can use ptrace), which secrets you have access to, and more. As a concrete example, Firefox's Taskcluster settings are such that the cryptographic keys/secrets used to sign Firefox builds are inaccessible to untrusted tasks (like the equivalent of tasks initiated by PRs - the Try Server in Mozilla speak). Taskcluster is the only CI platform I'm aware of that has sufficient protections in place to mitigate the fact that CI platforms are gaping remote code execution as a service risks that can and should keep your internal security and risk teams up at night. Taskcluster's security model makes GitHub Actions, GitLab Pipelines, and other commonly used CI services look like data exfiltration and software supply chain vulnerability factories by comparison.

Taskcluster does support adding a YAML file to your repository to define tasks. However, because there's a generic scheduling API, you don't need to use it and you aren't constrained by its features. You could roll your own configuration/frontend for defining tasks: Taskcluster doesn't care because it is a true platform. In fact, Firefox mostly eschews this Taskcluster YAML, instead building out its own functionality for defining tasks. There's a pile of code checked into the Firefox repository that when run will derive the thousands of discrete tasks constituting Firefox's build and release DAG and will register the appropriate sub-graph as Taskcluster tasks. (This also happens to be a pile of YAML. But the programming primitives and control flow are largely absent from YAML files, making it a bit cleaner than the YAML DSL that e.g. GitHub and GitLab CI YAML has evolved into.) This functionality is its own mini build system where the Taskcluster platform is the execution/evaluation mechanism.

Taskcluster's model and capabilities are vastly beyond anything in GitHub Actions or GitLab Pipelines today. There's a lot of great ideas worth copying.

Unfortunately, Taskcluster is very much a power user CI offering. There's no centralized instance that anyone can use (unlike GitHub or GitLab). The learning curve is quite steep. All that power comes at a cost of complexity. I can't in good faith recommend Taskcluster to casual users. But if you want to host your own CI platform, other CI offerings don't quite cut it for you, and you can afford a few people to support your CI platform on an ongoing basis (i.e. your total cost to operate CI including people and machines is >$1M annually), then Taskcluster is worth considering.

Let's get back to the post at hand.

Looking to the Future

In my ideal world there exists a single remote code execution as a service platform purpose built for servicing both near real time and batch/delayed execution. It is probably tailored towards supporting software development, as those domain specific features set it apart from generic compute as a service tools like Kubernetes, Lambda, and others. But something more generic could potentially work.

The concept of a DAG is strongly baked into the execution model so you can define execution units as a graph, capturing dependencies. Sure, you could define isolated, ad-hoc units of work. But if you wanted to define a set of units, you could do that without having to run a persistent agent to coordinate execution through completion like build systems typically do. (Think of this as uploading your DAG to an execution service.)

In my ideal world, there is a single DAG dictating all build, testing, and release tasks. There is no DAG fragmentation at the build, CI, and other batch execute boundaries. No N+1 system or configuration to manage and additional platform to maintain because everything is unified. Economies of scale applies and overall efficiency improves through consolidation.

The platform consists of pools of workers running agents capable of performing work. There are probably pools for near real time / synchronous RPC style invocations and pools for scheduled / delayed / asynchronous execution. You can define your own worker pools and bring your own workers. Advanced customers will likely throw autoscaling groups consisting of highly ephemeral workers (such as EC2 spot instances) at these pools, scaling capacity to meet demand relatively cheaply, terminating workers and machines when capacity is no longer needed to save on billing costs (this is what Firefox's Taskcluster instance has been doing for at least 6 years).

To end-users, a local build consists of driving or scheduling the subset of the complete task graph necessary to produce the build artifacts you need. A CI build/test consists of the subset of the task graph necessary to achieve that (it is probably a superset of the local build graph). Same for releasing.

As for the configuration frontend and how execution units are defined, this platform only needs to provide a single thing: an API that can be used to schedule/execute work. However, for this product offering to be user-friendly, it should offer something like YAML configuration files like CI systems do today. That's fine: many (most?) users will stick to using the simplified YAML interface. Just as long as power users have an escape/scaling vector and can use the low-level schedule/execute API to write their own driver. People will write plug-ins for their build systems enabling it to integrate with this platform. Someone will coerce existing extensible build systems like Bazel, Buck, and Gradle to convert nodes in the build graph to compute tasks in this platform. This unlocks the unification of the build and CI systems (and maybe things like data pipelines too).

Finally, because we're talking about a specialized system tailored for software development, we need robust result/reporting APIs and interfaces. What good is all this fancy distributed remote compute if nobody can see what it is doing? This is probably the most specialized service of the bunch, as how you track results is exceptionally domain specific. Power users may want to build their own result tracking service, so keep that in mind. But the platform should provide a generic one (like what GitHub Actions and GitLab Pipelines do today) because it is a massive value add and few will use your product without such a feature.

Quickly, my proposed unified world will not alleviate the CI complexity concerns raised above: sufficiently large build/CI systems will always have an intrinsic complexity to them and possibly require specialists to maintain. However, because a complex CI system is almost always attached to a complex build system, by consolidating build and CI systems, you reduce the surface area of complexity (you don't have to worry about build/CI interop as much). Lower fragmentation reduces overall complexity, and is therefore a new win. (A similar line of thinking applies to justifying monorepositories.)

All of the components for my vision exist in some working form today. Bazel, Gradle Enterprise, and other modern build systems have RPCs for remote execute and/or caching. They are even extensible and you can write your own plugins to change core functionality for how the build system runs (to varying degrees of course). CI offerings like Taskcluster and GitLab Pipelines support scheduling DAGs of tasks (with Taskcluster's support far more suited for the desired end state). There are batch job execution frameworks like Airflow that look an awful lot like a domain-specific, specialized versions of Taskcluster. What we don't have a is a single product or service with all these features bundled as a cohesive offering.

I'm convinced that building what I'd like to see is not a question of if it can be done but whether we should and who will do it.

And this is where we probably run into problems. I hate to say it, but I'm skeptical this will exist as a widely available service outside a few corporations' walls any time soon. The reason is the total addressable market.

The value of my vision is through unification of discrete systems (build, CI, and maybe some one-offs like data pipelines) that are themselves complex enough that unification is something you'd want to do for business/efficiency reasons. After all, if it isn't complex/inefficient, you probably don't care about making it simpler/faster. Right here we are probably filtering out >90% of the market because their systems just aren't complex enough for this to matter.

This vision requires adoption of a sufficiently advanced build system so it can serve as the brains behind a unified DAG driving remote execute. Some companies and projects will adopt compatible, advanced build systems like Bazel because they have the resources, technical know-how, and efficiency incentives to pull it off. But many won't. The benefit of a more advanced build system over something simpler is often marginal. Factor in that many companies perceive build and CI support as product development overhead and a virtual cost center whose line item needs to be minimized. If you can get by on a less advanced build system that is good enough for a fraction of the cost without excessive hardship, that's the path many companies and projects will follow. Again, people and companies generally don't care about wrangling build and CI systems: they just want to ship.

The total addressable market for this idea seems too small for me to see any major player with the technical know-how to implement and offer such a service in the next few years. After all, we're not even over the hurdle that what I propose (unifying build and CI systems) is a good idea. Having worked in this space for a decade, witnessed the potential of Taskcluster's model, and seen former, present, and potential employers all struggling in this space to varying degrees, I know that this idea would be extremely valuable to some. (For some companies multiple millions of dollars could be saved annually by eliminating redundant human capital maintaining similar systems, reducing machine idle/run costs, and improving turnaround times of critical development loops.) As important as this would be to some companies, my intuition is they represent such a small sliver of the total addressable market that this slice of pie is too small for an existing CI operator like GitHub or GitLab to care about at this time. There are far more lucrative opportunities. (Such as security scanning, as laws/regulation/litigation are finally catching up to the software industry and forcing companies to take security and privacy more seriously, which translates to spending money on security services. This is why GitHub and GitLab have been stumbling over each other to announce new security features over the past 1-2 years.)

I don't think a startup in this area would be a good idea: customer acquisition is too hard. And because much of the core tech already exists in existing tools, there's not much of a moat in the way of proprietary IP to keep copycats with deep pockets at bay. Your best exit here is likely an early acquisition by a Microsoft/GitHub, GitLab, or wannabe player in this space like Amazon/AWS.

Rather, I think our best hope for seeing this vision realized is an operator of an existing major CI platform (either private or public) who also has major build system or other ad-hoc batch execute challenges will implement it and release it upon the world, either as open source or as a service offering. GitHub, GitLab, and other code hosting providers are the ideal candidates since their community effect could help drive industry adoption. But I'd happily accept pretty much any high quality offering from a reputable company!

I'm not sure when, but my money is on GitHub/Microsoft executing on this vision first. They have a stronger incentive in the form of broader market/product tie-ins (think integrated build and CI in Visual Studio or GitHub Workspaces [for Enterprises]). Furthermore, they'll feel the call from within. Microsoft has some really massive build systems and CI challenges (notably Windows). It is clear that elements of Microsoft are conducting development on GitHub, in the open even (at this point Satya Nadella's Microsoft has frozen over so many levels of hell that Dante's classics need new revisions). Microsoft engineers will feel the pain and limitations of discrete build and CI systems. Eventually there will be calls for at least a build system remote execute service/offering on GitHub. (This would naturally fall under GitHub's existing apparent market strategy of capturing more and more of the software development lifecycle.) My hope is GitHub (or whomever) will implement this as a unified platform/service/product rather than discrete services because as I've argued they are practically the same problem. But a unified offering isn't the path of least resistance, so who knows what will happen.

Conclusion

If I could snap my fingers and move industry's discrete build, CI, and maybe batch execute (e.g. data pipelines) ahead 10 years, I would:

Take Mozilla's Taskcluster and its best-in-class specialized remote execute as a service platform.
Add support for a real-time, synchronous execute API (like Bazel's remote execute API) to supplement the existing batch/asynchronous functionality.
Define Starlark dialects so you define CI/release like primitives in build tools like Bazel. (You could also do YAML here. But if your configuration files devolve into DSL, just use a real programming language already.)
Teach build tools like Bazel to work better when units of work that can take minutes or even hours to run (a synchronous/online driver model such as classically employed by build systems isn't appropriate for long-running test, release, or say data pipelines).
Throw a polished web UI for platform interaction, result reporting, etc on top.
Release it to the world.

Will this dream become a reality any time soon? Probably not. But I can dream. And maybe I'll have convinced a reader to pursue it.

Investing in the Firefox Build System in 2016

January 11, 2016 at 02:20 PM | categories: Mozilla, build system

Most of Mozilla gathered in Orlando in December for an all hands meeting. If you attended any of the plenary sessions, you probably heard people like David Bryant and Lawrence Mandel make references to improving the Firefox build system and related tools. Well, the cat is out of the bag: Mozilla will be investing heavily in the Firefox build system and related tooling in 2016!

In the past 4+ years, the Firefox build system has mostly been held together and incrementally improved by a loose coalition of people who cared. We had a period in 2013 where a number of people were making significant updates (this is when moz.build files happened). But for the past 1.5+ years, there hasn't really been a coordinated effort to improve the build system - just a lot of one-off tasks and (much-appreciated) individual heroics. This is changing.

Improving the build system is a high priority for Mozilla in 2016. And investment has already begun. In Orlando, we had a marathon 3 hour meeting planning work for Q1. At least 8 people have committed to projects in Q1.

The focus of work is split between immediate short-term wins and longer-term investments. We also decided to prioritize the Firefox and Fennec developer workflows (over Gecko/Platform) as well as the development experience on Windows. This is because these areas have been under-loved and therefore have more potential for impact.

Here are the projects we're focusing on in Q1:

Turnkey artifact based builds for Firefox and Fennec (download pre-built binaries so you don't have to spend 20 minutes compiling C++)
Running tests from the source directory (so you don't have to copy tens of thousands of files to the object directory)
Speed up configure / prototype a replacement
Telemetry for mach and the build system
NSPR, NSS, and (maybe) ICU build system rewrites
mach build faster improvements
Improvements to build rules used for building binaries (enables non-make build backends)
mach command for analyzing C++ dependencies
Deploy automated testing for mach bootstrap on TaskCluster

Work has already started on a number of these projects. I'm optimistic 2016 will be a watershed year for the Firefox build system and the improvements will have a drastic positive impact on developer productivity.

MacBook Pro Firefox Build Times Comparison

November 05, 2013 at 10:00 AM | categories: Mozilla, build system

Many developers use MacBook Pros for day-to-day Firefox development. So, I thought it would be worthwhile to perform a comparison of Firefox build times for various models of MacBook Pros.

Test setup

The numbers in this post are obtained from 3 generations of MacBook Pros:

A 2011 Sandy Bridge 4 core x 2.3 GHz with 8 GB RAM and an aftermarket SSD.
A 2012 Ivy Bridge retina with 4 core x 2.6 GHz, 16 GB RAM, and a factory SSD (or possibly flash storage).
A 2013 Haswell retina with 4 core x 2.6 GHz, 16 GB RAM, and flash storage.

All machines were running OS X 10.9 Mavericks and were using the Xcode 5.0.1 toolchain (Xcode 5 clang: Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)) to build.

The power settings prevented machine sleep and machines were plugged into A/C power during measuring. I did not use the machines while obtaining measurements.

The 2012 and 2013 machines were very vanilla OS installs. However, the 2011 machine was my primary work computer and may have had a few background services running and may have been slower due to normal wear and tear. The 2012 machine was a loaner machine from IT and has an unknown history.

All data was obtained from mozilla-central revision d4a27d8eda28.

The mozconfig used contained:

export MOZ_PSEUDO_DERECURSE=1 mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/obj-firefox.noindex

Please note that the objdir name ends with .noindex to prevent Finder from indexing build files.

I performed all tests multiple times and used the fastest time. I used time command for obtaining measurements of wall, user, and system time.

Results

Configure Times

The result of mach configure is as follows:

Machine	Wall time	User time	System time
2011	29.748	17.921	11.644
2012	26.765	15.942	10.501
2013	21.581	12.597	8.595

Clobber build no ccache

mach build was performed after running mach configure. ccache was not enabled.

Machine	Wall time	User time	System time	Total CPU time
2011	22:29 (1349)	145:35 (8735)	12:03 (723)	157:38 (9458)
2012	15:00 (900)	94:18 (5658)	8:14 (494)	102:32 (6152)
2013	11:13 (673)	69:55 (4195)	6:04 (364)	75:59 (4559)

Clobber build with empty ccache

mach build was performed after running mach configure. ccache was enabled. The ccache ccache was cleared before running mach configure.

Machine	Wall time	User time	System time	Total CPU time
2011	25:57 (1557)	161:30 (9690)	18:21 (1101)	179:51 (10791)
2012	16:58 (1018)	104:50 (6290)	12:32 (752)	117:22 (7042)
2013	12:59 (779)	79:51 (4791)	9:24 (564)	89:15 (5355)

Clobber build with populated ccache

mach build was performed after running mach configure. ccache was enabled and the ccache was populated with the results of a prior build. In theory, all compiler invocations should be serviced by ccache entries.

This measure is a very crude way to measure how fast clobber builds would be if compiler invocations were nearly instantaneous.

Machine	Wall time	User time	System time
2011	3:59 (239)	8:04 (484)	3:21 (201)(
2012	3:11 (191)	6:45 (405)	2:53 (173)
2013	2:31 (151)	5:22 (322)	2:12 (132)

No-op builds

mach build was performed on a tree that was already built.

Machine	Wall time	User time	System time
2011	1:58 (118)	2:25 (145)	0:41 (41)
2012	1:42 (102)	2:02 (122)	0:37 (37)
2013	1:20 (80)	1:39 (99)	0:28 (28)

binaries no-op

mach build binaries was performed on a fully built tree. This results in nothing being executed. It's a way to test the overhead of the binaries make target.

Machine	Wall time	User time	System time
2011	4.21	4.38	0.92
2012	3.17	3.37	0.71
2013	2.67	2.75	0.56

binaries touch single .cpp

mach build binaries was performed on a fully built tree after touching the file netwerk/dns/nsHostResolver.cpp. ccache was enabled but cleared before running this test. This test simulates common C++ developer workflow of changing C++ and recompiling.

Machine	Wall time	User time	System time
2011	12.89	13.88	1.96
2012	10.82	11.63	1.78
2013	8.57	9.29	1.23

Tier times

The times of each build system tier were measured on the 2013 Haswell MacBook Pro. These timings were obtained out of curiosity to help isolate the impact of different parts of the build. ccache was not enabled for these tests.

Action	Wall time	User time	System time	Total CPU time
export clobber	15.75	66.11	11.33	77.44
compile clobber	9:01 (541)	64:58 (3898)	5:08 (308)	70:06 (4206)
libs clobber	1:34 (94)	2:15 (135)	0:39 (39)	2:54 (174)
tools clobber	9.33	13.41	2.48	15.89
export no-op	3.01	9.72	3.47	13.19
compile no-op	3.18	18.02	2.64	20.66
libs no-op	58.2	46.9	13.4	60.3
tools no-op	8.82	12.68	1.72	14.40

Observations and conclusions

The data speaks for itself: the 2013 Haswell MacBook Pro is significantly faster than its predecessors. It clocks in at 2x faster than the benchmarked 2011 Sandy Bridge model (keep in mind the 300 MHz base clock difference) and is ~34% faster than the 2012 Ivy Bridge (at similar clock speed). Personally, I was surprised by this. I was expecting speed improvements over Ivy Bridge, but not 34%.

It should go without saying: if you have the opportunity to upgrade to a new, Haswell-based machine: do it. If possible, purchase the upgrade to a 2.6 GHz CPU, as it contains ~13% more MHz than the base 2.3 GHz model: this will make a measurable difference in build times.

It's worth noting the increased efficiency of Haswell over its predecessors. The total CPU time required to build decreased from ~158 minutes to ~103 minutes to 76 minutes! That 76 minute number is worth highlighting because it means if we get 100% CPU saturation during builds, we'll be able to build the tree in under 10 wall time minutes!

I hadn't performed crude benchmarks of high-level build system actions since the MOZ_PSEUDO_DERECURSE work landed and I wanted to use the opportunity of this hardware comparison to grab some numbers.

The overhead of ccache continues to surprise me. On the 2013 machine, enabling ccache increased the wall time of a clobber build by 1:46 and added 13:16 of CPU time. This is an increase of 16% and 17%, respectively.

It's worth highlighting just how much time is spent compiling C/C++. In our artificial tier measuring results, our clobber build time was ~660 wall time seconds (11 minutes) and used ~4473s CPU time (74:33). Of this, 9:01 wall time and 70:06 CPU time was spent compiling C/C++. This represents ~82% wall time and ~94% CPU time! Please note this does not include linking. Anything we can do to decrease the CPU time used by the compiler will make the build faster.

I also found it interesting to note variances in obtained times. Even on my brand new 2013 Haswell MacBook Pro where I know there aren't many background processes running, wall times could vary significantly. I think I isolated it to CPU bursting and heat issues. If I wait a few minutes between CPU intensive tests, results are pretty consistent. But if I perform CPU intensive tests back-to-back, the run times often vary. The only other thing coming into play could be page caching or filesystem indexing. I accounted for the latter by disabling Finder on the object directory. And, I'd like to think that flash storage is fast enough to remove I/O latency from the equation. Who knows. At the end of the day, laptops aren't servers and OS X is a consumer OS, so I don't expect ultra consistency.

Finally, I want to restate just how fast Haswell is. If you have the opportunity to upgrade, do it.

Distributed Compiling and Firefox

October 31, 2013 at 11:35 AM | categories: Mozilla, build system

If you had infinite CPU cores available and the Firefox build system could distribute them all for concurrent compilation, Firefox clobber build times would likely be 3-5 minutes instead of ~15 minutes on modern machines. This is a massive win. It therefore should come as no surprise that distributed compiling is very interesting to us.

Up until recently, the benefits of distributed compiling in the Firefox build system couldn't be fully realized. This was because the build system was performing recursive make traversal and make only knew about a tiny subset of the tree's total C++ files at one time. For example, when visiting /layout/base it only knew about 35 of the close to 6000 files that get compiled as part of building Firefox. This meant there was a hard ceiling to the max concurrency the build system could achieve. This ceiling was often higher than the number of cores in an individual machine, so it wasn't a huge issue for single machine builds. But it did significantly limit the benefits of distributed compiling. This all changed recently.

As of a few weeks ago, the build system no longer encounters a low ceiling preventing distributed compilation from reaping massive benefits. If you have build with make -j128, make will spawn 128 compiler processes when processing the compile tier (which is where most compilation occurs). If your compiler is set to a distributed compiler, you will win.

So, what should you do about it?

I encourage people to set up distributed compilation networks to reap the benefits of distributed compilation. Here are some tools you should know about and some things to keep in mind.

distcc is the tried and proven tool for performing distributed compilation. It's heavily used and gets the job done. It even works on Windows and can perform remote processing, which is a huge win for our tree, where preprocessing can be computationally expensive because of excessive includes. But, it has a few significant drawbacks. Read the next paragraph.

I'm personally more excited about icecream. It has some very compelling advantages to distcc. It has a scheduler that can intelligently distribute load between machines. It uses network broadcast to discover the scheduler. So, you just start the client daemon and if there is a scheduler on the local network, it's all set up. Icecream transfers the compiler toolchain between nodes so you are guaranteed to have consistent output. (With distcc, output may not be idempotent if the nodes aren't homogenous since distcc relies on the system-local toolchain. If different versions are installed on different nodes, you are out of luck). Icecream also supports cross-compiling. In theory, you can have Linux machines building for OS X, 32-bit machines building for 64-bit, etc. This is all very difficult (if not impossible) to do with distcc. Unfortunately, icecream doesn't work on Windows and doesn't appear to support server-side preprocessing. Although, I imagine both could be made to work if someone put in the effort.

Distributed compilation is very network intensive. I haven't measured, but I suspect Wi-Fi bandwidth and latency constraints might make it prohibitive there. It certainly won't be good for Wi-Fi saturation! If you are in a Mozilla office, please do not attempt to perform distributed compilation over Wi-Fi! For the same reasons, distributed compilation will likely not benefit you if you are attempting to compile on network-distant nodes.

I have set up an icecream server in the Mozilla San Francisco office. If you install the icecream client daemon (iceccd) on your machine, it should just work. I'm not sure what broadcast nets are configured as, but I've successfully had machines on the 7th floor discover it automatically. I guarantee no SLA for this server. Ping me privately if you have difficulty connecting.

I've started very preliminary talks with Mozilla IT about setting up dedicated compiler farms in Mozilla offices. I'm not saying this is coming any time soon. I feel this will have a major impact on developer productivity and I wanted to get the ball rolling months in advance so nobody can claim this is a fire drill.

For distributed compilation to work well, the build system really needs to be aware of distributed compilation. For example, to yield the benefits of distributed compilation with make, you need to pass -j64 or some other large value for concurrency. However, this value would be universal for every task in the build. There are still thousands of processes that must run locally. Using -j64 on these local tasks could cause memory exhaustion, I/O saturation, excessive context switching, etc. But if you decrease the concurrency ceiling, you lose the benefits of distributed compilation! The build system thus needs to be taught when distributed compilation is available and what tasks can be made concurrent so it can intelligently adjust the -j concurrency limit at run-time. This is why we have a higher-level build wrapper tool: mach build. (This is another reason why people should be building through mach instead of invoking make directly.)

No matter what technical solution we employ, I would like the build system to automatically discover and use distributed compilation if it is available. If we need to hardcode Mozilla IP addresses or hostnames into the build system, I'm fine with that. I just don't want developers not achieving much-faster build times because they are ignorant. If you are in a physical location with distributed compilation support, you should get that automatically: fast builds should not be hard.

We can and should investigate distributed compilation as part of release automation. Icecream should mitigate the concerns about build reproducibility since the toolchain is transferred at build time.

I have had success getting Icecream to work with Linux builds. However, OS X is problematic. Specifically, Icecream is unable to create the build environment for distribution (likely modern OS X/Xcode compatibility issue). Details are in bug 927952.

Build peers have a lot on our plate this quarter and making distributed compilation work well is not in our official goals. I would love, love, love if someone could step up and be a hero to make distributed compilation work better with the build system. If you are interested, pop into #build on irc.mozilla.org.

In summary, there are massive developer productivity wins waiting to be realized through distributed compiling. There is nobody tasked to work on this officially. Although, I'd love it if there were. If you find yourself setting up ad-hoc networks in offices, I'd really like to see some kind of discovery in mach. If not, there will be people left behind and that really stinks for those individuals. If you do any work around distributed compiling, please have it tracked under bug 485559.

The State of the Firefox Build System (2013 Q3 Review)

October 15, 2013 at 01:00 PM | categories: Mozilla, build system

As we look ahead to Q4 planning for the Firefox build system, I wanted to take the time to reflect on what was accomplished in Q3 and to simultaneously look forward to Q4 and beyond.

2013 Q3 Build System Improvements

There were notable improvements in the build system during the last quarter.

The issues our customers care most about is speed. Here is a list of accomplishments in that area:

MOZ_PSEUDO_DERECURSE work to change how make directory traversal works. This enabled the binaries make target, which can do no-op libxul-only builds in just a few seconds. Of all the changes that landed this quarter, this is the most impactful to local build times. This change also enables C++ compilation to scale out to as many cores as you have. Previously, the build system was starved in many parts of the tree when compiling C++. Mike Hommey is responsible for this work. I reviewed most of it.
WebIDL and IPDL bindings are now compiled in unified mode, reducing compile times and linker memory usage. Nathan Froyd wrote the code. I reviewed the patches.
XPIDL files are generated much more efficiently. This removed a few minutes of CPU core time from builds. I wrote these patches and Mike Hommey reviewed.
Increased reliance on install manifests to process file installs. They have drastically reduced the number of processes required to build by performing all actions inside Python processes as system calls and removing the clownshoes of having to delete parts of the object directory at the beginning of builds. When many mochitests were converted to manifests, no-op build times dropped by ~15% on my machine. Many people are responsible for this work. Mike Hommey wrote the original install code for packaging a few months ago. I built in manifest file support, support for symlinks, and made the code a bit more robust and faster. Mike Hommey reviewed these patches.
Many bugs and issues around dependency files on Windows have been discovered and fixed. These were a common source of clobbers. Mike Hommey found most of these, many during his work to make MOZ_PSEUDO_DERECURSE work.
The effort to reduce C++ include hell is resulting in significantly shorter incremental builds. While this effort is largely outside the build config module, it is worth mentioning. Ehsan Akhgari is leading this effort. He's been assisted by too many people to mention.
The build system now has different build modes favoring faster building vs release build options depending on the environment. Mike Hommey wrote most (all?) of the patches.

A number of other non-speed related improvements have been made:

The build system now monitors resource usage during builds and can graph the results. I wrote the code. Ted Mielczarek, Mike Hommey, and Mike Shal had reviews.
Support for test manifests has been integrated with the build system. This enabled some build speed wins and is paving the road for better testing UX, such as the automagical mach test command, which will run the appropriate test suite automatically. Multiple people were involved in the work to integrate test manifests with the build system. I wrote the patches. But Ted Mielczarek got primary review. Joel Maher, Jeff Hammel, and Ms2ger provided excellent assistance during the design and implementation phase. The work around mochitest manifests likely wouldn't have happened this quarter if all of us weren't attending an A*Team work week in August.
There are now in-tree build system docs. They are published automatically. Efforts have been made to purge MDN of cruft. I am responsible for writing the code and most of the docs. Benjamin Smedberg and Mike Shal performed code reviews.
Improvements have been made to object directory detection in mach. This was commonly a barrier to some users using mach. I am responsible for the code. Nearly every peer has reviewed patches.
We now require Python 2.7.3 to build, making our future Python 3 compatibility story much easier while eliminating a large class of Python 2.7.2 and below bugs that we constantly found ourselves working around.
mach bootstrap has grown many new features and should be more robust than ever. There are numerous contributors here, including many community members that have found and fixed bugs and have added support for additional distributions.
The boilerplate from Makefile.in has disappeared. Mike Hommey is to thank.
dumbmake integrated with mach. Resulted in friendlier build interface for a nice UX win. Code by Nick Alexander. I reviewed.
Many variables have been ported from Makefile.in to moz.build. We started Q3 with support for 47 variables and now support 73. We started with 1226 Makefile.in and 1517 moz.build and currently have 941 Makefile.in and 1568 moz.build. Many people contributed to this work. Worth mentioning are Joey Armstrong, Mike Shal, Joshua Cranmer, and Ms2ger.
Many build actions are moving to Python packages. This enabled pymake inlining (faster builds) and is paving the road towards no .pyc files in the source directory. (pyc files commonly are the source of clobber headaches and make it difficult to efficiently perform builds on read-only filesystems.) I wrote most of the patches and Mike Shal and Mike Hommey reviewed.
moz.build is now more strict about what it accepts. We check for missing files at config parse time rather than build time, causing errors to surface faster. Many people are responsible for this work. Mike Shal deserves kudos for work around C/C++ file validation.
mach has been added to the B2G repo. Jonathan Griffin and Andrew Halberstadt drove this.

Current status of the build bystem

Q3 was a very significant quarter for the build system. For the first time in years, we made fundamental changes to how the build system goes about building. The moz.build work to free our build config from the shackles of make files had enabled us to consume that data and do new and novel things with it. This has enabled improvements in build robustness and - most importantly - speed.

This is most evident with the MOZ_PSEUDO_DERECURSE work, which effectively replaces how make traverses directories. The work there has allowed Gecko developers focused on libxul to go from e.g. 50s no-op build times to less than 5s. Combined with optimized building of XPIDL, IPDL, and WebIDL files, processing of file installs via manifests, and C++ header dependency reduction, and a host of other changes, and we are finally turning a corner on build times! Much of this work wouldn't have been possible without moz.build files providing a whole world view of our build config.

The quarter wasn't all roses. Unfortunately, we also broke things. A lot. The total number of required clobbers this quarter grew slightly from 38 in Q2 to 43 in Q3. Many of these clobbers were regressions from supposed improvements to the build system. Too many of these regressions were Windows/pymake only and surely would have been found prior to landing if more build peers were actively building on Windows. There are various reasons we aren't. We should strive to fix them so more build development occurs on Windows and Windows users aren't unfairly punished.

The other class of avoidable clobbers mostly revolves around the theme that the build system is complicated, particularly when it comes to integration with release automation. Build automation has its build logic currently coded in Buildbot config files. This means it's all but impossible for build peers to test and reproduce that build environment and flow without time-intensive, stop-energy abundant excessive try pushes or loading out build slaves. The RelEng effort to extract this code from buildbot to mozharness can't come soon enough. See my overview on how automation works for more.

This quarter, the sheriffs have been filing bugs whenever a clobber is needed. This has surfaced clobber issues to build peers better and I have no doubt their constant pestering caused clobber issues to be resolved sooner. It's a terrific incentive for us to fix the build system.

I have mixed feelings on the personnel/contribution front in Q3. Kyle Huey no longer participates in active build system development or patch review. Ted Mielczarek is also starting to drift away from active coding and review. Although, he does constantly provide knowledge and historical context, so not all is lost. It is disappointing to see fantastic people and contributors no longer actively participating on the coding front. But, I understand the reasons behind it. Mozilla doesn't have a build team with a common manager and decree (a mistake if you ask me). Ted and Kyle are both insanely smart and talented and they work for teams that have other important goals. They've put in their time (and suffering). So I see why they've moved on.

On the plus side, Mike Hommey has been spending a lot more time on build work. He was involved in many of the improvements listed above. Due to review load and Mike's technical brilliance, I don't think many of our accomplishments would have happened without him. If there is one Mozillian who should be commended for build system work in Q3, it should be Mike Hommey.

Q3 also saw the addition of new build peers. Mike Shal is now a full build config module peer. Nick Alexander is now a peer of a submodule covering just the Fennec build system. Aside from his regular patch work, Mike Shal has been developing his review skills and responsibilities. Without him, we would likely be drowning in review requests and bug investigations due to the departures of Kyle and Ted. Nick is already doing what I'd hope he'd do when put in charge of the Fennec build system: looking at a proper build backend for Java (not make) and Eclipse project generation. (I still can't believe many of our Fennec developers code Java in vanilla text editors, not powerful IDEs. If there is one language that would miss IDEs the most, I'd think it would be Java. Anyway.)

There was a steady stream of contributions from people not in the build config module. Joshua Cranmer has been keeping up with moz.build conversions for comm-central. Nathan Froyd and Boris Zbarsky have helped with all kinds of IDL work. Trevor Saunders has helped keep things clean. Ms2ger has been eager to provide assistance through code and reviews. Various community contributors have helped with moz.build conversion patches and improvements to mach and the bootstrapper. Thank you to everyone who contributed last quarter!

Looking to the future

At the beginning of the quarter, I didn't think it would be possible to attain no-op build speeds with make as quickly as make binaries now does. But, Mike Hommey worked some magic and this is now possible. This was a game changer. The code he wrote can be applied to other build actions. And, our other solutions involving moz.build files to autogenerated make files seems to be working pretty well too. This raises some interesting questions with regards to priortization.

Long term, we know we want to move away from make. It is old and clumsy. It's easy to do things wrong. It doesn't scale to handle a single DAG as large as our build system. The latter is particularly important if we are to ever have a build system that doesn't require clobbers periodically.

Up to this point we've prioritized work on moz.build conversion, with the rationale being that it would more soon enable a clean break from make and thus we'd arrive at drastically faster builds sooner. The assumption in that argument was that drastically faster builds weren't attainable with make. Between the directory traversal overhaul and the release of GNU make 4.0 last week (which actually seems to work on Windows, making the pymake slowness a non-issue), the importance of breaking away from make now seems much less pressing.

While we would like to actively move off make, developments in the past few weeks seem to say that we can reassess priorities. I believe that we can drive down no-op builds with make to a time that satisfies many - let's say under 10s to be conservative. Using clever tricks optimizing for common developer workflows, we can probably get that under 5s everywhere, including Windows (people only caring about libxul can get 2.5s on mozilla-central today). This isn't the 250ms we could get with Tup. But it's much better than 45s. If we got there, I don't think many people would be complaining.

So, the big question for goals setting this quarter will be whether we want to focus on a new build backend (likely Tup) or whether we should continue with an emphasis on make. Now, a lot of the work involved applies to both make and any other build backend. But, I have little doubt it would be less overall work to support one build backend (make) than two. On the other hand, we know we want to support multiple build backends eventually. Why wait? In the balance are numerous other projects that have varying impact for developers and release automation. While important in their own right, it is difficult to balance them against build speed. While we could strive towards instantaneous builds, at some point we'll hit good enough and the diminishing returns that accompany them. There is already a small vocal faction advocating for Ninja support, even though it would only decrease no-op libxul build times from ~2.5s to 250ms. While a factor of 10x improvement, I think this is dangerously close to diminishing returns territory and our time investment would be better spent elsehwere. (Of course, once we can support building libxul with Ninja, we could easily get it for Tup. And, I believe Tup wins that tie.). Anyway, I'm sure it will be an interesting discussion!

Whatever the future holds, it was a good quarter for the build system and the future is looking brighter than ever. We have transitioned from a maintain-and-react mode (which I understand has largely been the norm since the dawn of Firefox) to a proactive and future-looking approach that will satisfy the needs of Firefox and its developers for the next ten years. All of this progress is even more impressive when you consider that we still react to an aweful lot of fire drills and unwanted maintenance!

The Firefox build system is improving. I'm as anxioux as you are to see various milestones in terms of build speed and other features. But it's hard work. Wish us luck. Please help out where you can.