Gregory Szorc's Digital Home

Modern CI is Too Complex and Misdirected

April 07, 2021 at 09:00 AM | categories: CI, build system

The state of CI platforms is much stronger than it was just a few years ago. Overall, this is a good thing: access to powerful CI platforms enables software developers and companies to ship more reliable software more frequently, which benefits its users/customers. Centralized CI platforms like GitHub Actions, GitLab Pipelines, and Bitbucket provide benefits of scale, as the Internet serves as a collective information repository for how to use them. Do a search for how to do X on CI platform Y and you'll typically find some code you can copy and paste. Nobody wants to toil with wrangling their CI configuration after all: they just want to ship.

Modern CI Systems are Too Complex

The advancements in CI platforms have come at a cost: increased complexity. And the more I think about it, I'm coming around to the belief that modern CI systems are too complex. Let me explain.

At its core, a CI platform is a specialized remote code execution as a service (it's a feature, not a CVE!) where the code being executed is in pursuit of building, testing, and shipping software (unless you abuse it to mine cryptocurrency). So, CI platforms typically throw in a bunch of value-add features to enable you to ship software more easily. There are vastly different approaches and business models here. (I must tip my hat to GitHub Actions leveraging network effects via community maintained actions: this lowers TCO for GitHub as they don't need to maintain many actions, creates vendor lock-in as users develop a dependence on platform-proprietary actions, all while increasing the value of the platform for end-users - a rare product trifecta.) A common value-add property of CI platforms is some kind of configuration file (often YAML) which itself offers common functionality, such as configuring the version control checkout and specifying what commands to run. This is where we start to get into problems.

(I'm going to focus on GitHub Actions here, not because they are the worst (far from it), but because they seem to be the most popular and readers can relate more easily. But my commentary applies to other platforms like GitLab as well.)

The YAML configuration of modern CI platforms is... powerful. Here are features present in GitHub Actions workflow YAML:

An embedded templating system that results in the source YAML being expanded into a final YAML document that is actually evaluated. This includes a custom expression mini language.
Triggers for when to run jobs.
Named variables.
Conditional job execution.
Dependencies between jobs.
Defining Docker-based run-time environment.
Encrypted secrets.
Steps constituting each job and what actions those steps should take.

If we expand scope slightly to include actions maintained by GitHub, we also have steps/actions features for:

Performing Git checkouts.
Storing artifacts used by workflows/jobs.
Caching artifacts used by workflows/jobs.
Installing common programming languages and environments (like Java, Node.js, Python, and Ruby).
And a whole lot more.

And then of course there are 3rd party Actions. And there's a lot of them!

There's a lot of functionality here and a lot of it is arguably necessary: I'm hard pressed to name a feature to cut. (Although I'm no fan of using YAML as a programming language but I concede it use is a fair compromise compared to forcing people to write code to produce YAML or make equivalent API calls to do what the YAML would do.) All these features seem necessary for a sufficiently powerful CI offering. Nobody would use your offering if it didn't offer turnkey functionality after all.

So what's my complaint?

I posit that a sufficiently complex CI system becomes indistinguishable from a build system. I challenge you: try to convince me or yourself that GitHub Actions, GitLab CI, and other CI systems aren't build systems. The basic primitives are all there. GitHub Actions Workflows comprised of jobs comprised of steps are little different from say Makefiles comprised of rules comprised of commands to execute for that rule, with dependencies gluing everything together. The main difference is the form factor and the execution model (build systems are traditionally local and single machine but CI systems are remote/distributed).

Then we have a similar conjecture: a sufficiently complex build system becomes indistinguishable from a CI system. Earlier I said that CI systems are remote code execution as a service. While build systems are historically things that run locally (and therefore not a service), modern build systems like Bazel (or Buck or Gradle) are completely different animals. For example, Bazel has remote execution and remote caching as built-in features. Hey - those are built-in features of modern CI systems too! So here's a thought experiment: if I define a build system in Bazel and then define a server-side Git push hook so the remote server triggers Bazel to build, run tests, and post the results somewhere, is that a CI system? I think it is! A crude one. But I think that qualifies as a CI system.

If you squint hard enough, sufficiently complex CI systems and sufficiently complex build systems start to look like the same thing to me. At a very high level, both are providing a pool of servers offering general compute/execute functionality with specialized features in the domain of building/shipping software, like inter-task artifact exchange, caching, dependencies, and a frontend language to define how everything works.

(If you squint really hard you can start to see a value proposition of Kubernetes for even more general compute scheduling, but I'm not going to go that far in this post because it is a much harder point to make and I don't necessarily believe in it myself. But I thought I'd mention it as an interesting thought experiment. But an easier leap to make is to throw batch job execution (as is often found in data warehouses) in with build and CI systems as belonging in the same bucket: batch job execution also tends to have dependencies, exchange of artifacts between jobs, and I think can strongly resemble a CI system and therefore a build system.)

The thing that bugs me about modern CI systems is that I inevitably feel like I'm reinventing a build system and fragmenting build system logic. Your CI configuration inevitably devolves into a bunch of complex YAML with all kinds of caching and dependency optimizations to keep execution time low and reliability in check - just like your build system. You find yourself contorting your project's build system to work in the context of CI and vice versa. You end up managing two complex DAGs and platforms/systems instead of one.

Because build systems are more generic than CI systems (I think a sufficiently advanced build system can do a superset of the things that a sufficiently complex CI system can do), that means that CI systems are redundant with sufficiently advanced build systems. So going beyond the section title: CI systems aren't too complex: they shouldn't need to exist. Your CI functionality should be an extension of the build system.

In addition to the redundancy argument, I think unified systems are more user friendly. By integrating your CI system into your build system (which by definition can be driven locally as part of regular development workflows), you can expose the full power of the CI system to developers more easily. Think running ad-hoc CI jobs without having to push your changes to a remote server first, just like you can with local builds or tests. This is huge for ergonomics and can drastically compress the cycle time for changes to these systems (which are often brittle to change/test).

Don't get me wrong, aspects of CI systems not traditionally found in build systems (such as centralized results reporting and a UI/API for (re)triggering jobs) absolutely need to exist. Instead, it is the remote compute and work definition aspects that are completely redundant with build systems.

Let's explore the implications of build and CI systems being more of the same.

Modern CI Offerings are Targeting the Wrong Abstraction

If you assume that build and CI systems can be / are more of the same, then it follows that many modern CI offerings like GitHub Actions, GitLab CI, and others are targeting the wrong abstraction: they are defined as domain specific platforms for running CI systems when instead they should take a step back and target the broader general compute platform that is also needed for build systems (and maybe batch job execution, such as what's commonly found in data warehouses/pipelines).

Every CI offering is somewhere different on the spectrum here. I would go so far as to argue that GitHub Actions is more a CI product than a platform. Let me explain.

In my ideal CI platform, I have the ability to schedule an ad-hoc graph of tasks against that platform. I have the ability to hit some APIs with definitions of the tasks I want that platform to run and it accepts them, executes them, uploads artifacts somewhere, reports task results so dependent tasks can execute, etc.

There is a GitHub Actions API that allows you to interact with the service. But the critical feature it doesn't let me do is define ad-hoc units of work: the actual remote execute as a service. Rather, the only way to define units of work is via workflow YAML files checked into your repository. That's so constraining!

GitLab Pipelines is a lot better. GitLab Pipelines supports features like parent-child pipelines (dependencies between different pipelines), multi-project pipelines (dependencies between different projects/repos), and dynamic child pipelines (generate YAML files in pipeline job that defines a new pipeline). (I don't believe GitHub Actions supports any of these features.) Dynamic child pipelines are an important feature, as they mostly divorce the checked-in YAML configuration from the remote execute as a service feature. The main missing feature here is a generic API that allows you achieve this functionality without having to go through a parent pipeline / YAML first. If that API existed, you could build your own build/CI/batch execute system on top of GitLab Pipelines with fewer constraints imposed on you by GitLab Pipeline's opinionated YAML configuration files and the intended use of its creators. (Generally, I think a good litmus test for a well-designed platform or tool is when its authors are surprised by someone's unintended use for it. Of course this knife cuts both ways, as sometimes people do undesirable things, like mine cryptocurrency.)

CI offerings like GitHub Actions and GitLab Pipelines are more products than platforms because they tightly couple an opinionated configuration mechanism (YAML files) and web UI (and corresponding APIs) on top of a theoretically generic remote execute as a service offering. For me to consider these offerings as platforms, they need to grow the ability to schedule arbitrary compute via an API, without being constrained by the YAML officially supported out of the box. GitLab is almost there (the critical missing link is a schedule an inline-defined pipeline API). It is unknown if GitHub is - or is even interested in - pursuing this direction. (More on this later.)

Taskcluster: The Most Powerful CI Platform You've Never Heard Of

I wanted to just mention Taskcluster in passing as a counterexample to the CI offerings that GitHub, GitLab, and others are pursuing. But I found myself heaping praises towards it, so you get a full section on Taskcluster. This content isn't critical to the overall post, so feel free to skip. But if you want to know what a CI platform built for engineers looks like or you are a developer of CI platforms and would like to read about some worthwhile ideas to steal, keep reading.

Mozilla's Taskcluster is a generic CI platform originally built for Firefox. At the time it was conceived and initially built out in 2014-2015, there was nothing else quite like it. And I'm still not aware of anything that can match its raw capabilities. There might be something proprietary behind corporate walls. But nothing close to it in the open source domain. And even the proprietary CI platforms I'm aware of often fall short of Taskcluster's feature list.

To my knowledge, Taskcluster is the only publicly available, mega project scale, true CI platform in existence.

Germane to this post, one thing I love about Taskcluster is its core primitives around defining execution units. The core execute primitive in Taskcluster is a task. Tasks are connected together to form a DAG. (This is not unlike how a build system works.)

A task is created by issuing an API request to a queue service. That API request essentially says schedule this unit of work.

Tasks are defined somewhat generically, essentially as units of arbitrary compute along with metadata, such as task dependencies, permissions/scopes that task has, etc. That unit of work has many of the primitives that are familiar to you if you use GitHub Actions, GitLab Pipelines, etc: a list of commands to execute, which Docker image to execute in, paths to files constituting artifacts, retry settings, etc.

Taskcluster has features far beyond what are offered by GitHub, GitLab, and others today.

For example, Taskcluster offers an IAM-like scopes feature that moderates access control. Scopes control what actions you can perform, what services you have access to, which runner features you can use (e.g. whether you can use ptrace), which secrets you have access to, and more. As a concrete example, Firefox's Taskcluster settings are such that the cryptographic keys/secrets used to sign Firefox builds are inaccessible to untrusted tasks (like the equivalent of tasks initiated by PRs - the Try Server in Mozilla speak). Taskcluster is the only CI platform I'm aware of that has sufficient protections in place to mitigate the fact that CI platforms are gaping remote code execution as a service risks that can and should keep your internal security and risk teams up at night. Taskcluster's security model makes GitHub Actions, GitLab Pipelines, and other commonly used CI services look like data exfiltration and software supply chain vulnerability factories by comparison.

Taskcluster does support adding a YAML file to your repository to define tasks. However, because there's a generic scheduling API, you don't need to use it and you aren't constrained by its features. You could roll your own configuration/frontend for defining tasks: Taskcluster doesn't care because it is a true platform. In fact, Firefox mostly eschews this Taskcluster YAML, instead building out its own functionality for defining tasks. There's a pile of code checked into the Firefox repository that when run will derive the thousands of discrete tasks constituting Firefox's build and release DAG and will register the appropriate sub-graph as Taskcluster tasks. (This also happens to be a pile of YAML. But the programming primitives and control flow are largely absent from YAML files, making it a bit cleaner than the YAML DSL that e.g. GitHub and GitLab CI YAML has evolved into.) This functionality is its own mini build system where the Taskcluster platform is the execution/evaluation mechanism.

Taskcluster's model and capabilities are vastly beyond anything in GitHub Actions or GitLab Pipelines today. There's a lot of great ideas worth copying.

Unfortunately, Taskcluster is very much a power user CI offering. There's no centralized instance that anyone can use (unlike GitHub or GitLab). The learning curve is quite steep. All that power comes at a cost of complexity. I can't in good faith recommend Taskcluster to casual users. But if you want to host your own CI platform, other CI offerings don't quite cut it for you, and you can afford a few people to support your CI platform on an ongoing basis (i.e. your total cost to operate CI including people and machines is >$1M annually), then Taskcluster is worth considering.

Let's get back to the post at hand.

Looking to the Future

In my ideal world there exists a single remote code execution as a service platform purpose built for servicing both near real time and batch/delayed execution. It is probably tailored towards supporting software development, as those domain specific features set it apart from generic compute as a service tools like Kubernetes, Lambda, and others. But something more generic could potentially work.

The concept of a DAG is strongly baked into the execution model so you can define execution units as a graph, capturing dependencies. Sure, you could define isolated, ad-hoc units of work. But if you wanted to define a set of units, you could do that without having to run a persistent agent to coordinate execution through completion like build systems typically do. (Think of this as uploading your DAG to an execution service.)

In my ideal world, there is a single DAG dictating all build, testing, and release tasks. There is no DAG fragmentation at the build, CI, and other batch execute boundaries. No N+1 system or configuration to manage and additional platform to maintain because everything is unified. Economies of scale applies and overall efficiency improves through consolidation.

The platform consists of pools of workers running agents capable of performing work. There are probably pools for near real time / synchronous RPC style invocations and pools for scheduled / delayed / asynchronous execution. You can define your own worker pools and bring your own workers. Advanced customers will likely throw autoscaling groups consisting of highly ephemeral workers (such as EC2 spot instances) at these pools, scaling capacity to meet demand relatively cheaply, terminating workers and machines when capacity is no longer needed to save on billing costs (this is what Firefox's Taskcluster instance has been doing for at least 6 years).

To end-users, a local build consists of driving or scheduling the subset of the complete task graph necessary to produce the build artifacts you need. A CI build/test consists of the subset of the task graph necessary to achieve that (it is probably a superset of the local build graph). Same for releasing.

As for the configuration frontend and how execution units are defined, this platform only needs to provide a single thing: an API that can be used to schedule/execute work. However, for this product offering to be user-friendly, it should offer something like YAML configuration files like CI systems do today. That's fine: many (most?) users will stick to using the simplified YAML interface. Just as long as power users have an escape/scaling vector and can use the low-level schedule/execute API to write their own driver. People will write plug-ins for their build systems enabling it to integrate with this platform. Someone will coerce existing extensible build systems like Bazel, Buck, and Gradle to convert nodes in the build graph to compute tasks in this platform. This unlocks the unification of the build and CI systems (and maybe things like data pipelines too).

Finally, because we're talking about a specialized system tailored for software development, we need robust result/reporting APIs and interfaces. What good is all this fancy distributed remote compute if nobody can see what it is doing? This is probably the most specialized service of the bunch, as how you track results is exceptionally domain specific. Power users may want to build their own result tracking service, so keep that in mind. But the platform should provide a generic one (like what GitHub Actions and GitLab Pipelines do today) because it is a massive value add and few will use your product without such a feature.

Quickly, my proposed unified world will not alleviate the CI complexity concerns raised above: sufficiently large build/CI systems will always have an intrinsic complexity to them and possibly require specialists to maintain. However, because a complex CI system is almost always attached to a complex build system, by consolidating build and CI systems, you reduce the surface area of complexity (you don't have to worry about build/CI interop as much). Lower fragmentation reduces overall complexity, and is therefore a new win. (A similar line of thinking applies to justifying monorepositories.)

All of the components for my vision exist in some working form today. Bazel, Gradle Enterprise, and other modern build systems have RPCs for remote execute and/or caching. They are even extensible and you can write your own plugins to change core functionality for how the build system runs (to varying degrees of course). CI offerings like Taskcluster and GitLab Pipelines support scheduling DAGs of tasks (with Taskcluster's support far more suited for the desired end state). There are batch job execution frameworks like Airflow that look an awful lot like a domain-specific, specialized versions of Taskcluster. What we don't have a is a single product or service with all these features bundled as a cohesive offering.

I'm convinced that building what I'd like to see is not a question of if it can be done but whether we should and who will do it.

And this is where we probably run into problems. I hate to say it, but I'm skeptical this will exist as a widely available service outside a few corporations' walls any time soon. The reason is the total addressable market.

The value of my vision is through unification of discrete systems (build, CI, and maybe some one-offs like data pipelines) that are themselves complex enough that unification is something you'd want to do for business/efficiency reasons. After all, if it isn't complex/inefficient, you probably don't care about making it simpler/faster. Right here we are probably filtering out >90% of the market because their systems just aren't complex enough for this to matter.

This vision requires adoption of a sufficiently advanced build system so it can serve as the brains behind a unified DAG driving remote execute. Some companies and projects will adopt compatible, advanced build systems like Bazel because they have the resources, technical know-how, and efficiency incentives to pull it off. But many won't. The benefit of a more advanced build system over something simpler is often marginal. Factor in that many companies perceive build and CI support as product development overhead and a virtual cost center whose line item needs to be minimized. If you can get by on a less advanced build system that is good enough for a fraction of the cost without excessive hardship, that's the path many companies and projects will follow. Again, people and companies generally don't care about wrangling build and CI systems: they just want to ship.

The total addressable market for this idea seems too small for me to see any major player with the technical know-how to implement and offer such a service in the next few years. After all, we're not even over the hurdle that what I propose (unifying build and CI systems) is a good idea. Having worked in this space for a decade, witnessed the potential of Taskcluster's model, and seen former, present, and potential employers all struggling in this space to varying degrees, I know that this idea would be extremely valuable to some. (For some companies multiple millions of dollars could be saved annually by eliminating redundant human capital maintaining similar systems, reducing machine idle/run costs, and improving turnaround times of critical development loops.) As important as this would be to some companies, my intuition is they represent such a small sliver of the total addressable market that this slice of pie is too small for an existing CI operator like GitHub or GitLab to care about at this time. There are far more lucrative opportunities. (Such as security scanning, as laws/regulation/litigation are finally catching up to the software industry and forcing companies to take security and privacy more seriously, which translates to spending money on security services. This is why GitHub and GitLab have been stumbling over each other to announce new security features over the past 1-2 years.)

I don't think a startup in this area would be a good idea: customer acquisition is too hard. And because much of the core tech already exists in existing tools, there's not much of a moat in the way of proprietary IP to keep copycats with deep pockets at bay. Your best exit here is likely an early acquisition by a Microsoft/GitHub, GitLab, or wannabe player in this space like Amazon/AWS.

Rather, I think our best hope for seeing this vision realized is an operator of an existing major CI platform (either private or public) who also has major build system or other ad-hoc batch execute challenges will implement it and release it upon the world, either as open source or as a service offering. GitHub, GitLab, and other code hosting providers are the ideal candidates since their community effect could help drive industry adoption. But I'd happily accept pretty much any high quality offering from a reputable company!

I'm not sure when, but my money is on GitHub/Microsoft executing on this vision first. They have a stronger incentive in the form of broader market/product tie-ins (think integrated build and CI in Visual Studio or GitHub Workspaces [for Enterprises]). Furthermore, they'll feel the call from within. Microsoft has some really massive build systems and CI challenges (notably Windows). It is clear that elements of Microsoft are conducting development on GitHub, in the open even (at this point Satya Nadella's Microsoft has frozen over so many levels of hell that Dante's classics need new revisions). Microsoft engineers will feel the pain and limitations of discrete build and CI systems. Eventually there will be calls for at least a build system remote execute service/offering on GitHub. (This would naturally fall under GitHub's existing apparent market strategy of capturing more and more of the software development lifecycle.) My hope is GitHub (or whomever) will implement this as a unified platform/service/product rather than discrete services because as I've argued they are practically the same problem. But a unified offering isn't the path of least resistance, so who knows what will happen.

Conclusion

If I could snap my fingers and move industry's discrete build, CI, and maybe batch execute (e.g. data pipelines) ahead 10 years, I would:

Take Mozilla's Taskcluster and its best-in-class specialized remote execute as a service platform.
Add support for a real-time, synchronous execute API (like Bazel's remote execute API) to supplement the existing batch/asynchronous functionality.
Define Starlark dialects so you define CI/release like primitives in build tools like Bazel. (You could also do YAML here. But if your configuration files devolve into DSL, just use a real programming language already.)
Teach build tools like Bazel to work better when units of work that can take minutes or even hours to run (a synchronous/online driver model such as classically employed by build systems isn't appropriate for long-running test, release, or say data pipelines).
Throw a polished web UI for platform interaction, result reporting, etc on top.
Release it to the world.

Will this dream become a reality any time soon? Probably not. But I can dream. And maybe I'll have convinced a reader to pursue it.