Why You Shouldn't Use Git LFS

May 12, 2021 at 10:30 AM | categories: Mercurial, Git

I have long held the opinion that you should avoid Git LFS if possible. Since people keeping asking me why, I figured I'd capture my thoughts in a blog post so I have something to refer them to.

Here are my reasons for not using Git LFS.

Git LFS is a Stop Gap Solution

Git LFS was developed outside the official Git project to fulfill a very real market need that Git didn't/doesn't handle large files very well.

I believe it is inevitable that Git will gain better support for handling of large files, as this seems like a critical feature for a popular version control tool.

If you make this long bet, LFS is only an interim solution and its value proposition disappears after Git has better native support for large files.

LFS as a stop gap solution would be tolerable except for the fact that...

Git LFS is a One Way Door

The adoption or removal of Git LFS in a repository is an irreversible decision that requires rewriting history and losing your original commit SHAs.

In some contexts, rewriting history is tolerable. In many others, it is an extremely expensive proposition. My experience maintaining version control in professional contexts aligns with the opinion that rewriting history is expensive and should only be considered a measure of last resort. Maybe if tools made it easier to rewrite history without the negative consequences (e.g. GitHub would redirect references to old SHA1 in URLs and API calls) I would change my opinion here. Until that day, the drawbacks of losing history are just too high to stomach for many.

The reason adoption or removal of LFS is irreversible is due to the way Git LFS works. What LFS does is change the blob content that a Git commit/tree references. Instead of the content itself, it stores a pointer to the content. At checkout and commit time, LFS blobs/records are treated specially via a mechanism in Git that allows content to be rewritten as it moves between Git's core storage and its materialized representation. (The same filtering mechanism is responsible for normalizing line endings in text files. Although that feature is built into the core Git product and doesn't work exactly the same way. But the principles are the same.)

Since the LFS pointer is part of the Merkle tree that a Git commit derives from, you can't add or remove LFS from a repo without rewriting existing Git commit SHAs.

I want to explicitly call out that even if a rewrite is acceptable in the short term, things may change in the future. If you adopt LFS today, you are committing to a) running an LFS server forever b) incurring a history rewrite in the future in order to remove LFS from your repo, or c) ceasing to provide an LFS server and locking out people from using older Git commits. I don't think any of these are great options: I would prefer if there were a way to offboard from LFS in the future with minimal disruption. This is theoretically possible, but it requires the Git core product to recognize LFS blobs/records natively. There's no guarantee this will happen. So adoption of Git LFS is a one way door that can't be easily reversed.

LFS is More Complexity

LFS is more complex for Git end users.

Git users have to install, configure, and sometimes know about the existence of Git LFS. Version control should just work. Large file handling should just work. End-users shouldn't have to care that large files are handled slightly differently from small files.

The usability of Git LFS is generally pretty good. However, there's an upper limit on that usability as long as LFS exists outside the core Git product. And LFS will likely never be integrated into the core Git product because the Git maintainers know that LFS is only a stop gap solution. They would rather solve large files storage correctly than ~forever carry the legacy baggage of having to support LFS in the core product.

LFS is more complexity for Git server operators as well. Instead of a self-contained Git repository and server to support, you now have to support a likely separate HTTP server to facilitate LFS access. This isn't the hardest thing in the world, especially since we're talking about key-value blob storage, which is arguably a solved problem. But it's another piece of infrastructure to support and secure and it increases the surface area of complexity instead of minimizing it. As a server operator, I would much prefer if the large file storage were integrated into the core Git product and I simply needed to provide some settings for it to just work.

Mercurial Does LFS Slightly Better

Since I'm a maintainer of the Mercurial version control tool, I thought I'd throw out how Mercurial handles large file storage better than Git. Mercurial's large file handling isn't great, but I believe it is strictly better with regards to the trade-offs of adopting large file storage.

In Mercurial, use of LFS is a dynamic feature that server/repo operators can choose to enable or disable whenever they want. When the Mercurial server sends file content to a client, presence of external/LFS storage is a flag set on that file revision. Essentially, the flag says the data you are receiving is an LFS record, not the file content itself and the client knows how to resolve that record into content.

Conceptually, this is little different from Git LFS records in terms of content resolution. However, the big difference is this flag is part of the repository interchange data, not the core repository data as it is with Git. Since this flag isn't part of the Merkle tree used to derive the commit SHA, adding, removing, or altering the content of the LFS records doesn't require rewriting commit SHAs. The tracked content SHA - the data now stored in LFS - is still tracked as part of the Merkle tree, so the integrity of the commit / repository can still be verified.

In Mercurial, the choice of whether to use LFS and what to use LFS for is made by the server operator and settings can change over time. For example, you could start with no use of LFS and then one day decide to use LFS for all file revisions larger than 10 MB. Then a year later you lower that to all revisions larger than 1 MB. Then a year after that Mercurial gains better native support for large files and you decide to stop using LFS altogether.

Also in Mercurial, it is possible for clients to push a large file inline as part of the push operation. When the server sees that large file, it can be like this is a large file: I'm going to add it to the blob store and advertise it as LFS. Because the large file record isn't part of the Merkle tree, you can have nice things like this.

I suspect it is only a matter of time before Git's wire protocol learns the ability to dynamically advertise remote servers for content retrieval and this feature will be leveraged for better large file handling. Until that day, I suppose we're stuck with having to rewrite history with LFS and/or funnel large blobs through Git natively, with all the pain that entails.

Conclusion

This post summarized reasons to avoid Git LFS. Are there justifiable scenarios for using LFS? Absolutely! If you insist on using Git and insist on tracking many large files in version control, you should definitely consider LFS. (Although, if you are a heavy user of large files in version control, I would consider Plastic SCM instead, as they seem to have the most mature solution for large files handling.)

The main point of this post is to highlight some drawbacks with using Git LFS because Git LFS is most definitely not a magic bullet. If you can stomach the short and long term effects of Git LFS adoption, by all means, use Git LFS. But please make an informed decision either way.


Problems with Pull Requests and How to Fix Them

January 07, 2020 at 12:10 PM | categories: Mercurial, Git

You've probably used or at least heard of pull requests: the pull request is the contribution workflow practiced on and made popular by [code] collaboration sites like GitHub, GitLab, Bitbucket, and others. Someone (optionally) creates a fork, authors some commits, pushes them to a branch, then creates a pull request to track integrating those commits into a target repository and branch. The pull request is then used as a vehicle for code review, tracking automated checks, and discussion until it is ready to be integrated. Integration is usually performed by a project maintainer, often with the click of a merge button on the pull request's web page.

It's worth noting that the term pull request is not universally used: GitLab calls them merge requests for example. Furthermore I regard the terms pull request and merge request to be poorly named, as the terms can be conflated with terminology used by your version control tool (e.g. git pull or git merge. And the implementations of a pull or merge request may not even perform a pull or a merge (you can also rebase a pull/merge request, but nobody is calling them rebase requests). A modern day pull request is so much more than a version control tool operation or even a simple request to pull or merge a branch: it is a nexus to track the integration of a proposed change before during and after that change is integrated. But alas. Because GitHub coined the term and is the most popular collaboration platform implementing this functionality, I'll refer to this general workflow as implemented on GitHub, GitLab, Bitbucket, and others as pull requests for the remainder of this post.

Pull requests have existed in essentially their current form for over a decade. The core workflow has remained mostly unchanged. What is different are the addition of value-add features, such as integrating status checks like CI results, the ability to rebase or squash commits instead of merging, code review tooling improvements, and lots of UI polish. GitLab deserves a call out here, as their implementation of merge requests tracks so much more than other tools do. (This is a side-effect of GitLab having more built-in features than comparable tools.) I will also give kudos to GitLab for adding new features to pull requests when GitHub was asleep at the wheel as a company a few years ago. (Not having a CEO for clear product/company leadership really showed.) Fortunately, both companies (and others) are now churning out new, useful features at a terrific clip, greatly benefiting the industry!

While I don't have evidence of this, I suspect pull requests (and the forking model used by services that implement them) came into existence when someone thought how do I design a collaboration web site built on top of Git's new and novel distributed nature and branching features. They then proceeded to invent forking and pull requests. After all, the pull request as implemented by GitHub was initially a veneer over a common Git workflow of create a clone, create a branch, and send it somewhere. Without GitHub, you would run git clone, git branch, then some other command like git request-pull (where have I seen those words before) to generate/send your branch somewhere. On GitHub, the comparable steps are roughly create a fork, create a branch to your fork, and submit a pull request. Today, you can even do all of this straight from the web interface without having to run git directly! This means that GitHub can conceptually be thought of as a purely server-side abstraction/implementation of the Git feature branch workflow.

At its core, the pull request is fundamentally a nice UI and feature layer built around the common Git feature branch workflow. It was likely initially conceived as polish and value-add features over this historically client-side workflow. And this core property of pull requests from its very first days has been copied by vendors like Bitbucket and GitLab (and in Bitbucket's case it was implemented for Mercurial - not Git - as Bitbucket was initially Mercurial only).

A decade is an eternity in the computer industry. As they say, if you aren't moving forward, you are moving backward. I think it is time for industry to scrutinize the pull request model and to evolve it into something better.

I know what you are thinking: you are thinking that pull requests work great and that they are popular because they are a superior model compared to what came before. These statements - aside from some nuance - are true. But if you live in the version control space (like I do) or are paid to deliver tools and workflows to developers to improve productivity and code/product quality (which I am), the deficiencies in the pull request workflow and implementation of that workflow among vendors like GitHub, GitLab, Bitbucket, etc are obvious and begging to be overhauled if not replaced wholesale.

So buckle in: you've started a ten thousand word adventure about everything you didn't think you wanted to know about pull requests!

Problems with Pull Requests

To build a better workflow, we first have to understand what is wrong/sub-optimal with pull requests.

I posit that the foremost goal of an pull request is to foster the incorporation of a high quality and desired change into a target repository with minimal overhead and complexity for submitter, integrator, and everyone in between. Pull requests achieve this goal by fostering collaboration to discuss the change (including code review), tracking automated checks against the change, linking to related issues, etc. In other words, the way I see the world is that a specific vendor's pull request implementation is just that: an implementation detail. And like all implementation details, they should be frequently scrutinized and changed, if necessary.

Let's start dissecting the problems with pull requests by focusing on the size of review units. Research by Google, Microsoft here, and here, and others has shown an inverse correlation with review unit size and defect rate. In Google's words (emphasis mine):

The size distribution of changes is an important factor in the quality of the code review process. Previous studies have found that the number of useful comments decreases and the review latency increases as the size of the change increases. Size also influences developers' perception of the code review process; a survey of Mozilla contributors found that developers feel that size-related factors have the greatest effect on review latency. A correlation between change size and review quality is acknowledged by Google and developers are strongly encouraged to make small, incremental changes (with the exception of large deletions and automated refactoring). These findings and our study support the value of reviewing small changes and the need for research and tools to help developers create such small, self-contained code changes for review.

Succinctly, larger changes result in fewer useful comments during review (meaning quality is undermined) and make reviews take longer (meaning productivity is undermined). Takeaway: if you care about defect rate / quality and/or velocity, you should be authoring and reviewing more, smaller changes as opposed to fewer, larger changes.

I stronger agree with Google's opinion on this matter and wholeheartedly endorse writing more, smaller changes. Having practiced both forms of change authorship, I can say without a doubt that more, smaller changes is superior: superior for authors, superior for code reviewers, and superior for people looking at repository history later. The main downside with this model is that it requires a bit more knowledge of your version control tool to execute. And, it requires corresponding tooling to play well with this change authorship model and to introduce as little friction as possible along the way since the number of interactions with tooling will increase as change size decreases, velocity increases, and there are more distinct units of change being considered for integration.

That last point is important and is germane to this post because the common implementation of pull requests today is not very compatible with the many small changes workflow. As I'll argue, the current implementation of pull requests actively discourages the many smaller changes workflow. And since smaller changes result in higher quality and faster reviews, today's implementations of pull requests are undermining quality and velocity.

I don't mean to pick on them, but since they are the most popular and the people who made pull requests popular, let's use GitHub's implementation of pull requests to demonstrate my point.

I posit that in order for us to author more, smaller changes, we must either a) create more, smaller pull requests or b) have pull request reviews put emphasis on the individual commits (as opposed to the overall merge diff). Let's examine these individually.

If we were to author more, smaller pull requests, this would seemingly necessitate the need for dependencies between pull requests in order to maintain velocity. And dependencies between pull requests adds a potentially prohibitive amount of overhead. Let me explain. We don't want to sacrifice the overall rate at which authors and maintainers are able to integrate proposed changes. If we were to split existing proposed changes into more, smaller pull requests, we would have a lot more pull requests. Without dependencies between them, authors could wait for each pull request to be integrated before sending the next one. But this would incur more round trips between author and integrator and would almost certainly slow down the overall process. That's not desirable. The obvious mitigation to that is to allow multiple, related pull requests in flight simultaneously. But this would necessitate the invention of dependencies between pull requests in order to track relationships so one pull request doesn't integrate before another it logically depends on. This is certainly technically doable. But it imposes considerable overhead of its own. How do you define dependencies? Are dependencies automatically detected or updated based on commits in a DAG? If yes, what happens when you force push and it is ambiguous whether a new commit is a logically new commit or a successor of a previous one? If no, do you really want to impose additional hurdles on submitters to define dependencies between every pull request? In the extreme case of one pull request per commit, do you make someone submitting a series of say twenty commits and pull requests really annotate nineteen dependencies? That's crazy!

There's another, more practical issue at play: the interplay between Git branches and pull requests. As implemented on GitHub, a pull request is tracked by a Git branch. If we have N inter-dependent pull requests, that means N Git branches. In the worst case, we have one Git branch for every Git commit. Managing N in-flight Git branches would be absurd. It would impose considerable overhead on pull request submitters. It would perfectly highlight the inefficiency in Git's game of refs branch management that I blogged about two years ago. (Succinctly, once you are accustomed to workflows - like Mercurial's - which don't require you to name commits or branches, Git's forced naming of branches and all the commands requiring those branch names feels grossly inefficient and a mountain of overhead.) Some tooling could certainly be implemented to enable efficient submission of pull requests. (See ghstack for an example.) But I think the interplay between Git branches and GitHub pull requests is sufficiently complex that the tooling and workflow would be intractable for anything but the most trivial and best-case scenarios. Keep in mind that any sufficiently user-friendly solution to this problem would also entail improving git rebase so it moves branches on rewritten ancestor commits instead of leaving them on the old versions of commits. (Seriously, someone should implement this feature: it arguably makes sense as the default behavior for local branches.) In other words, I don't think you can implement the multiple pull request model reliably and without causing excessive burden on people without fundamentally changing the requirement that a pull request be a Git branch. (I'd love to be proven wrong.)

Therefore, I don't think the more, smaller changes workflow can be easily practiced with multiple pull requests using the common GitHub model without effectively moving the definition of a pull request away from equivalence with a Git branch (more on this later). And I also don't mean to imply that dependencies between pull requests can't be implemented: they can and GitLab is evidence. But GitLab's implementation is somewhat simple and crude (possibly because doing anything more complicated is really hard as I speculate).

So without fundamentally changing the relationship between a pull request and a branch, that leaves us with our alternative of pull requests putting more emphasis on the individual changes rather than the merge diff. Let's talk about that now.

Pull requests have historically placed emphasis on the merge diff. That is, GitHub (or another provider) takes the Git branch you have submitted, runs a git merge against the target branch behind the scenes, and displays that diff front and center for review as the main proposed unit of change: if you click the Files changed tab to commence review, you are seeing this overall merge diff. You can click on the Commits tab then select an individual commit to review just that commit. Or you can use the dropdown on the Files changed tab to select an individual commit to review it. These (relatively new) features are a very welcome improvement and do facilitate performing a commit-by-commit review, which is a requirement to realize the benefits of a more, smaller changes workflow. Unfortunately, they are far from sufficient to fully realize the benefits of that workflow.

Defaults matter and GitHub's default is to show the merge diff when conducting review. (I bet a large percentage of users don't even know it is possible to review individual commits.) Since larger changes result in a higher defect rate and slower review, GitHub's default of showing the merge diff effectively means GitHub is defaulting to lower quality, longer-lasting reviews. (I suppose this is good for engagement numbers, as it inflates service usage both immediately and in the long-term due to subsequent bugs driving further usage. But I sincerely hope no product manager is thinking let's design a product that undermines quality to drive engagement.)

Unfortunately, a trivial change of the default to show individual commits instead of the merge diff is not so simple, as many authors and projects don't practice clean commit authorship practices, where individual commits are authored such that they can be reviewed in isolation.

(One way of classifying commit authorship styles is by whether a series of commits is authored such that each commit is good in isolation or whether the effect of applying the overall series is what matters. A handful of mature projects - like the Linux kernel, Firefox, Chrome, Git, and Mercurial - practice the series of individually-good commits model, which I'll call a commit-centric workflow. I would wager the majority of projects on GitHub and similar services practice the we only care about the final result of the series of commits model. A litmus test for practicing the latter model is whether pull requests contain commits like fixup foo or if subsequent revisions to pull requests create new commits instead of amending existing ones. I'm a strong proponent of a clean commit history where each commit in the final repository history stands as good in isolation. But I tend to favor more grown-up software development practices and am a version control guru. That being said, the subject/debate is fodder for another post.)

If GitHub (or someone else) switched the pull request default to a per-commit review without otherwise changing the relationship between a pull request and a Git branch, that would force a lot of less experienced users to familiarize themselves with history rewriting in Git. This would impose considerable pain and suffering on pull request authors, which would in turn upset users, hurt engagement, etc. Therefore, I don't think this is a feasible global default that can be changed. Maybe if Git's user experience for history rewriting were better or we didn't have a decade of behavior to undo we'd be in a better position... But pull request implementations don't need to make a global change: they could right the ship by offering projects that practice clean commit practices an option to change the review default so it emphasizes individual commits instead of the merge diff. This would go a long way towards encouraging authoring and reviewing individual commits, which should have positive benefits on review velocity and code quality outcomes.

But even if these services did emphasize individual commits by default in pull request reviews, there's still a handful of significant deficiencies that would undermine the more, smaller changes workflow that we desire.

While it is possible to review individual commits, all the review comments are still funneled into a single per pull request timeline view of activity. If submitter and reviewer make the effort to craft and subsequently review individual commits, your reward is that all the feedback for the discrete units of change gets lumped together into one massive pile of feedback for the pull request as a whole. This unified pile of feedback (currently) does a poor job of identifying which commit it applies to and gives the author little assistance in knowing which commits need amending to address the feedback. This undermines the value of commit-centric workflows and effectively pushes commit authors towards the fixup style of commit authorship. In order to execute per-commit review effectively, review comments and discussion need to be bucketed by commit and not combined into a unified pull request timeline. This would be a massive change to the pull request user interface and would be a daunting undertaking, so it is understandable why it hasn't happened yet. And such an undertaking would also require addressing subtly complex issues like how to preserve reviews in the face of force pushes. Today, GitHub's review comments can lose context when force pushes occur. Things are better than they used to be, when review comments left on individual commits would flat out be deleted (yes: GitHub really did effectively lose code review comments for several years.) But even with tooling improvements, problems still remain and should adoption of commit-level review tracking occur, these technical problems would likely need resolution to appease users of this workflow.

Even if GitHub (or someone else) implements robust per-commit review for pull requests, there's still a problem with velocity. And that problem is that if the pull request is your unit of integration (read: merging), then you have to wait until every commit is reviewed before integration can occur. This may sound tolerable (it's what we practice today after all). But I argue this is less optimal than a world where a change integrates as soon as it is ready to, without having to wait for the changes after it. As an author and maintainer, if I see a change that is ready to integrate, I prefer to integrate it as soon as possible, without delay. The longer a ready-to-integrate change lingers, the longer it is susceptible to bit rot (when the change is no longer valid/good due to other changes in the system). Integrating a judged-good change sooner also reduces the time to meaningful feedback: if there is a fundamental problem early in a series of changes that isn't caught before integration, integrating earlier changes sooner without waiting for the ones following will expose problems sooner. This minimizes deltas in changed systems (often making regression hunting easier), often minimizes the blast radius if something goes wrong, and gives the author more time and less pressure to amend subsequent commits that haven't been integrated yet. And in addition to all of this, integrating more often just feels better. The Progress Principle states that people feel better and perform better work when they are making continuous progress. But setbacks more than offset the power of small wins. While I'm not aware of any explicit research in this area, my interpretation of the Progress Principle to change authorship and project maintenance(which is supported by anecdotal observation) is that a steady stream of integrated changes feels a hell of a lot better than a single monolithic change lingering in review purgatory for what can often seem like an eternity. While you need to be cognizant to not confuse movement with meaningful progress, I think there is real power to the Progress Principle and that we should aim to incorporate changes as soon as they are ready and not any later. Applied to version control and code review, this means integrating a commit as soon as author, reviewer, and our machine overlords reporting status checks all agree it is ready, without having to wait for a larger unit of work, like the pull request. Succinctly, move forward as soon as you are able to!

This desire to integrate faster has significant implications for pull requests. Again, looking at GitHub's implementation of pull requests, I don't see how today's pull requests could adapt to this desired end state without significant structural changes. For starters, review must grow the ability to track per-commit state otherwise integrating individual commits without the entirety of the parts makes little sense. But this entails all the complexity I described above. Then there's the problem of Git branches effectively defining a pull request. What happens when some commits in a pull request are integrated and the author rebases or merges their local branch against their new changes? This may or may not just work. And when it doesn't just work, the author can easily find themselves in merge conflict hell, where one commit after the other fails to apply cleanly and their carefully curated stack of commits quickly becomes a liability and impediment to forward progress. (As an aside, the Mercurial version control tool has a concept called changeset evolution where it tracks which commits - changesets in Mercurial parlance - have been rewritten as other commits and gracefully reacts in situations like a rebase. For example, if you have commits X and Y and X is integrated via a rebase as X', an hg rebase of Y onto X' will see that X was rewritten as X' and skip attempting to rebase X because it is already applied! This cleanly sidesteps a lot of the problems with history rewriting - like merge conflicts - and can make the end-user experience much more pleasant as a result.) While it is certainly possible to integrate changes as soon as they are ready with a pull request workflow, I think that it is awkward and that by the time you've made enough changes to accommodate the workflow, very little is left of the pull request workflow as we know it and it is effectively a different workflow altogether.

The above arguments overly hinge on the assumption that more smaller changes is superior for quality and/or velocity and that we should design workflows around this assertion. While I strongly believe in the merits of smaller units of change, others may disagree. (If you do disagree, you should ask yourself whether you believe the converse: that larger units of change are better for quality and velocity. I suspect most people can't justify this. But I do believe there is merit to the argument that smaller units of change impose additional per-unit costs or have second order effects that undermine their touted quality or velocity benefits.)

But even if you don't buy into the change size arguments, there's still a very valid reason why we should think beyond pull requests as they are implemented today: tool scalability.

The implementation of pull requests today is strongly coupled with how Git works out of the box. A pull request is initiated from a Git branch pushed to a remote Git repository. When the pull request is created, the server creates a Git branch/ref referring to that pull request's head commits. On GitHub, these refs are named pull/ID/head (you can fetch these from the remote Git repository but they are not fetched by default). Also when a pull request is created or updated, a git merge is performed to produce a diff for review. On GitHub, the resulting merge commit is saved and pointed to on the open pull request via a pull/ID/merge ref, which can also be fetched locally. (The merge ref is deleted when the pull request is closed.)

Herein resides our scalability problem: unbound growth of Git refs and ever-increasing rate of a change for a growing project. Each Git ref adds overhead to graph walking operations and data exchange. While involved operations are continuously getting optimized (often through the use of more advanced data structures or algorithms), there are intrinsic scaling challenges with this unbound growth that - speaking as a version control tool maintainer - I want no part of. Are technical solutions enabling things to scale to millions of Git refs viable? Yes'ish. But it requires high-effort solutions like JGit's Reftable, which required ~90 review rounds spanned across ~4 months to land. And that's after design of the feature was first proposed at least as far back as July 2017. Don't get me wrong: am I glad Reftable exists: yes. It is a fantastic solution to a real problem and reading how it works will probably make you a better engineer. But simultaneously, it is a solution to a problem that does not need to exist. There is a space for scaling graph data structures and algorithms to millions or even billions of nodes, edges, and paths: your version control tool should not be it. Millions or billions of commits and files: that's fine. But scaling the number of distinct paths through that graph by introducing millions of DAG heads is insane given how much complexity it introduces in random areas of the tool. In my opinion it requires unjustifiably large amounts of investment to make work at scale. As an engineer, my inclination when posed with problems like these is to avoid them in the first place. The easiest problems to solve are those you don't have.

Unfortunately, the tight coupling of pull requests to Git branches/refs introduces unbound growth and a myriad of problems associated with it. Most projects may not grow to a size that experiences these problems. But as someone who has experience with this problem space at multiple companies, I can tell you the problem is very real and the performance and scalability issues it creates undermines the viability of using today's implementation of pull requests once you've reached a certain scale. Since we can likely fix the underlying scaling issues with Git, I don't think the explosion of Git refs is a long-term deal breaker for scaling pull requests. But it is today and will remain so until Git and the tools built on top of it improve.

In summary, some high-level problems with pull requests are as follows:

  • Review of merge diff by default encourages larger units of review, which undermines quality and velocity outcomes.
  • Inability to incrementally integrate commits within a pull request, which slows down velocity, time to meaningful feedback, and can lower morale.
  • Tight coupling of pull requests with Git branches adds rigidity to workflows and shoehorns into less flexible and less desired workflows.
  • Deficiencies in the Git user experience - particular around what happens when rewrites (including rebase) occur - significantly curtail what workflows can be safely practiced with pull requests.
  • Tight coupling of pull requests with Git branches can lead to performance issues at scale.

We can invert language to arrive at a set of more ideal outcomes:

  • Review experience is optimized for individual commits - not the merge diff - so review units are smaller and quality and velocity outcomes are improved.
  • Ability to incrementally integrate individual commits from a larger set so ready-to-go changes are incorporated sooner, improving velocity, time to meaningful feedback, and morale.
  • How you use Git branches does not impose significant restrictions on handling of pull requests.
  • You can use your version control tool how you want, without having to worry about your workflow being shoehorned by how pull requests work.
  • Pull request server can scale to the most demanding use cases with relative ease.

Let's talk about how we could achieve these more desirable outcomes.

Exploring Alternative Models

A pull request is merely an implementation pattern for the general problem space of integrating a proposed change. There are other patterns used by other tools. Before I describe them, I want to coin the term integration request to refer to the generic concept of requesting some change being integrated elsewhere. GitHub pull requests and GitLab merge requests are implementations of integration requests, for example.

Rather than describe alternative tools in detail, I will outline the key areas where different tools differ from pull requests and assess the benefits and drawbacks to the different approaches.

Use of the VCS for Data Exchange

One can classify implementations of integration requests by how they utilize the underlying version control tools.

Before Git and GitHub came along, you were probably running a centralized version control tool which didn't support offline commits or feature branches (e.g. CVS or Subversion). In this world, the common mechanism for integration requests was exchanging diffs or patches through various media - email, post to a web service of a code review tool, etc. Your version control tool didn't speak directly to a VCS server to initiate an integration request. Instead, you would run a command which would export a text-based representation of the change and then send it somewhere.

Today, we can classify integration requests by whether or not they speak the version control tool's native protocol for exchanging data or whether they exchange patches through some other mechanism. Pull requests speak the VCS native protocol. Tools like Review Board and Phabricator exchange patches via custom HTTP web services. Typically, tools using non-native exchange will require additional client-side configuration, including potentially the installation of a custom tool (e.g. RBTools for Review Board or Arcanist for Phabricator). Although modern version control tools sometimes have this functionality built-in. (e.g. Git and Mercurial fulfill Zawinski's law and Mercurial has a Phabricator extension in its official distribution).

An interesting outlier is Gerrit, which ingests its integration requests via git push. (See the docs.) But the way Gerrit's ingestion via git push works is fundamentally different from how pull requests work! With pull requests, you are pushing your local branch to a remote branch and a pull request is built around that remote branch. With Gerrit, your push command is like git push gerrit HEAD:refs/for/master. For the non-gurus, that HEAD:refs/for/master syntax means, push the HEAD commit (effectively the commit corresponding to the working directory) to the refs/for/master ref on the gerrit remote (the SOURCE:DEST syntax specifies a mapping of local revision identifier to remote ref). The wizard behind the curtain here is that Gerrit runs a special Git server that implements non-standard behavior for the refs/for/* refs. When you push to refs/for/master, Gerrit receives your Git push like a normal Git server would. But instead of writing a ref named refs/for/master, it takes the incoming commits and ingests them into a code review request! Gerrit will create Git refs for the pushed commits. But it mainly does that for its internal tracking (Gerrit stores all its data in Git - from Git data to review comments). And if that functionality isn't too magical for you, you can also pass parameters to Gerrit via the ref name! e.g. git push gerrit HEAD refs/for/master%private will create a private review request that requires special permissions to see. (It is debatable whether overloading the ref name for additional functionality is a good user experience for average users. But you can't argue that this isn't a cool hack!)

On the surface, it may seem like using the version control tool's native data exchange is a superior workflow because it is more native and more modern. (Emailing patches is so old school.) Gone are the days of having to configure client-side tools to export and submit patches. Instead, you run git push and your changes can be turned into an integration request automatically or with a few mouse clicks. And from a technical level, this exchange methodology is likely safer, as round-tripping a text-based representation of a change without data loss is surprisingly finicky. (e.g. JSON's lack of lossless binary data exchange without encoding to e.g. base64 first often means that many services exchanging text-based patches are lossy, especially in the presence of content which doesn't conform to UTF-8, which can be commonplace in tests. You would be surprised how many tools experience data loss when converting version control commits/diffs to text. But I digress). Having Git's wire protocol exchange binary data is safer than exchanging text patches and probably easier to use since it doesn't require any additional client-side configuration.

But despite being more native, modern, and arguably robust, exchange via the version control tool may not be better.

For starters, use of the version control tool's native wire protocol inhibits use of arbitrary version control tools on the client. When your integration request requires the use of a version control tool's wire protocol, the client likely needs to be running that version control tool. With other approaches like exchange of text based patches, the client could be running any software it wanted: as long as it could spit out a patch or API request in the format the server needed, an integration request could be created! This meant there was less potential for lock-in, as people could use their own tools on their machines if they wanted and they (hopefully) wouldn't be inflicting their choice on others. Case in point, a majority of Firefox developers use Mercurial - the VCS of the canonical repository - but a large number use Git on the client. Because Firefox is using Phabricator (Review Board and Bugzilla before that) for code review and because Phabricator ingests text-based patches, the choice of the VCS on the client doesn't matter that much and the choice of the server VCS can be made without inciting a holy war among developers who would be forced to use a tool they don't prefer. Yes, there are good reasons for using a consistent tool (including organizational overhead) and sometimes mandates for tool use are justified. But in many cases (such as random open source contributions), it probably doesn't or shouldn't matter. And in cases like Git and Mercurial, where tools like the fantastic git-cinnabar make it possible to easily convert between the repositories without data loss and acceptable overhead, adoption of the version control tool's native wire protocol can exclude or inhibit the productivity of contributors since it can mandate use of specific, undesired tooling.

Another issue with using the version control tool's wire protocol is that it often forces or strongly encourages you to work a certain way. Take GitHub pull requests for example. The pull request is defined around the remote Git branch that you git push. If you want to update that branch, you need to know its name. So that requires some overhead to either create and track that branch or find its name when you want to update it. Contrast with Gerrit, where you don't have an explicit remote branch you push to: you simply git push gerrit HEAD:refs/for/master and it figures things out automatically (more on this later). With Gerrit, I don't have to create a local Git branch to initiate an integration request. With pull requests, I'm compelled to. And this can undermine my productivity by compelling me to practice less-efficient workflows!

Our final point of comparison involves scalability. When you use the version control tool wire protocol as part of integration requests, you have introduced the problem of scaling your version control server. Take it from someone who has had multiple jobs involving scaling version control servers and who is intimately aware of the low-level details of both the Git and Mercurial wire protocols: you don't want to be in the business of scaling a version control server. The wire protocols for both Git and Mercurial were designed in a now-ancient era of computing and weren't designed by network protocol experts. They are fundamentally difficult to scale at just the wire protocol level. I've heard stories that at one time, the most expensive single server at Google was their Perforce or Perforce-derived server (this was several years ago - Google has since moved on to a better architecture).

The poor network protocols of version control tools have many side-effects, including the inability or sheer difficulty of using distributed storage on the server. So in order to scale compute horizontally, you need to invest in expensive network storage solutions or devise a replication and synchronization strategy. And take it from someone who worked on data synchronization products (outside of the source control space) at three companies: this is a problem you don't want to solve yourself. Data synchronization is intrinsically difficult and rife with difficult trade-offs. It's almost always a problem best avoided if you have a choice in the matter.

If creating Git refs is part of creating an integration request, you've introduced a scaling challenge with the number of Git refs. Do these Git refs live forever? What happens when you have thousands of developers - possibly all working in the same repository - and the number of refs or ref mutations grows to the hundreds of thousands or millions per year?

Can your version control server handle ingesting a push every second or two with reasonable performance? Unless you are Google, Facebook, or a handful of other companies I'm aware of, it can't. And before you cry that I'm talking about problems that only plague the 0.01% of companies out there, I can name a handful of companies under 10% the size of these behemoths where this is a problem for them. And I also guarantee that many people don't have client-side metrics for their git push P99 times or reliability and don't even realize there is a problem! Scaling version control is probably not a core part of your company's business. Unfortunately, it all too often becomes something companies have to allocate resources for because of poorly designed or utilized tools.

Contrast the challenges of scaling integration requests with a native version control server versus just exchanging patches. With the more primitive approach, you are probably sending the patch over HTTP to a web service. And with tools like Phabricator and Review Board, that patch gets turned into rows in a relational database. I guarantee it will be easier to scale an HTTP web service fronting a relational database than it will be your version control server. If nothing else, it should be easier to manage and debug, as there are tons more experts in these domains than in the version control server domain!

Yes, it is true that many will not hit the scaling limits of the version control server. And some nifty solutions for scaling do exist. But large segments of this problem space - including the version control tool maintainers having to support crazy scaling vectors in their tools - could be avoided completely if integration requests didn't lean so heavily on the version control tools's default mode of operation. Unfortunately, solutions like GitHub pull requests and Gerrit's use of Git refs for storing everything exert a lot of pressure on scaling the version control server and make this a very real problem once you reach a certain scale.

Hopefully the above paragraphs enlightened you to some of the implications that the choice of a data exchange mechanism has on integration requests! Let's move on to another point of comparison.

Commit Tracking

One can classify implementations of integration requests by how they track commits through their integration lifecycle. What I mean by this is how the integration request follows the same logical change as it evolves. For example, if you submit a commit then amend it, how does the system know that the commit evolved from commit X to X'.

Pull requests don't track commits directly. Instead, a commit is part of a Git branch and that branch is tracked as the entity the pull request is built around. The review interface presents the merge diff front and center. It is possible to view individual commits. But as far as I know, none of these tools have smarts to explicitly track or map commits across new submissions. Instead, they simply assume that the commit order will be the same. If commits are reordered or added or removed in the middle of an existing series, the tool can get confused quite easily. (With GitHub, it was once possible for a review comment left on a commit to disappear entirely. The behavior has since been corrected and if GitHub doesn't know where to print a comment from a previous commit, it renders it as part of the pull request's timeline view.)

If all you are familiar with is pull requests, you may not realize there are alternatives to commit tracking! In fact, the most common alternative (which isn't do nothing) predates pull requests entirely and is still practiced by various tools today.

The way that Gerrit, Phabricator, and Review Board work is the commit message contains a unique token identifying the integration request for that commit. e.g. a commit message for a Phabricator review will contain the line Differential Revision: https://phab.mercurial-scm.org/D7543. Gerrit will have something like Change-Id: Id9bfca21f7697ebf85d6a6fa7bac7de4358d7a43.

The way this annotation appears in the commit message differs by tool. Gerrit's web UI advertises a shell one-liner to clone repositories which not only performs a git clone but also uses curl to download a shell script from the Gerrit server and install it as Git's commit-msg hook in the newly-cloned repositories. This Git hook will ensure that any newly-created commit has a Change-ID: XXX line containing a randomly generated, hopefully unique identifier. Phabricator and Review Board leverage client-side tooling to rewrite commit messages after submission to their respective tool so the commit message contains the URL of the code review. One can debate which approach is better - they each have advantages and drawbacks. Fortunately, this debate is not germane to this post, so we won't cover it here.

What is important is how this metadata in commit messages is used.

The commit message metadata comes into play when a commit is being ingested into an integration request. If a commit message lacks metadata or references an entity that doesn't exist, the receiving system assumes it is new. If the metadata matches an entity on file, the incoming commit is often automatically matched up to an existing commit, even if its Git SHA is different!

This approach of inserting a tracking identifier into commit messages works surprisingly well for tracking the evolution of commits! Even if you amend, reorder, insert, or remove commits, the tool can often figure out what matches up to previous submissions and reconcile state accordingly. Although support for this varies by tool. Mercurial's extension for submitting to Phabricator is smart enough to take the local commit DAG into account and change dependencies of review units in Phabricator to reflect the new DAG shape, for example.

The tracking of commits is another one of those areas where the simpler and more modern features of pull requests often don't work as well as the solutions that came before. Yes, inserting an identifier into commit messages feels hacky and can be brittle at times (some tools don't implement commit rewriting very well and this can lead to a poor user experience). But you can't argue with the results: using explicit, stable identifiers to track commits is far more robust than the heuristics that pull requests rely on. The false negative/positive rate is so much lower. (I know this from first hand experience because we attempted to implement commit tracking heuristics for a code review tool at Mozilla before Phabricator was deployed and there were a surprising number of corner cases we couldn't handle properly. And this was using Mercurial's obsolescence markers, which gave us commit evolution data generated directly by the version control tool! If that didn't work well enough, it's hard to imagine an heuristic that would. We eventually gave up and used stable identifiers in commit messages, which fixed most of the annoying corner cases.)

The use of explicit commit tracking identifiers may not seem like it makes a meaningful difference. But it's impact is profound.

The obvious benefit of tracking identifiers is that they allow rewriting commits without confusing the integration request tool. This means that people can perform advanced history rewriting with near impunity as to how it would affect the integration request. I am a heavy history rewriter. I like curating a series of individually high-quality commits that can each stand in isolation. When I submit a series like this to a GitHub pull request and receive feedback on something I need to change, when I enact those changes I have to think will my rewriting history here make re-review harder? (I try to be empathetic with the reviewer and make their life easier whenever possible. I ask what I would appreciate someone doing if I were reviewing their change and tend to do that.) With GitHub pull requests, if I reorder commits or add or remove a commit in the middle of a series, I realize that this may make review comments left on those commits hard to find since GitHub won't be able to sort out the history rewriting. And this may mean those review comments get lost and are ultimately not acted upon, leading to bugs or otherwise deficient changes. This is a textbook example of tooling deficiencies dictating a sub-optimal workflow and outcome: because pull requests don't track commits explicitly, I'm forced to adopt a non-ideal workflow or sacrifice something like commit quality in order to minimize risks that the review tool won't get confused. In general, tools should not externalize these kinds of costs or trade-offs onto users: they should just work and optimize for generally agreed-upon ideal outcomes.

Another benefit to tracking identifiers is that they enable per-commit review to be viable. Once you can track the logical evolution of a single commit, you can start to associate things like review comments with individual commits with a high degree of confidence. With pull requests (as they are implemented today), you can attempt to associate comments with commits. But because you are unable to track commits across rewrites with an acceptably high degree of success, rewritten commits often fall through the cracks, orphaning data like review comments with them. Data loss is bad, so you need a place to collect this orphaned data. The main pull request activity timeline facilitates this function.

But once you can track commits reliably (and tools like Gerrit and Phabricator prove this is possible), you don't have this severe problem of data loss and therefore don't need to worry about finding a place to collect orphaned data! You are then able to create per-commit review units, each as loosely coupled with other commits and an overall series as you want to make it!

It is interesting to note the different approaches in different tools here. it is doubly interesting to note behavior that is possible with the review tool itself and what it does by default!

Let's examine Phabricator. Phabricator's review unit is the Differential revision. (Differential is the name of the code review tool in Phabricator, which is actually a suite of functionality - like GitLab, but not nearly as feature complete.) A Differential revision represents a single diff. Differential revisions can have parent-child relationships with others. Multiple revisions associated like this form a conceptual stack in Phabricator's terminology. Go to https://phab.mercurial-scm.org/D4414 and search for stack to see it in action. (Stack is a bit misleading name because the parent-child relationships actually form a DAG and Phabricator is capable of rendering things like multiple children in its graphical view.) Phabricator's official client side tool for submitting to Phabricator - Arcanist or arc - has default behavior of collapsing all Git commits into a single Differential revision.

Phabricator can preserve metadata from the individual commits (it can render at least the commit messages in the web UI so you can see where the Differential revision came from). In other words, by default Arcanist does not construct multiple Differential revisions for each commit and therefore does not construct parent-child relationships for them. So there is no stack to render here. To be honest, I'm not sure if modern versions of Arcanist even support doing this. I do know both Mercurial and Mozilla authored custom client side tools for submitting to Phabricator to work around deficiencies like this in Arcanist. Mozilla's may or may not be generally suitable for users outside of Mozilla - I'm not sure.

Another interesting aspect of Phabricator is that there is no concept of an over-arching series. Instead, each Differential revision stands in isolation. They can form parent-child relationships and constitute a stack. But there is no primary UI or APIs for stacks (the last I looked anyway). This may seem radical. You may be asking questions like how do I track the overall state of a series or how do I convey information pertinent to the series as a whole. These are good questions. But without diving into them, the answer is that as radical as it sounds to not have an overall tracking entity for a series of Differential revisions, it does work. And having used this workflow with the Mercurial Project for a few years, I can say I'm not missing the functionality that much.

Gerrit is also worth examining. Like Phabricator, Gerrit uses an identifier in commit messages to track the commit. But whereas Phabricator rewrites commit messages at initially submission time to contain the URL that was created as part of that submission, Gerrit peppers the commit message with a unique identifier at commit creation time. The server then maintains a mapping of commit identifier to review unit. Implementation details aside, the end result is similar: individual commits can be tracked more easily.

What distinguishes Gerrit from Phabricator is that Gerrit does have a stronger grouping around multiple commits. Gerrit will track when commits are submitted together and will render both a relation chain and submitted together list automatically. While it lacks the visual beauty of Phabricator's implementation, it is effective and is shown in the UI by default, unlike Phabricator.

Another difference from Phabricator is that Gerrit uses per-commit review by default. Whereas you need a non-official client for Phabricator to submit a series of commits to constitute a linked chain, Gerrit does this by default. And as far as I can tell, there's no way to tell Gerrit to squash your local commits down to a single diff for review: if you want a single review to appear, you must first squash commits locally then push the squashed commit. (More on this topic later in the post.)

A secondary benefit of per-commit review is that this model enables incremental integration workflows, where some commits in a series or set can integrate before others, without having to wait for the entire batch. Incremental integration of commits can drastically speed up certain workflows, as commits can integrate as soon as they are ready and not any longer. The benefits of this model can be incredible. But actually deploying this workflow can be tricky. One problem is that your version control tool may get confused when you rebase or merge partially landed state. Another problem is it can increase the overall change rate of the repository, which may strain systems from version control to CI to deployment mechanisms. Another potential problem involves communicating review sign-off from integration sign-off. Many tools/workflows conflate I sign off on this change and I sign off on landing this change. While they are effectively identical in many cases, there are some valid cases where you want to track these distinctly. And adopting a workflow where commits can integrate incrementally will expose these corner cases. So before you go down this path, you want to be thinking about who integrates commits and when they are integrated. (You should probably be thinking about this anyway because it is important.)

Designing a Better Integration Request

Having described some problems with pull requests and alternate ways of going about solving the general problem of integration requests, it is time to answer the million dollar problem: designing a better integration request. (When you factor in the time people spend in pull requests and the cost of bugs / low quality changes that slip through due to design of existing tooling, improving integration requests industry wide would be a lot more valuable than $1M.)

As a reminder, the pull request is fundamentally a nice UI and set of features built around the common Git feature branch workflow. This property is preserved from the earliest days of pull requests in 2007-2008 and has been copied by vendors like Bitbucket and GitLab in the years since. In my mind, pull requests should be ripe for overhaul.

Replace Forks

The first change I would make to pull requests is to move away from forks being a required part of the workflow. This may seem radical. But it isn't!

A fork on services like GitHub is a fully fledged project - just like the canonical project it was forked from. It has its own issues, wiki, releases, pull requests, etc. Now show of hands: how often do you use these features on a fork? Me neither. In the overwhelming majority of cases, a fork exists solely as a vehicle to initiate a pull request against the repository it was forked from. It serves little to no additional meaningful functionality. Now, I'm not saying forks don't serve a purpose - they certainly do! But in the case of someone wanting to propose a change to a repository, a fork is not strictly required and its existence is imposed on us by the current implementation of pull requests.

I said impose in the previous sentence because forks introduce overhead and confusion. The existence of a fork may confuse someone as to where a canonical project lives. Forks also add overhead in the version control tool. Their existence forces the user to manage an additional Git remote and branches. It forces people to remember to keep their branches in sync on their fork. As if remembering to keep your local repository in sync wasn't hard enough! And if pushing to a fork, you need to re-push data that was already pushed to the canonical repository, even though that data already exists on the server (just in a different view of the Git repository). (I believe Git is working on wire protocol improvements to mitigate this.)

When merely used as a vehicle to initiate integration requests, I do not believe forks offer enough value to justify their existence. Should forks exist: yes. Should people be forced to use them in order to contribute changes, no. (Valid use cases for a fork would be to perform a community splinter of a project and to create an independent entity for reasons such as better guarantees of data availability and integrity.)

Forks are essentially a veneer on top of a server-side git clone. And the reason why a separate Git repository is used at all is probably because the earliest versions of GitHub were just a pile of abstractions over git commands. The service took off in popularity, people copied its features almost verbatim, and nobody ever looked back and thought why are we doing things like this in the first place.

To answer what we would replace forks with, we have to go back to first principles and ask what are we trying to do. And that is propose a unit a change against an existing project. And for version control tools, all you need to propose a change is a patch/commit. So to replace forks, we just need an alternate mechanism to submit patches/commits to an existing project.

My preferred alternative to forks is to use git push directly to the canonical repository. This could be implemented like Gerrit where you push to a special ref. e.g. git push origin HEAD:refs/for/master. Or - and this is my preferred solution - version control servers could grow more smarts about how pushes work - possibly even changing what commands like git push do if the server is operating in special modes.

One idea would be for the Git server to expose different refs namespaces depending on the authenticated user. For example, I'm indygreg on GitHub. If I wanted to propose a change to a project - let's say python/cpython - I would git clone git@github.com:python/cpython. I would create a branch - say indygreg/proposed-change. I would then git push origin indygreg/proposed-change and because the branch prefix matches my authenticated username, the server lets it through. I can then open a pull request without a fork! (Using branch prefixes is less than ideal, but it should be relatively easy to implement on the server. A better approach would rely on remapping Git ref names. But this may require a bit more configuration with current versions of Git than users are willing to stomach. An even better solution would be for Git to grow some functionality to make this easier. e.g. git push --workspace origin proposed-change would push proposed-change to a workspace on the origin remote, which Git would know how to translate to a proper remote ref update.)

Another idea would be for the version control server to invent a new concept for exchanging commits - one based on sets of commits instead of DAG synchronization. Essentially, instead of doing a complicated discovery dance to synchronize commits with the underlying Git repository, the server would ingest and expose representations of sets of commits stored next to - but not within - the repository itself. This way you are not scaling the repository DAG to infinite heads - which is a hard problem! A concrete implementation of this might have the client run a git push --workspace origin proposed-change to tell the remote server to store your proposed-change branch in your personal workspace (sorry for reusing the term from the previous paragraph). The Git server would receive your commits, generate a standalone blob to hold them, save that blob to a key-value store like S3, then update a mapping of which commits/branches are in which blobs in a data store such as a relational database somewhere. This would effectively segment the core project data from the more transient branch data, keeping the core repository clean and pure. It allows the server to lean on easier-to-scale data stores such as key-value blob stores and relational databases instead of the version control tool. I know this idea is feasible because Facebook implemented it for Mercurial. The infinitepush extension essentially siphons Mercurial bundles (standalone files holding commit data) off to a blob store when pushes come in over the wire. At hg pull time, if a requested revision is not present in the repository, the server asks the database-backed blob index if the revision exists anywhere. If it does, the blob/bundle is fetched, dynamically overlayed onto the repository in memory, and served to the client. While the infinitepush extension in the official Mercurial project is somewhat lacking (through no fault of Facebook's), the core idea is solid and I wish someone would spend the time to flush out the design a bit more because it really could lead to logically scaling repositories to infinite DAG heads without the complexities of actually scaling scaling DAG algorithms, repository storage, and version control tool algorithms to infinite heads. Getting back to the subject of integration requests, one could imagine having a target for workspace pushes. For example, git push --workspace=review origin would push to the review workspace, which would automatically initiate a code review.

Astute readers of this blog may find these ideas familiar. I proposed user namespaces in my /blog/2017/12/11/high-level-problems-with-git-and-how-to-fix-them/ post a few years ago. So read there for more on implications of doing away with forks.

Could forks be done away with as a requirement to submit pull requests? Yes! Gerrit's git push origin HEAD:refs/for/master mechanism proves it. Is Gerrit's approach too much magic or confusing for normal users? I'm not sure. Could Git grow features to make the user experience much better so users don't need to be burdened with complexity or magic and could simply run commands like git submit --for review? Definitely!

Shift Focus From Branches to Individual Commits

My ideal integration request revolves around individual commits, not branches. While the client may submit a branch to initiate or update an integration request, the integration request is composed of a set of loosely coupled commits, where parent-child relationships can exist to express a dependency between commits. Each commit is evaluated individually. Although someone may need to inspect multiple commits to gain a full understanding of the proposed change. And some UI enabling operations against a group of related commits (such as mass deleting abandoned commits) may be warranted.

In this world, the branch would not matter. Instead, commits are king. Because we would be abandoning the branch name as a tracker for the integration request, we would need something to replace it, otherwise we have no way of knowing how to update an existing integration request! We should do what tools like Phabricator, Gerrit, and Review Board do and add a persistent identifier to commits which survive history rewriting. (Branch-based pull requests should do this anyway so history rewrites don't confuse the review tool and e.g. cause comments to get orphaned - see above.)

It's worth noting that a commit-centric integration request model does not imply that everyone is writing or reviewing series of smaller commits! While titans of industry and I strongly encourage the authorship of smaller commits, commit-centric integration requests don't intrinsically force you to do so. This is because commit-centric integration requests aren't forcing you to change your local workflow! If you are the type of person who doesn't want to curate a ton of small, good-in-isolation commits (it does take a bit more work after all), nobody would be forcing you to do so. Instead, if this is your commit authorship pattern, the submission of the proposed change could squash these commits together as part of the submission, optionally rewriting your local history in the process. If you want to keep dozens of fixup commits around in your local history, that's fine: just have the tooling collapse them all together on submission. While I don't think those fixup commits are that valuable and shouldn't be seen by reviewers, if we wanted, we could have tools continue to submit them and make them visible (like they are in e.g. GitHub pull requests today). But they wouldn't be the focus of review (again like GitHub pull requests today). Making integration requests commit-centric doesn't force people to adopt a different commit authorship workflow. But it does enable projects that wish to adopt more mature commit hygiene to do so. That being said, hows tools are implemented can impose restrictions. But that's nothing about commit-centric review that fundamentally prohibits the use of fixup commits in local workflows.

While I should create a dedicated post espousing the virtues of commit-centric workflows, I'll state my case through proxy by noting that some projects aren't using modern pull requests precisely because commit-centric workflows are not viable. When I was at Mozilla, one of the blockers to moving to GitHub was the pull request review tooling wasn't compatible with our world view that review units should be small. (This view is generally shared by Google, Facebook, and some prominent open source projects, among others.) And for reasons outlined earlier in this post, I think that as long as pull requests revolve around branches / merge diffs and aren't robust in the face of history rewriting (due to the lack of robust commit tracking), projects that insist on more refined practices will continue to eschew pull requests. Again, a link between review size and quality has been established. And better quality - along with its long-term effect of lowering development costs due to fewer bugs - can tip the scales in its favor, even against all the benefits you receive when using a product like GitHub, GitLab, or Bitbucket.

The Best of What's Around

Aspects of a better integration request exist in tools today. Unfortunately, many of these features are not present on pull requests as implemented by GitHub, GitLab, Bitbucket, etc. So to improve the pull request, these products will need to borrow ideas from other tools.

Integration requests not built around Git branches (Gerrit, Phabricattor, Review Board, etc) use identifiers in commit messages to track commits. This helps tracking commits across changes. There are compelling advantages to this model. Robust commit tracking is a requirement for commit-centric workflows. And it would even improve the functionality of branch-based pull requests. A well-designed integration request would have a robust commit tracking mechanism.

Gerrit has the best-in-class experience for commit-centric workflows. It is the only popular implementation of integration requests I'm aware of that supports and caters to this workflow by default. In fact, I don't think you can change this! (This behavior is user hostile in some cases since it forces users to know how to rewrite commits, which is often perilous in Git land. It would be nice if you could have Gerrit squash commits into the same review unit automatically on the server. But I understand the unwillingness to implement this feature because this has its own set of challenges around commit tracking, which I won't bore you with.) Gerrit also shows groups of related commits front and center when viewing a proposed change.

Phabricator is the only other tool I know of where one can achieve a reasonable commit-centric workflow without the pitfalls of orphaned comments, context overload, etc mentioned earlier in this post. But this requires non-standard submission tooling and commit series aren't featured prominently in the web UI. So Phabricator's implementation is not as solid as Gerrit's.

Another Gerrit feature worth lauding is the submission mechanism. You simply git push to a special ref. That's it. There's no fork to create. No need to create a Git branch. No need to create a separate pull request after the push. Gerrit just takes the commits you pushed and turns them into a request for review. And it doesn't require any additional client-side tooling!

Using a single common git command to submit and update an integration request is simpler and arguably more intuitive than other tools. Is Gerrit's submission perfect? No. The git push origin HEAD:refs/for/master syntax is not intuitive. And overloading submission options by effectively encoding URL parameters on the ref name is a gross - albeit effective - hack. But users will likely quickly learn the one-liner's or create more intuitive aliases.

The elegance of using just a git push to initiate an integration request puts Gerrit in a league of its own. I would be ecstatic if the GitHubs of the world reduced the complexity of submitting pull requests to simply clone the canonical repository, create some commits, and run a git command. The future of submitting integration requests* hopefully looks more like Gerrit than other alternatives.

What Needs Built

Some aspects of the better integration request don't yet exist or need considerable work before I consider them viable.

For tools which leverage the native version control tool for submission (e.g. via git push), there needs to be some work to support submission via a more generic, HTTP endpoint. I'm fine with leveraging git push as a submission mechanism because it makes the end-user experience so turnkey. But making it the only submission mechanism is a bit unfortunate. There is some support for this: I believe you can cobble together a pull request from scratch via GitHub's APIs, for example. But it isn't as simple as submit a patch to an endpoint, which it arguably should be. Even Gerrit's robust HTTP API, does not seem to allow creating new commits/diffs via that API. Anyway, this limitation not only excludes non-Git tools from using these tools, but also limits other tooling from submitting without using Git. For example, you may want to write a bot that proposes automated changes and it is much easier to produce a diff than to use git since the former does not require a filesystem (this matters in serverless environments for example).

A larger issue with many implementations is the over-reliance on Git for server storage. This is most pronounced in Gerrit, where not only are your git pushes stored in a Git repository on the Gerrit server, but every code review comment and reply is stored in Git as well! Git is a generic key-value store and you can store any data you want in it if you shoehorn it properly. And it is cool that all your Gerrit data can be replicated via git clone - this pretty much eliminates the we took a decentralized tool and centralized it via GitHub series of arguments. But if you apply this store everything in Git approach at scale, it means you will be running a Git server at scale. And not just any Git server - a write load heavy Git server! And if you have thousands of developers possibly all working out of the same repository, then you are looking at potentially millions of new Git refs per year. While the Git, Gerrit, and JGit people have done some fantastic work making these tools scale, I'd feel much better if we eschewed the make Git scale to infinite pushes and refs problem and used a more scalable approach, like an HTTP ingestion endpoint which writes data to key-value stores or relational databases. In order words, use of a version control tool for servicing integration requests at scale is a self-imposed footgun and could be avoided.

Conclusion

Congratulations on making it through my brain dump! As massive as the wall of text is, there are still plenty of topics I could have covered but didn't. This includes the more specific topic of code review and the various features that entails. I also largely ignored some general topics like the value that an integration request can serve on the overall development lifecycle: integration requests are more than just code review - they serve as a nexus to track the evolution of a change throughout time.

Hopefully this post gave you some idea at some of the structural issues at play with the integration of pull requests and integration requests. And if you are someone in a position to design or implement a better integration request or tooling around them (including in version control tools themselves), hopefully it gave you some good ideas or where to go next.


High-level Problems with Git and How to Fix Them

December 11, 2017 at 10:30 AM | categories: Git, Mercurial, Mozilla

I have a... complicated relationship with Git.

When Git first came onto the scene in the mid 2000's, I was initially skeptical because of its horrible user interface. But once I learned it, I appreciated its speed and features - especially the ease at which you could create feature branches, merge, and even create commits offline (which was a big deal in the era when Subversion was the dominant version control tool in open source and you needed to speak with a server in order to commit code). When I started using Git day-to-day, it was such an obvious improvement over what I was using before (mainly Subversion and even CVS).

When I started working for Mozilla in 2011, I was exposed to the Mercurial version control, which then - and still today - hosts the canonical repository for Firefox. I didn't like Mercurial initially. Actually, I despised it. I thought it was slow and its features lacking. And I frequently encountered repository corruption.

My first experience learning the internals of both Git and Mercurial came when I found myself hacking on hg-git - a tool that allows you to convert Git and Mercurial repositories to/from each other. I was hacking on hg-git so I could improve the performance of converting Mercurial repositories to Git repositories. And I was doing that because I wanted to use Git - not Mercurial - to hack on Firefox. I was trying to enable an unofficial Git mirror of the Firefox repository to synchronize faster so it would be more usable. The ulterior motive was to demonstrate that Git is a superior version control tool and that Firefox should switch its canonical version control tool from Mercurial to Git.

In what is a textbook definition of irony, what happened instead was I actually learned how Mercurial worked, interacted with the Mercurial Community, realized that Mozilla's documentation and developer practices were... lacking, and that Mercurial was actually a much, much more pleasant tool to use than Git. It's an old post, but I summarized my conversion four and a half years ago. This started a chain of events that somehow resulted in me contributing a ton of patches to Mercurial, taking stewardship of hg.mozilla.org, and becoming a member of the Mercurial Steering Committee - the governance group for the Mercurial Project.

I've been an advocate of Mercurial over the years. Some would probably say I'm a Mercurial fanboy. I reject that characterization because fanboy has connotations that imply I'm ignorant of realities. I'm well aware of Mercurial's faults and weaknesses. I'm well aware of Mercurial's relative lack of popularity, I'm well aware that this lack of popularity almost certainly turns away contributors to Firefox and other Mozilla projects because people don't want to have to learn a new tool. I'm well aware that there are changes underway to enable Git to scale to very large repositories and that these changes could threaten Mercurial's scalability advantages over Git, making choices to use Mercurial even harder to defend. (As an aside, the party most responsible for pushing Git to adopt architectural changes to enable it to scale these days is Microsoft. Could anyone have foreseen that?!)

I've achieved mastery in both Git and Mercurial. I know their internals and their command line interfaces extremely well. I understand the architecture and principles upon which both are built. I'm also exposed to some very experienced and knowledgeable people in the Mercurial Community. People who have been around version control for much, much longer than me and have knowledge of random version control tools you've probably never heard of. This knowledge and exposure allows me to make connections and see opportunities for version control that quite frankly most do not.

In this post, I'll be talking about some high-level, high-impact problems with Git and possible solutions for them. My primary goal of this post is to foster positive change in Git and the services around it. While I personally prefer Mercurial, improving Git is good for everyone. Put another way, I want my knowledge and perspective from being part of a version control community to be put to good wherever it can.

Speaking of Mercurial, as I said, I'm a heavy contributor and am somewhat influential in the Mercurial Community. I want to be clear that my opinions in this post are my own and I'm not speaking on behalf of the Mercurial Project or the larger Mercurial Community. I also don't intend to claim that Mercurial is holier-than-thou. Mercurial has tons of user interface failings and deficiencies. And I'll even admit to being frustrated that some systemic failings in Mercurial have gone unaddressed for as long as they have. But that's for another post. This post is about Git. Let's get started.

The Staging Area

The staging area is a feature that should not be enabled in the default Git configuration.

Most people see version control as an obstacle standing in the way of accomplishing some other task. They just want to save their progress towards some goal. In other words, they want version control to be a save file feature in their workflow.

Unfortunately, modern version control tools don't work that way. For starters, they require people to specify a commit message every time they save. This in of itself can be annoying. But we generally accept that as the price you pay for version control: that commit message has value to others (or even your future self). So you must record it.

Most people want the barrier to saving changes to be effortless. A commit message is already too annoying for many users! The Git staging area establishes a higher barrier to saving. Instead of just saving your changes, you must first stage your changes to be saved.

If you requested save in your favorite GUI application, text editor, etc and it popped open a select the changes you would like to save dialog, you would rightly think just save all my changes already, dammit. But this is exactly what Git does with its staging area! Git is saying I know all the changes you made: now tell me which changes you'd like to save. To the average user, this is infuriating because it works in contrast to how the save feature works in almost every other application.

There is a counterargument to be made here. You could say that the editor/application/etc is complex - that it has multiple contexts (files) - that each context is independent - and that the user should have full control over which contexts (files) - and even changes within those contexts - to save. I agree: this is a compelling feature. However, it isn't an appropriate default feature. The ability to pick which changes to save is a power-user feature. Most users just want to save all the changes all the time. So that should be the default behavior. And the Git staging area should be an opt-in feature.

If intrinsic workflow warts aren't enough, the Git staging area has a horrible user interface. It is often referred to as the cache for historical reasons. Cache of course means something to anyone who knows anything about computers or programming. And Git's use of cache doesn't at all align with that common definition. Yet the the terminology in Git persists. You have to run commands like git diff --cached to examine the state of the staging area. Huh?!

But Git also refers to the staging area as the index. And this terminology also appears in Git commands! git help commit has numerous references to the index. Let's see what git help glossary has to say::

index
    A collection of files with stat information, whose contents are
    stored as objects. The index is a stored version of your working tree.
    Truth be told, it can also contain a second, and even a third
    version of a working tree, which are used when merging.

index entry
    The information regarding a particular file, stored in the index.
    An index entry can be unmerged, if a merge was started, but not
    yet finished (i.e. if the index contains multiple versions of that
    file).

In terms of end-user documentation, this is a train wreck. It tells the lay user absolutely nothing about what the index actually is. Instead, it casually throws out references to stat information (requires the user know what the stat() function call and struct are) and objects (a Git term for a piece of data stored by Git). It even undermines its own credibility with that truth be told sentence. This definition is so bad that it would probably improve user understanding if it were deleted!

Of course, git help index says No manual entry for gitindex. So there is literally no hope for you to get a concise, understandable definition of the index. Instead, it is one of those concepts that you think you learn from interacting with it all the time. Oh, when I git add something it gets into this state where git commit will actually save it.

And even if you know what the Git staging area/index/cached is, it can still confound you. Do you know the interaction between uncommitted changes in the staging area and working directory when you git rebase? What about git checkout? What about the various git reset invocations? I have a confession: I can't remember all the edge cases either. To play it safe, I try to make sure all my outstanding changes are committed before I run something like git rebase because I know that will be safe.

The Git staging area doesn't have to be this complicated. A re-branding away from index to staging area would go a long way. Adding an alias from git diff --staged to git diff --cached and removing references to the cache from common user commands would make a lot of sense and reduce end-user confusion.

Of course, the Git staging area doesn't really need to exist at all! The staging area is essentially a soft commit. It performs the save progress role - the basic requirement of a version control tool. And in some aspects it is actually a better save progress implementation than a commit because it doesn't require you to type a commit message! Because the staging area is a soft commit, all workflows using it can be modeled as if it were a real commit and the staging area didn't exist at all! For example, instead of git add --interactive + git commit, you can run git commit --interactive. Or if you wish to incrementally add new changes to an in-progress commit, you can run git commit --amend or git commit --amend --interactive or git commit --amend --all. If you actually understand the various modes of git reset, you can use those to uncommit. Of course, the user interface to performing these actions in Git today is a bit convoluted. But if the staging area didn't exist, new high-level commands like git amend and git uncommit could certainly be invented.

To the average user, the staging area is a complicated concept. I'm a power user. I understand its purpose and how to harness its power. Yet when I use Mercurial (which doesn't have a staging area), I don't miss the staging area at all. Instead, I learn that all operations involving the staging area can be modeled as other fundamental primitives (like commit amend) that you are likely to encounter anyway. The staging area therefore constitutes an unnecessary burden and cognitive load on users. While powerful, its complexity and incurred confusion does not justify its existence in the default Git configuration. The staging area is a power-user feature and should be opt-in by default.

Branches and Remotes Management is Complex and Time-Consuming

When I first used Git (coming from CVS and Subversion), I thought branches and remotes were incredible because they enabled new workflows that allowed you to easily track multiple lines of work across many repositories. And ~10 years later, I still believe the workflows they enable are important. However, having amassed a broader perspective, I also believe their implementation is poor and this unnecessarily confuses many users and wastes the time of all users.

My initial zen moment with Git - the time when Git finally clicked for me - was when I understood Git's object model: that Git is just a content indexed key-value store consisting of a different object types (blobs, trees, and commits) that have a particular relationship with each other. Refs are symbolic names pointing to Git commit objects. And Git branches - both local and remote - are just refs having a well-defined naming convention (refs/heads/<name> for local branches and refs/remotes/<remote>/<name> for remote branches). Even tags and notes are defined via refs.

Refs are a necessary primitive in Git because the Git storage model is to throw all objects into a single, key-value namespace. Since the store is content indexed and the key name is a cryptographic hash of the object's content (which for all intents and purposes is random gibberish to end-users), the Git store by itself is unable to locate objects. If all you had was the key-value store and you wanted to find all commits, you would need to walk every object in the store and read it to see if it is a commit object. You'd then need to buffer metadata about those objects in memory so you could reassemble them into say a DAG to facilitate looking at commit history. This approach obviously doesn't scale. Refs short-circuit this process by providing pointers to objects of importance. It may help to think of the set of refs as an index into the Git store.

Refs also serve another role: as guards against garbage collection. I won't go into details about loose objects and packfiles, but it's worth noting that Git's key-value store also behaves in ways similar to a generational garbage collector like you would find in programming languages such as Java and Python. The important thing to know is that Git will garbage collect (read: delete) objects that are unused. And the mechanism it uses to determine which objects are unused is to iterate through refs and walk all transitive references from that initial pointer. If there is an object in the store that can't be traced back to a ref, it is unreachable and can be deleted.

Reflogs maintain the history of a value for a ref: for each ref they contain a log of what commit it was pointing to, when that pointer was established, who established it, etc. Reflogs serve two purposes: facilitating undoing a previous action and holding a reference to old data to prevent it from being garbage collected. The two use cases are related: if you don't care about undo, you don't need the old reference to prevent garbage collection.

This design of Git's store is actually quite sensible. It's not perfect (nothing is). But it is a solid foundation to build a version control tool (or even other data storage applications) on top of.

The title of this section has to do with sub-optimal branches and remotes management. But I've hardly said anything about branches or remotes! And this leads me to my main complaint about Git's branches and remotes: that they are very thin veneer over refs. The properties of Git's underlying key-value store unnecessarily bleed into user-facing concepts (like branches and remotes) and therefore dictate sub-optimal practices. This is what's referred to as a leaky abstraction.

I'll give some examples.

As I stated above, many users treat version control as a save file step in their workflow. I believe that any step that interferes with users saving their work is user hostile. This even includes writing a commit message! I already argued that the staging area significantly interferes with this critical task. Git branches do as well.

If we were designing a version control tool from scratch (or if you were a new user to version control), you would probably think that a sane feature/requirement would be to update to any revision and start making changes. In Git speak, this would be something like git checkout b201e96f, make some file changes, git commit. I think that's a pretty basic workflow requirement for a version control tool. And the workflow I suggested is pretty intuitive: choose the thing to start working on, make some changes, then save those changes.

Let's see what happens when we actually do this:

$ git checkout b201e96f
Note: checking out 'b201e96f'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at b201e96f94... Merge branch 'rs/config-write-section-fix' into maint

$ echo 'my change' >> README.md
$ git commit -a -m 'my change'
[detached HEAD aeb0c997ff] my change
 1 file changed, 1 insertion(+)

$ git push indygreg
fatal: You are not currently on a branch.
To push the history leading to the current (detached HEAD)
state now, use

    git push indygreg HEAD:<name-of-remote-branch>

$ git checkout master
Warning: you are leaving 1 commit behind, not connected to
any of your branches:

  aeb0c997ff my change

If you want to keep it by creating a new branch, this may be a good time
to do so with:

 git branch <new-branch-name> aeb0c997ff

Switched to branch 'master'
Your branch is up to date with 'origin/master'.

I know what all these messages mean because I've mastered Git. But if you were a newcomer (or even a seasoned user), you might be very confused. Just so we're on the same page, here is what's happening (along with some commentary).

When I run git checkout b201e96f, Git is trying to tell me that I'm potentially doing something that could result in the loss of my data. A golden rule of version control tools is don't lose the user's data. When I run git checkout, Git should be stating the risk for data loss very clearly. But instead, the If you want to create a new branch sentence is hiding this fact by instead phrasing things around retaining commits you create rather than the possible loss of data. It's up to the user to make the connection that retaining commits you create actually means don't eat my data. Preventing data loss is critical and Git should not mince words here!

The git commit seems to work like normal. However, since we're in a detached HEAD state (a phrase that is likely gibberish to most users), that commit isn't referred to by any ref, so it can be lost easily. Git should be telling me that I just committed something it may not be able to find in the future. But it doesn't. Again, Git isn't being as protective of my data as it needs to be.

The failure in the git push command is essentially telling me I need to give things a name in order to push. Pushing is effectively remote save. And I'm going to apply my reasoning about version control tools not interfering with save to pushing as well: Git is adding an extra barrier to remote save by refusing to push commits without a branch attached and by doing so is being user hostile.

Finally, we git checkout master to move to another commit. Here, Git is actually doing something halfway reasonable. It is telling me I'm leaving commits behind, which commits those are, and the command to use to keep those commits. The warning is good but not great. I think it needs to be stronger to reflect the risk around data loss if that suggested Git commit isn't executed. (Of course, the reflog for HEAD will ensure that data isn't immediately deleted. But users shouldn't need to involve reflogs to not lose data that wasn't rewritten.)

The point I want to make is that Git doesn't allow you to just update and save. Because its dumb store requires pointers to relevant commits (refs) and because that requirement isn't abstracted away or paved over by user-friendly features in the frontend, Git is effectively requiring end-users to define names (branches) for all commits. If you fail to define a name, it gets a lot harder to find your commits, exchange them, and Git may delete your data. While it is technically possible to not create branches, the version control tool is essentially unusable without them.

When local branches are exchanged, they appear as remote branches to others. Essentially, you give each instance of the repository a name (the remote). And branches/refs fetched from a named remote appear as a ref in the ref namespace for that remote. e.g. refs/remotes/origin holds refs for the origin remote. (Git allows you to not have to specify the refs/remotes part, so you can refer to e.g. refs/remotes/origin/master as origin/master.)

Again, if you were designing a version control tool from scratch or you were a new Git user, you'd probably think remote refs would make good starting points for work. For example, if you know you should be saving new work on top of the master branch, you might be inclined to begin that work by running git checkout origin/master. But like our specific-commit checkout above:

$ git checkout origin/master
Note: checking out 'origin/master'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at 95ec6b1b33... RelNotes: the eighth batch

This is the same message we got for a direct checkout. But we did supply a ref/remote branch name. What gives? Essentially, Git tries to enforce that the refs/remotes/ namespace is read-only and only updated by operations that exchange data with a remote, namely git fetch, git pull, and git push.

For this to work correctly, you need to create a new local branch (which initially points to the commit that refs/remotes/origin/master points to) and then switch/activate that local branch.

I could go on talking about all the subtle nuances of how Git branches are managed. But I won't.

If you've used Git, you know you need to use branches. You may or may not recognize just how frequently you have to type a branch name into a git command. I guarantee that if you are familiar with version control tools and workflows that aren't based on having to manage refs to track data, you will find Git's forced usage of refs and branches a bit absurd. I half jokingly refer to Git as Game of Refs. I say that because coming from Mercurial (which doesn't require you to name things), Git workflows feel to me like all I'm doing is typing the names of branches and refs into git commands. I feel like I'm wasting my precious time telling Git the names of things only because this is necessary to placate the leaky abstraction of Git's storage layer which requires references to relevant commits.

Git and version control doesn't have to be this way.

As I said, my Mercurial workflow doesn't rely on naming things. Unlike Git, Mercurial's store has an explicit (not shared) storage location for commits (changesets in Mercurial parlance). And this data structure is ordered, meaning a changeset later always occurs after its parent/predecessor. This means that Mercurial can open a single file/index to quickly find all changesets. Because Mercurial doesn't need pointers to commits of relevance, names aren't required.

My Zen of Mercurial moment came when I realized you didn't have to name things in Mercurial. Having used Git before Mercurial, I was conditioned to always be naming things. This is the Git way after all. And, truth be told, it is common to name things in Mercurial as well. Mercurial's named branches were the way to do feature branches in Mercurial for years. Some used the MQ extension (essentially a port of quilt), which also requires naming individual patches. Git users coming to Mercurial were missing Git branches and Mercurial's bookmarks were a poor port of Git branches.

But recently, more and more Mercurial users have been coming to the realization that names aren't really necessary. If the tool doesn't actually require naming things, why force users to name things? As long as users can find the commits they need to find, do you actually need names?

As a demonstration, my Mercurial workflow leans heavily on the hg show work and hg show stack commands. You will need to enable the show extension by putting the following in your hgrc config file to use them:

[extensions]
show =

Running hg show work (I have also set the config commands.show.aliasprefix=sto enable me to type hg swork) finds all in-progress changesets and other likely-relevant changesets (those with names and DAG heads). It prints a concise DAG of those changesets:

hg show work output

And hg show stack shows just the current line of work and its relationship to other important heads:

hg show stack output

Aside from the @ bookmark/name set on that top-most changeset, there are no names! (That @ comes from the remote repository, which has set that name.)

Outside of code archeology workflows, hg show work shows the changesets I care about 95% of the time. With all I care about (my in-progress work and possible rebase targets) rendered concisely, I don't have to name things because I can just find whatever I'm looking for by running hg show work! Yes, you need to run hg show work, visually scan for what you are looking for, and copy a (random) hash fragment into a number of commands. This sounds like a lot of work. But I believe it is far less work than naming things. Only when you practice this workflow do you realize just how much time you actually spend finding and then typing names in to hg and - especailly - git commands! The ability to just hg update to a changeset and commit without having to name things is just so liberating. It feels like my version control tool is putting up fewer barriers and letting me work quickly.

Another benefit of hg show work and hg show stack are that they present a concise DAG visualization to users. This helps educate users about the underlying shape of repository data. When you see connected nodes on a graph and how they change over time, it makes it a lot easier to understand concepts like merge and rebase.

This nameless workflow may sound radical. But that's because we're all conditioned to naming things. I initially thought it was crazy as well. But once you have a mechanism that gives you rapid access to data you care about (hg show work in Mercurial's case), names become very optional. Now, a pure nameless workflow isn't without its limitations. You want names to identify the main targets for work (e.g. the master branch). And when you exchange work with others, names are easier to work with, especially since names survive rewriting. But in my experience, most of my commits are only exchanged with me (synchronizing my in-progress commits across devices) and with code review tools (which don't really need names and can operate against raw commits). My most frequent use of names comes when I'm in repository maintainer mode and I need to ensure commits have names for others to reference.

Could Git support nameless workflows? In theory it can.

Git needs refs to find relevant commits in its store. And the wire protocol uses refs to exchange data. So refs have to exist for Git to function (assuming Git doesn't radically change its storage and exchange mechanisms to mitigate the need for refs, but that would be a massive change and I don't see this happening).

While there is a fundamental requirement for refs to exist, this doesn't necessarily mean that user-facing names must exist. The reason that we need branches today is because branches are little more than a ref with special behavior. It is theoretically possible to invent a mechanism that transparently maps nameless commits onto refs. For example, you could create a refs/nameless/ namespace that was automatically populated with DAG heads that didn't have names attached. And Git could exchange these refs just like it can branches today. It would be a lot of work to think through all the implications and to design and implement support for nameless development in Git. But I think it is possible.

I encourage the Git community to investigate supporting nameless workflows. Having adopted this workflow in Mercurial, Git's workflow around naming branches feels heavyweight and restrictive to me. Put another way, nameless commits are actually lighter-weight branches than Git branches! To the common user who just wants version control to be a save feature, requiring names establishes a barrier towards that goal. So removing the naming requirement would make Git simpler and more approachable to new users.

Forks aren't the Model You are Looking For

This section is more about hosted Git services (like GitHub, Bitbucket, and GitLab) than Git itself. But since hosted Git services are synonymous with Git and interaction with a hosted Git services is a regular part of a common Git user's workflow, I feel like I need to cover it. (For what it's worth, my experience at Mozilla tells me that a large percentage of people who say I prefer Git or we should use Git actually mean I like GitHub. Git and GitHub/Bitbucket/GitLab are effectively the same thing in the minds of many and anyone finding themselves discussing version control needs to keep this in mind because Git is more than just the command line tool: it is an ecosystem.)

I'll come right out and say it: I think forks are a relatively poor model for collaborating. They are light years better than what existed before. But they are still so far from the turn-key experience that should be possible. The fork hasn't really changed much since the current implementation of it was made popular by GitHub many years ago. And I view this as a general failure of hosted services to innovate.

So we have a shared understanding, a fork (as implemented on GitHub, Bitbucket, GitLab, etc) is essentially a complete copy of a repository (a git clone if using Git) and a fresh workspace for additional value-added services the hosting provider offers (pull requests, issues, wikis, project tracking, release tracking, etc). If you open the main web page for a fork on these services, it looks just like the main project's. You know it is a fork because there are cosmetics somewhere (typically next to the project/repository name) saying forked from.

Before service providers adopted the fork terminology, fork was used in open source to refer to a splintering of a project. If someone or a group of people didn't like the direction a project was taking, wanted to take over ownership of a project because of stagnation, etc, they would fork it. The fork was based on the original (and there may even be active collaboration between the fork and original), but the intent of the fork was to create distance between the original project and its new incantation. A new entity that was sufficiently independent of the original.

Forks on service providers mostly retain this old school fork model. The fork gets a new copy of issues, wikis, etc. And anyone who forks establishes what looks like an independent incantation of a project. It's worth noting that the execution varies by service provider. For example, GitHub won't enable Issues for a fork by default, thereby encouraging people to file issues against the upstream project it was forked from. (This is good default behavior.)

And I know why service providers (initially) implemented things this way: it was easy. If you are building a product, it's simpler to just say a user's version of this project is a git clone and they get a fresh database. On a technical level, this meets the traditional definition of fork. And rather than introduce a new term into the vernacular, they just re-purposed fork (albeit with softer connotations, since the traditional fork commonly implied there was some form of strife precipitating a fork).

To help differentiate flavors of forks, I'm going to define the terms soft fork and hard fork. A soft fork is a fork that exists for purposes of collaboration. The differentiating feature between a soft fork and hard fork is whether the fork is intended to be used as its own project. If it is, it is a hard fork. If not - if all changes are intended to be merged into the upstream project and be consumed from there - it is a soft fork.

I don't have concrete numbers, but I'm willing to wager that the vast majority of forks on Git service providers which have changes are soft forks rather than hard forks. In other words, these forks exist purely as a conduit to collaborate with the canonical/upstream project (or to facilitate a short-lived one-off change).

The current implementation of fork - which borrows a lot from its predecessor of the same name - is a good - but not great - way to facilitate collaboration. It isn't great because it technically resembles what you'd expect to see for hard fork use cases even though it is used predominantly with soft forks. This mismatch creates problems.

If you were to take a step back and invent your own version control hosted service and weren't tainted by exposure to existing services and were willing to think a bit beyond making it a glorified frontend for the git command line interface, you might realize that the problem you are solving - the product you are selling - is collaboration as a service, not a Git hosting service. And if your product is collaboration, then implementing your collaboration model around the hard fork model with strong barriers between the original project and its forks is counterproductive and undermines your own product. But this is how GitHub, Bitbucket, GitLab, and others have implemented their product!

To improve collaboration on version control hosted services, the concept of a fork needs to significantly curtailed. Replacing it should be a UI and workflow that revolves around the central, canonical repository.

You shouldn't need to create your own clone or fork of a repository in order to contribute. Instead, you should be able to clone the canonical repository. When you create commits, those commits should be stored and/or more tightly affiliated with the original project - not inside a fork.

One potential implementation is doable today. I'm going to call it workspaces. Here's how it would work.

There would exist a namespace for refs that can be controlled by the user. For example, on GitHub (where my username is indygreg), if I wanted to contribute to some random project, I would git push my refs somewhere under refs/users/indygreg/ directly to that project's. No forking necessary. If I wanted to contribute to a project, I would just clone its repo then push to my workspace under it. You could do this today by configuring your Git refspec properly. For pushes, it would look something like refs/heads/*:refs/users/indygreg/* (that tells Git to map local refs under refs/heads/ to refs/users/indygreg/ on that remote repository). If this became a popular feature, presumably the Git wire protocol could be taught to advertise this feature such that Git clients automatically configured themselves to push to user-specific workspaces attached to the original repository.

There are several advantages to such a workspace model. Many of them revolve around eliminating forks.

At initial contribution time, no server-side fork is necessary in order to contribute. You would be able to clone and contribute without waiting for or configuring a fork. Or if you can create commits from the web interface, the clone wouldn't even be necessary! Lowering the barrier to contribution is a good thing, especially if collaboration is the product you are selling.

In the web UI, workspaces would also revolve around the source project and not be off in their own world like forks are today. People could more easily see what others are up to. And fetching their work would require typing in their username as opposed to configuring a whole new remote. This would bring communities closer and hopefully lead to better collaboration.

Not requiring forks also eliminates the need to synchronize your fork with the upstream repository. I don't know about you, but one of the things that bothers me about the Game of Refs that Git imposes is that I have to keep my refs in sync with the upstream refs. When I fetch from origin and pull down a new master branch, I need to git merge that branch into my local master branch. Then I need to push that new master branch to my fork. This is quite tedious. And it is easy to merge the wrong branches and get your branch state out of whack. There are better ways to map remote refs into your local names to make this far less confusing.

Another win here is not having to push and store data multiple times. When working on a fork (which is a separate repository), after you git fetch changes from upstream, you need to eventually git push those into your fork. If you've ever worked on a large repository and didn't have a super fast Internet connection, you may have been stymied by having to git push large amounts of data to your fork. This is quite annoying, especially for people with slow Internet connections. Wouldn't it be nice if that git push only pushed the data that was truly new and didn't already exist somewhere else on the server? A workspace model where development all occurs in the original repository would fix this. As a bonus, it would make the storage problem on servers easier because you would eliminate thousands of forks and you probably wouldn't have to care as much about data duplication across repos/clones because the version control tool solves a lot of this problem for you, courtesy of having all data live alongside or in the original repository instead of in a fork.

Another win from workspace-centric development would be the potential to do more user-friendly things after pull/merge requests are incorporated in the official project. For example, the ref in your workspace could be deleted automatically. This would ease the burden on users to clean up after their submissions are accepted. Again, instead of mashing keys to play the Game of Refs, this would all be taken care of for you automatically. (Yes, I know there are scripts and shell aliases to make this more turn-key. But user-friendly behavior shouldn't have to be opt-in: it should be the default.)

But workspaces aren't all rainbows and unicorns. There are access control concerns. You probably don't want users able to mutate the workspaces of other users. Or do you? You can make a compelling case that project administrators should have that ability. And what if someone pushes bad or illegal content to a workspace and you receive a cease and desist? Can you take down just the offending workspace while complying with the order? And what happens if the original project is deleted? Do all its workspaces die with it? These are not trivial concerns. But they don't feel impossible to tackle either.

Workspaces are only one potential alternative to forks. And I can come up with multiple implementations of the workspace concept. Although many of them are constrained by current features in the Git wire protocol. But Git is (finally) getting a more extensible wire protocol, so hopefully this will enable nice things.

I challenge Git service providers like GitHub, Bitbucket, and GitLab to think outside the box and implement something better than how forks are implemented today. It will be a large shift. But I think users will appreciate it in the long run.

Conclusion

Git is an ubiquitous version control tool. But it is frequently lampooned for its poor usability and documentation. We even have research papers telling us which parts are bad. Nobody I know has had a pleasant initial experience with Git. And it is clear that few people actually understand Git: most just know the command incantations they need to know to accomplish a small set of common activities. (If you are such a person, there is nothing to be ashamed about: Git is a hard tool.)

Popular Git-based hosting and collaboration services (such as GitHub, Bitbucket, and GitLab) exist. While they've made strides to make it easier to commit data to a Git repository (I purposefully avoid saying use Git because the most usable tools seem to avoid the git command line interface as much as possible), they are often a thin veneer over Git itself (see forks). And Git is a thin veneer over a content indexed key-value store (see forced usage of bookmarks).

As an industry, we should be concerned about the lousy usability of Git and the tools and services that surround it. Some may say that Git - with its near monopoly over version control mindset - is a success. I have a different view: I think it is a failure that a tool with a user experience this bad has achieved the success it has.

The cost to Git's poor usability can be measured in tens if not hundreds of millions of dollars in time people have wasted because they couldn't figure out how to use Git. Git should be viewed as a source of embarrassment, not a success story.

What's really concerning is that the usability problems of Git have been known for years. Yet it is as popular as ever and there have been few substantial usability improvements. We do have some alternative frontends floating around. But these haven't caught on.

I'm at a loss to understand how an open source tool as popular as Git has remained so mediocre for so long. The source code is out there. Anybody can submit a patch to fix it. Why is it that so many people get tripped up by the same poor usability issues years after Git became the common version control tool? It certainly appears that as an industry we have been unable or unwilling to address systemic deficiencies in a critical tool. Why this is, I'm not sure.

Despite my pessimism about Git's usability and its poor track record of being attentive to the needs of people who aren't power users, I'm optimistic that the future will be brighter. While the ~7000 words in this post pale in comparison to the aggregate word count that has been written about Git, hopefully this post strikes a nerve and causes positive change. Just because one generation has toiled with the usability problems of Git doesn't mean the next generation has to suffer through the same. Git can be improved and I encourage that change to happen. The three issues above and their possible solutions would be a good place to start.


Notes from Git Merge 2015

May 12, 2015 at 03:40 PM | categories: Git, Mercurial, Mozilla

Git Merge 2015 was a Git user conference held in Paris on April 8 and 9, 2015.

I'm kind of a version control geek. I'm also responsible for a large part of Mozilla's version control hosting. So, when the videos were made public, you can bet I took interest.

This post contains my notes from a few of the Git Merge talks. I try to separate content/facts from my opinions by isolating my opinions (within parenthesis).

Git at Google

Git at Google: Making Big Projects (and everyone else) Happy is from a Googler (Dave Borowitz) who works on JGit for the Git infrastructure team at Google.

"Everybody in this room is going to feel some kind of pain working with Git at scale at some time in their career."

First Git usage at Google in 2008 for Android. 2011 googlesource.com launches.

24,000 total Git repos at Google. 77.1M requests/day. 30-40 TB/day. 2-3 Gbps.

Largest repo is 210GB (not public it appears).

800 repos in AOSP. Google maintains internal fork of all Android repos (so they can throw stuff over the wall). Fresh AOSP tree is 17 GiB. Lots of contracts dictating access.

Chrome repos include Chromium, Blink, Chromium OS. Performed giant Subversion migration. Developers of Chrome very set in their ways. Had many workflows integrated with Subversion web interface. Subversion blame was fast, Git blame slow. Built caching backend for Git blame to make developers happy.

Chromium 2.9 GiB, 3.6M objects, 390k commits. Blink 5.3 GiB, 3.1M objects, 177k commits. They merged them into a monorepo. Mention of Facebook's monorepo talk and Mercurial scaling efforts for a repo larger then Chromium/Blink monorepo. Benefits to developers for doing atomic refactorings, etc.

"Being big is hard."

AOSP: 1 Gbps -> 2 minutes for 17 GiB. 20 Mbps -> 3 hours. Flaky internet combined with non-resumable clone results in badness. Delta resolution can take a while. Checkout of hundreds of thousands of files can be slow, especially on Windows.

"As tool developers... it's hard to tell people don't check in large binaries, do things this way, ... when all they are trying to do is get their job done." (I couldn't agree more: tools should ideally not impose sub-optimal workflows.)

They prefer scaling pain to supporting multiple tools. (I think this meant don't use multiple VCSs if you can just make one scale.)

Shallow clone beneficial. But some commands don't work. log not very useful.

Narrow clones mentioned. Apparently talked about elsewhere at Git Merge not captured on video. Non-trivial problem for Git. "We have no idea when this is going to happen."

Split repos until narrow clone is available. Google wrote repo to manage multiple repos. They view repo and multiple repos as stop-gap until narrow clone is implemented.

git submodule needs some love. Git commands don't handle submodules or multiple repos very well. They'd like to see repo features incorporated into git submodule.

Transition to server operation.

Pre-2.0, counting objects was hard. For Linux kernel, 60s 100% CPU time per clone to count objects. "Linux isn't even that big."

Traditionally Git is single homed. Load from users. Load from automation.

Told anecdote about how Google's automation once recloned the repo after a failed Git command. Accidentally made a change one day that caused a command to persistently fail. DoS against server. (We've had this at Mozilla.)

Garbage collection on server is CPU intensive and necessary. Takes cores away from clients.

Reachability bitmaps implemented in JGit, ported to Git 2.0. Counting objects for Linux clones went from 60s CPU to ~100ms.

Google puts static, pre-generated bundles on a CDN. Client downloads bundle then does incremental fetch. Massive reduction in server load. Bundle files better for users. Resumable.

They have ideas for integrating bundles into git fetch, but it's "a little way's off." (This feature is partially implemented in Mercurial 3.4 and we have plans for using it at Mozilla.) It's feature in repo today.

Shared filesystem would be really nice to spread CPU load. NFS "works." Performance problems with high throughput repositories.

Master-mirror replication can help. Problems with replication lag. Consistency is hard.

Google uses a distributed Git store using Bigtable and GFS built on JGit. Git-aware load balancer. Completely separate pool of garbage collection workers. They do replication to multiple datacenters before pushes. 6 datacenters across world. Some of their stuff is open source. A lot isn't.

Humans have difficulty managing hundreds of repositories. "How do you as a human know what you need to modify?" Monorepos have this problem too. Inherent with lots of content. (Seemed to imply it is worse with multiple repos than with a monorepo.)

Porting changes between forks is hard. e.g. cherry picking between internal and external Android repos.

ACLs are a mess.

Google built Gerrit code review. It does ACLs, auto rebasing, release branch management. It's a swiss army knife. (This aligns with my vision for MozReview and code-centric development.)

Scaling Git at Twitter

Wilhelm Bierbaum from Twitter talks about Scaling Git at Twitter.

"We've decided it's really important to make Twitter a good place to work for developers. Source control is one of those points where we were lacking. We had some performance problems with Git in the past."

Twitter runs a monorepo. Used to be 3 repos. "Working with a single repository is the way they prefer to develop software when developing hundreds of services." They also have a single build system. They have a geo diverse workforce.

They use normal canonical implementation of Git + some optimizations.

Benefits of a monorepo:

Visibility. Easier for people to find code in one repo. Code search tools tailored towards single repos.

Single toolchain. single set of tools to build, test, and deploy. When improvements to tools made, everyone benefits because one toolchain.

Easy to share code (particularly generated code like IDL). When operating many small services, services developed together. Code duplication is minimized. Twitter relies on IDL heavily.

Simpler to predict the impact of your changes. Easier to look at single code base then to understand how multiple code bases interact. Easy to make a change and see what breaks rather than submit changes to N repos and do testing in N repos.

Makes refactoring less arduous.

Surfaces architecture issues earlier.

Breaking changes easier to coordinate

Drawbacks of monorepos:

Large disk footprint for full history.

Tuning filesystem only goes so far.

Forces some organizations to adopt sparse checkouts and shallow clones.

Submodules aren't good enough to use. add and commit don't recognize submodule boundaries very well and aren't very usable.

"To us, using a tool such as repo that is in effect a secondary version control tool on top of Git does not feel right and doesn't lead to a fluid experience."

Twitter has centralized use of Git. Don't really benefit from distributed version control system. Feature branches. Goal is to live as close to master as possible. Long-running branches discouraged. Fewer conflicts to resolve.

They have project-level ownership system. But any developer can change anything.

They have lots of read-only replicas. Highly available writable server.

They use reference repos heavily so object exchange overhead is manageable.

Scaling issues with many refs. Partially due to how refs are stored on disk. File locking limits in OS. Commands like status, add, and commit can be slow, even with repo garbage collected and packed. Locking issues with garbage collection.

Incorporated file alteration monitor to make status faster. Reference to Facebook's work on watchman and its Mercurial integration. Significant impact on OS X. "Pretty much all our developers use OS X." (I assume they are using Watchman with Git - I've seen patches for this on the Git mailing list but they haven't been merged into core yet.)

They implemented a custom index format. Adopted faster hashing algorithm that uses native instructions.

Discovery with many refs didn't scale. 13 MB of raw data for refs exchange at Twitter. (!!) Experimenting with clients sending a bloom filter of refs. Hacked it together via HTTP header.

Fetch protocol is expensive. Lots of random I/O. Can take minutes to do incremental fetches. Bitmap indices help, but aren't good enough for them. Since they have central and well-defined environment, they changed fetch protocol to work like a journal: send all changed data since client's last fetch. Server work is essentially a sendfile system call. git push appends push packs to a log-structured journal. On fetch, clients "replay" the transactions from the journal. Similar to MySQL binary log replication. (This is very similar to some of the Mercurial replication work I'm doing at Mozilla. Yay technical validation.) (Append only files are also how Mercurial's storage model works by default.)

Log-structured data exchange means server side is cheap. They can insert HTTP caches to handle Range header aware requests.

Without this hack, they can't scale their servers.

Initial clone is seeded by BitTorrent.

It sounds like everyone's unmerged feature branches are on the one central repo and get transferred everywhere by default. Their journal fetch can selectively fetch refs so this isn't a huge problem.

They'd like to experiment with sparse repos. But they haven't gotten to that yet. They'd like a better storage abstraction in Git to enable necessary future scalability. They want a network-aware storage backend. Local objects not necessary if the network has them.

They are running a custom Git distribution/fork on clients. But they don't want to maintain forever. Prefer to send upstream.


Code First and the Rise of the DVCS and GitHub

January 10, 2015 at 12:35 PM | categories: Git, Mozilla

The ascendancy of GitHub has very little to do with its namesake tool, Git.

What GitHub did that was so radical for its time and the strategy that GitHub continues to execute so well on today is the approach of putting code first and enabling change to be a frictionless process.

In case you weren't around for the pre-GitHub days or don't remember, they were not pleasant. Tools around code management were a far cry from where they are today (I still argue the tools are pretty bad, but that's for another post). Centralized version control systems were prevalent (CVS and Subversion in open source, Perforce, ClearCase, Team Foundation Server, and others in the corporate world). Tools for looking at and querying code had horrible, ugly interfaces and came out of a previous era of web design and browser capabilities. It felt like a chore to do anything, including committing code. Yes, the world had awesome services like SourceForge, but they weren't the same as GitHub is today.

Before I get to my central thesis, I want to highlight some supporting reasons for GitHub's success. There were two developments in the second half of the 2000s the contributed to the success of GitHub: the rises of the distributed version control system (DVCS) and the modern web.

While distributed version control systems like Sun WorkShop TeamWare and BitKeeper existed earlier, it wasn't until the second half of the 2000s that DVCS systems took off. You can argue part of the reason for this was open source: my recollection is there wasn't a well-known DVCS available as free software before 2005. Speaking of 2005, it was a big year for DVCS projects: Git, Mercurial, and Bazaar all had initial releases. Suddenly, there were old-but-new ideas on how to do source control being exposed to new and willing-to-experiment audiences. DVCS were a critical leap from traditional version control because they (theoretically) impose less process and workflow limitations on users. With traditional version control, you needed to be online to commit, meaning you were managing patches, not commits, in your local development workflow. There were some forms of branching and merging, but they were a far cry from what is available today and were often too complex for mere mortals to use. As more and more people were exposed to distributed version control, they welcomed its less-restrictive and more powerful workflows. They realized that source control tools don't have to be so limiting. Distributed version control also promised all kinds of revamped workflows that could be harnessed. There were potential wins all around.

Around the same time that open source DVCS systems were emerging, web browsers were evolving from an application to render static pages to a platform for running web applications. Web sites using JavaScript to dynamically manipulate web page content (DHTML as it was known back then) were starting to hit their stride. I believe it was GMail that turned the most heads as to the full power of the modern web experience, with its novel-for-its-time extreme reliance on XMLHttpRequest for dynamically changing page content. People were realizing that powerful, desktop-like applications could be built for the web and could run everywhere.

GitHub launched in April 2008 standing on the shoulders of both the emerging interest in the Git content tracking tool and the capabilities of modern browsers.

I wasn't an early user of GitHub. My recollection is that GitHub was mostly a Rubyist's playground back then. I wasn't a Ruby programmer, so I had little reason to use GitHub in the early stages. But people did start using GitHub. And in the spirit of Ruby (on Rails), GitHub moved fast, or at least was projecting the notion that they were. While other services built on top of DVCS tools - like Bitbucket - did exist back then, GitHub seemed to have momentum associated with it. (Look at the archives for GitHub's and Bitbucket's respective blogs. GitHub has hundreds of blog entries; Bitbucket numbers in the dozens.) Developers everywhere up until this point had all been dealing with sub-optimal tools and workflows. Some of us realized it. Others hadn't. Many of those who did saw GitHub as a beacon of hope: we have all these new ideas and new potentials with distributed version control and here is a service under active development trying to figure out how to exploit that. Oh, and it's free for open source. Sign me up!

GitHub did capitalize on a market opportunity. They also capitalized on the value of marketing and the perception that they were moving fast and providing features that people - especially in open source - wanted. This captured the early adopters market. But I think what really set GitHub apart and led to the success they are enjoying today is their code first approach and their desire to make contribution easy, and even fun and sociable.

As developers, our job is to solve problems. We often do that by writing and changing code. And this often involves working as part of a team, or collaborating. To collaborate, we need tools. You eventually need some processes. And as I recently blogged, this can lead to process debt and inefficiencies associated with them.

Before GitHub, the process debt for contributing to other projects was high. You often had to subscribe to mailing lists in order to submit patches as emails. Or, you had to create an account on someone's bug tracker or code review tool before you could send patches. Then you had to figure out how to use these tools and any organization or project-specific extensions and workflows attached to them. It was quite involved and a lot could go wrong. Many projects and organizations (like Mozilla) still practice this traditional methology. Furthermore (and as I've written before), these traditional, single patch/commit-based tools often aren't effective at ensuring the desired output of high quality software.

Before GitHub solved process debt via commoditization of knowledge via market dominance, they took another approach: emphasizing code first development.

GitHub is all about the code. You load a project page and you see code. You may think a README with basic project information would be the first thing on a project page. But it isn't. Code, like data, is king.

Collaboration and contribution on GitHub revolve around the pull request. It's a way of saying, hey, I made a change, will you take it? There's nothing too novel in the concept of the pull request: it's fundamentally no different than sending emails with patches to a mailing list. But what is so special is GitHub's execution. Gone are the days of configuring and using one-off tools and processes. Instead, we have the friendly confines of a clean, friendly, and modern web experience. While GitHub is built upon the Git tool, you don't even need to use Git (a tool lampooned for its horrible usability and approachability) to contribute on GitHub! Instead, you can do everything from your browser. That warrants repeating: you don't need to leave your browser to contribute on GitHub. GitHub has essentially reduced process debt to edit a text document territory, and pretty much anybody who has used a computer can do that. This has enabled GitHub to dabble into non-code territory, such as its GitHub and Government initiative to foster community involvement in government. (GitHub is really a platform for easily seeing and changing any content or data. But, please, let me continue using code as a stand-in, since I want to focus on the developer audience.)

GitHub took an overly-complicated and fragmented world of varying contribution processes and made the new world revolve around code and a unified and simple process for change - the pull request.

Yes, there are other reasons for GitHub's success. You can make strong arguments that GitHub has capitalized on the social and psychological aspects of coding and human desire for success and happiness. I agree.

You can also argue GitHub succeeded because of Git. That statement is more or less technically accurate, but I don't think it is a sound argument. Git may have been the most feature complete open source DVCS at the time GitHub came into existence. But that doesn't mean there is something special about Git that no other DVCS has that makes GitHub popular. Had another tool been more feature complete or had the backing of a project as large as Linux at the time of GitHub's launch, we could very well be looking at a successful service built on something that isn't Git. Git had early market advantage and I argue its popularity today - a lot of it via GitHub - is largely a result of its early advantages over competing tools. And, I would go so far to say that when you consider the poor usability of Git and the pain that its users go through when first learning it, more accurate statements would be that GitHub succeeded in spite of Git and Git owes much of its success to GitHub.

When I look back at the rise of GitHub, I see a service that has succeeded by putting people first by allowing them to capitalize on more productive workflows and processes. They've done this by emphasizing code, not process, as the means for change. Organizations and projects should take note.


Next Page ยป