Why You Shouldn't Use Git LFS

May 12, 2021 at 10:30 AM | categories: Mercurial, Git

I have long held the opinion that you should avoid Git LFS if possible. Since people keeping asking me why, I figured I'd capture my thoughts in a blog post so I have something to refer them to.

Here are my reasons for not using Git LFS.

Git LFS is a Stop Gap Solution

Git LFS was developed outside the official Git project to fulfill a very real market need that Git didn't/doesn't handle large files very well.

I believe it is inevitable that Git will gain better support for handling of large files, as this seems like a critical feature for a popular version control tool.

If you make this long bet, LFS is only an interim solution and its value proposition disappears after Git has better native support for large files.

LFS as a stop gap solution would be tolerable except for the fact that...

Git LFS is a One Way Door

The adoption or removal of Git LFS in a repository is an irreversible decision that requires rewriting history and losing your original commit SHAs.

In some contexts, rewriting history is tolerable. In many others, it is an extremely expensive proposition. My experience maintaining version control in professional contexts aligns with the opinion that rewriting history is expensive and should only be considered a measure of last resort. Maybe if tools made it easier to rewrite history without the negative consequences (e.g. GitHub would redirect references to old SHA1 in URLs and API calls) I would change my opinion here. Until that day, the drawbacks of losing history are just too high to stomach for many.

The reason adoption or removal of LFS is irreversible is due to the way Git LFS works. What LFS does is change the blob content that a Git commit/tree references. Instead of the content itself, it stores a pointer to the content. At checkout and commit time, LFS blobs/records are treated specially via a mechanism in Git that allows content to be rewritten as it moves between Git's core storage and its materialized representation. (The same filtering mechanism is responsible for normalizing line endings in text files. Although that feature is built into the core Git product and doesn't work exactly the same way. But the principles are the same.)

Since the LFS pointer is part of the Merkle tree that a Git commit derives from, you can't add or remove LFS from a repo without rewriting existing Git commit SHAs.

I want to explicitly call out that even if a rewrite is acceptable in the short term, things may change in the future. If you adopt LFS today, you are committing to a) running an LFS server forever b) incurring a history rewrite in the future in order to remove LFS from your repo, or c) ceasing to provide an LFS server and locking out people from using older Git commits. I don't think any of these are great options: I would prefer if there were a way to offboard from LFS in the future with minimal disruption. This is theoretically possible, but it requires the Git core product to recognize LFS blobs/records natively. There's no guarantee this will happen. So adoption of Git LFS is a one way door that can't be easily reversed.

LFS is More Complexity

LFS is more complex for Git end users.

Git users have to install, configure, and sometimes know about the existence of Git LFS. Version control should just work. Large file handling should just work. End-users shouldn't have to care that large files are handled slightly differently from small files.

The usability of Git LFS is generally pretty good. However, there's an upper limit on that usability as long as LFS exists outside the core Git product. And LFS will likely never be integrated into the core Git product because the Git maintainers know that LFS is only a stop gap solution. They would rather solve large files storage correctly than ~forever carry the legacy baggage of having to support LFS in the core product.

LFS is more complexity for Git server operators as well. Instead of a self-contained Git repository and server to support, you now have to support a likely separate HTTP server to facilitate LFS access. This isn't the hardest thing in the world, especially since we're talking about key-value blob storage, which is arguably a solved problem. But it's another piece of infrastructure to support and secure and it increases the surface area of complexity instead of minimizing it. As a server operator, I would much prefer if the large file storage were integrated into the core Git product and I simply needed to provide some settings for it to just work.

Mercurial Does LFS Slightly Better

Since I'm a maintainer of the Mercurial version control tool, I thought I'd throw out how Mercurial handles large file storage better than Git. Mercurial's large file handling isn't great, but I believe it is strictly better with regards to the trade-offs of adopting large file storage.

In Mercurial, use of LFS is a dynamic feature that server/repo operators can choose to enable or disable whenever they want. When the Mercurial server sends file content to a client, presence of external/LFS storage is a flag set on that file revision. Essentially, the flag says the data you are receiving is an LFS record, not the file content itself and the client knows how to resolve that record into content.

Conceptually, this is little different from Git LFS records in terms of content resolution. However, the big difference is this flag is part of the repository interchange data, not the core repository data as it is with Git. Since this flag isn't part of the Merkle tree used to derive the commit SHA, adding, removing, or altering the content of the LFS records doesn't require rewriting commit SHAs. The tracked content SHA - the data now stored in LFS - is still tracked as part of the Merkle tree, so the integrity of the commit / repository can still be verified.

In Mercurial, the choice of whether to use LFS and what to use LFS for is made by the server operator and settings can change over time. For example, you could start with no use of LFS and then one day decide to use LFS for all file revisions larger than 10 MB. Then a year later you lower that to all revisions larger than 1 MB. Then a year after that Mercurial gains better native support for large files and you decide to stop using LFS altogether.

Also in Mercurial, it is possible for clients to push a large file inline as part of the push operation. When the server sees that large file, it can be like this is a large file: I'm going to add it to the blob store and advertise it as LFS. Because the large file record isn't part of the Merkle tree, you can have nice things like this.

I suspect it is only a matter of time before Git's wire protocol learns the ability to dynamically advertise remote servers for content retrieval and this feature will be leveraged for better large file handling. Until that day, I suppose we're stuck with having to rewrite history with LFS and/or funnel large blobs through Git natively, with all the pain that entails.

Conclusion

This post summarized reasons to avoid Git LFS. Are there justifiable scenarios for using LFS? Absolutely! If you insist on using Git and insist on tracking many large files in version control, you should definitely consider LFS. (Although, if you are a heavy user of large files in version control, I would consider Plastic SCM instead, as they seem to have the most mature solution for large files handling.)

The main point of this post is to highlight some drawbacks with using Git LFS because Git LFS is most definitely not a magic bullet. If you can stomach the short and long term effects of Git LFS adoption, by all means, use Git LFS. But please make an informed decision either way.


Mercurial's Journey to and Reflections on Python 3

January 13, 2020 at 08:45 AM | categories: Python, Mercurial

Mercurial 5.2 was released on November 5, 2019. It is the first version of Mercurial that supports Python 3. This milestone comes nearly 11 years after Python 3.0 was first released on December 3, 2008.

Speaking as a maintainer of Mercurial and an avid user of Python, I feel like the experience of making Mercurial work with Python 3 is worth sharing because there are a number of lessons to be learned.

This post is logically divided into two sections: a mostly factual recount of Mercurial's Python 3 porting effort and a more opinionated commentary of the transition to Python 3 and the Python language ecosystem as a whole. Those who don't care about the mechanics of porting a large Python project to Python 3 may want to skip the next section or two.

Porting Mercurial to Python 3

Let's start with a brief history lesson of Mercurial's support for Python 3 as told by its own commit history.

The Mercurial version control tool was first released in April 2005 (the same month that Git was initially released). Version 1.0 came out in March 2008. The first reference to Python 3 I found in the code base was in September 2008. Then not much happens for a while until June 2010, when someone authors a bunch of changes to make the Python C extensions start to recognize Python 3. Then things were again quiet for a while until January 2013, when a handful of changes landed to remove 2 argument raise. There were a handful of commits in 2014 but nothing worth calling out.

Mercurial's meaningful journey to Python 3 started in 2015. In code, the work started in April 2015, with effort to make Mercurial's test harness run with Python 3. Part of this was a decision that Python 3.5 (to be released several months later in September 2015) would be the minimum Python 3 version that Mercurial would support.

Once the Mercurial Project decided it wanted to port to Python 3 (as opposed to another language), one of the earliest decisions was how to perform that port. Mercurial's code base was too large to attempt a flag day conversion where there would be a Python 2 version and a Python 3 version and one day everyone would switch from Python 2 to 3. Mercurial needed a way to run the same code (or as much of the same code) on both Python 2 and 3. We would maintain a single code base and users would gradually switch from running with Python 2 to Python 3.

In May 2015, Mercurial dropped support for Python 2.4 and 2.5. Dropping support for these older Python versions was critical, as it was effectively impossible to write Python code that ran on this wide gamut of versions because of incompatibilities in syntax and language features. For example, you needed Python 2.6 to get print() via from __future__ import print_function. The project's late start at a Python 3 port can be significantly attributed to Python 2.4 and 2.5 compatibility holding us back.

The main goal with Mercurial's early porting work was just getting the code base to a point where import mercurial would work. There were a myriad of places where Mercurial used syntax that was invalid on Python 3 and Python 3 couldn't even parse the source code, let alone compile it to bytecode and execute it.

This effort began in earnest in June 2015 with global source code rewrites like using modern octal syntax, modern exception catching syntax (except Exception as e instead of except Exception, e), print() instead of print, and a modern import convention along with the use of from __future__ import absolute_import.

In the early days of the port, our first goal was to get all source code parsing as valid Python 3. The next step was to get all the modules importing cleanly. This entailed fixing code that ran at import time to work on Python 3. Our thinking was that we would need the code base to be import clean on Python 3 before seriously thinking about run-time behavior. In reality, we quickly ported a lot of modules to import cleanly and then moved on to higher-level porting, leaving a long-tail of modules with import failures.

This initial porting effort played out over months. There weren't many people working on it in the early days: a few people would basically hack on Python 3 as a form of itch scratching and most of the project's energy was focused on improving the existing Python 2 based product. You can get a rough idea of the timeline and participation in the early porting effort through the history of test-check-py3-compat.t. We see the test being added in December 2015, By June 2016, most of the code base was ported to our modern import convention and we were ready to move on to more meaningful porting.

One of the biggest early hurdles in our porting effort was how to overcome the string literals type mismatch between Python 2 and 3. In Python 2, a '' string literal is a sequence of bytes. In Python 3, a '' string literal is a sequence of Unicode code points. These are fundamentally different types. And in Mercurial's code base, most of our string types are binary by design: use of a Unicode based str for representing data is flat out wrong for our use case. We knew that Mercurial would need to eventually switch many string literals from '' to b'' to preserve type compatibility. But doing so would be problematic.

In the early days of Mercurial's Python 3 port in 2015, Mercurial's project maintainer (Matt Mackall) set a ground rule that the Python 3 port shouldn't overly disrupt others: he wanted the Python 3 port to more or less happen in the background and not require every developer to be aware of Python 3's low-level behavior in order to get work done on the existing Python 2 code base. This may seem like a questionable decision (and I probably disagreed with him to some extent at the time because I was doing Python 3 porting work and the decision constrained this work). But it was the correct decision. Matt knew that it would be years before the Python 3 port was either necessary or resulted in a meaningful return on investment (the value proposition of Python 3 has always been weak to Mercurial because Python 3 doesn't demonstrate a compelling advantage over Python 2 for our use case). What Matt was trying to do was minimize the externalized costs that a Python 3 port would inflict on the project. He correctly recognized that maintaining the existing product and supporting existing users was more important than a long-term bet in its infancy.

This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.

In addition, there were some other practical issues with doing a bulk b'' prefix insertion. One was that the added b characters would cause a lot of lines to grow beyond our length limits and we'd have to reformat code. That would require manual intervention and would significantly slow down porting. And a sub-issue of adding all the b prefixes and reformatting code is that it would break annotate/blame more than was tolerable. The latter issue was addressed by teaching Mercurial's annotate/blame feature to skip revisions. The project now has a convention of annotating commit messages with # skip-blame <reason> so structural only changes can easily be ignored when performing an annotate/blame.

A stop-gap solution to the b'' everywhere issue came in July 2016, when I introduced a custom Python module importer that rewrote source code as part of import when running on Python 3. (I have previously blogged about this hack.) What this did was transparently add b'' prefixes to all un-prefixed string literals as well as modify how a few common functions were called so that we wouldn't need to modify source code so things would run natively on Python 3. The source transformer allowed us to have the benefits of progressing in our Python 3 port without having to rewrite tens of thousands of lines of source code. The solution was hacky. But it enabled us to make significant progress on the Python 3 port without externalizing a lot of cost onto others.

I thought the source transformer would be relatively short-lived and would be removed shortly after the project inevitably decided to go all in on Python 3. To my surprise, others built additional transforms over the years and the source transformer persisted all the way until October 2019, when I removed it just before the first non-alpha Python 3 compatible version of Mercurial was released.

A common problem Mercurial faced with making the code base dual Python 2/3 native was dealing with standard library differences. Most of the problems stemmed from changes between Python 2.7 and 3.5+. But there are changes within the versions of Python 3 that we had to wallpaper over as well. In April 2016, the mercurial.pycompat module was introduced to export aliases or wrappers around standard library functionality to abstract the differences between Python versions. This file grew over time and eventually became Mercurial's version of six. To be honest, I'm not sure if we should have used six from the beginning. six probably would have saved some work. But we had to eventually write a lot of shims for converting between str and bytes and would have needed to invent a pycompat layer in some form anyway. So I'm not sure six would have saved enough effort to justify the baggage of integrating a 3rd party package into Mercurial. (When Mercurial accepts a 3rd party package, downstream packagers like Debian get all hot and bothered and end up making questionable patches to our source code. So we prefer to minimize the surface area for problems by minimizing dependencies on 3rd party packages.)

Once we had a source transforming module importer and the pycompat compatibility shim, we started to focus in earnest on making core functionality actually work on Python 3. We established a convention of annotating changesets needed for Python 3 with py3, so a commit message search yields a lot of the history. (But it isn't a full history since not every Python 3 oriented change used this convention). We see from that history that after the source importer landed, a lot of porting effort was spent on things very early in the hg process lifetime. This included handling environment variables, loading config files, and argument parsing. We introduced a test-check-py3-commands.t test to track the progress of hg commands working in Python 3. The very early history of that file shows the various error messages changing, as underlying early process functionality was slowly ported to work on Python 3. By December 2016, we had hg version working on Python 3!

With basic hg command dispatch ported to Python 3 at the end of 2016, 2017 represented an inflection point in the Python 3 porting effort. With the early process functionality working, different people could pick up different commands and code paths and start making code work with Python 3. By March 2017, basic repository opening and hg files worked. Shortly thereafter, hg init started working as well. And hg status and hg commit did as well.

Within a few months, enough of Mercurial's functionality was working with Python 3 that we started to track which tests passed on Python 3. The evolution of this file shows a reasonable history of the porting velocity.

In May 2017, we dropped support for Python 2.6. This significantly reduced the complexity of supporting Python 3, as there was tons of functionality in Python 2.7 that made it easier to target both Python 2 and 3 and now our hands were untied to utilize it.

In November 2017, I landed a test harness feature to report exceptions seen during test runs. I later refined the output so the most frequent failures were reported more prominently. This feature greatly enabled our ability to target the most common exceptions, allowing us to write patches to fix the most prevalent issues on Python 3 and uncover previously unknown failures.

By the end of 2017, we had most of the structural pieces in place to complete the port. Essentially all that was required at that point was time and labor. We didn't have a formal mechanism in place to target porting efforts. Instead, people would pick up a component or test that they wanted to hack on and then make incremental changes towards making that work. All the while, we didn't have a strict policy on not regressing Python 3 and regressions in Python 3 porting progress were semi-frequent. Although we did tend to correct regressions quickly. And over time, developers saw a flurry of Python 3 patches and slowly grew awareness of how to accommodate Python 3, and the number of Python 3 regressions became less frequent.

As useful as the source-transforming module importer was, it incurred some additional burden for the porting effort. The source transformer effectively converted all un-prefixed string literals ('') to bytes literals (b'') to preserve string type behavior with Python 2. But various aspects of Python 3 didn't like the existence of bytes. Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes. So our pycompat layer grew pretty large to accommodate calling into various standard library functionality. Another side-effect which we didn't initially anticipate was the **kwargs calling convention. Python allows you to use ** with a dict with string keys to turn those keys into named arguments in a function call. But Python 3 requires these dict keys to be str and outright rejects bytes keys, even if the bytes instance is ASCII safe and has the same underlying byte representation of the string data as the str instance would. So we had to invent support functions that would convert dict keys from bytes to str for use with **kwargs and another to convert a **kwargs dict from str keys to bytes keys so we could use '' syntax to access keys in our source code! Also on the string type front, we had to sprinkle the codebase with raw string literals (r'') to force the use of str irregardless of which Python version you were running on (our source transformer only changed unprefixed string literals, so existing r'' strings would be preserved as str).

Blind transformation of all string literals to bytes was less than ideal and it did impose some unwanted side-effects. But, again, most strings in Mercurial are bytes by design, so we thought it would be easier to byteify all strings then selectively undo that where native strings were actually warranted (like keys in most dicts) than to take the up-front cost to examine every string and make an intelligent determination as to what type it should be. I go back and forth as to whether this was the correct call. But when you factor in that the source transforming module importer unblocked Python 3 porting at a time in the project's history when there was so much focus on improving the core product and it did so without externalizing many costs onto the people doing the critical core product work, I think it was the right call.

By mid 2019, the number of test failures in Python 3 had been whittled down to a reasonable, less daunting number. It felt like victory was in grasp and inevitable. But a few significant issues lingered.

One remaining question was around addressing differences between Python 3 versions. At the time, Python 3.5, 3.6, and 3.7 were released and 3.8 was scheduled for release by the end of the year. We had a surprising number of issues with differences in Python 3 versions. Many of us were running Python 3.7, so it had the fewest failures. We had to spend extra effort to get Python 3.5 and 3.6 working as well as 3.7. Same for 3.8.

Another task we deferred until the second half of 2019 was standing up robust CI for Python 3. We had some coverage, but it was minimal. Wanting a distraction from PyOxidizer for a bit and wanting to overhaul Mercurial's CI system (which is officially built on Buildbot), I cobbled together a serverless CI system built on top of AWS DynamoDB and S3 for storage, Lambda functions and CloudWatch events for all business logic, and EC2 spot instances for job execution. This CI system executed Python 3.5, 3.6, 3.7, and 3.8 variants of our test harness on Linux and Python 3.7 on Windows. This gave developers insight into version-specific failures. More importantly, it also gave insight into Windows failures, which was previously not well tested. It was discovered that Python 3 on Windows was lagging significantly behind POSIX.

By the time of the Mercurial developer meetup in October 2019, nearly all tests were passing on POSIX platforms and we were confident that we could declare Python 3 support as at least beta quality for the Mercurial 5.2 release, planned for early November.

One of our blockers for ripping off the alpha label on Python 3 support was removing our source-transforming module importer. It had performance implications and it wasn't something we wanted to ship because it felt too hacky. A blocker for this was we wanted to automatically format our source tree with black because if we removed the source transformer, we'd have to rewrite a lot of source code to apply changes the transformer was performing, which would necessitate wrapping a lot of lines, which would involve a lot of manual effort. We wanted to blacken our code base first so that mass rewriting source code wouldn't involve a lot of tedious reformatting since black would handle that for us automatically. And rewriting the source tree with black was blocked on a specific feature landing in black! (We did not agree with black's behavior of unwrapping comma-delimited lists of items if they could fit on a single line. So one of our core contributors wrote a patch to black that changed its behavior so a trailing , in a list of items will force items to be formatted on multiple lines. I personally find the multiple line formatting much easier to read. And the behavior is arguably better for code review and annotation, which is line based.) Once this feature landed in black, we reformatted our source tree and started ripping out the source transformations, starting by inserting b'' literals everywhere. By late October, the source transformer was no more and we were ready to release beta quality support for Python 3 (at least on UNIX-like platforms).

Having described a mostly factual overview of Mercurial's port to Python 3, it is now time to shift gears to the speculative and opinionated parts of this post. I want to underscore that the opinions reflected here are my own and do not reflect the overall Mercurial Project or even a consensus within it.

The Future of Python 3 and Mercurial

Mercurial's port to Python 3 is still ongoing. While we've shipped Python 3 support and the test harness is clean on Python 3, I view shipping as only a milestone - arguably the most important one - in a longer journey. There's still a lot of work to do.

It is now 2020 and Python 2 support is now officially dead from the perspective of the Python language maintainers. Linux distributions are starting to rip out Python 2. Packages are dropping Python 2 support in new versions. The world is moving to Python 3 only. But Mercurial still officially supports Python 2. And it is still yet to be determined how long we will retain support for Python 2 in the code base. We've only had one release supporting Python 3. Our users still need to port their extensions (implemented in Python). Our users still need to start widely using Mercurial with Python 3. Even our own developers need to switch to Python 3 (old habits are hard to break).

I anticipate a long tail of random bugs in Mercurial on Python 3. While the tests may pass, our code coverage is not 100%. And even if it were, Python is a dynamic language and there are tons of invariants that aren't caught at compile time and can only be discovered at run time. These invariants cannot all be detected by tests, no matter how good your test coverage is. This is a feature/limitation of dynamic languages. Our users will likely be finding a long tail of miscellaneous bugs on Python 3 for years.

At present, our code base is littered with tons of random hacks to bridge the gap between Python 2 and 3. Once Python 2 support is dropped, we'll need to remove these hacks and make the source tree Python 3 native, with minimal shims to wallpaper over differences in Python 3 versions. Removing this Python version bridge code will likely require hundreds of commits and will be a non-trivial effort. It's likely to be deemed a low priority (it is glorified busy work after all), and code for the express purpose of supporting Python 2 will likely linger for years.

We are also still shoring up our packaging and distribution story on Python 3. This is easier on some platforms than others. I created PyOxidizer partially because of the poor experience I had with Python application packaging and distribution through the Mercurial Project. The Mercurial Project has already signed off on using PyOxidizer for distributing Mercurial in the future. So look for an oxidized Mercurial distribution in the near future! (You could argue PyOxidizer is an epic yak shave to better support Mercurial. But that's for another post.)

Then there's Windows support. A Python 3 powered Mercurial on Windows still has a handful of known issues. It may require a few more releases before we consider Python 3 on Windows to be stable.

Because we're still on a code base that must support Python 2, our adoption of Python 3 features is very limited. The only Python 3 feature that Mercurial developers seem to almost universally get excited about is type annotations. We already have some people playing around with pytype using comment-based annotations and pytype has already caught a few bugs. We're eager to go all in on type annotations and uncover lots of dynamic typing bugs and poorly implemented APIs. Beyond type annotations, I can't name any feature that people are screaming to adopt and which makes a lot of sense for Mercurial. There's a long tail of minor features I'm sure will get utilized. But none of the marquee features that define major language releases seem that interesting to us. Time will tell.

Commentary on Python 3

Having described Mercurial's ongoing journey to Python 3, I now want to focus more on Python itself. Again, the opinions here are my own and don't reflect those of the Mercurial Project.

Succinctly, my experience porting Mercurial and other projects to Python 3 has significantly soured my perceptions of Python. As much as I have historically loved Python - from the language to the welcoming community - I am still struggling to understand how Python could manage to inflict so much hardship on the community by choosing the transition plan that they did. I believe Python's choices represent a terrific example of what not to do when managing a large project or ecosystem. Maintainers of other largely-deployed systems would benefit from taking the time to understand and reflect on Python's missteps.

Python 3.0 was released on December 3, 2008. And it took the better part of a decade for the community to embrace it. This should be universally recognized as a failure. While hindsight is 20/20, many of the issues with Python 3 were obvious at the time and could have been mitigated had the language maintainers been more accommodating - and dare I say empathetic - to its users.

Initially, Python 3 had a rather cavalier attitude towards backwards and forwards compatibility. In the early years of Python 3, the attitude of Python's maintainers was Python 3 is a new, better language: you should target it explicitly. There were some tools and methods to ease the transition. But nothing super polished, especially in the early years. Adoption of Python 3 in the overall community was slow. Python developers in the wild justifiably complained that the value proposition of Python 3 was too weak to justify porting effort. Not helping was that the early advice for targeting Python 3 was to rewrite the source code to become Python 3 native. This is in contrast with using the same source to run on both Python 2 and 3. For library and application maintainers, this potentially meant maintaining separate versions of your code or forcing end-users to make a giant leap, which would realistically orphan users on an old version, fragmenting your user base. Neither of those were great alternatives, so you can understand why many projects didn't bite.

For many projects of non-trivial size, flag day transitions from Python 2 to 3 were simply not viable: the pathway to Python 3 was to make code dual Python 2/3 compatible and gradually switch over the runtime to Python 3. But initial versions of Python 3 made this effectively impossible! Let me give a few specific examples.

In Python 2, a string literal '' is effectively an array of bytes. In Python 3, it is a series of Unicode code points - a fundamentally different type! In Python 2, you could write b'' to be explicit that a string literal was bytes or you could write u'' to indicate a Unicode literal, mimicking Python 3's behavior. In Python 3, you could write b'' to create a bytes instance. But for whatever reason, Python 3 initially removed the u'' syntax, meaning there wasn't as easy way to explicitly denote the type of each string literal so that it was consistent between Python 2 and 3! Python 3.3 (released September 2012) restored u'' support, making it more viable to write Python source code that worked on both Python 2 and 3. For nearly 4 years, Python 3 took away the consistent syntax for denoting bytes/Unicode string literals.

Another feature was % formatting of strings. Python 2 allowed use of the % formatting operator on both its string types. But Python 3 initially removed the implementation of % from bytes. Why, I have no clue. It is perfectly reasonable to splice byte sequences into a buffer via use of a formatting string. But the Python language maintainers insisted otherwise. And it wasn't until the community complained about its absence loudly enough that this feature was restored in Python 3.5, which was released in September 2015. Fun fact: the lack of this feature was once considered a blocker for Mercurial moving to Python 3 because Mercurial uses bytes almost universally, which meant that nearly every use of % would have to be changed to something else. And to this day, Python 3's bytes still doesn't have a format() method, so the alternative was effectively string concatenation, which is a massive step backwards from the expressiveness of % formatting.

The initial approach of Python 3 mirrors a folly that many developers and projects make: attempting a rewrite instead of performing incremental evolution. For established projects, large scale rewrites often go poorly. And Python 3 is no exception. Yes, from a code level, CPython (and likely other Python implementations) were incremental changes over Python 2 using the same code base. But from a language and standard library level, the differences in Python 3 were significant enough that I - and even Python's core maintainers - considered it a new language, and therefore a rewrite. When your random project attempts a rewrite and fails, the blast radius of that is often contained to that project. Maybe you don't publish a new release as soon as you otherwise would. But when you are powering an ecosystem, the ripple effects from a failed rewrite percolate throughout that ecosystem and last for years and have many second order effects. We see this with Python 3, where poor choices made in the late 2000s are inflicting significant hardship still in 2020.

From the initial restrained adoption of Python 3, it is obvious that the Python ecosystem overwhelmingly rejected the initial boil the oceans approach of Python 3. Python's maintainers eventually got the message and started restoring features like u'' and bytes % formatting back into the language to placate the community. All the while Python 3 had been accumulating new features and the cumulative sum of those features was compelling enough to win over users.

For many projects (including Mercurial), Python 3.4/3.5 was the first viable porting target for Python 3. Python 3.5 was released in September 2015, almost 7 years after Python 3.0 was released in December 2008. Seven. Years. An ecosystem that falters for that long is generally not healthy. What may have saved Python from total collapse here is that Python 2 was still going strong and people were generally happy with it. I really do think Python dodged a bullet here, because there was a massive window where the language could have hemorrhaged a critical amount of its user base and been relegated to an afterthought. One could draw an analogy to Perl, which lost out to PHP, Python, and Ruby, and whose fall from grace aligned with a lengthy transition from Perl 5 to 6.

If you look back at the early history of Python 3, I think you are forced to conclude that Python effectively kneecapped itself for 5-7 years through questionable implementation choices that prevented users from incurring incremental transitions between the major language versions. 2008 to 2013-2015 should be known as the lost years of Python because so much opportunity and energy was squandered. Yes, Python is still healthy today and Python 3 is (finally) being adopted at scale. But had earlier versions of Python 3 been more empathetic towards Python 2 users porting to it, Python and Python 3 in 2020 would be even stronger than it is. The community was artificially hindered for years. And we won't know until 2023-2025 what things could have looked like in 2020 had the Python core language team spent more time paving a smoother road between the major language versions.

To be clear, I do think Python 3 is generally a better language than Python 2. It has fewer warts, more compelling features, and better performance (except for startup time, which is still slower than Python 2). I am ecstatic the community is finally rallying around Python 3! For my Python coding, it has reached the point where I curse under my breath when I need to support Python 2 or even older versions of Python 3, like 3.5 or 3.6: I just wish the world would move on and adopt the future already!

But I would be remiss if I failed to mention some of my gripes with Python 3 beyond the transition shenanigans.

Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode. In Python 2, the default string type was backed by bytes. In Python 3, the default string type is backed by Unicode code points. As part of that transition, large parts of the standard library now operate in the Unicode space instead of the domain of bytes. I understand why Python does this: they want strings to be Unicode and don't want users to have to spend that much energy thinking about when to use str versus bytes. This approach is admirable and somewhat defensible because it takes a stand on a solution that is arguably good enough for most users. However, the approach of assuming the world is Unicode is flat out wrong and has significant implications for systems level applications (like version control tools).

There are a myriad of places in Python's standard library where Python insists on using the Unicode-backed str type and rejects bytes. For example, various networking modules refuse to accept bytes for hostnames or URLs. HTTP libraries won't accept bytes for HTTP header names or values. Functions that are proxies to POSIX-defined functions won't accept bytes even though the POSIX function it calls into is using char * and isn't Unicode aware. Then there's filename handling, where Python assumes the existence of a global encoding for filenames and uses this encoding to convert between str and bytes. And it does this despite POSIX filesystem paths being a bag of bytes where the only rules are that \0 terminates the filename and / is special.

In cases like Python refusing to accept bytes for things like HTTP header names (which will just be spit out over the wire as bytes), Python's pendulum has swung too far towards Unicode only. In my opinion, Python needs to be more accommodating and allow bytes when it makes sense. I hope the pendulum knocks some sense into people when it swings back towards a more reasonable solution that better acknowledges the realities of the world we live in.

For areas like filename handling, the world is more complicated. Python is effectively an abstraction layer over the operating system APIs exposing this functionality. And there is often an impedance mismatch between operating systems. For example, POSIX (Linux) tends to use char * for everything and doesn't care about encoding and Windows tends to use 16 bit character types where the encoding is... a can of worms.

The reality here is that it is impossible to abstract over differences between operating system behavior without compromises that can result in data loss, outright wrong behavior, or loss of functionality. But Python 3 attempts to do it anyway, making Python 3 unsuitable (or at least highly undesirable) for certain systems level applications that rely on it (like a version control tool).

In fairness to Python, it isn't the only programming language that gets this wrong. The only language I've seen properly implement higher-order abstractions on top of operating system facilities is Rust, whose approach can be generalized as use Python 3's solution of normalizing to Unicode/UTF-8 by default, but expose escape hatches which allow access to the raw underlying types and APIs used by the operating system for the advanced consumers who require it. For example, Rust's Path type which represents a filesystem path allows access to the raw OsStr value used by the operating system, not a normalization of it to bytes or Unicode, which may be lossy. This allows consumers to e.g. create and retrieve OS-native filesystem paths without data loss. This functionality is critical in some domains. Python 3's awareness/insistence that the world is Unicode (which it isn't universally) reduces Python's applicability in these domains.

Speaking of Rust, at the Mercurial developer meetup in October 2019, we were discussing the use of Rust in Mercurial and one of the core maintainers blurted out something along the lines of if Rust were at its current state 5 years ago, Mercurial would have likely ported from Python 2 to Rust instead of Python 3. As crazy as it initially sounded, I think I agree with that assessment. With the benefit of hindsight, having been a key player in the Python 3 porting effort, seeing all the complications and headaches Python 3 is introducing, and having learned Rust and witnessed its benefits for performance, control, and correctness firsthand, porting to Rust would likely have been the correct move for the project at that point in time. 2020 is not 2014, however, and I'm not sure if I would opt for a rewrite in Rust today. (Most rewrites are follies after all.) But I know one thing: I certainly wouldn't implement a new version control tool in Python 3 and I would probably choose Rust as an implementation language for most new projects in the systems level space or with an expected shelf life of 10+ years. (I really should blog about how awesome Rust is.)

Back to the topic of Python itself, I'm really soured on Python at this point in time. The effort required to port to Python 3 was staggering. For Mercurial, Python 3 introduces a ton of problems and doesn't really solve many. We effectively sludged through mud for several years only to wind up in a state that feels strictly worse than where we started. I'm sure it will be strictly better in a few years. But at that point, we're talking about a 5+ year transition. To call the Python 3 transition disruptive and distracting for the project would be an understatement. As a project maintainer, it's natural to ask what we could have accomplished if we weren't forced to carry out this sideshow.

I can't shake the feeling that a lot of the pain afflicted by the Python 3 transition could have been avoided had Python's language leadership made a different set of decisions and more highly prioritized the transition experience. (Like not initially removing features like u'' and bytes % and not introducing gratuitous backwards compatibility breaks, like with items()/iteritems(). I would have also liked to see a feature like from __future__ - maybe from __past__ - that would make it easier for Python 3 code to target semantics in earlier versions in order to provide a more turnkey on-ramp onto new versions.) I simultaneously see Python 3 losing its position as a justifiable tool in some domains (like systems level tooling) due to ongoing design decisions and poor implementation (like startup overhead problems). (In contrast, I see Rust excelling where Python is faltering and find Rust code surprisingly expressive to write and maintain given how low-level it is and therefore feel that Rust is a compelling alternative to Python in a surprisingly large number of domains.)

Look, I know it is easy for me to armchair quarterback and critique with the benefit of hindsight/ignorance. I'm sure there is a lot of nuance here. I'm sure there was disagreement within the Python community over a lot of these issues. Maintaining a large and successful programming language and community like Python's is hard and you aren't going to please all the people all the time. And speaking as a maintainer, I have mad respect for the people leading such a large community. But niceties aside, everyone knows the Python 3 transition was rough and could have gone better. It should not have taken 11 years to get to where we are today.

I'd like to encourage the Python Project to conduct a thorough postmortem on the transition to Python 3. Identify what went well, what could have gone better, and what should be done next time such a large language change is wanted. Speaking as a Python user, a maintainer of a Python project, and as someone in industry who is now skeptical about use of Python at work due to risks of potentially company crippling high-effort migrations in the future, a postmortem would help restore my confidence that Python's maintainers learned from the various missteps on the road to Python 3 and these potentially ecosystem crippling mistakes won't be made again.

Python had a wildly successful past few decades. And it can continue to thrive for several more. But the Python 3 migration was painful for all involved. And as much as we need to move on and leave Python 2 behind us, there are some important lessons to be learned. I hope the Python community takes the opportunity to reflect and am confident it will grow stronger by taking the time to do so.


Problems with Pull Requests and How to Fix Them

January 07, 2020 at 12:10 PM | categories: Mercurial, Git

You've probably used or at least heard of pull requests: the pull request is the contribution workflow practiced on and made popular by [code] collaboration sites like GitHub, GitLab, Bitbucket, and others. Someone (optionally) creates a fork, authors some commits, pushes them to a branch, then creates a pull request to track integrating those commits into a target repository and branch. The pull request is then used as a vehicle for code review, tracking automated checks, and discussion until it is ready to be integrated. Integration is usually performed by a project maintainer, often with the click of a merge button on the pull request's web page.

It's worth noting that the term pull request is not universally used: GitLab calls them merge requests for example. Furthermore I regard the terms pull request and merge request to be poorly named, as the terms can be conflated with terminology used by your version control tool (e.g. git pull or git merge. And the implementations of a pull or merge request may not even perform a pull or a merge (you can also rebase a pull/merge request, but nobody is calling them rebase requests). A modern day pull request is so much more than a version control tool operation or even a simple request to pull or merge a branch: it is a nexus to track the integration of a proposed change before during and after that change is integrated. But alas. Because GitHub coined the term and is the most popular collaboration platform implementing this functionality, I'll refer to this general workflow as implemented on GitHub, GitLab, Bitbucket, and others as pull requests for the remainder of this post.

Pull requests have existed in essentially their current form for over a decade. The core workflow has remained mostly unchanged. What is different are the addition of value-add features, such as integrating status checks like CI results, the ability to rebase or squash commits instead of merging, code review tooling improvements, and lots of UI polish. GitLab deserves a call out here, as their implementation of merge requests tracks so much more than other tools do. (This is a side-effect of GitLab having more built-in features than comparable tools.) I will also give kudos to GitLab for adding new features to pull requests when GitHub was asleep at the wheel as a company a few years ago. (Not having a CEO for clear product/company leadership really showed.) Fortunately, both companies (and others) are now churning out new, useful features at a terrific clip, greatly benefiting the industry!

While I don't have evidence of this, I suspect pull requests (and the forking model used by services that implement them) came into existence when someone thought how do I design a collaboration web site built on top of Git's new and novel distributed nature and branching features. They then proceeded to invent forking and pull requests. After all, the pull request as implemented by GitHub was initially a veneer over a common Git workflow of create a clone, create a branch, and send it somewhere. Without GitHub, you would run git clone, git branch, then some other command like git request-pull (where have I seen those words before) to generate/send your branch somewhere. On GitHub, the comparable steps are roughly create a fork, create a branch to your fork, and submit a pull request. Today, you can even do all of this straight from the web interface without having to run git directly! This means that GitHub can conceptually be thought of as a purely server-side abstraction/implementation of the Git feature branch workflow.

At its core, the pull request is fundamentally a nice UI and feature layer built around the common Git feature branch workflow. It was likely initially conceived as polish and value-add features over this historically client-side workflow. And this core property of pull requests from its very first days has been copied by vendors like Bitbucket and GitLab (and in Bitbucket's case it was implemented for Mercurial - not Git - as Bitbucket was initially Mercurial only).

A decade is an eternity in the computer industry. As they say, if you aren't moving forward, you are moving backward. I think it is time for industry to scrutinize the pull request model and to evolve it into something better.

I know what you are thinking: you are thinking that pull requests work great and that they are popular because they are a superior model compared to what came before. These statements - aside from some nuance - are true. But if you live in the version control space (like I do) or are paid to deliver tools and workflows to developers to improve productivity and code/product quality (which I am), the deficiencies in the pull request workflow and implementation of that workflow among vendors like GitHub, GitLab, Bitbucket, etc are obvious and begging to be overhauled if not replaced wholesale.

So buckle in: you've started a ten thousand word adventure about everything you didn't think you wanted to know about pull requests!

Problems with Pull Requests

To build a better workflow, we first have to understand what is wrong/sub-optimal with pull requests.

I posit that the foremost goal of an pull request is to foster the incorporation of a high quality and desired change into a target repository with minimal overhead and complexity for submitter, integrator, and everyone in between. Pull requests achieve this goal by fostering collaboration to discuss the change (including code review), tracking automated checks against the change, linking to related issues, etc. In other words, the way I see the world is that a specific vendor's pull request implementation is just that: an implementation detail. And like all implementation details, they should be frequently scrutinized and changed, if necessary.

Let's start dissecting the problems with pull requests by focusing on the size of review units. Research by Google, Microsoft here, and here, and others has shown an inverse correlation with review unit size and defect rate. In Google's words (emphasis mine):

The size distribution of changes is an important factor in the quality of the code review process. Previous studies have found that the number of useful comments decreases and the review latency increases as the size of the change increases. Size also influences developers' perception of the code review process; a survey of Mozilla contributors found that developers feel that size-related factors have the greatest effect on review latency. A correlation between change size and review quality is acknowledged by Google and developers are strongly encouraged to make small, incremental changes (with the exception of large deletions and automated refactoring). These findings and our study support the value of reviewing small changes and the need for research and tools to help developers create such small, self-contained code changes for review.

Succinctly, larger changes result in fewer useful comments during review (meaning quality is undermined) and make reviews take longer (meaning productivity is undermined). Takeaway: if you care about defect rate / quality and/or velocity, you should be authoring and reviewing more, smaller changes as opposed to fewer, larger changes.

I stronger agree with Google's opinion on this matter and wholeheartedly endorse writing more, smaller changes. Having practiced both forms of change authorship, I can say without a doubt that more, smaller changes is superior: superior for authors, superior for code reviewers, and superior for people looking at repository history later. The main downside with this model is that it requires a bit more knowledge of your version control tool to execute. And, it requires corresponding tooling to play well with this change authorship model and to introduce as little friction as possible along the way since the number of interactions with tooling will increase as change size decreases, velocity increases, and there are more distinct units of change being considered for integration.

That last point is important and is germane to this post because the common implementation of pull requests today is not very compatible with the many small changes workflow. As I'll argue, the current implementation of pull requests actively discourages the many smaller changes workflow. And since smaller changes result in higher quality and faster reviews, today's implementations of pull requests are undermining quality and velocity.

I don't mean to pick on them, but since they are the most popular and the people who made pull requests popular, let's use GitHub's implementation of pull requests to demonstrate my point.

I posit that in order for us to author more, smaller changes, we must either a) create more, smaller pull requests or b) have pull request reviews put emphasis on the individual commits (as opposed to the overall merge diff). Let's examine these individually.

If we were to author more, smaller pull requests, this would seemingly necessitate the need for dependencies between pull requests in order to maintain velocity. And dependencies between pull requests adds a potentially prohibitive amount of overhead. Let me explain. We don't want to sacrifice the overall rate at which authors and maintainers are able to integrate proposed changes. If we were to split existing proposed changes into more, smaller pull requests, we would have a lot more pull requests. Without dependencies between them, authors could wait for each pull request to be integrated before sending the next one. But this would incur more round trips between author and integrator and would almost certainly slow down the overall process. That's not desirable. The obvious mitigation to that is to allow multiple, related pull requests in flight simultaneously. But this would necessitate the invention of dependencies between pull requests in order to track relationships so one pull request doesn't integrate before another it logically depends on. This is certainly technically doable. But it imposes considerable overhead of its own. How do you define dependencies? Are dependencies automatically detected or updated based on commits in a DAG? If yes, what happens when you force push and it is ambiguous whether a new commit is a logically new commit or a successor of a previous one? If no, do you really want to impose additional hurdles on submitters to define dependencies between every pull request? In the extreme case of one pull request per commit, do you make someone submitting a series of say twenty commits and pull requests really annotate nineteen dependencies? That's crazy!

There's another, more practical issue at play: the interplay between Git branches and pull requests. As implemented on GitHub, a pull request is tracked by a Git branch. If we have N inter-dependent pull requests, that means N Git branches. In the worst case, we have one Git branch for every Git commit. Managing N in-flight Git branches would be absurd. It would impose considerable overhead on pull request submitters. It would perfectly highlight the inefficiency in Git's game of refs branch management that I blogged about two years ago. (Succinctly, once you are accustomed to workflows - like Mercurial's - which don't require you to name commits or branches, Git's forced naming of branches and all the commands requiring those branch names feels grossly inefficient and a mountain of overhead.) Some tooling could certainly be implemented to enable efficient submission of pull requests. (See ghstack for an example.) But I think the interplay between Git branches and GitHub pull requests is sufficiently complex that the tooling and workflow would be intractable for anything but the most trivial and best-case scenarios. Keep in mind that any sufficiently user-friendly solution to this problem would also entail improving git rebase so it moves branches on rewritten ancestor commits instead of leaving them on the old versions of commits. (Seriously, someone should implement this feature: it arguably makes sense as the default behavior for local branches.) In other words, I don't think you can implement the multiple pull request model reliably and without causing excessive burden on people without fundamentally changing the requirement that a pull request be a Git branch. (I'd love to be proven wrong.)

Therefore, I don't think the more, smaller changes workflow can be easily practiced with multiple pull requests using the common GitHub model without effectively moving the definition of a pull request away from equivalence with a Git branch (more on this later). And I also don't mean to imply that dependencies between pull requests can't be implemented: they can and GitLab is evidence. But GitLab's implementation is somewhat simple and crude (possibly because doing anything more complicated is really hard as I speculate).

So without fundamentally changing the relationship between a pull request and a branch, that leaves us with our alternative of pull requests putting more emphasis on the individual changes rather than the merge diff. Let's talk about that now.

Pull requests have historically placed emphasis on the merge diff. That is, GitHub (or another provider) takes the Git branch you have submitted, runs a git merge against the target branch behind the scenes, and displays that diff front and center for review as the main proposed unit of change: if you click the Files changed tab to commence review, you are seeing this overall merge diff. You can click on the Commits tab then select an individual commit to review just that commit. Or you can use the dropdown on the Files changed tab to select an individual commit to review it. These (relatively new) features are a very welcome improvement and do facilitate performing a commit-by-commit review, which is a requirement to realize the benefits of a more, smaller changes workflow. Unfortunately, they are far from sufficient to fully realize the benefits of that workflow.

Defaults matter and GitHub's default is to show the merge diff when conducting review. (I bet a large percentage of users don't even know it is possible to review individual commits.) Since larger changes result in a higher defect rate and slower review, GitHub's default of showing the merge diff effectively means GitHub is defaulting to lower quality, longer-lasting reviews. (I suppose this is good for engagement numbers, as it inflates service usage both immediately and in the long-term due to subsequent bugs driving further usage. But I sincerely hope no product manager is thinking let's design a product that undermines quality to drive engagement.)

Unfortunately, a trivial change of the default to show individual commits instead of the merge diff is not so simple, as many authors and projects don't practice clean commit authorship practices, where individual commits are authored such that they can be reviewed in isolation.

(One way of classifying commit authorship styles is by whether a series of commits is authored such that each commit is good in isolation or whether the effect of applying the overall series is what matters. A handful of mature projects - like the Linux kernel, Firefox, Chrome, Git, and Mercurial - practice the series of individually-good commits model, which I'll call a commit-centric workflow. I would wager the majority of projects on GitHub and similar services practice the we only care about the final result of the series of commits model. A litmus test for practicing the latter model is whether pull requests contain commits like fixup foo or if subsequent revisions to pull requests create new commits instead of amending existing ones. I'm a strong proponent of a clean commit history where each commit in the final repository history stands as good in isolation. But I tend to favor more grown-up software development practices and am a version control guru. That being said, the subject/debate is fodder for another post.)

If GitHub (or someone else) switched the pull request default to a per-commit review without otherwise changing the relationship between a pull request and a Git branch, that would force a lot of less experienced users to familiarize themselves with history rewriting in Git. This would impose considerable pain and suffering on pull request authors, which would in turn upset users, hurt engagement, etc. Therefore, I don't think this is a feasible global default that can be changed. Maybe if Git's user experience for history rewriting were better or we didn't have a decade of behavior to undo we'd be in a better position... But pull request implementations don't need to make a global change: they could right the ship by offering projects that practice clean commit practices an option to change the review default so it emphasizes individual commits instead of the merge diff. This would go a long way towards encouraging authoring and reviewing individual commits, which should have positive benefits on review velocity and code quality outcomes.

But even if these services did emphasize individual commits by default in pull request reviews, there's still a handful of significant deficiencies that would undermine the more, smaller changes workflow that we desire.

While it is possible to review individual commits, all the review comments are still funneled into a single per pull request timeline view of activity. If submitter and reviewer make the effort to craft and subsequently review individual commits, your reward is that all the feedback for the discrete units of change gets lumped together into one massive pile of feedback for the pull request as a whole. This unified pile of feedback (currently) does a poor job of identifying which commit it applies to and gives the author little assistance in knowing which commits need amending to address the feedback. This undermines the value of commit-centric workflows and effectively pushes commit authors towards the fixup style of commit authorship. In order to execute per-commit review effectively, review comments and discussion need to be bucketed by commit and not combined into a unified pull request timeline. This would be a massive change to the pull request user interface and would be a daunting undertaking, so it is understandable why it hasn't happened yet. And such an undertaking would also require addressing subtly complex issues like how to preserve reviews in the face of force pushes. Today, GitHub's review comments can lose context when force pushes occur. Things are better than they used to be, when review comments left on individual commits would flat out be deleted (yes: GitHub really did effectively lose code review comments for several years.) But even with tooling improvements, problems still remain and should adoption of commit-level review tracking occur, these technical problems would likely need resolution to appease users of this workflow.

Even if GitHub (or someone else) implements robust per-commit review for pull requests, there's still a problem with velocity. And that problem is that if the pull request is your unit of integration (read: merging), then you have to wait until every commit is reviewed before integration can occur. This may sound tolerable (it's what we practice today after all). But I argue this is less optimal than a world where a change integrates as soon as it is ready to, without having to wait for the changes after it. As an author and maintainer, if I see a change that is ready to integrate, I prefer to integrate it as soon as possible, without delay. The longer a ready-to-integrate change lingers, the longer it is susceptible to bit rot (when the change is no longer valid/good due to other changes in the system). Integrating a judged-good change sooner also reduces the time to meaningful feedback: if there is a fundamental problem early in a series of changes that isn't caught before integration, integrating earlier changes sooner without waiting for the ones following will expose problems sooner. This minimizes deltas in changed systems (often making regression hunting easier), often minimizes the blast radius if something goes wrong, and gives the author more time and less pressure to amend subsequent commits that haven't been integrated yet. And in addition to all of this, integrating more often just feels better. The Progress Principle states that people feel better and perform better work when they are making continuous progress. But setbacks more than offset the power of small wins. While I'm not aware of any explicit research in this area, my interpretation of the Progress Principle to change authorship and project maintenance(which is supported by anecdotal observation) is that a steady stream of integrated changes feels a hell of a lot better than a single monolithic change lingering in review purgatory for what can often seem like an eternity. While you need to be cognizant to not confuse movement with meaningful progress, I think there is real power to the Progress Principle and that we should aim to incorporate changes as soon as they are ready and not any later. Applied to version control and code review, this means integrating a commit as soon as author, reviewer, and our machine overlords reporting status checks all agree it is ready, without having to wait for a larger unit of work, like the pull request. Succinctly, move forward as soon as you are able to!

This desire to integrate faster has significant implications for pull requests. Again, looking at GitHub's implementation of pull requests, I don't see how today's pull requests could adapt to this desired end state without significant structural changes. For starters, review must grow the ability to track per-commit state otherwise integrating individual commits without the entirety of the parts makes little sense. But this entails all the complexity I described above. Then there's the problem of Git branches effectively defining a pull request. What happens when some commits in a pull request are integrated and the author rebases or merges their local branch against their new changes? This may or may not just work. And when it doesn't just work, the author can easily find themselves in merge conflict hell, where one commit after the other fails to apply cleanly and their carefully curated stack of commits quickly becomes a liability and impediment to forward progress. (As an aside, the Mercurial version control tool has a concept called changeset evolution where it tracks which commits - changesets in Mercurial parlance - have been rewritten as other commits and gracefully reacts in situations like a rebase. For example, if you have commits X and Y and X is integrated via a rebase as X', an hg rebase of Y onto X' will see that X was rewritten as X' and skip attempting to rebase X because it is already applied! This cleanly sidesteps a lot of the problems with history rewriting - like merge conflicts - and can make the end-user experience much more pleasant as a result.) While it is certainly possible to integrate changes as soon as they are ready with a pull request workflow, I think that it is awkward and that by the time you've made enough changes to accommodate the workflow, very little is left of the pull request workflow as we know it and it is effectively a different workflow altogether.

The above arguments overly hinge on the assumption that more smaller changes is superior for quality and/or velocity and that we should design workflows around this assertion. While I strongly believe in the merits of smaller units of change, others may disagree. (If you do disagree, you should ask yourself whether you believe the converse: that larger units of change are better for quality and velocity. I suspect most people can't justify this. But I do believe there is merit to the argument that smaller units of change impose additional per-unit costs or have second order effects that undermine their touted quality or velocity benefits.)

But even if you don't buy into the change size arguments, there's still a very valid reason why we should think beyond pull requests as they are implemented today: tool scalability.

The implementation of pull requests today is strongly coupled with how Git works out of the box. A pull request is initiated from a Git branch pushed to a remote Git repository. When the pull request is created, the server creates a Git branch/ref referring to that pull request's head commits. On GitHub, these refs are named pull/ID/head (you can fetch these from the remote Git repository but they are not fetched by default). Also when a pull request is created or updated, a git merge is performed to produce a diff for review. On GitHub, the resulting merge commit is saved and pointed to on the open pull request via a pull/ID/merge ref, which can also be fetched locally. (The merge ref is deleted when the pull request is closed.)

Herein resides our scalability problem: unbound growth of Git refs and ever-increasing rate of a change for a growing project. Each Git ref adds overhead to graph walking operations and data exchange. While involved operations are continuously getting optimized (often through the use of more advanced data structures or algorithms), there are intrinsic scaling challenges with this unbound growth that - speaking as a version control tool maintainer - I want no part of. Are technical solutions enabling things to scale to millions of Git refs viable? Yes'ish. But it requires high-effort solutions like JGit's Reftable, which required ~90 review rounds spanned across ~4 months to land. And that's after design of the feature was first proposed at least as far back as July 2017. Don't get me wrong: am I glad Reftable exists: yes. It is a fantastic solution to a real problem and reading how it works will probably make you a better engineer. But simultaneously, it is a solution to a problem that does not need to exist. There is a space for scaling graph data structures and algorithms to millions or even billions of nodes, edges, and paths: your version control tool should not be it. Millions or billions of commits and files: that's fine. But scaling the number of distinct paths through that graph by introducing millions of DAG heads is insane given how much complexity it introduces in random areas of the tool. In my opinion it requires unjustifiably large amounts of investment to make work at scale. As an engineer, my inclination when posed with problems like these is to avoid them in the first place. The easiest problems to solve are those you don't have.

Unfortunately, the tight coupling of pull requests to Git branches/refs introduces unbound growth and a myriad of problems associated with it. Most projects may not grow to a size that experiences these problems. But as someone who has experience with this problem space at multiple companies, I can tell you the problem is very real and the performance and scalability issues it creates undermines the viability of using today's implementation of pull requests once you've reached a certain scale. Since we can likely fix the underlying scaling issues with Git, I don't think the explosion of Git refs is a long-term deal breaker for scaling pull requests. But it is today and will remain so until Git and the tools built on top of it improve.

In summary, some high-level problems with pull requests are as follows:

  • Review of merge diff by default encourages larger units of review, which undermines quality and velocity outcomes.
  • Inability to incrementally integrate commits within a pull request, which slows down velocity, time to meaningful feedback, and can lower morale.
  • Tight coupling of pull requests with Git branches adds rigidity to workflows and shoehorns into less flexible and less desired workflows.
  • Deficiencies in the Git user experience - particular around what happens when rewrites (including rebase) occur - significantly curtail what workflows can be safely practiced with pull requests.
  • Tight coupling of pull requests with Git branches can lead to performance issues at scale.

We can invert language to arrive at a set of more ideal outcomes:

  • Review experience is optimized for individual commits - not the merge diff - so review units are smaller and quality and velocity outcomes are improved.
  • Ability to incrementally integrate individual commits from a larger set so ready-to-go changes are incorporated sooner, improving velocity, time to meaningful feedback, and morale.
  • How you use Git branches does not impose significant restrictions on handling of pull requests.
  • You can use your version control tool how you want, without having to worry about your workflow being shoehorned by how pull requests work.
  • Pull request server can scale to the most demanding use cases with relative ease.

Let's talk about how we could achieve these more desirable outcomes.

Exploring Alternative Models

A pull request is merely an implementation pattern for the general problem space of integrating a proposed change. There are other patterns used by other tools. Before I describe them, I want to coin the term integration request to refer to the generic concept of requesting some change being integrated elsewhere. GitHub pull requests and GitLab merge requests are implementations of integration requests, for example.

Rather than describe alternative tools in detail, I will outline the key areas where different tools differ from pull requests and assess the benefits and drawbacks to the different approaches.

Use of the VCS for Data Exchange

One can classify implementations of integration requests by how they utilize the underlying version control tools.

Before Git and GitHub came along, you were probably running a centralized version control tool which didn't support offline commits or feature branches (e.g. CVS or Subversion). In this world, the common mechanism for integration requests was exchanging diffs or patches through various media - email, post to a web service of a code review tool, etc. Your version control tool didn't speak directly to a VCS server to initiate an integration request. Instead, you would run a command which would export a text-based representation of the change and then send it somewhere.

Today, we can classify integration requests by whether or not they speak the version control tool's native protocol for exchanging data or whether they exchange patches through some other mechanism. Pull requests speak the VCS native protocol. Tools like Review Board and Phabricator exchange patches via custom HTTP web services. Typically, tools using non-native exchange will require additional client-side configuration, including potentially the installation of a custom tool (e.g. RBTools for Review Board or Arcanist for Phabricator). Although modern version control tools sometimes have this functionality built-in. (e.g. Git and Mercurial fulfill Zawinski's law and Mercurial has a Phabricator extension in its official distribution).

An interesting outlier is Gerrit, which ingests its integration requests via git push. (See the docs.) But the way Gerrit's ingestion via git push works is fundamentally different from how pull requests work! With pull requests, you are pushing your local branch to a remote branch and a pull request is built around that remote branch. With Gerrit, your push command is like git push gerrit HEAD:refs/for/master. For the non-gurus, that HEAD:refs/for/master syntax means, push the HEAD commit (effectively the commit corresponding to the working directory) to the refs/for/master ref on the gerrit remote (the SOURCE:DEST syntax specifies a mapping of local revision identifier to remote ref). The wizard behind the curtain here is that Gerrit runs a special Git server that implements non-standard behavior for the refs/for/* refs. When you push to refs/for/master, Gerrit receives your Git push like a normal Git server would. But instead of writing a ref named refs/for/master, it takes the incoming commits and ingests them into a code review request! Gerrit will create Git refs for the pushed commits. But it mainly does that for its internal tracking (Gerrit stores all its data in Git - from Git data to review comments). And if that functionality isn't too magical for you, you can also pass parameters to Gerrit via the ref name! e.g. git push gerrit HEAD refs/for/master%private will create a private review request that requires special permissions to see. (It is debatable whether overloading the ref name for additional functionality is a good user experience for average users. But you can't argue that this isn't a cool hack!)

On the surface, it may seem like using the version control tool's native data exchange is a superior workflow because it is more native and more modern. (Emailing patches is so old school.) Gone are the days of having to configure client-side tools to export and submit patches. Instead, you run git push and your changes can be turned into an integration request automatically or with a few mouse clicks. And from a technical level, this exchange methodology is likely safer, as round-tripping a text-based representation of a change without data loss is surprisingly finicky. (e.g. JSON's lack of lossless binary data exchange without encoding to e.g. base64 first often means that many services exchanging text-based patches are lossy, especially in the presence of content which doesn't conform to UTF-8, which can be commonplace in tests. You would be surprised how many tools experience data loss when converting version control commits/diffs to text. But I digress). Having Git's wire protocol exchange binary data is safer than exchanging text patches and probably easier to use since it doesn't require any additional client-side configuration.

But despite being more native, modern, and arguably robust, exchange via the version control tool may not be better.

For starters, use of the version control tool's native wire protocol inhibits use of arbitrary version control tools on the client. When your integration request requires the use of a version control tool's wire protocol, the client likely needs to be running that version control tool. With other approaches like exchange of text based patches, the client could be running any software it wanted: as long as it could spit out a patch or API request in the format the server needed, an integration request could be created! This meant there was less potential for lock-in, as people could use their own tools on their machines if they wanted and they (hopefully) wouldn't be inflicting their choice on others. Case in point, a majority of Firefox developers use Mercurial - the VCS of the canonical repository - but a large number use Git on the client. Because Firefox is using Phabricator (Review Board and Bugzilla before that) for code review and because Phabricator ingests text-based patches, the choice of the VCS on the client doesn't matter that much and the choice of the server VCS can be made without inciting a holy war among developers who would be forced to use a tool they don't prefer. Yes, there are good reasons for using a consistent tool (including organizational overhead) and sometimes mandates for tool use are justified. But in many cases (such as random open source contributions), it probably doesn't or shouldn't matter. And in cases like Git and Mercurial, where tools like the fantastic git-cinnabar make it possible to easily convert between the repositories without data loss and acceptable overhead, adoption of the version control tool's native wire protocol can exclude or inhibit the productivity of contributors since it can mandate use of specific, undesired tooling.

Another issue with using the version control tool's wire protocol is that it often forces or strongly encourages you to work a certain way. Take GitHub pull requests for example. The pull request is defined around the remote Git branch that you git push. If you want to update that branch, you need to know its name. So that requires some overhead to either create and track that branch or find its name when you want to update it. Contrast with Gerrit, where you don't have an explicit remote branch you push to: you simply git push gerrit HEAD:refs/for/master and it figures things out automatically (more on this later). With Gerrit, I don't have to create a local Git branch to initiate an integration request. With pull requests, I'm compelled to. And this can undermine my productivity by compelling me to practice less-efficient workflows!

Our final point of comparison involves scalability. When you use the version control tool wire protocol as part of integration requests, you have introduced the problem of scaling your version control server. Take it from someone who has had multiple jobs involving scaling version control servers and who is intimately aware of the low-level details of both the Git and Mercurial wire protocols: you don't want to be in the business of scaling a version control server. The wire protocols for both Git and Mercurial were designed in a now-ancient era of computing and weren't designed by network protocol experts. They are fundamentally difficult to scale at just the wire protocol level. I've heard stories that at one time, the most expensive single server at Google was their Perforce or Perforce-derived server (this was several years ago - Google has since moved on to a better architecture).

The poor network protocols of version control tools have many side-effects, including the inability or sheer difficulty of using distributed storage on the server. So in order to scale compute horizontally, you need to invest in expensive network storage solutions or devise a replication and synchronization strategy. And take it from someone who worked on data synchronization products (outside of the source control space) at three companies: this is a problem you don't want to solve yourself. Data synchronization is intrinsically difficult and rife with difficult trade-offs. It's almost always a problem best avoided if you have a choice in the matter.

If creating Git refs is part of creating an integration request, you've introduced a scaling challenge with the number of Git refs. Do these Git refs live forever? What happens when you have thousands of developers - possibly all working in the same repository - and the number of refs or ref mutations grows to the hundreds of thousands or millions per year?

Can your version control server handle ingesting a push every second or two with reasonable performance? Unless you are Google, Facebook, or a handful of other companies I'm aware of, it can't. And before you cry that I'm talking about problems that only plague the 0.01% of companies out there, I can name a handful of companies under 10% the size of these behemoths where this is a problem for them. And I also guarantee that many people don't have client-side metrics for their git push P99 times or reliability and don't even realize there is a problem! Scaling version control is probably not a core part of your company's business. Unfortunately, it all too often becomes something companies have to allocate resources for because of poorly designed or utilized tools.

Contrast the challenges of scaling integration requests with a native version control server versus just exchanging patches. With the more primitive approach, you are probably sending the patch over HTTP to a web service. And with tools like Phabricator and Review Board, that patch gets turned into rows in a relational database. I guarantee it will be easier to scale an HTTP web service fronting a relational database than it will be your version control server. If nothing else, it should be easier to manage and debug, as there are tons more experts in these domains than in the version control server domain!

Yes, it is true that many will not hit the scaling limits of the version control server. And some nifty solutions for scaling do exist. But large segments of this problem space - including the version control tool maintainers having to support crazy scaling vectors in their tools - could be avoided completely if integration requests didn't lean so heavily on the version control tools's default mode of operation. Unfortunately, solutions like GitHub pull requests and Gerrit's use of Git refs for storing everything exert a lot of pressure on scaling the version control server and make this a very real problem once you reach a certain scale.

Hopefully the above paragraphs enlightened you to some of the implications that the choice of a data exchange mechanism has on integration requests! Let's move on to another point of comparison.

Commit Tracking

One can classify implementations of integration requests by how they track commits through their integration lifecycle. What I mean by this is how the integration request follows the same logical change as it evolves. For example, if you submit a commit then amend it, how does the system know that the commit evolved from commit X to X'.

Pull requests don't track commits directly. Instead, a commit is part of a Git branch and that branch is tracked as the entity the pull request is built around. The review interface presents the merge diff front and center. It is possible to view individual commits. But as far as I know, none of these tools have smarts to explicitly track or map commits across new submissions. Instead, they simply assume that the commit order will be the same. If commits are reordered or added or removed in the middle of an existing series, the tool can get confused quite easily. (With GitHub, it was once possible for a review comment left on a commit to disappear entirely. The behavior has since been corrected and if GitHub doesn't know where to print a comment from a previous commit, it renders it as part of the pull request's timeline view.)

If all you are familiar with is pull requests, you may not realize there are alternatives to commit tracking! In fact, the most common alternative (which isn't do nothing) predates pull requests entirely and is still practiced by various tools today.

The way that Gerrit, Phabricator, and Review Board work is the commit message contains a unique token identifying the integration request for that commit. e.g. a commit message for a Phabricator review will contain the line Differential Revision: https://phab.mercurial-scm.org/D7543. Gerrit will have something like Change-Id: Id9bfca21f7697ebf85d6a6fa7bac7de4358d7a43.

The way this annotation appears in the commit message differs by tool. Gerrit's web UI advertises a shell one-liner to clone repositories which not only performs a git clone but also uses curl to download a shell script from the Gerrit server and install it as Git's commit-msg hook in the newly-cloned repositories. This Git hook will ensure that any newly-created commit has a Change-ID: XXX line containing a randomly generated, hopefully unique identifier. Phabricator and Review Board leverage client-side tooling to rewrite commit messages after submission to their respective tool so the commit message contains the URL of the code review. One can debate which approach is better - they each have advantages and drawbacks. Fortunately, this debate is not germane to this post, so we won't cover it here.

What is important is how this metadata in commit messages is used.

The commit message metadata comes into play when a commit is being ingested into an integration request. If a commit message lacks metadata or references an entity that doesn't exist, the receiving system assumes it is new. If the metadata matches an entity on file, the incoming commit is often automatically matched up to an existing commit, even if its Git SHA is different!

This approach of inserting a tracking identifier into commit messages works surprisingly well for tracking the evolution of commits! Even if you amend, reorder, insert, or remove commits, the tool can often figure out what matches up to previous submissions and reconcile state accordingly. Although support for this varies by tool. Mercurial's extension for submitting to Phabricator is smart enough to take the local commit DAG into account and change dependencies of review units in Phabricator to reflect the new DAG shape, for example.

The tracking of commits is another one of those areas where the simpler and more modern features of pull requests often don't work as well as the solutions that came before. Yes, inserting an identifier into commit messages feels hacky and can be brittle at times (some tools don't implement commit rewriting very well and this can lead to a poor user experience). But you can't argue with the results: using explicit, stable identifiers to track commits is far more robust than the heuristics that pull requests rely on. The false negative/positive rate is so much lower. (I know this from first hand experience because we attempted to implement commit tracking heuristics for a code review tool at Mozilla before Phabricator was deployed and there were a surprising number of corner cases we couldn't handle properly. And this was using Mercurial's obsolescence markers, which gave us commit evolution data generated directly by the version control tool! If that didn't work well enough, it's hard to imagine an heuristic that would. We eventually gave up and used stable identifiers in commit messages, which fixed most of the annoying corner cases.)

The use of explicit commit tracking identifiers may not seem like it makes a meaningful difference. But it's impact is profound.

The obvious benefit of tracking identifiers is that they allow rewriting commits without confusing the integration request tool. This means that people can perform advanced history rewriting with near impunity as to how it would affect the integration request. I am a heavy history rewriter. I like curating a series of individually high-quality commits that can each stand in isolation. When I submit a series like this to a GitHub pull request and receive feedback on something I need to change, when I enact those changes I have to think will my rewriting history here make re-review harder? (I try to be empathetic with the reviewer and make their life easier whenever possible. I ask what I would appreciate someone doing if I were reviewing their change and tend to do that.) With GitHub pull requests, if I reorder commits or add or remove a commit in the middle of a series, I realize that this may make review comments left on those commits hard to find since GitHub won't be able to sort out the history rewriting. And this may mean those review comments get lost and are ultimately not acted upon, leading to bugs or otherwise deficient changes. This is a textbook example of tooling deficiencies dictating a sub-optimal workflow and outcome: because pull requests don't track commits explicitly, I'm forced to adopt a non-ideal workflow or sacrifice something like commit quality in order to minimize risks that the review tool won't get confused. In general, tools should not externalize these kinds of costs or trade-offs onto users: they should just work and optimize for generally agreed-upon ideal outcomes.

Another benefit to tracking identifiers is that they enable per-commit review to be viable. Once you can track the logical evolution of a single commit, you can start to associate things like review comments with individual commits with a high degree of confidence. With pull requests (as they are implemented today), you can attempt to associate comments with commits. But because you are unable to track commits across rewrites with an acceptably high degree of success, rewritten commits often fall through the cracks, orphaning data like review comments with them. Data loss is bad, so you need a place to collect this orphaned data. The main pull request activity timeline facilitates this function.

But once you can track commits reliably (and tools like Gerrit and Phabricator prove this is possible), you don't have this severe problem of data loss and therefore don't need to worry about finding a place to collect orphaned data! You are then able to create per-commit review units, each as loosely coupled with other commits and an overall series as you want to make it!

It is interesting to note the different approaches in different tools here. it is doubly interesting to note behavior that is possible with the review tool itself and what it does by default!

Let's examine Phabricator. Phabricator's review unit is the Differential revision. (Differential is the name of the code review tool in Phabricator, which is actually a suite of functionality - like GitLab, but not nearly as feature complete.) A Differential revision represents a single diff. Differential revisions can have parent-child relationships with others. Multiple revisions associated like this form a conceptual stack in Phabricator's terminology. Go to https://phab.mercurial-scm.org/D4414 and search for stack to see it in action. (Stack is a bit misleading name because the parent-child relationships actually form a DAG and Phabricator is capable of rendering things like multiple children in its graphical view.) Phabricator's official client side tool for submitting to Phabricator - Arcanist or arc - has default behavior of collapsing all Git commits into a single Differential revision.

Phabricator can preserve metadata from the individual commits (it can render at least the commit messages in the web UI so you can see where the Differential revision came from). In other words, by default Arcanist does not construct multiple Differential revisions for each commit and therefore does not construct parent-child relationships for them. So there is no stack to render here. To be honest, I'm not sure if modern versions of Arcanist even support doing this. I do know both Mercurial and Mozilla authored custom client side tools for submitting to Phabricator to work around deficiencies like this in Arcanist. Mozilla's may or may not be generally suitable for users outside of Mozilla - I'm not sure.

Another interesting aspect of Phabricator is that there is no concept of an over-arching series. Instead, each Differential revision stands in isolation. They can form parent-child relationships and constitute a stack. But there is no primary UI or APIs for stacks (the last I looked anyway). This may seem radical. You may be asking questions like how do I track the overall state of a series or how do I convey information pertinent to the series as a whole. These are good questions. But without diving into them, the answer is that as radical as it sounds to not have an overall tracking entity for a series of Differential revisions, it does work. And having used this workflow with the Mercurial Project for a few years, I can say I'm not missing the functionality that much.

Gerrit is also worth examining. Like Phabricator, Gerrit uses an identifier in commit messages to track the commit. But whereas Phabricator rewrites commit messages at initially submission time to contain the URL that was created as part of that submission, Gerrit peppers the commit message with a unique identifier at commit creation time. The server then maintains a mapping of commit identifier to review unit. Implementation details aside, the end result is similar: individual commits can be tracked more easily.

What distinguishes Gerrit from Phabricator is that Gerrit does have a stronger grouping around multiple commits. Gerrit will track when commits are submitted together and will render both a relation chain and submitted together list automatically. While it lacks the visual beauty of Phabricator's implementation, it is effective and is shown in the UI by default, unlike Phabricator.

Another difference from Phabricator is that Gerrit uses per-commit review by default. Whereas you need a non-official client for Phabricator to submit a series of commits to constitute a linked chain, Gerrit does this by default. And as far as I can tell, there's no way to tell Gerrit to squash your local commits down to a single diff for review: if you want a single review to appear, you must first squash commits locally then push the squashed commit. (More on this topic later in the post.)

A secondary benefit of per-commit review is that this model enables incremental integration workflows, where some commits in a series or set can integrate before others, without having to wait for the entire batch. Incremental integration of commits can drastically speed up certain workflows, as commits can integrate as soon as they are ready and not any longer. The benefits of this model can be incredible. But actually deploying this workflow can be tricky. One problem is that your version control tool may get confused when you rebase or merge partially landed state. Another problem is it can increase the overall change rate of the repository, which may strain systems from version control to CI to deployment mechanisms. Another potential problem involves communicating review sign-off from integration sign-off. Many tools/workflows conflate I sign off on this change and I sign off on landing this change. While they are effectively identical in many cases, there are some valid cases where you want to track these distinctly. And adopting a workflow where commits can integrate incrementally will expose these corner cases. So before you go down this path, you want to be thinking about who integrates commits and when they are integrated. (You should probably be thinking about this anyway because it is important.)

Designing a Better Integration Request

Having described some problems with pull requests and alternate ways of going about solving the general problem of integration requests, it is time to answer the million dollar problem: designing a better integration request. (When you factor in the time people spend in pull requests and the cost of bugs / low quality changes that slip through due to design of existing tooling, improving integration requests industry wide would be a lot more valuable than $1M.)

As a reminder, the pull request is fundamentally a nice UI and set of features built around the common Git feature branch workflow. This property is preserved from the earliest days of pull requests in 2007-2008 and has been copied by vendors like Bitbucket and GitLab in the years since. In my mind, pull requests should be ripe for overhaul.

Replace Forks

The first change I would make to pull requests is to move away from forks being a required part of the workflow. This may seem radical. But it isn't!

A fork on services like GitHub is a fully fledged project - just like the canonical project it was forked from. It has its own issues, wiki, releases, pull requests, etc. Now show of hands: how often do you use these features on a fork? Me neither. In the overwhelming majority of cases, a fork exists solely as a vehicle to initiate a pull request against the repository it was forked from. It serves little to no additional meaningful functionality. Now, I'm not saying forks don't serve a purpose - they certainly do! But in the case of someone wanting to propose a change to a repository, a fork is not strictly required and its existence is imposed on us by the current implementation of pull requests.

I said impose in the previous sentence because forks introduce overhead and confusion. The existence of a fork may confuse someone as to where a canonical project lives. Forks also add overhead in the version control tool. Their existence forces the user to manage an additional Git remote and branches. It forces people to remember to keep their branches in sync on their fork. As if remembering to keep your local repository in sync wasn't hard enough! And if pushing to a fork, you need to re-push data that was already pushed to the canonical repository, even though that data already exists on the server (just in a different view of the Git repository). (I believe Git is working on wire protocol improvements to mitigate this.)

When merely used as a vehicle to initiate integration requests, I do not believe forks offer enough value to justify their existence. Should forks exist: yes. Should people be forced to use them in order to contribute changes, no. (Valid use cases for a fork would be to perform a community splinter of a project and to create an independent entity for reasons such as better guarantees of data availability and integrity.)

Forks are essentially a veneer on top of a server-side git clone. And the reason why a separate Git repository is used at all is probably because the earliest versions of GitHub were just a pile of abstractions over git commands. The service took off in popularity, people copied its features almost verbatim, and nobody ever looked back and thought why are we doing things like this in the first place.

To answer what we would replace forks with, we have to go back to first principles and ask what are we trying to do. And that is propose a unit a change against an existing project. And for version control tools, all you need to propose a change is a patch/commit. So to replace forks, we just need an alternate mechanism to submit patches/commits to an existing project.

My preferred alternative to forks is to use git push directly to the canonical repository. This could be implemented like Gerrit where you push to a special ref. e.g. git push origin HEAD:refs/for/master. Or - and this is my preferred solution - version control servers could grow more smarts about how pushes work - possibly even changing what commands like git push do if the server is operating in special modes.

One idea would be for the Git server to expose different refs namespaces depending on the authenticated user. For example, I'm indygreg on GitHub. If I wanted to propose a change to a project - let's say python/cpython - I would git clone git@github.com:python/cpython. I would create a branch - say indygreg/proposed-change. I would then git push origin indygreg/proposed-change and because the branch prefix matches my authenticated username, the server lets it through. I can then open a pull request without a fork! (Using branch prefixes is less than ideal, but it should be relatively easy to implement on the server. A better approach would rely on remapping Git ref names. But this may require a bit more configuration with current versions of Git than users are willing to stomach. An even better solution would be for Git to grow some functionality to make this easier. e.g. git push --workspace origin proposed-change would push proposed-change to a workspace on the origin remote, which Git would know how to translate to a proper remote ref update.)

Another idea would be for the version control server to invent a new concept for exchanging commits - one based on sets of commits instead of DAG synchronization. Essentially, instead of doing a complicated discovery dance to synchronize commits with the underlying Git repository, the server would ingest and expose representations of sets of commits stored next to - but not within - the repository itself. This way you are not scaling the repository DAG to infinite heads - which is a hard problem! A concrete implementation of this might have the client run a git push --workspace origin proposed-change to tell the remote server to store your proposed-change branch in your personal workspace (sorry for reusing the term from the previous paragraph). The Git server would receive your commits, generate a standalone blob to hold them, save that blob to a key-value store like S3, then update a mapping of which commits/branches are in which blobs in a data store such as a relational database somewhere. This would effectively segment the core project data from the more transient branch data, keeping the core repository clean and pure. It allows the server to lean on easier-to-scale data stores such as key-value blob stores and relational databases instead of the version control tool. I know this idea is feasible because Facebook implemented it for Mercurial. The infinitepush extension essentially siphons Mercurial bundles (standalone files holding commit data) off to a blob store when pushes come in over the wire. At hg pull time, if a requested revision is not present in the repository, the server asks the database-backed blob index if the revision exists anywhere. If it does, the blob/bundle is fetched, dynamically overlayed onto the repository in memory, and served to the client. While the infinitepush extension in the official Mercurial project is somewhat lacking (through no fault of Facebook's), the core idea is solid and I wish someone would spend the time to flush out the design a bit more because it really could lead to logically scaling repositories to infinite DAG heads without the complexities of actually scaling scaling DAG algorithms, repository storage, and version control tool algorithms to infinite heads. Getting back to the subject of integration requests, one could imagine having a target for workspace pushes. For example, git push --workspace=review origin would push to the review workspace, which would automatically initiate a code review.

Astute readers of this blog may find these ideas familiar. I proposed user namespaces in my /blog/2017/12/11/high-level-problems-with-git-and-how-to-fix-them/ post a few years ago. So read there for more on implications of doing away with forks.

Could forks be done away with as a requirement to submit pull requests? Yes! Gerrit's git push origin HEAD:refs/for/master mechanism proves it. Is Gerrit's approach too much magic or confusing for normal users? I'm not sure. Could Git grow features to make the user experience much better so users don't need to be burdened with complexity or magic and could simply run commands like git submit --for review? Definitely!

Shift Focus From Branches to Individual Commits

My ideal integration request revolves around individual commits, not branches. While the client may submit a branch to initiate or update an integration request, the integration request is composed of a set of loosely coupled commits, where parent-child relationships can exist to express a dependency between commits. Each commit is evaluated individually. Although someone may need to inspect multiple commits to gain a full understanding of the proposed change. And some UI enabling operations against a group of related commits (such as mass deleting abandoned commits) may be warranted.

In this world, the branch would not matter. Instead, commits are king. Because we would be abandoning the branch name as a tracker for the integration request, we would need something to replace it, otherwise we have no way of knowing how to update an existing integration request! We should do what tools like Phabricator, Gerrit, and Review Board do and add a persistent identifier to commits which survive history rewriting. (Branch-based pull requests should do this anyway so history rewrites don't confuse the review tool and e.g. cause comments to get orphaned - see above.)

It's worth noting that a commit-centric integration request model does not imply that everyone is writing or reviewing series of smaller commits! While titans of industry and I strongly encourage the authorship of smaller commits, commit-centric integration requests don't intrinsically force you to do so. This is because commit-centric integration requests aren't forcing you to change your local workflow! If you are the type of person who doesn't want to curate a ton of small, good-in-isolation commits (it does take a bit more work after all), nobody would be forcing you to do so. Instead, if this is your commit authorship pattern, the submission of the proposed change could squash these commits together as part of the submission, optionally rewriting your local history in the process. If you want to keep dozens of fixup commits around in your local history, that's fine: just have the tooling collapse them all together on submission. While I don't think those fixup commits are that valuable and shouldn't be seen by reviewers, if we wanted, we could have tools continue to submit them and make them visible (like they are in e.g. GitHub pull requests today). But they wouldn't be the focus of review (again like GitHub pull requests today). Making integration requests commit-centric doesn't force people to adopt a different commit authorship workflow. But it does enable projects that wish to adopt more mature commit hygiene to do so. That being said, hows tools are implemented can impose restrictions. But that's nothing about commit-centric review that fundamentally prohibits the use of fixup commits in local workflows.

While I should create a dedicated post espousing the virtues of commit-centric workflows, I'll state my case through proxy by noting that some projects aren't using modern pull requests precisely because commit-centric workflows are not viable. When I was at Mozilla, one of the blockers to moving to GitHub was the pull request review tooling wasn't compatible with our world view that review units should be small. (This view is generally shared by Google, Facebook, and some prominent open source projects, among others.) And for reasons outlined earlier in this post, I think that as long as pull requests revolve around branches / merge diffs and aren't robust in the face of history rewriting (due to the lack of robust commit tracking), projects that insist on more refined practices will continue to eschew pull requests. Again, a link between review size and quality has been established. And better quality - along with its long-term effect of lowering development costs due to fewer bugs - can tip the scales in its favor, even against all the benefits you receive when using a product like GitHub, GitLab, or Bitbucket.

The Best of What's Around

Aspects of a better integration request exist in tools today. Unfortunately, many of these features are not present on pull requests as implemented by GitHub, GitLab, Bitbucket, etc. So to improve the pull request, these products will need to borrow ideas from other tools.

Integration requests not built around Git branches (Gerrit, Phabricattor, Review Board, etc) use identifiers in commit messages to track commits. This helps tracking commits across changes. There are compelling advantages to this model. Robust commit tracking is a requirement for commit-centric workflows. And it would even improve the functionality of branch-based pull requests. A well-designed integration request would have a robust commit tracking mechanism.

Gerrit has the best-in-class experience for commit-centric workflows. It is the only popular implementation of integration requests I'm aware of that supports and caters to this workflow by default. In fact, I don't think you can change this! (This behavior is user hostile in some cases since it forces users to know how to rewrite commits, which is often perilous in Git land. It would be nice if you could have Gerrit squash commits into the same review unit automatically on the server. But I understand the unwillingness to implement this feature because this has its own set of challenges around commit tracking, which I won't bore you with.) Gerrit also shows groups of related commits front and center when viewing a proposed change.

Phabricator is the only other tool I know of where one can achieve a reasonable commit-centric workflow without the pitfalls of orphaned comments, context overload, etc mentioned earlier in this post. But this requires non-standard submission tooling and commit series aren't featured prominently in the web UI. So Phabricator's implementation is not as solid as Gerrit's.

Another Gerrit feature worth lauding is the submission mechanism. You simply git push to a special ref. That's it. There's no fork to create. No need to create a Git branch. No need to create a separate pull request after the push. Gerrit just takes the commits you pushed and turns them into a request for review. And it doesn't require any additional client-side tooling!

Using a single common git command to submit and update an integration request is simpler and arguably more intuitive than other tools. Is Gerrit's submission perfect? No. The git push origin HEAD:refs/for/master syntax is not intuitive. And overloading submission options by effectively encoding URL parameters on the ref name is a gross - albeit effective - hack. But users will likely quickly learn the one-liner's or create more intuitive aliases.

The elegance of using just a git push to initiate an integration request puts Gerrit in a league of its own. I would be ecstatic if the GitHubs of the world reduced the complexity of submitting pull requests to simply clone the canonical repository, create some commits, and run a git command. The future of submitting integration requests* hopefully looks more like Gerrit than other alternatives.

What Needs Built

Some aspects of the better integration request don't yet exist or need considerable work before I consider them viable.

For tools which leverage the native version control tool for submission (e.g. via git push), there needs to be some work to support submission via a more generic, HTTP endpoint. I'm fine with leveraging git push as a submission mechanism because it makes the end-user experience so turnkey. But making it the only submission mechanism is a bit unfortunate. There is some support for this: I believe you can cobble together a pull request from scratch via GitHub's APIs, for example. But it isn't as simple as submit a patch to an endpoint, which it arguably should be. Even Gerrit's robust HTTP API, does not seem to allow creating new commits/diffs via that API. Anyway, this limitation not only excludes non-Git tools from using these tools, but also limits other tooling from submitting without using Git. For example, you may want to write a bot that proposes automated changes and it is much easier to produce a diff than to use git since the former does not require a filesystem (this matters in serverless environments for example).

A larger issue with many implementations is the over-reliance on Git for server storage. This is most pronounced in Gerrit, where not only are your git pushes stored in a Git repository on the Gerrit server, but every code review comment and reply is stored in Git as well! Git is a generic key-value store and you can store any data you want in it if you shoehorn it properly. And it is cool that all your Gerrit data can be replicated via git clone - this pretty much eliminates the we took a decentralized tool and centralized it via GitHub series of arguments. But if you apply this store everything in Git approach at scale, it means you will be running a Git server at scale. And not just any Git server - a write load heavy Git server! And if you have thousands of developers possibly all working out of the same repository, then you are looking at potentially millions of new Git refs per year. While the Git, Gerrit, and JGit people have done some fantastic work making these tools scale, I'd feel much better if we eschewed the make Git scale to infinite pushes and refs problem and used a more scalable approach, like an HTTP ingestion endpoint which writes data to key-value stores or relational databases. In order words, use of a version control tool for servicing integration requests at scale is a self-imposed footgun and could be avoided.

Conclusion

Congratulations on making it through my brain dump! As massive as the wall of text is, there are still plenty of topics I could have covered but didn't. This includes the more specific topic of code review and the various features that entails. I also largely ignored some general topics like the value that an integration request can serve on the overall development lifecycle: integration requests are more than just code review - they serve as a nexus to track the evolution of a change throughout time.

Hopefully this post gave you some idea at some of the structural issues at play with the integration of pull requests and integration requests. And if you are someone in a position to design or implement a better integration request or tooling around them (including in version control tools themselves), hopefully it gave you some good ideas or where to go next.


Absorbing Commit Changes in Mercurial 4.8

November 05, 2018 at 09:25 AM | categories: Mercurial, Mozilla

Every so often a tool you use introduces a feature that is so useful that you can't imagine how things were before that feature existed. The recent 4.8 release of the Mercurial version control tool introduces such a feature: the hg absorb command.

hg absorb is a mechanism to automatically and intelligently incorporate uncommitted changes into prior commits. Think of it as hg histedit or git rebase -i with auto squashing.

Imagine you have a set of changes to prior commits in your working directory. hg absorb figures out which changes map to which commits and absorbs each of those changes into the appropriate commit. Using hg absorb, you can replace cumbersome and often merge conflict ridden history editing workflows with a single command that often just works. Read on for more details and examples.

Modern version control workflows often entail having multiple unlanded commits in flight. What this looks like varies heavily by the version control tool, standards and review workflows employed by the specific project/repository, and personal preferences.

A workflow practiced by a lot of projects is to author your commits into a sequence of standalone commits, with each commit representing a discrete, logical unit of work. Each commit is then reviewed/evaluated/tested on its own as part of a larger series. (This workflow is practiced by Firefox, the Git and Mercurial projects, and the Linux Kernel to name a few.)

A common task that arises when working with such a workflow is the need to incorporate changes into an old commit. For example, let's say we have a stack of the following commits:

$ hg show stack
  @  1c114a ansible/hg-web: serve static files as immutable content
  o  d2cf48 ansible/hg-web: synchronize templates earlier
  o  c29f28 ansible/hg-web: convert hgrc to a template
  o  166549 ansible/hg-web: tell hgweb that static files are in /static/
  o  d46d6a ansible/hg-web: serve static template files from httpd
  o  37fdad testing: only print when in verbose mode
 /   (stack base)
o  e44c2e (@) testing: install Mercurial 4.8 final

Contained within this stack are 5 commits changing the way that static files are served by hg.mozilla.org (but that's not important).

Let's say I submit this stack of commits for review. The reviewer spots a problem with the second commit (serve static template files from httpd) and wants me to make a change.

How do you go about making that change?

Again, this depends on the exact tool and workflow you are using.

A common workflow is to not rewrite the existing commits at all: you simply create a new fixup commit on top of the stack, leaving the existing commits as-is. e.g.:

$ hg show stack
  o  deadad fix typo in httpd config
  o  1c114a ansible/hg-web: serve static files as immutable content
  o  d2cf48 ansible/hg-web: synchronize templates earlier
  o  c29f28 ansible/hg-web: convert hgrc to a template
  o  166549 ansible/hg-web: tell hgweb that static files are in /static/
  o  d46d6a ansible/hg-web: serve static template files from httpd
  o  37fdad testing: only print when in verbose mode
 /   (stack base)
o  e44c2e (@) testing: install Mercurial 4.8 final

When the entire series of commits is incorporated into the repository, the end state of the files is the same, so all is well. But this strategy of using fixup commits (while popular - especially with Git-based tooling like GitHub that puts a larger emphasis on the end state of changes rather than the individual commits) isn't practiced by all projects. hg absorb will not help you if this is your workflow.

A popular variation of this fixup commit workflow is to author a new commit then incorporate this commit into a prior commit. This typically involves the following actions:

<save changes to a file>

$ hg commit
<type commit message>

$ hg histedit
<manually choose what actions to perform to what commits>

OR

<save changes to a file>

$ git add <file>
$ git commit
<type commit message>

$ git rebase --interactive
<manually choose what actions to perform to what commits>

Essentially, you produce a new commit. Then you run a history editing command. You then tell that history editing command what to do (e.g. to squash or fold one commit into another), that command performs work and produces a set of rewritten commits.

In simple cases, you may make a simple change to a single file. Things are pretty straightforward. You need to know which two commits to squash together. This is often trivial. Although it can be cumbersome if there are several commits and it isn't clear which one should be receiving the new changes.

In more complex cases, you may make multiple modifications to multiple files. You may even want to squash your fixups into separate commits. And for some code reviews, this complex case can be quite common. It isn't uncommon for me to be incorporating dozens of reviewer-suggested changes across several commits!

These complex use cases are where things can get really complicated for version control tool interactions. Let's say we want to make multiple changes to a file and then incorporate those changes into multiple commits. To keep it simple, let's assume 2 modifications in a single file squashing into 2 commits:

<save changes to file>

$ hg commit --interactive
<select changes to commit>
<type commit message>

$ hg commit
<type commit message>

$ hg histedit
<manually choose what actions to perform to what commits>

OR

<save changes to file>

$ git add <file>
$ git add --interactive
<select changes to stage>

$ git commit
<type commit message>

$ git add <file>
$ git commit
<type commit message>

$ git rebase --interactive
<manually choose which actions to perform to what commits>

We can see that the number of actions required by users has already increased substantially. Not captured by the number of lines is the effort that must go into the interactive commands like hg commit --interactive, git add --interactive, hg histedit, and git rebase --interactive. For these commands, users must tell the VCS tool exactly what actions to take. This takes time and requires some cognitive load. This ultimately distracts the user from the task at hand, which is bad for concentration and productivity. The user just wants to amend old commits: telling the VCS tool what actions to take is an obstacle in their way. (A compelling argument can be made that the work required with these workflows to produce a clean history is too much effort and it is easier to make the trade-off favoring simpler workflows versus cleaner history.)

These kinds of squash fixup workflows are what hg absorb is designed to make easier. When using hg absorb, the above workflow can be reduced to:

<save changes to file>

$ hg absorb
<hit y to accept changes>

OR

<save changes to file>

$ hg absorb --apply-changes

Let's assume the following changes are made in the working directory:

$ hg diff
diff --git a/ansible/roles/hg-web/templates/vhost.conf.j2 b/ansible/roles/hg-web/templates/vhost.conf.j2
--- a/ansible/roles/hg-web/templates/vhost.conf.j2
+++ b/ansible/roles/hg-web/templates/vhost.conf.j2
@@ -76,7 +76,7 @@ LimitRequestFields 1000
      # Serve static files straight from disk.
      <Directory /repo/hg/htdocs/static/>
          Options FollowSymLinks
 -        AllowOverride NoneTypo
 +        AllowOverride None
          Require all granted
      </Directory>

@@ -86,7 +86,7 @@ LimitRequestFields 1000
      # and URLs are versioned by the v-c-t revision, they are immutable
      # and can be served with aggressive caching settings.
      <Location /static/>
 -        Header set Cache-Control "max-age=31536000, immutable, bad"
 +        Header set Cache-Control "max-age=31536000, immutable"
      </Location>

      #LogLevel debug

That is, we have 2 separate uncommitted changes to ansible/roles/hg-web/templates/vhost.conf.j2.

Here is what happens when we run hg absorb:

$ hg absorb
showing changes for ansible/roles/hg-web/templates/vhost.conf.j2
        @@ -78,1 +78,1 @@
d46d6a7 -        AllowOverride NoneTypo
d46d6a7 +        AllowOverride None
        @@ -88,1 +88,1 @@
1c114a3 -        Header set Cache-Control "max-age=31536000, immutable, bad"
1c114a3 +        Header set Cache-Control "max-age=31536000, immutable"

2 changesets affected
1c114a3 ansible/hg-web: serve static files as immutable content
d46d6a7 ansible/hg-web: serve static template files from httpd
apply changes (yn)?
<press "y">
2 of 2 chunk(s) applied

hg absorb automatically figured out that the 2 separate uncommitted changes mapped to 2 different changesets (Mercurial's term for commit). It print a summary of what lines would be changed in what changesets and prompted me to accept its plan for how to proceed. The human effort involved is a quick review of the proposed changes and answering a prompt.

At a technical level, hg absorb finds all uncommitted changes and attempts to map each changed line to an unambiguous prior commit. For every change that can be mapped cleanly, the uncommitted changes are absorbed into the appropriate prior commit. Commits impacted by the operation are rebased automatically. If a change cannot be mapped to an unambiguous prior commit, it is left uncommitted and users can fall back to an existing workflow (e.g. using hg histedit).

But wait - there's more!

The automatic rewriting logic of hg absorb is implemented by following the history of lines. This is fundamentally different from the approach taken by hg histedit or git rebase, which tend to rely on merge strategies based on the 3-way merge to derive a new version of a file given multiple input versions. This approach combined with the fact that hg absorb skips over changes with an ambiguous application commit means that hg absorb will never encounter merge conflicts! Now, you may be thinking if you ignore lines with ambiguous application targets, the patch would always apply cleanly using a classical 3-way merge. This statement logically sounds correct. But it isn't: hg absorb can avoid merge conflicts when the merging performed by hg histedit or git rebase -i would fail.

The above example attempts to exercise such a use case. Focusing on the initial change:

diff --git a/ansible/roles/hg-web/templates/vhost.conf.j2 b/ansible/roles/hg-web/templates/vhost.conf.j2
--- a/ansible/roles/hg-web/templates/vhost.conf.j2
+++ b/ansible/roles/hg-web/templates/vhost.conf.j2
@@ -76,7 +76,7 @@ LimitRequestFields 1000
     # Serve static files straight from disk.
     <Directory /repo/hg/htdocs/static/>
         Options FollowSymLinks
-        AllowOverride NoneTypo
+        AllowOverride None
         Require all granted
     </Directory>

This patch needs to be applied against the commit which introduced it. That commit had the following diff:

diff --git a/ansible/roles/hg-web/templates/vhost.conf.j2 b/ansible/roles/hg-web/templates/vhost.conf.j2
--- a/ansible/roles/hg-web/templates/vhost.conf.j2
+++ b/ansible/roles/hg-web/templates/vhost.conf.j2
@@ -73,6 +73,15 @@ LimitRequestFields 1000
         {% endfor %}
     </Location>

+    # Serve static files from templates directory straight from disk.
+    <Directory /repo/hg/hg_templates/static/>
+        Options None
+        AllowOverride NoneTypo
+        Require all granted
+    </Directory>
+
+    Alias /static/ /repo/hg/hg_templates/static/
+
     #LogLevel debug
     LogFormat "%h %v %u %t \"%r\" %>s %b %D \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\""
     ErrorLog "/var/log/httpd/hg.mozilla.org/error_log"

But after that commit was another commit with the following change:

diff --git a/ansible/roles/hg-web/templates/vhost.conf.j2 b/ansible/roles/hg-web/templates/vhost.conf.j2
--- a/ansible/roles/hg-web/templates/vhost.conf.j2
+++ b/ansible/roles/hg-web/templates/vhost.conf.j2
@@ -73,14 +73,21 @@ LimitRequestFields 1000
         {% endfor %}
     </Location>

-    # Serve static files from templates directory straight from disk.
-    <Directory /repo/hg/hg_templates/static/>
-        Options None
+    # Serve static files straight from disk.
+    <Directory /repo/hg/htdocs/static/>
+        Options FollowSymLinks
         AllowOverride NoneTypo
         Require all granted
     </Directory>

...

When we use hg histedit or git rebase -i to rewrite this history, the VCS would first attempt to re-order commits before squashing 2 commits together. When we attempt to reorder the fixup diff immediately after the commit that introduces it, there is a good chance your VCS tool would encounter a merge conflict. Essentially your VCS is thinking you changed this line but the lines around the change in the final version are different from the lines in the initial version: I don't know if those other lines matter and therefore I don't know what the end state should be, so I'm giving up and letting the user choose for me.

But since hg absorb operates at the line history level, it knows that this individual line wasn't actually changed (even though the lines around it did), assumes there is no conflict, and offers to absorb the change. So not only is hg absorb significantly simpler than today's hg histedit or git rebase -i workflows in terms of VCS command interactions, but it can also avoid time-consuming merge conflict resolution as well!

Another feature of hg absorb is that all the rewriting occurs in memory and the working directory is not touched when running the command. This means that the operation is fast (working directory updates often account for a lot of the execution time of hg histedit or git rebase commands). It also means that tools looking at the last modified time of files (e.g. build systems like GNU Make) won't rebuild extra (unrelated) files that were touched as part of updating the working directory to an old commit in order to apply changes. This makes hg absorb more friendly to edit-compile-test-commit loops and allows developers to be more productive.

And that's hg absorb in a nutshell.

When I first saw a demo of hg absorb at a Mercurial developer meetup, my jaw - along with those all over the room - hit the figurative floor. I thought it was magical and too good to be true. I thought Facebook (the original authors of the feature) were trolling us with an impossible demo. But it was all real. And now hg absorb is available in the core Mercurial distribution for anyone to use.

From my experience, hg absorb just works almost all of the time: I run the command and it maps all of my uncommitted changes to the appropriate commit and there's nothing more for me to do! In a word, it is magical.

To use hg absorb, you'll need to activate the absorb extension. Simply put the following in your hgrc config file:

[extensions]
absorb =

hg absorb is currently an experimental feature. That means there is no commitment to backwards compatibility and some rough edges are expected. I also anticipate new features (such as hg absorb --interactive) will be added before the experimental label is removed. If you encounter problems or want to leave comments, file a bug, make noise in #mercurial on Freenode, or submit a patch. But don't let the experimental label scare you away from using it: hg absorb is being used by some large install bases and also by many of the Mercurial core developers. The experimental label is mainly there because it is a brand new feature in core Mercurial and the experimental label is usually affixed to new features.

If you practice workflows that frequently require amending old commits, I think you'll be shocked at how much easier hg absorb makes these workflows. I think you'll find it to be a game changer: once you use hg abosrb, you'll soon wonder how you managed to get work done without it.


Global Kernel Locks in APFS

October 29, 2018 at 02:20 PM | categories: Python, Mercurial, Apple

Over the past several months, a handful of people had been complaining that Mercurial's test harness was executing much slower on Macs. But this slowdown seemingly wasn't occurring on Linux or Windows. And not every Mac user experienced the slowness!

Before jetting off to the Mercurial 4.8 developer meetup in Stockholm a few weeks ago, I sat down with a relatively fresh 6+6 core MacBook Pro and experienced the problem firsthand: on my 4+4 core i7-6700K running Linux, the Mercurial test harness completes in ~12 minutes, but on this MacBook Pro, it was executing in ~38 minutes! On paper, this result doesn't make any sense because there's no way that the MacBook Pro should be ~3x slower than that desktop machine.

Looking at Activity Monitor when running the test harness with 12 tests in parallel revealed something odd: the system was spending ~75% of overall CPU time inside the kernel! When reducing the number of tests that ran in parallel, the percentage of CPU time spent in the kernel decreased and the overall test harness execution time also decreased. This kind of behavior is usually a sign of something very inefficient in kernel land.

I sample profiled all processes on the system when running the Mercurial test harness. Aggregate thread stacks revealed a common pattern: readdir() being in the stack.

Upon closer examination of the stacks, readdir() calls into apfs_vnop_readdir(), which calls into some functions with bt or btree in their name, which call into lck_mtx_lock(), lck_mtx_lock_grab_mutex() and various other functions with lck_mtx in their name. And the caller of most readdir() appeared to be Python 2.7's module importing mechanism (notably import.c:case_ok()).

APFS refers to the Apple File System, which is a filesystem that Apple introduced in 2017 and is the default filesystem for new versions of macOS and iOS. If upgrading an old Mac to a new macOS, its HFS+ filesystems would be automatically converted to APFS.

While the source code for APFS is not available for me to confirm, the profiling results showing excessive time spent in lck_mtx_lock_grab_mutex() combined with the fact that execution time decreases when the parallel process count decreases leads me to the conclusion that APFS obtains a global kernel lock during read-only operations such as readdir(). In other words, APFS slows down when attempting to perform parallel read-only I/O.

This isn't the first time I've encountered such behavior in a filesystem: last year I blogged about very similar behavior in AUFS, which was making Firefox CI significantly slower.

Because Python 2.7's module importing mechanism was triggering the slowness by calling readdir(), I posted to python-dev about the problem, as I thought it was important to notify the larger Python community. After all, this is a generic problem that affects the performance of starting any Python process when running on APFS. i.e. if your build system invokes many Python processes in parallel, you could be impacted by this. As part of obtaining data for that post, I discovered that Python 3.7 does not call readdir() as part of module importing and therefore doesn't exhibit a severe slowdown. (Python's module importing code was rewritten significantly in Python 3 and the fix was likely introduced well before Python 3.7.)

I've produced a gist that can reproduce the problem. The script essentially performs a recursive directory walk. It exercises the opendir(), readdir(), closedir(), and lstat() functions heavily and is essentially a benchmark of the filesystem and filesystem cache's ability to return file metadata.

When you tell it to walk a very large directory tree - say a Firefox version control checkout (which has over 250,000 files) - the excessive time spent in the kernel is very apparent on macOS 10.13 High Sierra:

$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 172.209s

real    2m52.470s
user    1m54.053s
sys    23m42.808s

$ time ./slow-readdir.py -l 12 -j 1 ~/src/firefox
ran 12 walks across 1 processes in 523.440s

real    8m43.740s
user    1m13.397s
sys     3m50.687s

$ time ./slow-readdir.py -l 18 -j 18 ~/src/firefox
ran 18 walks across 18 processes in 210.487s

real    3m30.731s
user    2m40.216s
sys    33m34.406s

On the same machine upgraded to macOS 10.14 Mojave, we see a bit of a speedup!:

$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 97.833s

real    1m37.981s
user    1m40.272s
sys    10m49.091s

$ time ./slow-readdir.py -l 12 -j 1 ~/src/firefox
ran 12 walks across 1 processes in 461.415s

real    7m41.657s
user    1m05.830s
sys     3m47.041s

$ time ./slow-readdir.py -l 18 -j 18 ~/src/firefox
ran 18 walks across 18 processes in 140.474s

real    2m20.727s
user    3m01.048s
sys    17m56.228s

Contrast with my i7-6700K Linux machine backed by EXT4:

$ time ./slow-readdir.py -l 8 ~/src/firefox
ran 8 walks across 8 processes in 6.018s

real    0m6.191s
user    0m29.670s
sys     0m17.838s

$ time ./slow-readdir.py -l 8 -j 1 ~/src/firefox
ran 8 walks across 1 processes in 33.958s

real    0m34.164s
user    0m17.136s
sys     0m13.369s

$ time ./slow-readdir.py -l 12 -j 12 ~/src/firefox
ran 12 walks across 12 processes in 25.465s

real    0m25.640s
user    1m4.801s
sys     1m20.488s

It is apparent that macOS 10.14 Mojave has received performance work relative to macOS 10.13! Overall kernel CPU time when performing parallel directory walks has decreased substantially - to ~50% of original on some invocations! Stacks seem to reveal new code for lock acquisition, so this might indicate generic improvements to the kernel's locking mechanism rather than APFS specific changes. Changes to file metadata caching could also be responsible for performance changes. Although it is difficult to tell without access to the APFS source code. Despite those improvements, APFS is still spending a lot of CPU time in the kernel. And the kernel CPU time is still comparatively very high compared to Linux/EXT4, even for single process operation.

At this time, I haven't conducted a comprehensive analysis of APFS to determine what other filesystem operations seem to acquire global kernel locks: all I know is readdir() does. A casual analysis of profiled stacks when running Mercurial's test harness against Python 3.7 seems to show apfs_* functions still on the stack a lot and that seemingly indicates more APFS slowness under parallel I/O load. But HFS+ exhibited similar problems (it appeared HFS+ used a single I/O thread inside the kernel for many operations, making I/O on macOS pretty bad), so I'm not sure if these could be considered regressions the way readdir()'s new behavior is.

I've reported this issue to Apple at https://bugreport.apple.com/web/?problemID=45648013 and on OpenRadar at https://openradar.appspot.com/radar?id=5025294012383232. I'm told that issues get more attention from Apple when there are many duplicates of the same issue. So please reference this issue if you file your own report.

Now that I've elaborated on the technical details, I'd like to add some personal commentary. While this post is about APFS, this issue of global kernel locks during common I/O operations is not unique to APFS. I already referenced similar issues in AUFS. And I've encountered similar behaviors with Btrfs (although I can't recall exactly which operations). And NTFS has its own bag of problems.

This seeming pattern of global kernel locks for common filesystem operations and slow filesystems is really rubbing me the wrong way. Modern NVMe SSDs are capable of reading and writing well over 2 gigabytes per second and performing hundreds of thousands of I/O operations per second. We even have Intel soon producing persistent solid state storage that plugs into DIMM slots because it is that friggin fast.

Today's storage hardware is capable of ludicrous performance. It is fast enough that you will likely saturate multiple CPU cores processing the read or written data coming from and going to storage - especially if you are using higher-level, non-JITed (read: slower) programming languages (like Python). There has also been a trend that systems are growing more CPU cores faster than they are instructions per second per core. And SSDs only achieve these ridiculous IOPS numbers if many I/O operations are queued and can be more efficiently dispatched within the storage device. What this all means is that it probably makes sense to use parallel I/O across multiple threads in order to extract all potential performance from your persistent storage layer.

It's also worth noting that we now have solid state storage that outperforms (in some dimensions) what DRAM from ~20 years ago was capable of. Put another way I/O APIs and even some filesystems were designed in an era when its RAM was slower than what today's persistent storage is capable of! While I'm no filesystems or kernel expert, it does seem a bit silly to be using APIs and filesystems designed for an era when storage was multiple orders of magnitude slower and systems only had a single CPU core.

My takeaway is I can't help but feel that systems-level software (including the kernel) is severely limiting the performance potential of modern storage devices. If we have e.g. global kernel locks when performing common I/O operations, there's no chance we'll come close to harnessing the full potential of today's storage hardware. Furthermore, the behavior of filesystems is woefully under documented and software developers have little solid advice for how to achieve optimal I/O performance. As someone who cares about performance, I want to squeeze every iota of potential out of hardware. But the lack of documentation telling me which operations acquire locks, which strategies are best for say reading or writing 10,000 files using N threads, etc makes this extremely difficult. And even if this documentation existed, because of differences in behavior across filesystems and operating systems and the difficulty in programmatically determining the characteristics of filesystems at run time, it is practically impossible to design a one size fits all approach to high performance I/O.

The filesystem is a powerful concept. I want to agree and use the everything is a file philosophy. Unfortunately, filesystems don't appear to be scaling very well to support the potential of modern day storage technology. We're probably at the point where commodity priced solid state storage is far more capable than today's software for the majority of applications. Storage hardware manufacturers will keep producing faster and faster storage and their marketing teams will keep convincing us that we need to buy it. But until software catches up, chances are most of us won't come close to realizing the true potential of modern storage hardware. And that's even true for specialized applications that do employ tricks taking hundreds or thousands of person hours to implement in order to eek out every iota of performance potential. The average software developer and application using filesystems as they were designed to be used has little to no chance of coming close to utilizing the performance potential of modern storage devices. That's really a shame.


Next Page ยป