Gregory Szorc's Digital Home

Using Rust to Power Python Importing With oxidized_importer

May 10, 2020 at 01:15 PM | categories: Python, PyOxidizer

I'm pleased to announce the availability of the oxidized_importer Python package, a standalone version of the custom Python module importer used by PyOxidizer. oxidized_importer - a Python extension module implemented in Rust - enables Python applications to start and run quicker by providing an alternate, more efficient mechanism for loading Python resources (such as source and bytecode modules).

Installation instructions and detailed usage information are available in the official documentation. The rest of this post hopefully answers the questions of why are you doing this and why should I care.

In a traditional Python process, Python's module importer inspects the filesystem at run-time to find and load resources like Python source and bytecode modules. It is highly dynamic in nature and relies on the filesystem as a point-in-time source of truth for resource availability.

oxidized_importer takes a different approach to resource loading that is more static in nature and more suitable to application environments (where Python resources aren't changing). Instead of dynamically probing the filesystem for available resources, resources are instead indexed ahead of time. When Python goes to resolve a resource (say it is looking to import a module), oxidized_importer simply needs to perform a lookup in an in-memory data structure to locate said resource. This means oxidized_importer only has marginal reliance on the filesystem, which can make it much faster than Python's traditional importer. (Performance benefits of binaries built with PyOxidizer have already been clearly demonstrated.)

The oxidized_importer Python extension module exposes parts of PyOxidizer's packaging and run-time functionality to Python code, without requiring the full use of PyOxidizer for application packaging. Specifically, oxidized_importer allows you to:

Install a custom, high-performance module importer (OxidizedFinder) to service Python import statements and resource loading (potentially from memory, using zero-copy).
Scan the filesystem for Python resources (source modules, bytecode files, package resources, distribution metadata, etc) and turn them into Python objects, which can be loaded into OxidizedFinder instances.
Serialize Python resource data into an efficient binary data structure for loading into an OxidizedFinder instance. This facilitates producing a standalone resources blob that can be distributed with a Python application which contains all the Python modules, bytecode, etc required to power that application. See the docs on freezing an application with oxidized_importer.

oxidized_importer can be thought of as PyOxidizer-lite: it provides just enough functionality to allow Python application maintainers to leverage some of the technical advancements of PyOxidizer (such as in-memory module imports) without using PyOxidizer for application packaging. oxidized_importer can work with the Python distribution already installed on your system. You just pip install it like any other Python package.

By releasing oxidized_importer as a standalone Python package, my hope is to allow more people to leverage some of the technical achievements and performance benefits coming out of PyOxidizer. I also hope that having more users of PyOxidizer's underlying code will help uncover bugs and conformance issues, raising the quality and viability of the projects.

I would also like to use oxidized_importer as an opportunity to advance the discourse around Python's resource loading mechanism. Filesystem I/O can be extremely slow, especially in mobile and embedded environments. Dynamically probing the filesystem to service module imports can therefore be slow. (The Python standard library has the zipimport module for importing Python resources from a zip file. But in my opinion, we can do much better.) I would like to see Python move towards leveraging immutable, serialized data structures for loading resources as efficiently as possible. After all, Python resources like the Python standard library are likely not changing between Python process invocations. The performance zealot in me cringes thinking of all the overhead that Python's filesystem probing approach incurs - all of the excessive stat() and other filesystem I/O calls that must be performed to answer questions about state that is easily indexed and often doesn't change. oxidized_importer represents my vision for what a high-performance Python resource loader should look like. I hope it can be successful in steering Python towards a better approach for resource loading.

I plan to release oxidized_importer independently from PyOxidizer. While the projects will continue to be developed in the same repository and will leverage the same underlying Rust code, I view them as somewhat independent and serving different audiences.

While oxidized_importer evolved from facilitating PyOxidizer's run-time use cases, I'm not opposed to taking it in new directions. For example, I would entertain implementing Python's dynamic filesystem probing logic in oxidized_importer, allowing it to serve as a functional stand-in for the official importer shipped with the Python standard library. I have little doubt an importer implemented in 100% Rust would outperform the official importer, which is implemented in Python. There's all kinds of possibilities here, such as using a background thread to index sys.path outside the constraints of the GIL. But I don't want to get ahead of myself...

If you are a Python application maintainer and want to make your Python processes execute a bit faster by leveraging a pre-built index of available Python resources and/or taking advantage of in-memory module importing, I highly encourage you to take a look at oxidized_importer!

PyOxidizer 0.7

April 09, 2020 at 09:00 PM | categories: Python, PyOxidizer

I am very pleased to announce the 0.7 release of PyOxidizer, a modern Python application packaging tool.

There are a host of notable new features in this release. You can read all about them in the project history.

I want to use this blog post to call out the more meaningful ones.

I started PyOxidizer as a science experiment of sorts: I sat out to prove the hypothesis that it was possible to produce high performance single file executables embedding Python and all of its resources (Python modules, non-module resource files, compiled extensions, etc). PyOxidizer has achieved this on Windows, Linux, and macOS since its very earliest releases. Hypothesis confirmed!

In order to actually achieve single file executables, you have to fundamentally change aspects of Python's behavior. Some of these changes invalidate deeply rooted assumptions about how Python works, such as the existence of __file__ in modules. As you can imagine, these broken assumptions translated to numerous compatibility issues and PyOxidizer didn't work with many popular Python packages.

With the science experiment phase of PyOxidizer out of the way, I have been making a concerted effort to broaden the user base of PyOxidizer. While single file executables can be an amazing property, it isn't critical for many use cases and the issues it was causing were preventing people from exploring PyOxidizer.

This brings us to what I think are the major new features in PyOxidizer 0.7.

Better Support for Loading Extension Modules

Earlier versions of PyOxidizer insisted that you compile Python (C) extension modules from source and statically link them into a produced binary. This requirement prevented the use of pre-built extension modules (commonly found in Python binary wheels available on PyPI) with PyOxidizer, forcing people to compile them locally. While this often just worked for many extension modules, it frequently failed on complex extension modules and it frequently failed on Windows.

PyOxidizer now supports loading compiled extension modules from standalone files (typically .so or .pyd files, which are actually shared libraries). There are still some sharp edges and known deficiencies. But in many cases, if you tell PyOxidizer to run pip install and package the result, pre-built wheels can be installed and PyOxidizer will pick up the standalone files.

On Windows, PyOxidizer even supports embedding the shared library data into the produced .exe and loading the .pyd/DLL directly from memory.

Loading Resources from the Filesystem

Binaries built with PyOxidizer contain a blob holding an index of available Python resources along with their data.

Earlier versions of PyOxidizer only allowed you to define resources as in-memory. If the resource was defined in this blob, it was imported from memory. Otherwise it wasn't known to PyOxidizer. You could still install files next to the produced binary and tell PyOxidizer to enable Python's default filesystem-based importer. But PyOxidizer didn't explicitly know about these files on the filesystem.

In PyOxidizer 0.7, the blob index of Python resources is able to express different locations for that resource. Currently, a resource can have its data made available in-memory or filesystem-relative. in-memory works as before: the raw data is embedded next to the next in memory and loaded from there (using 0-copy). filesystem-relative encodes a filesystem path to the resource. During packaging, PyOxidizer will place the resource next to the executable (using a typical Python file layout scheme) and store the relative path to that resource in the resources index.

The filesystem-relative resource indexing feature has a few implications for PyOxidizer.

First, it is more standard. When PyOxidizer loads a Python module from the filesystem, it sets __file__, __path__, etc and the module semantics should behave as if the file were imported by Python's standard importer. This means that if a package is having issues with in-memory importing, you can simply fall back to filesystem-relative to get standard Python behavior and everything should just work.

Second, PyOxidizer's filesystem resource loading is faster than Python's! When Python's standard importer goes to import a module, it needs to stat() various paths to first locate the file. It then performs some sanity checking and other minor actions before actually importing the module. All of this has overhead. Since the goal of PyOxidizer is to produce standalone applications and applications should be immutable, PyOxidizer can avoid most of this overhead. PyOxidizer simply tries to open() and read() the relative path baked into the resource index at build time. If that works, the resource is loaded. Else there is a failure. The code path in PyOxidizer to locate a Python resource is effectively a lookup in a Rust HashMap<&str, T>.

I thought it would be interesting to isolate the performance benefits of this new feature. I ran Mercurial's test harness with different variants of hg on Linux on my Ryzen 3950X.

traditional - A hg script with a #!/path/to/python3.7 shebang.
oxidized - A hg executable built with PyOxidizer, without PyOxidizer's custom module importer.
filesystem - A hg executable built with PyOxidizer using the new filesystem-relative resource index.
in-memory - A hg executable built with PyOxidizer with all resources loaded from memory (how PyOxidizer has traditionally worked).

The results are quite clear:

Variant	CPU Time (s)	Delta (s)	% Orig
traditional	11,287	-552	100
oxidized	10,735	-552	95.1
filesystem	10,186	-1,101	90.2
in-memory	9,883	-1,404	87.6

We see a nice win just from using a native executable built with PyOxidizer (traditional to oxidized).

Then from oxidized to filesystem we see another jump of ~5%. This difference is attributed to using PyOxidizer's Rust-powered importer with an index of resources available on the filesystem. In other words, all that work that Python's standard importer is doing to discover files and then operate on them is non-trivial!

Finally, the smaller jump from filesystem to in-memory isolates the benefits of importing resource data from memory instead of involving filesystem I/O. (Filesystems are generally slow.) While I haven't measured explicitly, I hypothesize that macOS and Windows will see a bigger jump between these two variants, as the filesystem performance on these platforms generally isn't as good as it is on Linux.

PyOxidizer's Future

With PyOxidizer now supporting a couple of much-needed features to support a broader set of users, I'm hoping that future releases of PyOxidizer continue to broaden the utility of PyOxidizer.

The over-arching goal of PyOxidizer is to solve large aspects of the Python application packaging and distribution problem. So far a lot of focus has been spent on the former. PyOxidizer in its current form can materialize files on the filesystem that you can copy or package up manually and distribute. But I want these processes to be part of PyOxidizer: I want it to be possible for PyOxidizer to emit a Windows MSI installer, a macOS dmg, a Debian package, etc for a Python application.

In order to support the aforementioned marquee features of this PyOxidizer release, I had to pay down a lot of technical debt in the code base left over from the science experiment phase of PyOxidizer's inception.

In the short term, I plan to continue shoring up the code base and rounding out support for features requested in the issue tracker on GitHub. The next release of PyOxidizer will also likely require Python 3.8, as this will improve run-time control over the embedded Python interpreter and enable PyOxidizer to better support package metadata (importlib.metadata), enabling support for features like entry points.

I've also been thinking about extracting PyOxidizer's custom module importer to be usable as a standalone Python extension module. I think there's some value in publishing a pyoxidizer_importer package on PyPI that you can easily add to your installed packages to speed up Python's standard filesystem importer by a few percent. If nothing else, this may drum up interest in the larger Python community for standardizing a format for serializing Python resources in a single file. Perhaps we can get other Python packaging tools producing the same packed resources data blob that PyOxidizer uses so we can all standardize on a more efficient mechanism for loading Python modules. Time will tell.

Enjoy the new release. File issues at https://github.com/indygreg/PyOxidizer as you encounter them.

Mercurial's Journey to and Reflections on Python 3

January 13, 2020 at 08:45 AM | categories: Python, Mercurial

Mercurial 5.2 was released on November 5, 2019. It is the first version of Mercurial that supports Python 3. This milestone comes nearly 11 years after Python 3.0 was first released on December 3, 2008.

Speaking as a maintainer of Mercurial and an avid user of Python, I feel like the experience of making Mercurial work with Python 3 is worth sharing because there are a number of lessons to be learned.

This post is logically divided into two sections: a mostly factual recount of Mercurial's Python 3 porting effort and a more opinionated commentary of the transition to Python 3 and the Python language ecosystem as a whole. Those who don't care about the mechanics of porting a large Python project to Python 3 may want to skip the next section or two.

Porting Mercurial to Python 3

Let's start with a brief history lesson of Mercurial's support for Python 3 as told by its own commit history.

The Mercurial version control tool was first released in April 2005 (the same month that Git was initially released). Version 1.0 came out in March 2008. The first reference to Python 3 I found in the code base was in September 2008. Then not much happens for a while until June 2010, when someone authors a bunch of changes to make the Python C extensions start to recognize Python 3. Then things were again quiet for a while until January 2013, when a handful of changes landed to remove 2 argument raise. There were a handful of commits in 2014 but nothing worth calling out.

Mercurial's meaningful journey to Python 3 started in 2015. In code, the work started in April 2015, with effort to make Mercurial's test harness run with Python 3. Part of this was a decision that Python 3.5 (to be released several months later in September 2015) would be the minimum Python 3 version that Mercurial would support.

Once the Mercurial Project decided it wanted to port to Python 3 (as opposed to another language), one of the earliest decisions was how to perform that port. Mercurial's code base was too large to attempt a flag day conversion where there would be a Python 2 version and a Python 3 version and one day everyone would switch from Python 2 to 3. Mercurial needed a way to run the same code (or as much of the same code) on both Python 2 and 3. We would maintain a single code base and users would gradually switch from running with Python 2 to Python 3.

In May 2015, Mercurial dropped support for Python 2.4 and 2.5. Dropping support for these older Python versions was critical, as it was effectively impossible to write Python code that ran on this wide gamut of versions because of incompatibilities in syntax and language features. For example, you needed Python 2.6 to get print() via from __future__ import print_function. The project's late start at a Python 3 port can be significantly attributed to Python 2.4 and 2.5 compatibility holding us back.

The main goal with Mercurial's early porting work was just getting the code base to a point where import mercurial would work. There were a myriad of places where Mercurial used syntax that was invalid on Python 3 and Python 3 couldn't even parse the source code, let alone compile it to bytecode and execute it.

This effort began in earnest in June 2015 with global source code rewrites like using modern octal syntax, modern exception catching syntax (except Exception as e instead of except Exception, e), print() instead of print, and a modern import convention along with the use of from __future__ import absolute_import.

In the early days of the port, our first goal was to get all source code parsing as valid Python 3. The next step was to get all the modules importing cleanly. This entailed fixing code that ran at import time to work on Python 3. Our thinking was that we would need the code base to be import clean on Python 3 before seriously thinking about run-time behavior. In reality, we quickly ported a lot of modules to import cleanly and then moved on to higher-level porting, leaving a long-tail of modules with import failures.

This initial porting effort played out over months. There weren't many people working on it in the early days: a few people would basically hack on Python 3 as a form of itch scratching and most of the project's energy was focused on improving the existing Python 2 based product. You can get a rough idea of the timeline and participation in the early porting effort through the history of test-check-py3-compat.t. We see the test being added in December 2015, By June 2016, most of the code base was ported to our modern import convention and we were ready to move on to more meaningful porting.

One of the biggest early hurdles in our porting effort was how to overcome the string literals type mismatch between Python 2 and 3. In Python 2, a '' string literal is a sequence of bytes. In Python 3, a '' string literal is a sequence of Unicode code points. These are fundamentally different types. And in Mercurial's code base, most of our string types are binary by design: use of a Unicode based str for representing data is flat out wrong for our use case. We knew that Mercurial would need to eventually switch many string literals from '' to b'' to preserve type compatibility. But doing so would be problematic.

In the early days of Mercurial's Python 3 port in 2015, Mercurial's project maintainer (Matt Mackall) set a ground rule that the Python 3 port shouldn't overly disrupt others: he wanted the Python 3 port to more or less happen in the background and not require every developer to be aware of Python 3's low-level behavior in order to get work done on the existing Python 2 code base. This may seem like a questionable decision (and I probably disagreed with him to some extent at the time because I was doing Python 3 porting work and the decision constrained this work). But it was the correct decision. Matt knew that it would be years before the Python 3 port was either necessary or resulted in a meaningful return on investment (the value proposition of Python 3 has always been weak to Mercurial because Python 3 doesn't demonstrate a compelling advantage over Python 2 for our use case). What Matt was trying to do was minimize the externalized costs that a Python 3 port would inflict on the project. He correctly recognized that maintaining the existing product and supporting existing users was more important than a long-term bet in its infancy.

This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.

In addition, there were some other practical issues with doing a bulk b'' prefix insertion. One was that the added b characters would cause a lot of lines to grow beyond our length limits and we'd have to reformat code. That would require manual intervention and would significantly slow down porting. And a sub-issue of adding all the b prefixes and reformatting code is that it would break annotate/blame more than was tolerable. The latter issue was addressed by teaching Mercurial's annotate/blame feature to skip revisions. The project now has a convention of annotating commit messages with # skip-blame <reason> so structural only changes can easily be ignored when performing an annotate/blame.

A stop-gap solution to the b'' everywhere issue came in July 2016, when I introduced a custom Python module importer that rewrote source code as part of import when running on Python 3. (I have previously blogged about this hack.) What this did was transparently add b'' prefixes to all un-prefixed string literals as well as modify how a few common functions were called so that we wouldn't need to modify source code so things would run natively on Python 3. The source transformer allowed us to have the benefits of progressing in our Python 3 port without having to rewrite tens of thousands of lines of source code. The solution was hacky. But it enabled us to make significant progress on the Python 3 port without externalizing a lot of cost onto others.

I thought the source transformer would be relatively short-lived and would be removed shortly after the project inevitably decided to go all in on Python 3. To my surprise, others built additional transforms over the years and the source transformer persisted all the way until October 2019, when I removed it just before the first non-alpha Python 3 compatible version of Mercurial was released.

A common problem Mercurial faced with making the code base dual Python 2/3 native was dealing with standard library differences. Most of the problems stemmed from changes between Python 2.7 and 3.5+. But there are changes within the versions of Python 3 that we had to wallpaper over as well. In April 2016, the mercurial.pycompat module was introduced to export aliases or wrappers around standard library functionality to abstract the differences between Python versions. This file grew over time and eventually became Mercurial's version of six. To be honest, I'm not sure if we should have used six from the beginning. six probably would have saved some work. But we had to eventually write a lot of shims for converting between str and bytes and would have needed to invent a pycompat layer in some form anyway. So I'm not sure six would have saved enough effort to justify the baggage of integrating a 3rd party package into Mercurial. (When Mercurial accepts a 3rd party package, downstream packagers like Debian get all hot and bothered and end up making questionable patches to our source code. So we prefer to minimize the surface area for problems by minimizing dependencies on 3rd party packages.)

Once we had a source transforming module importer and the pycompat compatibility shim, we started to focus in earnest on making core functionality actually work on Python 3. We established a convention of annotating changesets needed for Python 3 with py3, so a commit message search yields a lot of the history. (But it isn't a full history since not every Python 3 oriented change used this convention). We see from that history that after the source importer landed, a lot of porting effort was spent on things very early in the hg process lifetime. This included handling environment variables, loading config files, and argument parsing. We introduced a test-check-py3-commands.t test to track the progress of hg commands working in Python 3. The very early history of that file shows the various error messages changing, as underlying early process functionality was slowly ported to work on Python 3. By December 2016, we had hg version working on Python 3!

With basic hg command dispatch ported to Python 3 at the end of 2016, 2017 represented an inflection point in the Python 3 porting effort. With the early process functionality working, different people could pick up different commands and code paths and start making code work with Python 3. By March 2017, basic repository opening and hg files worked. Shortly thereafter, hg init started working as well. And hg status and hg commit did as well.

Within a few months, enough of Mercurial's functionality was working with Python 3 that we started to track which tests passed on Python 3. The evolution of this file shows a reasonable history of the porting velocity.

In May 2017, we dropped support for Python 2.6. This significantly reduced the complexity of supporting Python 3, as there was tons of functionality in Python 2.7 that made it easier to target both Python 2 and 3 and now our hands were untied to utilize it.

In November 2017, I landed a test harness feature to report exceptions seen during test runs. I later refined the output so the most frequent failures were reported more prominently. This feature greatly enabled our ability to target the most common exceptions, allowing us to write patches to fix the most prevalent issues on Python 3 and uncover previously unknown failures.

By the end of 2017, we had most of the structural pieces in place to complete the port. Essentially all that was required at that point was time and labor. We didn't have a formal mechanism in place to target porting efforts. Instead, people would pick up a component or test that they wanted to hack on and then make incremental changes towards making that work. All the while, we didn't have a strict policy on not regressing Python 3 and regressions in Python 3 porting progress were semi-frequent. Although we did tend to correct regressions quickly. And over time, developers saw a flurry of Python 3 patches and slowly grew awareness of how to accommodate Python 3, and the number of Python 3 regressions became less frequent.

As useful as the source-transforming module importer was, it incurred some additional burden for the porting effort. The source transformer effectively converted all un-prefixed string literals ('') to bytes literals (b'') to preserve string type behavior with Python 2. But various aspects of Python 3 didn't like the existence of bytes. Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes. So our pycompat layer grew pretty large to accommodate calling into various standard library functionality. Another side-effect which we didn't initially anticipate was the **kwargs calling convention. Python allows you to use ** with a dict with string keys to turn those keys into named arguments in a function call. But Python 3 requires these dict keys to be str and outright rejects bytes keys, even if the bytes instance is ASCII safe and has the same underlying byte representation of the string data as the str instance would. So we had to invent support functions that would convert dict keys from bytes to str for use with **kwargs and another to convert a **kwargs dict from str keys to bytes keys so we could use '' syntax to access keys in our source code! Also on the string type front, we had to sprinkle the codebase with raw string literals (r'') to force the use of str irregardless of which Python version you were running on (our source transformer only changed unprefixed string literals, so existing r'' strings would be preserved as str).

Blind transformation of all string literals to bytes was less than ideal and it did impose some unwanted side-effects. But, again, most strings in Mercurial are bytes by design, so we thought it would be easier to byteify all strings then selectively undo that where native strings were actually warranted (like keys in most dicts) than to take the up-front cost to examine every string and make an intelligent determination as to what type it should be. I go back and forth as to whether this was the correct call. But when you factor in that the source transforming module importer unblocked Python 3 porting at a time in the project's history when there was so much focus on improving the core product and it did so without externalizing many costs onto the people doing the critical core product work, I think it was the right call.

By mid 2019, the number of test failures in Python 3 had been whittled down to a reasonable, less daunting number. It felt like victory was in grasp and inevitable. But a few significant issues lingered.

One remaining question was around addressing differences between Python 3 versions. At the time, Python 3.5, 3.6, and 3.7 were released and 3.8 was scheduled for release by the end of the year. We had a surprising number of issues with differences in Python 3 versions. Many of us were running Python 3.7, so it had the fewest failures. We had to spend extra effort to get Python 3.5 and 3.6 working as well as 3.7. Same for 3.8.

Another task we deferred until the second half of 2019 was standing up robust CI for Python 3. We had some coverage, but it was minimal. Wanting a distraction from PyOxidizer for a bit and wanting to overhaul Mercurial's CI system (which is officially built on Buildbot), I cobbled together a serverless CI system built on top of AWS DynamoDB and S3 for storage, Lambda functions and CloudWatch events for all business logic, and EC2 spot instances for job execution. This CI system executed Python 3.5, 3.6, 3.7, and 3.8 variants of our test harness on Linux and Python 3.7 on Windows. This gave developers insight into version-specific failures. More importantly, it also gave insight into Windows failures, which was previously not well tested. It was discovered that Python 3 on Windows was lagging significantly behind POSIX.

By the time of the Mercurial developer meetup in October 2019, nearly all tests were passing on POSIX platforms and we were confident that we could declare Python 3 support as at least beta quality for the Mercurial 5.2 release, planned for early November.

One of our blockers for ripping off the alpha label on Python 3 support was removing our source-transforming module importer. It had performance implications and it wasn't something we wanted to ship because it felt too hacky. A blocker for this was we wanted to automatically format our source tree with black because if we removed the source transformer, we'd have to rewrite a lot of source code to apply changes the transformer was performing, which would necessitate wrapping a lot of lines, which would involve a lot of manual effort. We wanted to blacken our code base first so that mass rewriting source code wouldn't involve a lot of tedious reformatting since black would handle that for us automatically. And rewriting the source tree with black was blocked on a specific feature landing in black! (We did not agree with black's behavior of unwrapping comma-delimited lists of items if they could fit on a single line. So one of our core contributors wrote a patch to black that changed its behavior so a trailing , in a list of items will force items to be formatted on multiple lines. I personally find the multiple line formatting much easier to read. And the behavior is arguably better for code review and annotation, which is line based.) Once this feature landed in black, we reformatted our source tree and started ripping out the source transformations, starting by inserting b'' literals everywhere. By late October, the source transformer was no more and we were ready to release beta quality support for Python 3 (at least on UNIX-like platforms).

Having described a mostly factual overview of Mercurial's port to Python 3, it is now time to shift gears to the speculative and opinionated parts of this post. I want to underscore that the opinions reflected here are my own and do not reflect the overall Mercurial Project or even a consensus within it.

The Future of Python 3 and Mercurial

Mercurial's port to Python 3 is still ongoing. While we've shipped Python 3 support and the test harness is clean on Python 3, I view shipping as only a milestone - arguably the most important one - in a longer journey. There's still a lot of work to do.

It is now 2020 and Python 2 support is now officially dead from the perspective of the Python language maintainers. Linux distributions are starting to rip out Python 2. Packages are dropping Python 2 support in new versions. The world is moving to Python 3 only. But Mercurial still officially supports Python 2. And it is still yet to be determined how long we will retain support for Python 2 in the code base. We've only had one release supporting Python 3. Our users still need to port their extensions (implemented in Python). Our users still need to start widely using Mercurial with Python 3. Even our own developers need to switch to Python 3 (old habits are hard to break).

I anticipate a long tail of random bugs in Mercurial on Python 3. While the tests may pass, our code coverage is not 100%. And even if it were, Python is a dynamic language and there are tons of invariants that aren't caught at compile time and can only be discovered at run time. These invariants cannot all be detected by tests, no matter how good your test coverage is. This is a feature/limitation of dynamic languages. Our users will likely be finding a long tail of miscellaneous bugs on Python 3 for years.

At present, our code base is littered with tons of random hacks to bridge the gap between Python 2 and 3. Once Python 2 support is dropped, we'll need to remove these hacks and make the source tree Python 3 native, with minimal shims to wallpaper over differences in Python 3 versions. Removing this Python version bridge code will likely require hundreds of commits and will be a non-trivial effort. It's likely to be deemed a low priority (it is glorified busy work after all), and code for the express purpose of supporting Python 2 will likely linger for years.

We are also still shoring up our packaging and distribution story on Python 3. This is easier on some platforms than others. I created PyOxidizer partially because of the poor experience I had with Python application packaging and distribution through the Mercurial Project. The Mercurial Project has already signed off on using PyOxidizer for distributing Mercurial in the future. So look for an oxidized Mercurial distribution in the near future! (You could argue PyOxidizer is an epic yak shave to better support Mercurial. But that's for another post.)

Then there's Windows support. A Python 3 powered Mercurial on Windows still has a handful of known issues. It may require a few more releases before we consider Python 3 on Windows to be stable.

Because we're still on a code base that must support Python 2, our adoption of Python 3 features is very limited. The only Python 3 feature that Mercurial developers seem to almost universally get excited about is type annotations. We already have some people playing around with pytype using comment-based annotations and pytype has already caught a few bugs. We're eager to go all in on type annotations and uncover lots of dynamic typing bugs and poorly implemented APIs. Beyond type annotations, I can't name any feature that people are screaming to adopt and which makes a lot of sense for Mercurial. There's a long tail of minor features I'm sure will get utilized. But none of the marquee features that define major language releases seem that interesting to us. Time will tell.

Commentary on Python 3

Having described Mercurial's ongoing journey to Python 3, I now want to focus more on Python itself. Again, the opinions here are my own and don't reflect those of the Mercurial Project.

Succinctly, my experience porting Mercurial and other projects to Python 3 has significantly soured my perceptions of Python. As much as I have historically loved Python - from the language to the welcoming community - I am still struggling to understand how Python could manage to inflict so much hardship on the community by choosing the transition plan that they did. I believe Python's choices represent a terrific example of what not to do when managing a large project or ecosystem. Maintainers of other largely-deployed systems would benefit from taking the time to understand and reflect on Python's missteps.

Python 3.0 was released on December 3, 2008. And it took the better part of a decade for the community to embrace it. This should be universally recognized as a failure. While hindsight is 20/20, many of the issues with Python 3 were obvious at the time and could have been mitigated had the language maintainers been more accommodating - and dare I say empathetic - to its users.

Initially, Python 3 had a rather cavalier attitude towards backwards and forwards compatibility. In the early years of Python 3, the attitude of Python's maintainers was Python 3 is a new, better language: you should target it explicitly. There were some tools and methods to ease the transition. But nothing super polished, especially in the early years. Adoption of Python 3 in the overall community was slow. Python developers in the wild justifiably complained that the value proposition of Python 3 was too weak to justify porting effort. Not helping was that the early advice for targeting Python 3 was to rewrite the source code to become Python 3 native. This is in contrast with using the same source to run on both Python 2 and 3. For library and application maintainers, this potentially meant maintaining separate versions of your code or forcing end-users to make a giant leap, which would realistically orphan users on an old version, fragmenting your user base. Neither of those were great alternatives, so you can understand why many projects didn't bite.

For many projects of non-trivial size, flag day transitions from Python 2 to 3 were simply not viable: the pathway to Python 3 was to make code dual Python 2/3 compatible and gradually switch over the runtime to Python 3. But initial versions of Python 3 made this effectively impossible! Let me give a few specific examples.

In Python 2, a string literal '' is effectively an array of bytes. In Python 3, it is a series of Unicode code points - a fundamentally different type! In Python 2, you could write b'' to be explicit that a string literal was bytes or you could write u'' to indicate a Unicode literal, mimicking Python 3's behavior. In Python 3, you could write b'' to create a bytes instance. But for whatever reason, Python 3 initially removed the u'' syntax, meaning there wasn't as easy way to explicitly denote the type of each string literal so that it was consistent between Python 2 and 3! Python 3.3 (released September 2012) restored u'' support, making it more viable to write Python source code that worked on both Python 2 and 3. For nearly 4 years, Python 3 took away the consistent syntax for denoting bytes/Unicode string literals.

Another feature was % formatting of strings. Python 2 allowed use of the % formatting operator on both its string types. But Python 3 initially removed the implementation of % from bytes. Why, I have no clue. It is perfectly reasonable to splice byte sequences into a buffer via use of a formatting string. But the Python language maintainers insisted otherwise. And it wasn't until the community complained about its absence loudly enough that this feature was restored in Python 3.5, which was released in September 2015. Fun fact: the lack of this feature was once considered a blocker for Mercurial moving to Python 3 because Mercurial uses bytes almost universally, which meant that nearly every use of % would have to be changed to something else. And to this day, Python 3's bytes still doesn't have a format() method, so the alternative was effectively string concatenation, which is a massive step backwards from the expressiveness of % formatting.

The initial approach of Python 3 mirrors a folly that many developers and projects make: attempting a rewrite instead of performing incremental evolution. For established projects, large scale rewrites often go poorly. And Python 3 is no exception. Yes, from a code level, CPython (and likely other Python implementations) were incremental changes over Python 2 using the same code base. But from a language and standard library level, the differences in Python 3 were significant enough that I - and even Python's core maintainers - considered it a new language, and therefore a rewrite. When your random project attempts a rewrite and fails, the blast radius of that is often contained to that project. Maybe you don't publish a new release as soon as you otherwise would. But when you are powering an ecosystem, the ripple effects from a failed rewrite percolate throughout that ecosystem and last for years and have many second order effects. We see this with Python 3, where poor choices made in the late 2000s are inflicting significant hardship still in 2020.

From the initial restrained adoption of Python 3, it is obvious that the Python ecosystem overwhelmingly rejected the initial boil the oceans approach of Python 3. Python's maintainers eventually got the message and started restoring features like u'' and bytes % formatting back into the language to placate the community. All the while Python 3 had been accumulating new features and the cumulative sum of those features was compelling enough to win over users.

For many projects (including Mercurial), Python 3.4/3.5 was the first viable porting target for Python 3. Python 3.5 was released in September 2015, almost 7 years after Python 3.0 was released in December 2008. Seven. Years. An ecosystem that falters for that long is generally not healthy. What may have saved Python from total collapse here is that Python 2 was still going strong and people were generally happy with it. I really do think Python dodged a bullet here, because there was a massive window where the language could have hemorrhaged a critical amount of its user base and been relegated to an afterthought. One could draw an analogy to Perl, which lost out to PHP, Python, and Ruby, and whose fall from grace aligned with a lengthy transition from Perl 5 to 6.

If you look back at the early history of Python 3, I think you are forced to conclude that Python effectively kneecapped itself for 5-7 years through questionable implementation choices that prevented users from incurring incremental transitions between the major language versions. 2008 to 2013-2015 should be known as the lost years of Python because so much opportunity and energy was squandered. Yes, Python is still healthy today and Python 3 is (finally) being adopted at scale. But had earlier versions of Python 3 been more empathetic towards Python 2 users porting to it, Python and Python 3 in 2020 would be even stronger than it is. The community was artificially hindered for years. And we won't know until 2023-2025 what things could have looked like in 2020 had the Python core language team spent more time paving a smoother road between the major language versions.

To be clear, I do think Python 3 is generally a better language than Python 2. It has fewer warts, more compelling features, and better performance (except for startup time, which is still slower than Python 2). I am ecstatic the community is finally rallying around Python 3! For my Python coding, it has reached the point where I curse under my breath when I need to support Python 2 or even older versions of Python 3, like 3.5 or 3.6: I just wish the world would move on and adopt the future already!

But I would be remiss if I failed to mention some of my gripes with Python 3 beyond the transition shenanigans.

Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode. In Python 2, the default string type was backed by bytes. In Python 3, the default string type is backed by Unicode code points. As part of that transition, large parts of the standard library now operate in the Unicode space instead of the domain of bytes. I understand why Python does this: they want strings to be Unicode and don't want users to have to spend that much energy thinking about when to use str versus bytes. This approach is admirable and somewhat defensible because it takes a stand on a solution that is arguably good enough for most users. However, the approach of assuming the world is Unicode is flat out wrong and has significant implications for systems level applications (like version control tools).

There are a myriad of places in Python's standard library where Python insists on using the Unicode-backed str type and rejects bytes. For example, various networking modules refuse to accept bytes for hostnames or URLs. HTTP libraries won't accept bytes for HTTP header names or values. Functions that are proxies to POSIX-defined functions won't accept bytes even though the POSIX function it calls into is using char * and isn't Unicode aware. Then there's filename handling, where Python assumes the existence of a global encoding for filenames and uses this encoding to convert between str and bytes. And it does this despite POSIX filesystem paths being a bag of bytes where the only rules are that \0 terminates the filename and / is special.

In cases like Python refusing to accept bytes for things like HTTP header names (which will just be spit out over the wire as bytes), Python's pendulum has swung too far towards Unicode only. In my opinion, Python needs to be more accommodating and allow bytes when it makes sense. I hope the pendulum knocks some sense into people when it swings back towards a more reasonable solution that better acknowledges the realities of the world we live in.

For areas like filename handling, the world is more complicated. Python is effectively an abstraction layer over the operating system APIs exposing this functionality. And there is often an impedance mismatch between operating systems. For example, POSIX (Linux) tends to use char * for everything and doesn't care about encoding and Windows tends to use 16 bit character types where the encoding is... a can of worms.

The reality here is that it is impossible to abstract over differences between operating system behavior without compromises that can result in data loss, outright wrong behavior, or loss of functionality. But Python 3 attempts to do it anyway, making Python 3 unsuitable (or at least highly undesirable) for certain systems level applications that rely on it (like a version control tool).

In fairness to Python, it isn't the only programming language that gets this wrong. The only language I've seen properly implement higher-order abstractions on top of operating system facilities is Rust, whose approach can be generalized as use Python 3's solution of normalizing to Unicode/UTF-8 by default, but expose escape hatches which allow access to the raw underlying types and APIs used by the operating system for the advanced consumers who require it. For example, Rust's Path type which represents a filesystem path allows access to the raw OsStr value used by the operating system, not a normalization of it to bytes or Unicode, which may be lossy. This allows consumers to e.g. create and retrieve OS-native filesystem paths without data loss. This functionality is critical in some domains. Python 3's awareness/insistence that the world is Unicode (which it isn't universally) reduces Python's applicability in these domains.

Speaking of Rust, at the Mercurial developer meetup in October 2019, we were discussing the use of Rust in Mercurial and one of the core maintainers blurted out something along the lines of if Rust were at its current state 5 years ago, Mercurial would have likely ported from Python 2 to Rust instead of Python 3. As crazy as it initially sounded, I think I agree with that assessment. With the benefit of hindsight, having been a key player in the Python 3 porting effort, seeing all the complications and headaches Python 3 is introducing, and having learned Rust and witnessed its benefits for performance, control, and correctness firsthand, porting to Rust would likely have been the correct move for the project at that point in time. 2020 is not 2014, however, and I'm not sure if I would opt for a rewrite in Rust today. (Most rewrites are follies after all.) But I know one thing: I certainly wouldn't implement a new version control tool in Python 3 and I would probably choose Rust as an implementation language for most new projects in the systems level space or with an expected shelf life of 10+ years. (I really should blog about how awesome Rust is.)

Back to the topic of Python itself, I'm really soured on Python at this point in time. The effort required to port to Python 3 was staggering. For Mercurial, Python 3 introduces a ton of problems and doesn't really solve many. We effectively sludged through mud for several years only to wind up in a state that feels strictly worse than where we started. I'm sure it will be strictly better in a few years. But at that point, we're talking about a 5+ year transition. To call the Python 3 transition disruptive and distracting for the project would be an understatement. As a project maintainer, it's natural to ask what we could have accomplished if we weren't forced to carry out this sideshow.

I can't shake the feeling that a lot of the pain afflicted by the Python 3 transition could have been avoided had Python's language leadership made a different set of decisions and more highly prioritized the transition experience. (Like not initially removing features like u'' and bytes % and not introducing gratuitous backwards compatibility breaks, like with items()/iteritems(). I would have also liked to see a feature like from __future__ - maybe from __past__ - that would make it easier for Python 3 code to target semantics in earlier versions in order to provide a more turnkey on-ramp onto new versions.) I simultaneously see Python 3 losing its position as a justifiable tool in some domains (like systems level tooling) due to ongoing design decisions and poor implementation (like startup overhead problems). (In contrast, I see Rust excelling where Python is faltering and find Rust code surprisingly expressive to write and maintain given how low-level it is and therefore feel that Rust is a compelling alternative to Python in a surprisingly large number of domains.)

Look, I know it is easy for me to armchair quarterback and critique with the benefit of hindsight/ignorance. I'm sure there is a lot of nuance here. I'm sure there was disagreement within the Python community over a lot of these issues. Maintaining a large and successful programming language and community like Python's is hard and you aren't going to please all the people all the time. And speaking as a maintainer, I have mad respect for the people leading such a large community. But niceties aside, everyone knows the Python 3 transition was rough and could have gone better. It should not have taken 11 years to get to where we are today.

I'd like to encourage the Python Project to conduct a thorough postmortem on the transition to Python 3. Identify what went well, what could have gone better, and what should be done next time such a large language change is wanted. Speaking as a Python user, a maintainer of a Python project, and as someone in industry who is now skeptical about use of Python at work due to risks of potentially company crippling high-effort migrations in the future, a postmortem would help restore my confidence that Python's maintainers learned from the various missteps on the road to Python 3 and these potentially ecosystem crippling mistakes won't be made again.

Python had a wildly successful past few decades. And it can continue to thrive for several more. But the Python 3 migration was painful for all involved. And as much as we need to move on and leave Python 2 behind us, there are some important lessons to be learned. I hope the Python community takes the opportunity to reflect and am confident it will grow stronger by taking the time to do so.

Problems with Pull Requests and How to Fix Them

January 07, 2020 at 12:10 PM | categories: Mercurial, Git

You've probably used or at least heard of pull requests: the pull request is the contribution workflow practiced on and made popular by [code] collaboration sites like GitHub, GitLab, Bitbucket, and others. Someone (optionally) creates a fork, authors some commits, pushes them to a branch, then creates a pull request to track integrating those commits into a target repository and branch. The pull request is then used as a vehicle for code review, tracking automated checks, and discussion until it is ready to be integrated. Integration is usually performed by a project maintainer, often with the click of a merge button on the pull request's web page.

It's worth noting that the term pull request is not universally used: GitLab calls them merge requests for example. Furthermore I regard the terms pull request and merge request to be poorly named, as the terms can be conflated with terminology used by your version control tool (e.g. git pull or git merge. And the implementations of a pull or merge request may not even perform a pull or a merge (you can also rebase a pull/merge request, but nobody is calling them rebase requests). A modern day pull request is so much more than a version control tool operation or even a simple request to pull or merge a branch: it is a nexus to track the integration of a proposed change before during and after that change is integrated. But alas. Because GitHub coined the term and is the most popular collaboration platform implementing this functionality, I'll refer to this general workflow as implemented on GitHub, GitLab, Bitbucket, and others as pull requests for the remainder of this post.

Pull requests have existed in essentially their current form for over a decade. The core workflow has remained mostly unchanged. What is different are the addition of value-add features, such as integrating status checks like CI results, the ability to rebase or squash commits instead of merging, code review tooling improvements, and lots of UI polish. GitLab deserves a call out here, as their implementation of merge requests tracks so much more than other tools do. (This is a side-effect of GitLab having more built-in features than comparable tools.) I will also give kudos to GitLab for adding new features to pull requests when GitHub was asleep at the wheel as a company a few years ago. (Not having a CEO for clear product/company leadership really showed.) Fortunately, both companies (and others) are now churning out new, useful features at a terrific clip, greatly benefiting the industry!

While I don't have evidence of this, I suspect pull requests (and the forking model used by services that implement them) came into existence when someone thought how do I design a collaboration web site built on top of Git's new and novel distributed nature and branching features. They then proceeded to invent forking and pull requests. After all, the pull request as implemented by GitHub was initially a veneer over a common Git workflow of create a clone, create a branch, and send it somewhere. Without GitHub, you would run git clone, git branch, then some other command like git request-pull (where have I seen those words before) to generate/send your branch somewhere. On GitHub, the comparable steps are roughly create a fork, create a branch to your fork, and submit a pull request. Today, you can even do all of this straight from the web interface without having to run git directly! This means that GitHub can conceptually be thought of as a purely server-side abstraction/implementation of the Git feature branch workflow.

At its core, the pull request is fundamentally a nice UI and feature layer built around the common Git feature branch workflow. It was likely initially conceived as polish and value-add features over this historically client-side workflow. And this core property of pull requests from its very first days has been copied by vendors like Bitbucket and GitLab (and in Bitbucket's case it was implemented for Mercurial - not Git - as Bitbucket was initially Mercurial only).

A decade is an eternity in the computer industry. As they say, if you aren't moving forward, you are moving backward. I think it is time for industry to scrutinize the pull request model and to evolve it into something better.

I know what you are thinking: you are thinking that pull requests work great and that they are popular because they are a superior model compared to what came before. These statements - aside from some nuance - are true. But if you live in the version control space (like I do) or are paid to deliver tools and workflows to developers to improve productivity and code/product quality (which I am), the deficiencies in the pull request workflow and implementation of that workflow among vendors like GitHub, GitLab, Bitbucket, etc are obvious and begging to be overhauled if not replaced wholesale.

So buckle in: you've started a ten thousand word adventure about everything you didn't think you wanted to know about pull requests!

Problems with Pull Requests

To build a better workflow, we first have to understand what is wrong/sub-optimal with pull requests.

I posit that the foremost goal of an pull request is to foster the incorporation of a high quality and desired change into a target repository with minimal overhead and complexity for submitter, integrator, and everyone in between. Pull requests achieve this goal by fostering collaboration to discuss the change (including code review), tracking automated checks against the change, linking to related issues, etc. In other words, the way I see the world is that a specific vendor's pull request implementation is just that: an implementation detail. And like all implementation details, they should be frequently scrutinized and changed, if necessary.

Let's start dissecting the problems with pull requests by focusing on the size of review units. Research by Google, Microsoft here, and here, and others has shown an inverse correlation with review unit size and defect rate. In Google's words (emphasis mine):

The size distribution of changes is an important factor in the quality of the code review process. Previous studies have found that the number of useful comments decreases and the review latency increases as the size of the change increases. Size also influences developers' perception of the code review process; a survey of Mozilla contributors found that developers feel that size-related factors have the greatest effect on review latency. A correlation between change size and review quality is acknowledged by Google and developers are strongly encouraged to make small, incremental changes (with the exception of large deletions and automated refactoring). These findings and our study support the value of reviewing small changes and the need for research and tools to help developers create such small, self-contained code changes for review.

Succinctly, larger changes result in fewer useful comments during review (meaning quality is undermined) and make reviews take longer (meaning productivity is undermined). Takeaway: if you care about defect rate / quality and/or velocity, you should be authoring and reviewing more, smaller changes as opposed to fewer, larger changes.

I stronger agree with Google's opinion on this matter and wholeheartedly endorse writing more, smaller changes. Having practiced both forms of change authorship, I can say without a doubt that more, smaller changes is superior: superior for authors, superior for code reviewers, and superior for people looking at repository history later. The main downside with this model is that it requires a bit more knowledge of your version control tool to execute. And, it requires corresponding tooling to play well with this change authorship model and to introduce as little friction as possible along the way since the number of interactions with tooling will increase as change size decreases, velocity increases, and there are more distinct units of change being considered for integration.

That last point is important and is germane to this post because the common implementation of pull requests today is not very compatible with the many small changes workflow. As I'll argue, the current implementation of pull requests actively discourages the many smaller changes workflow. And since smaller changes result in higher quality and faster reviews, today's implementations of pull requests are undermining quality and velocity.

I don't mean to pick on them, but since they are the most popular and the people who made pull requests popular, let's use GitHub's implementation of pull requests to demonstrate my point.

I posit that in order for us to author more, smaller changes, we must either a) create more, smaller pull requests or b) have pull request reviews put emphasis on the individual commits (as opposed to the overall merge diff). Let's examine these individually.

If we were to author more, smaller pull requests, this would seemingly necessitate the need for dependencies between pull requests in order to maintain velocity. And dependencies between pull requests adds a potentially prohibitive amount of overhead. Let me explain. We don't want to sacrifice the overall rate at which authors and maintainers are able to integrate proposed changes. If we were to split existing proposed changes into more, smaller pull requests, we would have a lot more pull requests. Without dependencies between them, authors could wait for each pull request to be integrated before sending the next one. But this would incur more round trips between author and integrator and would almost certainly slow down the overall process. That's not desirable. The obvious mitigation to that is to allow multiple, related pull requests in flight simultaneously. But this would necessitate the invention of dependencies between pull requests in order to track relationships so one pull request doesn't integrate before another it logically depends on. This is certainly technically doable. But it imposes considerable overhead of its own. How do you define dependencies? Are dependencies automatically detected or updated based on commits in a DAG? If yes, what happens when you force push and it is ambiguous whether a new commit is a logically new commit or a successor of a previous one? If no, do you really want to impose additional hurdles on submitters to define dependencies between every pull request? In the extreme case of one pull request per commit, do you make someone submitting a series of say twenty commits and pull requests really annotate nineteen dependencies? That's crazy!

There's another, more practical issue at play: the interplay between Git branches and pull requests. As implemented on GitHub, a pull request is tracked by a Git branch. If we have N inter-dependent pull requests, that means N Git branches. In the worst case, we have one Git branch for every Git commit. Managing N in-flight Git branches would be absurd. It would impose considerable overhead on pull request submitters. It would perfectly highlight the inefficiency in Git's game of refs branch management that I blogged about two years ago. (Succinctly, once you are accustomed to workflows - like Mercurial's - which don't require you to name commits or branches, Git's forced naming of branches and all the commands requiring those branch names feels grossly inefficient and a mountain of overhead.) Some tooling could certainly be implemented to enable efficient submission of pull requests. (See ghstack for an example.) But I think the interplay between Git branches and GitHub pull requests is sufficiently complex that the tooling and workflow would be intractable for anything but the most trivial and best-case scenarios. Keep in mind that any sufficiently user-friendly solution to this problem would also entail improving git rebase so it moves branches on rewritten ancestor commits instead of leaving them on the old versions of commits. (Seriously, someone should implement this feature: it arguably makes sense as the default behavior for local branches.) In other words, I don't think you can implement the multiple pull request model reliably and without causing excessive burden on people without fundamentally changing the requirement that a pull request be a Git branch. (I'd love to be proven wrong.)

Therefore, I don't think the more, smaller changes workflow can be easily practiced with multiple pull requests using the common GitHub model without effectively moving the definition of a pull request away from equivalence with a Git branch (more on this later). And I also don't mean to imply that dependencies between pull requests can't be implemented: they can and GitLab is evidence. But GitLab's implementation is somewhat simple and crude (possibly because doing anything more complicated is really hard as I speculate).

So without fundamentally changing the relationship between a pull request and a branch, that leaves us with our alternative of pull requests putting more emphasis on the individual changes rather than the merge diff. Let's talk about that now.

Pull requests have historically placed emphasis on the merge diff. That is, GitHub (or another provider) takes the Git branch you have submitted, runs a git merge against the target branch behind the scenes, and displays that diff front and center for review as the main proposed unit of change: if you click the Files changed tab to commence review, you are seeing this overall merge diff. You can click on the Commits tab then select an individual commit to review just that commit. Or you can use the dropdown on the Files changed tab to select an individual commit to review it. These (relatively new) features are a very welcome improvement and do facilitate performing a commit-by-commit review, which is a requirement to realize the benefits of a more, smaller changes workflow. Unfortunately, they are far from sufficient to fully realize the benefits of that workflow.

Defaults matter and GitHub's default is to show the merge diff when conducting review. (I bet a large percentage of users don't even know it is possible to review individual commits.) Since larger changes result in a higher defect rate and slower review, GitHub's default of showing the merge diff effectively means GitHub is defaulting to lower quality, longer-lasting reviews. (I suppose this is good for engagement numbers, as it inflates service usage both immediately and in the long-term due to subsequent bugs driving further usage. But I sincerely hope no product manager is thinking let's design a product that undermines quality to drive engagement.)

Unfortunately, a trivial change of the default to show individual commits instead of the merge diff is not so simple, as many authors and projects don't practice clean commit authorship practices, where individual commits are authored such that they can be reviewed in isolation.

(One way of classifying commit authorship styles is by whether a series of commits is authored such that each commit is good in isolation or whether the effect of applying the overall series is what matters. A handful of mature projects - like the Linux kernel, Firefox, Chrome, Git, and Mercurial - practice the series of individually-good commits model, which I'll call a commit-centric workflow. I would wager the majority of projects on GitHub and similar services practice the we only care about the final result of the series of commits model. A litmus test for practicing the latter model is whether pull requests contain commits like fixup foo or if subsequent revisions to pull requests create new commits instead of amending existing ones. I'm a strong proponent of a clean commit history where each commit in the final repository history stands as good in isolation. But I tend to favor more grown-up software development practices and am a version control guru. That being said, the subject/debate is fodder for another post.)

If GitHub (or someone else) switched the pull request default to a per-commit review without otherwise changing the relationship between a pull request and a Git branch, that would force a lot of less experienced users to familiarize themselves with history rewriting in Git. This would impose considerable pain and suffering on pull request authors, which would in turn upset users, hurt engagement, etc. Therefore, I don't think this is a feasible global default that can be changed. Maybe if Git's user experience for history rewriting were better or we didn't have a decade of behavior to undo we'd be in a better position... But pull request implementations don't need to make a global change: they could right the ship by offering projects that practice clean commit practices an option to change the review default so it emphasizes individual commits instead of the merge diff. This would go a long way towards encouraging authoring and reviewing individual commits, which should have positive benefits on review velocity and code quality outcomes.

But even if these services did emphasize individual commits by default in pull request reviews, there's still a handful of significant deficiencies that would undermine the more, smaller changes workflow that we desire.

While it is possible to review individual commits, all the review comments are still funneled into a single per pull request timeline view of activity. If submitter and reviewer make the effort to craft and subsequently review individual commits, your reward is that all the feedback for the discrete units of change gets lumped together into one massive pile of feedback for the pull request as a whole. This unified pile of feedback (currently) does a poor job of identifying which commit it applies to and gives the author little assistance in knowing which commits need amending to address the feedback. This undermines the value of commit-centric workflows and effectively pushes commit authors towards the fixup style of commit authorship. In order to execute per-commit review effectively, review comments and discussion need to be bucketed by commit and not combined into a unified pull request timeline. This would be a massive change to the pull request user interface and would be a daunting undertaking, so it is understandable why it hasn't happened yet. And such an undertaking would also require addressing subtly complex issues like how to preserve reviews in the face of force pushes. Today, GitHub's review comments can lose context when force pushes occur. Things are better than they used to be, when review comments left on individual commits would flat out be deleted (yes: GitHub really did effectively lose code review comments for several years.) But even with tooling improvements, problems still remain and should adoption of commit-level review tracking occur, these technical problems would likely need resolution to appease users of this workflow.

Even if GitHub (or someone else) implements robust per-commit review for pull requests, there's still a problem with velocity. And that problem is that if the pull request is your unit of integration (read: merging), then you have to wait until every commit is reviewed before integration can occur. This may sound tolerable (it's what we practice today after all). But I argue this is less optimal than a world where a change integrates as soon as it is ready to, without having to wait for the changes after it. As an author and maintainer, if I see a change that is ready to integrate, I prefer to integrate it as soon as possible, without delay. The longer a ready-to-integrate change lingers, the longer it is susceptible to bit rot (when the change is no longer valid/good due to other changes in the system). Integrating a judged-good change sooner also reduces the time to meaningful feedback: if there is a fundamental problem early in a series of changes that isn't caught before integration, integrating earlier changes sooner without waiting for the ones following will expose problems sooner. This minimizes deltas in changed systems (often making regression hunting easier), often minimizes the blast radius if something goes wrong, and gives the author more time and less pressure to amend subsequent commits that haven't been integrated yet. And in addition to all of this, integrating more often just feels better. The Progress Principle states that people feel better and perform better work when they are making continuous progress. But setbacks more than offset the power of small wins. While I'm not aware of any explicit research in this area, my interpretation of the Progress Principle to change authorship and project maintenance(which is supported by anecdotal observation) is that a steady stream of integrated changes feels a hell of a lot better than a single monolithic change lingering in review purgatory for what can often seem like an eternity. While you need to be cognizant to not confuse movement with meaningful progress, I think there is real power to the Progress Principle and that we should aim to incorporate changes as soon as they are ready and not any later. Applied to version control and code review, this means integrating a commit as soon as author, reviewer, and our machine overlords reporting status checks all agree it is ready, without having to wait for a larger unit of work, like the pull request. Succinctly, move forward as soon as you are able to!

This desire to integrate faster has significant implications for pull requests. Again, looking at GitHub's implementation of pull requests, I don't see how today's pull requests could adapt to this desired end state without significant structural changes. For starters, review must grow the ability to track per-commit state otherwise integrating individual commits without the entirety of the parts makes little sense. But this entails all the complexity I described above. Then there's the problem of Git branches effectively defining a pull request. What happens when some commits in a pull request are integrated and the author rebases or merges their local branch against their new changes? This may or may not just work. And when it doesn't just work, the author can easily find themselves in merge conflict hell, where one commit after the other fails to apply cleanly and their carefully curated stack of commits quickly becomes a liability and impediment to forward progress. (As an aside, the Mercurial version control tool has a concept called changeset evolution where it tracks which commits - changesets in Mercurial parlance - have been rewritten as other commits and gracefully reacts in situations like a rebase. For example, if you have commits X and Y and X is integrated via a rebase as X', an hg rebase of Y onto X' will see that X was rewritten as X' and skip attempting to rebase X because it is already applied! This cleanly sidesteps a lot of the problems with history rewriting - like merge conflicts - and can make the end-user experience much more pleasant as a result.) While it is certainly possible to integrate changes as soon as they are ready with a pull request workflow, I think that it is awkward and that by the time you've made enough changes to accommodate the workflow, very little is left of the pull request workflow as we know it and it is effectively a different workflow altogether.

The above arguments overly hinge on the assumption that more smaller changes is superior for quality and/or velocity and that we should design workflows around this assertion. While I strongly believe in the merits of smaller units of change, others may disagree. (If you do disagree, you should ask yourself whether you believe the converse: that larger units of change are better for quality and velocity. I suspect most people can't justify this. But I do believe there is merit to the argument that smaller units of change impose additional per-unit costs or have second order effects that undermine their touted quality or velocity benefits.)

But even if you don't buy into the change size arguments, there's still a very valid reason why we should think beyond pull requests as they are implemented today: tool scalability.

The implementation of pull requests today is strongly coupled with how Git works out of the box. A pull request is initiated from a Git branch pushed to a remote Git repository. When the pull request is created, the server creates a Git branch/ref referring to that pull request's head commits. On GitHub, these refs are named pull/ID/head (you can fetch these from the remote Git repository but they are not fetched by default). Also when a pull request is created or updated, a git merge is performed to produce a diff for review. On GitHub, the resulting merge commit is saved and pointed to on the open pull request via a pull/ID/merge ref, which can also be fetched locally. (The merge ref is deleted when the pull request is closed.)

Herein resides our scalability problem: unbound growth of Git refs and ever-increasing rate of a change for a growing project. Each Git ref adds overhead to graph walking operations and data exchange. While involved operations are continuously getting optimized (often through the use of more advanced data structures or algorithms), there are intrinsic scaling challenges with this unbound growth that - speaking as a version control tool maintainer - I want no part of. Are technical solutions enabling things to scale to millions of Git refs viable? Yes'ish. But it requires high-effort solutions like JGit's Reftable, which required ~90 review rounds spanned across ~4 months to land. And that's after design of the feature was first proposed at least as far back as July 2017. Don't get me wrong: am I glad Reftable exists: yes. It is a fantastic solution to a real problem and reading how it works will probably make you a better engineer. But simultaneously, it is a solution to a problem that does not need to exist. There is a space for scaling graph data structures and algorithms to millions or even billions of nodes, edges, and paths: your version control tool should not be it. Millions or billions of commits and files: that's fine. But scaling the number of distinct paths through that graph by introducing millions of DAG heads is insane given how much complexity it introduces in random areas of the tool. In my opinion it requires unjustifiably large amounts of investment to make work at scale. As an engineer, my inclination when posed with problems like these is to avoid them in the first place. The easiest problems to solve are those you don't have.

Unfortunately, the tight coupling of pull requests to Git branches/refs introduces unbound growth and a myriad of problems associated with it. Most projects may not grow to a size that experiences these problems. But as someone who has experience with this problem space at multiple companies, I can tell you the problem is very real and the performance and scalability issues it creates undermines the viability of using today's implementation of pull requests once you've reached a certain scale. Since we can likely fix the underlying scaling issues with Git, I don't think the explosion of Git refs is a long-term deal breaker for scaling pull requests. But it is today and will remain so until Git and the tools built on top of it improve.

In summary, some high-level problems with pull requests are as follows:

Review of merge diff by default encourages larger units of review, which undermines quality and velocity outcomes.
Inability to incrementally integrate commits within a pull request, which slows down velocity, time to meaningful feedback, and can lower morale.
Tight coupling of pull requests with Git branches adds rigidity to workflows and shoehorns into less flexible and less desired workflows.
Deficiencies in the Git user experience - particular around what happens when rewrites (including rebase) occur - significantly curtail what workflows can be safely practiced with pull requests.
Tight coupling of pull requests with Git branches can lead to performance issues at scale.

We can invert language to arrive at a set of more ideal outcomes:

Review experience is optimized for individual commits - not the merge diff - so review units are smaller and quality and velocity outcomes are improved.
Ability to incrementally integrate individual commits from a larger set so ready-to-go changes are incorporated sooner, improving velocity, time to meaningful feedback, and morale.
How you use Git branches does not impose significant restrictions on handling of pull requests.
You can use your version control tool how you want, without having to worry about your workflow being shoehorned by how pull requests work.
Pull request server can scale to the most demanding use cases with relative ease.

Let's talk about how we could achieve these more desirable outcomes.

Exploring Alternative Models

A pull request is merely an implementation pattern for the general problem space of integrating a proposed change. There are other patterns used by other tools. Before I describe them, I want to coin the term integration request to refer to the generic concept of requesting some change being integrated elsewhere. GitHub pull requests and GitLab merge requests are implementations of integration requests, for example.

Rather than describe alternative tools in detail, I will outline the key areas where different tools differ from pull requests and assess the benefits and drawbacks to the different approaches.

Use of the VCS for Data Exchange

One can classify implementations of integration requests by how they utilize the underlying version control tools.

Before Git and GitHub came along, you were probably running a centralized version control tool which didn't support offline commits or feature branches (e.g. CVS or Subversion). In this world, the common mechanism for integration requests was exchanging diffs or patches through various media - email, post to a web service of a code review tool, etc. Your version control tool didn't speak directly to a VCS server to initiate an integration request. Instead, you would run a command which would export a text-based representation of the change and then send it somewhere.

Today, we can classify integration requests by whether or not they speak the version control tool's native protocol for exchanging data or whether they exchange patches through some other mechanism. Pull requests speak the VCS native protocol. Tools like Review Board and Phabricator exchange patches via custom HTTP web services. Typically, tools using non-native exchange will require additional client-side configuration, including potentially the installation of a custom tool (e.g. RBTools for Review Board or Arcanist for Phabricator). Although modern version control tools sometimes have this functionality built-in. (e.g. Git and Mercurial fulfill Zawinski's law and Mercurial has a Phabricator extension in its official distribution).

An interesting outlier is Gerrit, which ingests its integration requests via git push. (See the docs.) But the way Gerrit's ingestion via git push works is fundamentally different from how pull requests work! With pull requests, you are pushing your local branch to a remote branch and a pull request is built around that remote branch. With Gerrit, your push command is like git push gerrit HEAD:refs/for/master. For the non-gurus, that HEAD:refs/for/master syntax means, push the HEAD commit (effectively the commit corresponding to the working directory) to the refs/for/master ref on the gerrit remote (the SOURCE:DEST syntax specifies a mapping of local revision identifier to remote ref). The wizard behind the curtain here is that Gerrit runs a special Git server that implements non-standard behavior for the refs/for/* refs. When you push to refs/for/master, Gerrit receives your Git push like a normal Git server would. But instead of writing a ref named refs/for/master, it takes the incoming commits and ingests them into a code review request! Gerrit will create Git refs for the pushed commits. But it mainly does that for its internal tracking (Gerrit stores all its data in Git - from Git data to review comments). And if that functionality isn't too magical for you, you can also pass parameters to Gerrit via the ref name! e.g. git push gerrit HEAD refs/for/master%private will create a private review request that requires special permissions to see. (It is debatable whether overloading the ref name for additional functionality is a good user experience for average users. But you can't argue that this isn't a cool hack!)

On the surface, it may seem like using the version control tool's native data exchange is a superior workflow because it is more native and more modern. (Emailing patches is so old school.) Gone are the days of having to configure client-side tools to export and submit patches. Instead, you run git push and your changes can be turned into an integration request automatically or with a few mouse clicks. And from a technical level, this exchange methodology is likely safer, as round-tripping a text-based representation of a change without data loss is surprisingly finicky. (e.g. JSON's lack of lossless binary data exchange without encoding to e.g. base64 first often means that many services exchanging text-based patches are lossy, especially in the presence of content which doesn't conform to UTF-8, which can be commonplace in tests. You would be surprised how many tools experience data loss when converting version control commits/diffs to text. But I digress). Having Git's wire protocol exchange binary data is safer than exchanging text patches and probably easier to use since it doesn't require any additional client-side configuration.

But despite being more native, modern, and arguably robust, exchange via the version control tool may not be better.

For starters, use of the version control tool's native wire protocol inhibits use of arbitrary version control tools on the client. When your integration request requires the use of a version control tool's wire protocol, the client likely needs to be running that version control tool. With other approaches like exchange of text based patches, the client could be running any software it wanted: as long as it could spit out a patch or API request in the format the server needed, an integration request could be created! This meant there was less potential for lock-in, as people could use their own tools on their machines if they wanted and they (hopefully) wouldn't be inflicting their choice on others. Case in point, a majority of Firefox developers use Mercurial - the VCS of the canonical repository - but a large number use Git on the client. Because Firefox is using Phabricator (Review Board and Bugzilla before that) for code review and because Phabricator ingests text-based patches, the choice of the VCS on the client doesn't matter that much and the choice of the server VCS can be made without inciting a holy war among developers who would be forced to use a tool they don't prefer. Yes, there are good reasons for using a consistent tool (including organizational overhead) and sometimes mandates for tool use are justified. But in many cases (such as random open source contributions), it probably doesn't or shouldn't matter. And in cases like Git and Mercurial, where tools like the fantastic git-cinnabar make it possible to easily convert between the repositories without data loss and acceptable overhead, adoption of the version control tool's native wire protocol can exclude or inhibit the productivity of contributors since it can mandate use of specific, undesired tooling.

Another issue with using the version control tool's wire protocol is that it often forces or strongly encourages you to work a certain way. Take GitHub pull requests for example. The pull request is defined around the remote Git branch that you git push. If you want to update that branch, you need to know its name. So that requires some overhead to either create and track that branch or find its name when you want to update it. Contrast with Gerrit, where you don't have an explicit remote branch you push to: you simply git push gerrit HEAD:refs/for/master and it figures things out automatically (more on this later). With Gerrit, I don't have to create a local Git branch to initiate an integration request. With pull requests, I'm compelled to. And this can undermine my productivity by compelling me to practice less-efficient workflows!

Our final point of comparison involves scalability. When you use the version control tool wire protocol as part of integration requests, you have introduced the problem of scaling your version control server. Take it from someone who has had multiple jobs involving scaling version control servers and who is intimately aware of the low-level details of both the Git and Mercurial wire protocols: you don't want to be in the business of scaling a version control server. The wire protocols for both Git and Mercurial were designed in a now-ancient era of computing and weren't designed by network protocol experts. They are fundamentally difficult to scale at just the wire protocol level. I've heard stories that at one time, the most expensive single server at Google was their Perforce or Perforce-derived server (this was several years ago - Google has since moved on to a better architecture).

The poor network protocols of version control tools have many side-effects, including the inability or sheer difficulty of using distributed storage on the server. So in order to scale compute horizontally, you need to invest in expensive network storage solutions or devise a replication and synchronization strategy. And take it from someone who worked on data synchronization products (outside of the source control space) at three companies: this is a problem you don't want to solve yourself. Data synchronization is intrinsically difficult and rife with difficult trade-offs. It's almost always a problem best avoided if you have a choice in the matter.

If creating Git refs is part of creating an integration request, you've introduced a scaling challenge with the number of Git refs. Do these Git refs live forever? What happens when you have thousands of developers - possibly all working in the same repository - and the number of refs or ref mutations grows to the hundreds of thousands or millions per year?

Can your version control server handle ingesting a push every second or two with reasonable performance? Unless you are Google, Facebook, or a handful of other companies I'm aware of, it can't. And before you cry that I'm talking about problems that only plague the 0.01% of companies out there, I can name a handful of companies under 10% the size of these behemoths where this is a problem for them. And I also guarantee that many people don't have client-side metrics for their git push P99 times or reliability and don't even realize there is a problem! Scaling version control is probably not a core part of your company's business. Unfortunately, it all too often becomes something companies have to allocate resources for because of poorly designed or utilized tools.

Contrast the challenges of scaling integration requests with a native version control server versus just exchanging patches. With the more primitive approach, you are probably sending the patch over HTTP to a web service. And with tools like Phabricator and Review Board, that patch gets turned into rows in a relational database. I guarantee it will be easier to scale an HTTP web service fronting a relational database than it will be your version control server. If nothing else, it should be easier to manage and debug, as there are tons more experts in these domains than in the version control server domain!

Yes, it is true that many will not hit the scaling limits of the version control server. And some nifty solutions for scaling do exist. But large segments of this problem space - including the version control tool maintainers having to support crazy scaling vectors in their tools - could be avoided completely if integration requests didn't lean so heavily on the version control tools's default mode of operation. Unfortunately, solutions like GitHub pull requests and Gerrit's use of Git refs for storing everything exert a lot of pressure on scaling the version control server and make this a very real problem once you reach a certain scale.

Hopefully the above paragraphs enlightened you to some of the implications that the choice of a data exchange mechanism has on integration requests! Let's move on to another point of comparison.

Commit Tracking

One can classify implementations of integration requests by how they track commits through their integration lifecycle. What I mean by this is how the integration request follows the same logical change as it evolves. For example, if you submit a commit then amend it, how does the system know that the commit evolved from commit X to X'.

Pull requests don't track commits directly. Instead, a commit is part of a Git branch and that branch is tracked as the entity the pull request is built around. The review interface presents the merge diff front and center. It is possible to view individual commits. But as far as I know, none of these tools have smarts to explicitly track or map commits across new submissions. Instead, they simply assume that the commit order will be the same. If commits are reordered or added or removed in the middle of an existing series, the tool can get confused quite easily. (With GitHub, it was once possible for a review comment left on a commit to disappear entirely. The behavior has since been corrected and if GitHub doesn't know where to print a comment from a previous commit, it renders it as part of the pull request's timeline view.)

If all you are familiar with is pull requests, you may not realize there are alternatives to commit tracking! In fact, the most common alternative (which isn't do nothing) predates pull requests entirely and is still practiced by various tools today.

The way that Gerrit, Phabricator, and Review Board work is the commit message contains a unique token identifying the integration request for that commit. e.g. a commit message for a Phabricator review will contain the line Differential Revision: https://phab.mercurial-scm.org/D7543. Gerrit will have something like Change-Id: Id9bfca21f7697ebf85d6a6fa7bac7de4358d7a43.

The way this annotation appears in the commit message differs by tool. Gerrit's web UI advertises a shell one-liner to clone repositories which not only performs a git clone but also uses curl to download a shell script from the Gerrit server and install it as Git's commit-msg hook in the newly-cloned repositories. This Git hook will ensure that any newly-created commit has a Change-ID: XXX line containing a randomly generated, hopefully unique identifier. Phabricator and Review Board leverage client-side tooling to rewrite commit messages after submission to their respective tool so the commit message contains the URL of the code review. One can debate which approach is better - they each have advantages and drawbacks. Fortunately, this debate is not germane to this post, so we won't cover it here.

What is important is how this metadata in commit messages is used.

The commit message metadata comes into play when a commit is being ingested into an integration request. If a commit message lacks metadata or references an entity that doesn't exist, the receiving system assumes it is new. If the metadata matches an entity on file, the incoming commit is often automatically matched up to an existing commit, even if its Git SHA is different!

This approach of inserting a tracking identifier into commit messages works surprisingly well for tracking the evolution of commits! Even if you amend, reorder, insert, or remove commits, the tool can often figure out what matches up to previous submissions and reconcile state accordingly. Although support for this varies by tool. Mercurial's extension for submitting to Phabricator is smart enough to take the local commit DAG into account and change dependencies of review units in Phabricator to reflect the new DAG shape, for example.

The tracking of commits is another one of those areas where the simpler and more modern features of pull requests often don't work as well as the solutions that came before. Yes, inserting an identifier into commit messages feels hacky and can be brittle at times (some tools don't implement commit rewriting very well and this can lead to a poor user experience). But you can't argue with the results: using explicit, stable identifiers to track commits is far more robust than the heuristics that pull requests rely on. The false negative/positive rate is so much lower. (I know this from first hand experience because we attempted to implement commit tracking heuristics for a code review tool at Mozilla before Phabricator was deployed and there were a surprising number of corner cases we couldn't handle properly. And this was using Mercurial's obsolescence markers, which gave us commit evolution data generated directly by the version control tool! If that didn't work well enough, it's hard to imagine an heuristic that would. We eventually gave up and used stable identifiers in commit messages, which fixed most of the annoying corner cases.)

The use of explicit commit tracking identifiers may not seem like it makes a meaningful difference. But it's impact is profound.

The obvious benefit of tracking identifiers is that they allow rewriting commits without confusing the integration request tool. This means that people can perform advanced history rewriting with near impunity as to how it would affect the integration request. I am a heavy history rewriter. I like curating a series of individually high-quality commits that can each stand in isolation. When I submit a series like this to a GitHub pull request and receive feedback on something I need to change, when I enact those changes I have to think will my rewriting history here make re-review harder? (I try to be empathetic with the reviewer and make their life easier whenever possible. I ask what I would appreciate someone doing if I were reviewing their change and tend to do that.) With GitHub pull requests, if I reorder commits or add or remove a commit in the middle of a series, I realize that this may make review comments left on those commits hard to find since GitHub won't be able to sort out the history rewriting. And this may mean those review comments get lost and are ultimately not acted upon, leading to bugs or otherwise deficient changes. This is a textbook example of tooling deficiencies dictating a sub-optimal workflow and outcome: because pull requests don't track commits explicitly, I'm forced to adopt a non-ideal workflow or sacrifice something like commit quality in order to minimize risks that the review tool won't get confused. In general, tools should not externalize these kinds of costs or trade-offs onto users: they should just work and optimize for generally agreed-upon ideal outcomes.

Another benefit to tracking identifiers is that they enable per-commit review to be viable. Once you can track the logical evolution of a single commit, you can start to associate things like review comments with individual commits with a high degree of confidence. With pull requests (as they are implemented today), you can attempt to associate comments with commits. But because you are unable to track commits across rewrites with an acceptably high degree of success, rewritten commits often fall through the cracks, orphaning data like review comments with them. Data loss is bad, so you need a place to collect this orphaned data. The main pull request activity timeline facilitates this function.

But once you can track commits reliably (and tools like Gerrit and Phabricator prove this is possible), you don't have this severe problem of data loss and therefore don't need to worry about finding a place to collect orphaned data! You are then able to create per-commit review units, each as loosely coupled with other commits and an overall series as you want to make it!

It is interesting to note the different approaches in different tools here. it is doubly interesting to note behavior that is possible with the review tool itself and what it does by default!

Let's examine Phabricator. Phabricator's review unit is the Differential revision. (Differential is the name of the code review tool in Phabricator, which is actually a suite of functionality - like GitLab, but not nearly as feature complete.) A Differential revision represents a single diff. Differential revisions can have parent-child relationships with others. Multiple revisions associated like this form a conceptual stack in Phabricator's terminology. Go to https://phab.mercurial-scm.org/D4414 and search for stack to see it in action. (Stack is a bit misleading name because the parent-child relationships actually form a DAG and Phabricator is capable of rendering things like multiple children in its graphical view.) Phabricator's official client side tool for submitting to Phabricator - Arcanist or arc - has default behavior of collapsing all Git commits into a single Differential revision.

Phabricator can preserve metadata from the individual commits (it can render at least the commit messages in the web UI so you can see where the Differential revision came from). In other words, by default Arcanist does not construct multiple Differential revisions for each commit and therefore does not construct parent-child relationships for them. So there is no stack to render here. To be honest, I'm not sure if modern versions of Arcanist even support doing this. I do know both Mercurial and Mozilla authored custom client side tools for submitting to Phabricator to work around deficiencies like this in Arcanist. Mozilla's may or may not be generally suitable for users outside of Mozilla - I'm not sure.

Another interesting aspect of Phabricator is that there is no concept of an over-arching series. Instead, each Differential revision stands in isolation. They can form parent-child relationships and constitute a stack. But there is no primary UI or APIs for stacks (the last I looked anyway). This may seem radical. You may be asking questions like how do I track the overall state of a series or how do I convey information pertinent to the series as a whole. These are good questions. But without diving into them, the answer is that as radical as it sounds to not have an overall tracking entity for a series of Differential revisions, it does work. And having used this workflow with the Mercurial Project for a few years, I can say I'm not missing the functionality that much.

Gerrit is also worth examining. Like Phabricator, Gerrit uses an identifier in commit messages to track the commit. But whereas Phabricator rewrites commit messages at initially submission time to contain the URL that was created as part of that submission, Gerrit peppers the commit message with a unique identifier at commit creation time. The server then maintains a mapping of commit identifier to review unit. Implementation details aside, the end result is similar: individual commits can be tracked more easily.

What distinguishes Gerrit from Phabricator is that Gerrit does have a stronger grouping around multiple commits. Gerrit will track when commits are submitted together and will render both a relation chain and submitted together list automatically. While it lacks the visual beauty of Phabricator's implementation, it is effective and is shown in the UI by default, unlike Phabricator.

Another difference from Phabricator is that Gerrit uses per-commit review by default. Whereas you need a non-official client for Phabricator to submit a series of commits to constitute a linked chain, Gerrit does this by default. And as far as I can tell, there's no way to tell Gerrit to squash your local commits down to a single diff for review: if you want a single review to appear, you must first squash commits locally then push the squashed commit. (More on this topic later in the post.)

A secondary benefit of per-commit review is that this model enables incremental integration workflows, where some commits in a series or set can integrate before others, without having to wait for the entire batch. Incremental integration of commits can drastically speed up certain workflows, as commits can integrate as soon as they are ready and not any longer. The benefits of this model can be incredible. But actually deploying this workflow can be tricky. One problem is that your version control tool may get confused when you rebase or merge partially landed state. Another problem is it can increase the overall change rate of the repository, which may strain systems from version control to CI to deployment mechanisms. Another potential problem involves communicating review sign-off from integration sign-off. Many tools/workflows conflate I sign off on this change and I sign off on landing this change. While they are effectively identical in many cases, there are some valid cases where you want to track these distinctly. And adopting a workflow where commits can integrate incrementally will expose these corner cases. So before you go down this path, you want to be thinking about who integrates commits and when they are integrated. (You should probably be thinking about this anyway because it is important.)

Designing a Better Integration Request

Having described some problems with pull requests and alternate ways of going about solving the general problem of integration requests, it is time to answer the million dollar problem: designing a better integration request. (When you factor in the time people spend in pull requests and the cost of bugs / low quality changes that slip through due to design of existing tooling, improving integration requests industry wide would be a lot more valuable than $1M.)

As a reminder, the pull request is fundamentally a nice UI and set of features built around the common Git feature branch workflow. This property is preserved from the earliest days of pull requests in 2007-2008 and has been copied by vendors like Bitbucket and GitLab in the years since. In my mind, pull requests should be ripe for overhaul.

Replace Forks

The first change I would make to pull requests is to move away from forks being a required part of the workflow. This may seem radical. But it isn't!

A fork on services like GitHub is a fully fledged project - just like the canonical project it was forked from. It has its own issues, wiki, releases, pull requests, etc. Now show of hands: how often do you use these features on a fork? Me neither. In the overwhelming majority of cases, a fork exists solely as a vehicle to initiate a pull request against the repository it was forked from. It serves little to no additional meaningful functionality. Now, I'm not saying forks don't serve a purpose - they certainly do! But in the case of someone wanting to propose a change to a repository, a fork is not strictly required and its existence is imposed on us by the current implementation of pull requests.

I said impose in the previous sentence because forks introduce overhead and confusion. The existence of a fork may confuse someone as to where a canonical project lives. Forks also add overhead in the version control tool. Their existence forces the user to manage an additional Git remote and branches. It forces people to remember to keep their branches in sync on their fork. As if remembering to keep your local repository in sync wasn't hard enough! And if pushing to a fork, you need to re-push data that was already pushed to the canonical repository, even though that data already exists on the server (just in a different view of the Git repository). (I believe Git is working on wire protocol improvements to mitigate this.)

When merely used as a vehicle to initiate integration requests, I do not believe forks offer enough value to justify their existence. Should forks exist: yes. Should people be forced to use them in order to contribute changes, no. (Valid use cases for a fork would be to perform a community splinter of a project and to create an independent entity for reasons such as better guarantees of data availability and integrity.)

Forks are essentially a veneer on top of a server-side git clone. And the reason why a separate Git repository is used at all is probably because the earliest versions of GitHub were just a pile of abstractions over git commands. The service took off in popularity, people copied its features almost verbatim, and nobody ever looked back and thought why are we doing things like this in the first place.

To answer what we would replace forks with, we have to go back to first principles and ask what are we trying to do. And that is propose a unit a change against an existing project. And for version control tools, all you need to propose a change is a patch/commit. So to replace forks, we just need an alternate mechanism to submit patches/commits to an existing project.

My preferred alternative to forks is to use git push directly to the canonical repository. This could be implemented like Gerrit where you push to a special ref. e.g. git push origin HEAD:refs/for/master. Or - and this is my preferred solution - version control servers could grow more smarts about how pushes work - possibly even changing what commands like git push do if the server is operating in special modes.

One idea would be for the Git server to expose different refs namespaces depending on the authenticated user. For example, I'm indygreg on GitHub. If I wanted to propose a change to a project - let's say python/cpython - I would git clone git@github.com:python/cpython. I would create a branch - say indygreg/proposed-change. I would then git push origin indygreg/proposed-change and because the branch prefix matches my authenticated username, the server lets it through. I can then open a pull request without a fork! (Using branch prefixes is less than ideal, but it should be relatively easy to implement on the server. A better approach would rely on remapping Git ref names. But this may require a bit more configuration with current versions of Git than users are willing to stomach. An even better solution would be for Git to grow some functionality to make this easier. e.g. git push --workspace origin proposed-change would push proposed-change to a workspace on the origin remote, which Git would know how to translate to a proper remote ref update.)

Another idea would be for the version control server to invent a new concept for exchanging commits - one based on sets of commits instead of DAG synchronization. Essentially, instead of doing a complicated discovery dance to synchronize commits with the underlying Git repository, the server would ingest and expose representations of sets of commits stored next to - but not within - the repository itself. This way you are not scaling the repository DAG to infinite heads - which is a hard problem! A concrete implementation of this might have the client run a git push --workspace origin proposed-change to tell the remote server to store your proposed-change branch in your personal workspace (sorry for reusing the term from the previous paragraph). The Git server would receive your commits, generate a standalone blob to hold them, save that blob to a key-value store like S3, then update a mapping of which commits/branches are in which blobs in a data store such as a relational database somewhere. This would effectively segment the core project data from the more transient branch data, keeping the core repository clean and pure. It allows the server to lean on easier-to-scale data stores such as key-value blob stores and relational databases instead of the version control tool. I know this idea is feasible because Facebook implemented it for Mercurial. The infinitepush extension essentially siphons Mercurial bundles (standalone files holding commit data) off to a blob store when pushes come in over the wire. At hg pull time, if a requested revision is not present in the repository, the server asks the database-backed blob index if the revision exists anywhere. If it does, the blob/bundle is fetched, dynamically overlayed onto the repository in memory, and served to the client. While the infinitepush extension in the official Mercurial project is somewhat lacking (through no fault of Facebook's), the core idea is solid and I wish someone would spend the time to flush out the design a bit more because it really could lead to logically scaling repositories to infinite DAG heads without the complexities of actually scaling scaling DAG algorithms, repository storage, and version control tool algorithms to infinite heads. Getting back to the subject of integration requests, one could imagine having a target for workspace pushes. For example, git push --workspace=review origin would push to the review workspace, which would automatically initiate a code review.

Astute readers of this blog may find these ideas familiar. I proposed user namespaces in my /blog/2017/12/11/high-level-problems-with-git-and-how-to-fix-them/ post a few years ago. So read there for more on implications of doing away with forks.

Could forks be done away with as a requirement to submit pull requests? Yes! Gerrit's git push origin HEAD:refs/for/master mechanism proves it. Is Gerrit's approach too much magic or confusing for normal users? I'm not sure. Could Git grow features to make the user experience much better so users don't need to be burdened with complexity or magic and could simply run commands like git submit --for review? Definitely!

Shift Focus From Branches to Individual Commits

My ideal integration request revolves around individual commits, not branches. While the client may submit a branch to initiate or update an integration request, the integration request is composed of a set of loosely coupled commits, where parent-child relationships can exist to express a dependency between commits. Each commit is evaluated individually. Although someone may need to inspect multiple commits to gain a full understanding of the proposed change. And some UI enabling operations against a group of related commits (such as mass deleting abandoned commits) may be warranted.

In this world, the branch would not matter. Instead, commits are king. Because we would be abandoning the branch name as a tracker for the integration request, we would need something to replace it, otherwise we have no way of knowing how to update an existing integration request! We should do what tools like Phabricator, Gerrit, and Review Board do and add a persistent identifier to commits which survive history rewriting. (Branch-based pull requests should do this anyway so history rewrites don't confuse the review tool and e.g. cause comments to get orphaned - see above.)

It's worth noting that a commit-centric integration request model does not imply that everyone is writing or reviewing series of smaller commits! While titans of industry and I strongly encourage the authorship of smaller commits, commit-centric integration requests don't intrinsically force you to do so. This is because commit-centric integration requests aren't forcing you to change your local workflow! If you are the type of person who doesn't want to curate a ton of small, good-in-isolation commits (it does take a bit more work after all), nobody would be forcing you to do so. Instead, if this is your commit authorship pattern, the submission of the proposed change could squash these commits together as part of the submission, optionally rewriting your local history in the process. If you want to keep dozens of fixup commits around in your local history, that's fine: just have the tooling collapse them all together on submission. While I don't think those fixup commits are that valuable and shouldn't be seen by reviewers, if we wanted, we could have tools continue to submit them and make them visible (like they are in e.g. GitHub pull requests today). But they wouldn't be the focus of review (again like GitHub pull requests today). Making integration requests commit-centric doesn't force people to adopt a different commit authorship workflow. But it does enable projects that wish to adopt more mature commit hygiene to do so. That being said, hows tools are implemented can impose restrictions. But that's nothing about commit-centric review that fundamentally prohibits the use of fixup commits in local workflows.

While I should create a dedicated post espousing the virtues of commit-centric workflows, I'll state my case through proxy by noting that some projects aren't using modern pull requests precisely because commit-centric workflows are not viable. When I was at Mozilla, one of the blockers to moving to GitHub was the pull request review tooling wasn't compatible with our world view that review units should be small. (This view is generally shared by Google, Facebook, and some prominent open source projects, among others.) And for reasons outlined earlier in this post, I think that as long as pull requests revolve around branches / merge diffs and aren't robust in the face of history rewriting (due to the lack of robust commit tracking), projects that insist on more refined practices will continue to eschew pull requests. Again, a link between review size and quality has been established. And better quality - along with its long-term effect of lowering development costs due to fewer bugs - can tip the scales in its favor, even against all the benefits you receive when using a product like GitHub, GitLab, or Bitbucket.

The Best of What's Around

Aspects of a better integration request exist in tools today. Unfortunately, many of these features are not present on pull requests as implemented by GitHub, GitLab, Bitbucket, etc. So to improve the pull request, these products will need to borrow ideas from other tools.

Integration requests not built around Git branches (Gerrit, Phabricattor, Review Board, etc) use identifiers in commit messages to track commits. This helps tracking commits across changes. There are compelling advantages to this model. Robust commit tracking is a requirement for commit-centric workflows. And it would even improve the functionality of branch-based pull requests. A well-designed integration request would have a robust commit tracking mechanism.

Gerrit has the best-in-class experience for commit-centric workflows. It is the only popular implementation of integration requests I'm aware of that supports and caters to this workflow by default. In fact, I don't think you can change this! (This behavior is user hostile in some cases since it forces users to know how to rewrite commits, which is often perilous in Git land. It would be nice if you could have Gerrit squash commits into the same review unit automatically on the server. But I understand the unwillingness to implement this feature because this has its own set of challenges around commit tracking, which I won't bore you with.) Gerrit also shows groups of related commits front and center when viewing a proposed change.

Phabricator is the only other tool I know of where one can achieve a reasonable commit-centric workflow without the pitfalls of orphaned comments, context overload, etc mentioned earlier in this post. But this requires non-standard submission tooling and commit series aren't featured prominently in the web UI. So Phabricator's implementation is not as solid as Gerrit's.

Another Gerrit feature worth lauding is the submission mechanism. You simply git push to a special ref. That's it. There's no fork to create. No need to create a Git branch. No need to create a separate pull request after the push. Gerrit just takes the commits you pushed and turns them into a request for review. And it doesn't require any additional client-side tooling!

Using a single common git command to submit and update an integration request is simpler and arguably more intuitive than other tools. Is Gerrit's submission perfect? No. The git push origin HEAD:refs/for/master syntax is not intuitive. And overloading submission options by effectively encoding URL parameters on the ref name is a gross - albeit effective - hack. But users will likely quickly learn the one-liner's or create more intuitive aliases.

The elegance of using just a git push to initiate an integration request puts Gerrit in a league of its own. I would be ecstatic if the GitHubs of the world reduced the complexity of submitting pull requests to simply clone the canonical repository, create some commits, and run a git command. The future of submitting integration requests* hopefully looks more like Gerrit than other alternatives.

What Needs Built

Some aspects of the better integration request don't yet exist or need considerable work before I consider them viable.

For tools which leverage the native version control tool for submission (e.g. via git push), there needs to be some work to support submission via a more generic, HTTP endpoint. I'm fine with leveraging git push as a submission mechanism because it makes the end-user experience so turnkey. But making it the only submission mechanism is a bit unfortunate. There is some support for this: I believe you can cobble together a pull request from scratch via GitHub's APIs, for example. But it isn't as simple as submit a patch to an endpoint, which it arguably should be. Even Gerrit's robust HTTP API, does not seem to allow creating new commits/diffs via that API. Anyway, this limitation not only excludes non-Git tools from using these tools, but also limits other tooling from submitting without using Git. For example, you may want to write a bot that proposes automated changes and it is much easier to produce a diff than to use git since the former does not require a filesystem (this matters in serverless environments for example).

A larger issue with many implementations is the over-reliance on Git for server storage. This is most pronounced in Gerrit, where not only are your git pushes stored in a Git repository on the Gerrit server, but every code review comment and reply is stored in Git as well! Git is a generic key-value store and you can store any data you want in it if you shoehorn it properly. And it is cool that all your Gerrit data can be replicated via git clone - this pretty much eliminates the we took a decentralized tool and centralized it via GitHub series of arguments. But if you apply this store everything in Git approach at scale, it means you will be running a Git server at scale. And not just any Git server - a write load heavy Git server! And if you have thousands of developers possibly all working out of the same repository, then you are looking at potentially millions of new Git refs per year. While the Git, Gerrit, and JGit people have done some fantastic work making these tools scale, I'd feel much better if we eschewed the make Git scale to infinite pushes and refs problem and used a more scalable approach, like an HTTP ingestion endpoint which writes data to key-value stores or relational databases. In order words, use of a version control tool for servicing integration requests at scale is a self-imposed footgun and could be avoided.

Conclusion

Congratulations on making it through my brain dump! As massive as the wall of text is, there are still plenty of topics I could have covered but didn't. This includes the more specific topic of code review and the various features that entails. I also largely ignored some general topics like the value that an integration request can serve on the overall development lifecycle: integration requests are more than just code review - they serve as a nexus to track the evolution of a change throughout time.

Hopefully this post gave you some idea at some of the structural issues at play with the integration of pull requests and integration requests. And if you are someone in a position to design or implement a better integration request or tooling around them (including in version control tools themselves), hopefully it gave you some good ideas or where to go next.

C Extension Support in PyOxidizer

June 30, 2019 at 04:40 PM | categories: Python, PyOxidizer

The initial release of PyOxidizer generated a bit of excitement across the Internet! The post was commented on heavily in various forums and my phone was constantly buzzing from all the Twitter activity. There has been a steady stream of GitHub Issues for the project, which I consider a good sign. So thank you everybody for the support and encouragement! And especially thank you to everyone who filed an issue or submitted a pull request!

While I don't usually read the comments, I was looking at various forums posting about PyOxidizer to see what reactions were like. A common negative theme was the lack of C extension support in PyOxidizer. People seemed dismissive of PyOxidizer because it didn't support C extensions. Despite the documentation stating that the feature was planned and that I had an idea for how to implement it, people seemed pessimistic. Perhaps I didn't adequately communicate that making C extensions work is actually a subset of the already-solved single file executable problem and therefore was already a solved problem at the technical level (only the integration with the Python and PyOxidizer build systems was missing). So in my mind C extension support was only a matter of time and the only open question was how many hacks would be needed to make it work, not whether it would work.

Well, I'm pleased to report that the just-released version 0.2 of PyOxidizer supports C extensions on Windows, macOS, and Linux. If you install a Python package through a pip-install-simple, pip-requirements-file, or setup-py-install packaging rule, C extensions will be compiled in a special way that enables them to be embedded in the same binary containing Python itself. I've tested it with the zope.interface, zstandard, and mercurial packages and it seems to work (although Mercurial has other issues that prevent it from being packaged as a PyOxidizer application - but the C extensions do compile).

There are some limitations to the support, however. I'm pretty confident the limitations can be eliminated given enough time. Given how many people were hung up on the lack of C extensions and were seemingly writing off PyOxidizer thinking it was snake oil or something, I wanted to deliver basic C extension support to curtail this line of criticism. Perfect is the enemy of good and hopefully basic C extension support is good enough to ease concerns about PyOxidizer's viability.

Also in PyOxidizer 0.2 are some minor new features, like the --pip-install and --python-code flags to pyoxidizer init. These allow you to generate a pyoxidizer.toml file pre-configured to install some packages from pip and run custom Python code. So now applications can be created and built with a one-liner without having to edit a pyoxidizer.toml file!

The full release notes are available. As always, please keep filing issues. I'm particularly interested in hearing about packages whose C extensions don't work properly.

« Previous Page -- Next Page »