Using Rust to Power Python Importing With oxidized_importer
May 10, 2020 at 01:15 PM | categories: Python, PyOxidizerI'm pleased to announce the availability of the oxidized_importer
Python package, a standalone version of the custom Python module importer
used by PyOxidizer.
oxidized_importer
- a Python extension module implemented in Rust - enables
Python applications to start and run quicker by providing an alternate,
more efficient mechanism for loading Python resources (such as source
and bytecode modules).
Installation instructions and detailed usage information are available in the official documentation. The rest of this post hopefully answers the questions of why are you doing this and why should I care.
In a traditional Python process, Python's module importer inspects the filesystem at run-time to find and load resources like Python source and bytecode modules. It is highly dynamic in nature and relies on the filesystem as a point-in-time source of truth for resource availability.
oxidized_importer
takes a different approach to resource loading that is
more static in nature and more suitable to application environments (where
Python resources aren't changing). Instead of dynamically probing the
filesystem for available resources, resources are instead indexed ahead
of time. When Python goes to resolve a resource (say it is looking to
import
a module), oxidized_importer
simply needs to perform a lookup
in an in-memory data structure to locate said resource. This means
oxidized_importer
only has marginal reliance on the filesystem, which
can make it much faster than Python's traditional importer. (Performance
benefits of binaries built with PyOxidizer have already been
clearly demonstrated.)
The oxidized_importer
Python extension module exposes parts of
PyOxidizer's packaging and run-time functionality to Python code, without
requiring the full use of PyOxidizer for application packaging.
Specifically, oxidized_importer
allows you to:
- Install a custom, high-performance module importer
(OxidizedFinder)
to service Python
import
statements and resource loading (potentially from memory, using zero-copy). - Scan the filesystem for Python resources
(source modules, bytecode files, package resources, distribution metadata,
etc) and turn them into Python objects, which can be loaded into
OxidizedFinder
instances. - Serialize Python resource data into an efficient binary data structure
for loading into an
OxidizedFinder
instance. This facilitates producing a standalone resources blob that can be distributed with a Python application which contains all the Python modules, bytecode, etc required to power that application. See the docs on freezing an application with oxidized_importer.
oxidized_importer
can be thought of as PyOxidizer-lite: it provides just
enough functionality to allow Python application maintainers to leverage some
of the technical advancements of PyOxidizer (such as in-memory module imports)
without using PyOxidizer for application packaging. oxidized_importer
can
work with the Python distribution already installed on your system. You just
pip install
it like any other Python package.
By releasing oxidized_importer
as a standalone Python package, my hope is
to allow more people to leverage some of the technical achievements and
performance benefits coming out of PyOxidizer. I also hope that having more
users of PyOxidizer's underlying code will help uncover bugs and conformance
issues, raising the quality and viability of the projects.
I would also like to use oxidized_importer
as an opportunity to advance the
discourse around Python's resource loading mechanism. Filesystem I/O can be
extremely slow, especially in mobile and embedded environments. Dynamically
probing the filesystem to service module imports can therefore be slow. (The
Python standard library has the zipimport
module for importing Python resources
from a zip file. But in my opinion, we can do much better.) I would like to
see Python move towards leveraging immutable, serialized data structures for
loading resources as efficiently as possible. After all, Python resources
like the Python standard library are likely not changing between Python process
invocations. The performance zealot in me cringes thinking of all the overhead
that Python's filesystem probing approach incurs - all of the excessive stat()
and other filesystem I/O calls that must be performed to answer questions about
state that is easily indexed and often doesn't change. oxidized_importer
represents my vision for what a high-performance Python resource loader should
look like. I hope it can be successful in steering Python towards a better
approach for resource loading.
I plan to release oxidized_importer
independently from PyOxidizer
. While
the projects will continue to be developed in the same
repository and will leverage the
same underlying Rust code, I view them as somewhat independent and serving
different audiences.
While oxidized_importer
evolved from facilitating PyOxidizer's run-time use
cases, I'm not opposed to taking it in new directions. For example, I would
entertain implementing Python's dynamic filesystem probing logic in
oxidized_importer
, allowing it to serve as a functional stand-in for the
official importer shipped with the Python standard library. I have little
doubt an importer implemented in 100% Rust would outperform the official
importer, which is implemented in Python. There's all kinds of possibilities
here, such as using a background thread to index sys.path
outside the
constraints of the GIL. But I don't want to get ahead of myself...
If you are a Python application maintainer and want to make your Python
processes execute a bit faster by leveraging a pre-built index of available
Python resources and/or taking advantage of in-memory module importing,
I highly encourage you to take a look at oxidized_importer
!
PyOxidizer 0.7
April 09, 2020 at 09:00 PM | categories: Python, PyOxidizerI am very pleased to announce the 0.7 release of PyOxidizer, a modern Python application packaging tool.
There are a host of notable new features in this release. You can read all about them in the project history.
I want to use this blog post to call out the more meaningful ones.
I started PyOxidizer as a science experiment of sorts: I sat out to prove the hypothesis that it was possible to produce high performance single file executables embedding Python and all of its resources (Python modules, non-module resource files, compiled extensions, etc). PyOxidizer has achieved this on Windows, Linux, and macOS since its very earliest releases. Hypothesis confirmed!
In order to actually achieve single file executables, you have to
fundamentally change aspects of Python's behavior. Some of these
changes invalidate deeply rooted assumptions about how Python works,
such as the existence of __file__
in modules. As you can imagine,
these broken assumptions translated to numerous compatibility issues
and PyOxidizer didn't work with many popular Python packages.
With the science experiment phase of PyOxidizer out of the way, I have been making a concerted effort to broaden the user base of PyOxidizer. While single file executables can be an amazing property, it isn't critical for many use cases and the issues it was causing were preventing people from exploring PyOxidizer.
This brings us to what I think are the major new features in PyOxidizer 0.7.
Better Support for Loading Extension Modules
Earlier versions of PyOxidizer insisted that you compile Python (C) extension modules from source and statically link them into a produced binary. This requirement prevented the use of pre-built extension modules (commonly found in Python binary wheels available on PyPI) with PyOxidizer, forcing people to compile them locally. While this often just worked for many extension modules, it frequently failed on complex extension modules and it frequently failed on Windows.
PyOxidizer now supports loading compiled extension modules from
standalone files (typically .so
or .pyd
files, which are actually
shared libraries). There are still some sharp edges and known
deficiencies. But in many cases, if you tell PyOxidizer to run
pip install
and package the result, pre-built wheels can be
installed and PyOxidizer will pick up the standalone files.
On Windows, PyOxidizer even supports embedding the shared library
data into the produced .exe
and loading the .pyd
/DLL directly
from memory.
Loading Resources from the Filesystem
Binaries built with PyOxidizer contain a blob holding an index of available Python resources along with their data.
Earlier versions of PyOxidizer only allowed you to define resources as in-memory. If the resource was defined in this blob, it was imported from memory. Otherwise it wasn't known to PyOxidizer. You could still install files next to the produced binary and tell PyOxidizer to enable Python's default filesystem-based importer. But PyOxidizer didn't explicitly know about these files on the filesystem.
In PyOxidizer 0.7, the blob index of Python resources is able to express different locations for that resource. Currently, a resource can have its data made available in-memory or filesystem-relative. in-memory works as before: the raw data is embedded next to the next in memory and loaded from there (using 0-copy). filesystem-relative encodes a filesystem path to the resource. During packaging, PyOxidizer will place the resource next to the executable (using a typical Python file layout scheme) and store the relative path to that resource in the resources index.
The filesystem-relative resource indexing feature has a few implications for PyOxidizer.
First, it is more standard. When PyOxidizer loads a Python
module from the filesystem, it sets __file__
, __path__
,
etc and the module semantics should behave as if the file
were imported by Python's standard importer. This means that
if a package is having issues with in-memory importing, you
can simply fall back to filesystem-relative to get standard
Python behavior and everything should just work.
Second, PyOxidizer's filesystem resource loading is faster
than Python's! When Python's standard importer goes to
import
a module, it needs to stat()
various paths to
first locate the file. It then performs some sanity checking
and other minor actions before actually importing the module.
All of this has overhead. Since the goal of PyOxidizer is
to produce standalone applications and applications should
be immutable, PyOxidizer can avoid most of this overhead.
PyOxidizer simply tries to open()
and read()
the relative
path baked into the resource index at build time. If that
works, the resource is loaded. Else there is a failure.
The code path in PyOxidizer to locate a Python resource
is effectively a lookup in a Rust HashMap<&str, T>
.
I thought it would be interesting to isolate the performance
benefits of this new feature. I ran Mercurial's test harness
with different variants of hg
on Linux on my Ryzen 3950X.
- traditional - A
hg
script with a#!/path/to/python3.7
shebang. - oxidized - A
hg
executable built with PyOxidizer, without PyOxidizer's custom module importer. - filesystem - A
hg
executable built with PyOxidizer using the new filesystem-relative resource index. - in-memory - A
hg
executable built with PyOxidizer with all resources loaded from memory (how PyOxidizer has traditionally worked).
The results are quite clear:
Variant | CPU Time (s) | Delta (s) | % Orig |
---|---|---|---|
traditional | 11,287 | -552 | 100 |
oxidized | 10,735 | -552 | 95.1 |
filesystem | 10,186 | -1,101 | 90.2 |
in-memory | 9,883 | -1,404 | 87.6 |
We see a nice win just from using a native executable built with PyOxidizer (traditional to oxidized).
Then from oxidized to filesystem we see another jump of ~5%. This difference is attributed to using PyOxidizer's Rust-powered importer with an index of resources available on the filesystem. In other words, all that work that Python's standard importer is doing to discover files and then operate on them is non-trivial!
Finally, the smaller jump from filesystem to in-memory isolates the benefits of importing resource data from memory instead of involving filesystem I/O. (Filesystems are generally slow.) While I haven't measured explicitly, I hypothesize that macOS and Windows will see a bigger jump between these two variants, as the filesystem performance on these platforms generally isn't as good as it is on Linux.
PyOxidizer's Future
With PyOxidizer now supporting a couple of much-needed features to support a broader set of users, I'm hoping that future releases of PyOxidizer continue to broaden the utility of PyOxidizer.
The over-arching goal of PyOxidizer is to solve large aspects of the Python application packaging and distribution problem. So far a lot of focus has been spent on the former. PyOxidizer in its current form can materialize files on the filesystem that you can copy or package up manually and distribute. But I want these processes to be part of PyOxidizer: I want it to be possible for PyOxidizer to emit a Windows MSI installer, a macOS dmg, a Debian package, etc for a Python application.
In order to support the aforementioned marquee features of this PyOxidizer release, I had to pay down a lot of technical debt in the code base left over from the science experiment phase of PyOxidizer's inception.
In the short term, I plan to continue shoring up the code base
and rounding out support for features requested in the
issue tracker on GitHub. The next release of PyOxidizer will
also likely require
Python 3.8, as this will improve run-time control over the
embedded Python interpreter and enable PyOxidizer to better
support package metadata (importlib.metadata
), enabling
support for features like entry points.
I've also been thinking about extracting PyOxidizer's custom
module importer to be usable as a standalone Python extension
module. I think there's some value in publishing a
pyoxidizer_importer
package on PyPI that you can easily
add to your installed packages to speed up Python's
standard filesystem importer by a few percent. If nothing else,
this may drum up interest in the larger Python community for
standardizing a format for serializing Python resources in a
single file. Perhaps we can get other Python packaging tools
producing the same
packed resources data
blob that PyOxidizer uses so we can all standardize on a
more efficient mechanism for loading Python modules. Time
will tell.
Enjoy the new release. File issues at https://github.com/indygreg/PyOxidizer as you encounter them.
Mercurial's Journey to and Reflections on Python 3
January 13, 2020 at 08:45 AM | categories: Python, MercurialMercurial 5.2 was released on November 5, 2019. It is the first version of Mercurial that supports Python 3. This milestone comes nearly 11 years after Python 3.0 was first released on December 3, 2008.
Speaking as a maintainer of Mercurial and an avid user of Python, I feel like the experience of making Mercurial work with Python 3 is worth sharing because there are a number of lessons to be learned.
This post is logically divided into two sections: a mostly factual recount of Mercurial's Python 3 porting effort and a more opinionated commentary of the transition to Python 3 and the Python language ecosystem as a whole. Those who don't care about the mechanics of porting a large Python project to Python 3 may want to skip the next section or two.
Porting Mercurial to Python 3
Let's start with a brief history lesson of Mercurial's support for Python 3 as told by its own commit history.
The Mercurial version control tool was first released in April 2005
(the same month that Git was initially released). Version 1.0 came out
in March 2008. The first reference to Python 3 I found in the code base
was in September 2008.
Then not much happens for a while until
June 2010, when
someone authors a bunch of changes to make the Python C extensions
start to recognize Python 3. Then things were again quiet for a while
until January 2013,
when a handful of changes landed to remove 2 argument raise
. There were
a handful of commits in 2014 but nothing worth calling out.
Mercurial's meaningful journey to Python 3 started in 2015. In code, the work started in April 2015, with effort to make Mercurial's test harness run with Python 3. Part of this was a decision that Python 3.5 (to be released several months later in September 2015) would be the minimum Python 3 version that Mercurial would support.
Once the Mercurial Project decided it wanted to port to Python 3 (as opposed to another language), one of the earliest decisions was how to perform that port. Mercurial's code base was too large to attempt a flag day conversion where there would be a Python 2 version and a Python 3 version and one day everyone would switch from Python 2 to 3. Mercurial needed a way to run the same code (or as much of the same code) on both Python 2 and 3. We would maintain a single code base and users would gradually switch from running with Python 2 to Python 3.
In May 2015,
Mercurial dropped support for Python 2.4 and 2.5. Dropping support for
these older Python versions was critical, as it was effectively impossible to
write Python code that ran on this wide gamut of versions because of
incompatibilities in syntax and language features. For example, you needed
Python 2.6 to get print()
via from __future__ import print_function
.
The project's late start at a Python 3 port can be significantly attributed
to Python 2.4 and 2.5 compatibility holding us back.
The main goal with Mercurial's early porting work was just getting the code base
to a point where import mercurial
would work. There were a myriad of places
where Mercurial used syntax that was invalid on Python 3 and Python 3
couldn't even parse the source code, let alone compile it to bytecode and
execute it.
This effort began in earnest in
June 2015
with global source code rewrites like using modern octal syntax,
modern exception catching syntax (except Exception as e
instead of
except Exception, e
), print()
instead of print
, and a
modern import convention
along with the use of from __future__ import absolute_import
.
In the early days of the port, our first goal was to get all source code
parsing as valid Python 3. The next step was to get all the modules import
ing
cleanly. This entailed fixing code that ran at import
time to work on
Python 3. Our thinking was that we would need the code base to be import
clean on Python 3 before seriously thinking about run-time behavior. In reality,
we quickly ported a lot of modules to import
cleanly and then moved on
to higher-level porting, leaving a long-tail of modules with import
failures.
This initial porting effort played out over months. There weren't many people working on it in the early days: a few people would basically hack on Python 3 as a form of itch scratching and most of the project's energy was focused on improving the existing Python 2 based product. You can get a rough idea of the timeline and participation in the early porting effort through the history of test-check-py3-compat.t. We see the test being added in December 2015, By June 2016, most of the code base was ported to our modern import convention and we were ready to move on to more meaningful porting.
One of the biggest early hurdles in our porting effort was how to overcome
the string literals type mismatch between Python 2 and 3. In Python 2, a
''
string literal is a sequence of bytes. In Python 3, a ''
string literal
is a sequence of Unicode code points. These are fundamentally different types.
And in Mercurial's code base, most of our string types are binary by design:
use of a Unicode based str
for representing data is flat out wrong for our use
case. We knew that Mercurial would need to eventually switch many string
literals from ''
to b''
to preserve type compatibility. But doing so would
be problematic.
In the early days of Mercurial's Python 3 port in 2015, Mercurial's project maintainer (Matt Mackall) set a ground rule that the Python 3 port shouldn't overly disrupt others: he wanted the Python 3 port to more or less happen in the background and not require every developer to be aware of Python 3's low-level behavior in order to get work done on the existing Python 2 code base. This may seem like a questionable decision (and I probably disagreed with him to some extent at the time because I was doing Python 3 porting work and the decision constrained this work). But it was the correct decision. Matt knew that it would be years before the Python 3 port was either necessary or resulted in a meaningful return on investment (the value proposition of Python 3 has always been weak to Mercurial because Python 3 doesn't demonstrate a compelling advantage over Python 2 for our use case). What Matt was trying to do was minimize the externalized costs that a Python 3 port would inflict on the project. He correctly recognized that maintaining the existing product and supporting existing users was more important than a long-term bet in its infancy.
This ground rule meant that a mass insertion of b''
prefixes everywhere
was not desirable, as that would require developers to think about whether
a type was a bytes
or str
, a distinction they didn't have to worry about
on Python 2 because we practically never used the Unicode-based string type in
Mercurial.
In addition, there were some other practical issues with doing a bulk b''
prefix insertion. One was that the added b
characters would cause a lot of lines
to grow beyond our length limits and we'd have to reformat code. That would
require manual intervention and would significantly slow down porting. And
a sub-issue of adding all the b
prefixes and reformatting code is that it would
break annotate/blame more than was tolerable. The latter issue was addressed
by teaching Mercurial's annotate/blame feature to skip revisions. The project
now has a convention of annotating commit messages with # skip-blame <reason>
so structural only changes can easily be ignored when performing an
annotate/blame.
A stop-gap solution to the b''
everywhere issue came in
July 2016, when I
introduced a custom Python module importer that rewrote source code as part
of import
when running on Python 3. (I have
previously blogged
about this hack.) What this did was transparently add b''
prefixes to all
un-prefixed string literals as well as modify how a few common functions were
called so that we wouldn't need to modify source code so things would run natively
on Python 3. The source transformer allowed us to have the benefits of progressing
in our Python 3 port without having to rewrite tens of thousands of lines of
source code. The solution was hacky. But it enabled us to make significant
progress on the Python 3 port without externalizing a lot of cost onto others.
I thought the source transformer would be relatively short-lived and would be removed shortly after the project inevitably decided to go all in on Python 3. To my surprise, others built additional transforms over the years and the source transformer persisted all the way until October 2019, when I removed it just before the first non-alpha Python 3 compatible version of Mercurial was released.
A common problem Mercurial faced with making the code base dual Python 2/3 native
was dealing with standard library differences. Most of the problems stemmed
from changes between Python 2.7 and 3.5+. But there are changes within the
versions of Python 3 that we had to wallpaper over as well. In
April 2016, the
mercurial.pycompat
module was introduced to export aliases or wrappers around
standard library functionality to abstract the differences between Python
versions. This file grew over time
and eventually became
Mercurial's version of six. To be honest, I'm
not sure if we should have used six
from the beginning. six
probably would
have saved some work. But we had to eventually write a lot of shims for
converting between str
and bytes
and would have needed to invent a
pycompat
layer in some form anyway. So I'm not sure six
would have saved
enough effort to justify the baggage of integrating a 3rd party package into
Mercurial. (When Mercurial accepts a 3rd party package, downstream packagers
like Debian get all hot and bothered and end up making questionable patches
to our source code. So we prefer to minimize the surface area for
problems by minimizing dependencies on 3rd party packages.)
Once we had a source transforming module importer and the pycompat
compatibility shim, we started to focus in earnest on making core
functionality actually work on Python 3. We established a convention of
annotating changesets needed for Python 3 with py3
, so a
commit message search
yields a lot of the history. (But it isn't a full history since not every Python 3
oriented change used this convention). We see from that history that after
the source importer landed, a lot of porting effort was spent on things
very early in the hg
process lifetime. This included handling environment
variables, loading config files, and argument parsing. We introduced a
test-check-py3-commands.t
test to track the progress of hg
commands working in Python 3. The very early
history of that file shows the various error messages changing, as underlying
early process functionality was slowly ported to work on Python 3. By
December 2016, we
had hg version
working on Python 3!
With basic hg
command dispatch ported to Python 3 at the end of 2016,
2017 represented an inflection point in the Python 3 porting effort. With the
early process functionality working, different people could pick up different
commands and code paths and start making code work with Python 3. By
March 2017, basic
repository opening and hg files
worked. Shortly thereafter,
hg init started working as well.
And hg status and
hg commit did as well.
Within a few months, enough of Mercurial's functionality was working with Python 3 that we started to track which tests passed on Python 3. The evolution of this file shows a reasonable history of the porting velocity.
In May 2017, we dropped support for Python 2.6. This significantly reduced the complexity of supporting Python 3, as there was tons of functionality in Python 2.7 that made it easier to target both Python 2 and 3 and now our hands were untied to utilize it.
In November 2017, I landed a test harness feature to report exceptions seen during test runs. I later refined the output so the most frequent failures were reported more prominently. This feature greatly enabled our ability to target the most common exceptions, allowing us to write patches to fix the most prevalent issues on Python 3 and uncover previously unknown failures.
By the end of 2017, we had most of the structural pieces in place to complete the port. Essentially all that was required at that point was time and labor. We didn't have a formal mechanism in place to target porting efforts. Instead, people would pick up a component or test that they wanted to hack on and then make incremental changes towards making that work. All the while, we didn't have a strict policy on not regressing Python 3 and regressions in Python 3 porting progress were semi-frequent. Although we did tend to correct regressions quickly. And over time, developers saw a flurry of Python 3 patches and slowly grew awareness of how to accommodate Python 3, and the number of Python 3 regressions became less frequent.
As useful as the source-transforming module importer was, it incurred some
additional burden for the porting effort. The source transformer effectively
converted all un-prefixed string literals (''
) to bytes literals (b''
)
to preserve string type behavior with Python 2. But various aspects of Python
3 didn't like the existence of bytes
. Various standard library functionality
now wanted unicode str
and didn't accept bytes
, even though the Python
2 implementation used the equivalent of bytes
. So our pycompat
layer
grew pretty large to accommodate calling into various standard library
functionality. Another side-effect which we didn't initially anticipate
was the **kwargs
calling convention. Python allows you to use **
with a dict with string keys to turn those keys into named arguments
in a function call. But Python 3 requires these dict
keys to be
str
and outright rejects bytes
keys, even if the bytes
instance
is ASCII safe and has the same underlying byte representation of the
string data as the str
instance would. So we had to invent support
functions that would convert dict
keys from bytes
to str
for
use with **kwargs
and another to convert a **kwargs
dict from
str
keys to bytes
keys so we could use ''
syntax to access keys
in our source code! Also on the string type front, we had to sprinkle
the codebase with raw string literals (r''
) to force the use of
str
irregardless of which Python version you were running on (our
source transformer only changed unprefixed string literals, so existing
r''
strings would be preserved as str
).
Blind transformation of all string literals to bytes
was less than ideal
and it did impose some unwanted side-effects. But, again, most strings
in Mercurial are bytes by design, so we thought it would be easier to
byteify all strings then selectively undo that where native strings
were actually warranted (like keys in most dict
s) than to take the
up-front cost to examine every string and make an intelligent determination
as to what type it should be. I go back and forth as to whether this was the
correct call. But when you factor in that the source transforming
module importer unblocked Python 3 porting at a time in the project's
history when there was so much focus on improving the core product and it
did so without externalizing many costs onto the people doing the critical
core product work, I think it was the right call.
By mid 2019, the number of test failures in Python 3 had been whittled down to a reasonable, less daunting number. It felt like victory was in grasp and inevitable. But a few significant issues lingered.
One remaining question was around addressing differences between Python 3 versions. At the time, Python 3.5, 3.6, and 3.7 were released and 3.8 was scheduled for release by the end of the year. We had a surprising number of issues with differences in Python 3 versions. Many of us were running Python 3.7, so it had the fewest failures. We had to spend extra effort to get Python 3.5 and 3.6 working as well as 3.7. Same for 3.8.
Another task we deferred until the second half of 2019 was standing up robust CI for Python 3. We had some coverage, but it was minimal. Wanting a distraction from PyOxidizer for a bit and wanting to overhaul Mercurial's CI system (which is officially built on Buildbot), I cobbled together a serverless CI system built on top of AWS DynamoDB and S3 for storage, Lambda functions and CloudWatch events for all business logic, and EC2 spot instances for job execution. This CI system executed Python 3.5, 3.6, 3.7, and 3.8 variants of our test harness on Linux and Python 3.7 on Windows. This gave developers insight into version-specific failures. More importantly, it also gave insight into Windows failures, which was previously not well tested. It was discovered that Python 3 on Windows was lagging significantly behind POSIX.
By the time of the Mercurial developer meetup in October 2019, nearly all tests were passing on POSIX platforms and we were confident that we could declare Python 3 support as at least beta quality for the Mercurial 5.2 release, planned for early November.
One of our blockers for ripping off the alpha label on Python 3 support
was removing our source-transforming module importer. It had performance
implications and it wasn't something we wanted to ship because it felt
too hacky. A blocker for this was we wanted to automatically format
our source tree with black
because if we removed the source transformer, we'd have to rewrite
a lot of source code to apply changes the transformer was performing,
which would necessitate wrapping a lot of lines, which would involve a lot
of manual effort. We wanted to blacken our code base first so that
mass rewriting source code wouldn't involve a lot of tedious reformatting
since black
would handle that for us automatically. And rewriting the
source tree with black
was blocked on a specific feature landing in
black
! (We did not agree with black
's behavior of
unwrapping comma-delimited lists of items if they could fit on a single
line. So one of our core contributors wrote a patch to black
that
changed its behavior so a trailing ,
in a list of items will force
items to be formatted on multiple lines. I personally find the multiple line
formatting much easier to read. And the behavior is arguably better for
code review and annotation, which is line based.) Once this feature
landed in black
, we reformatted our source tree and started ripping
out the source transformations, starting by inserting b''
literals
everywhere. By late October, the source transformer was no more and
we were ready to release beta quality support for Python 3 (at least
on UNIX-like platforms).
Having described a mostly factual overview of Mercurial's port to Python 3, it is now time to shift gears to the speculative and opinionated parts of this post. I want to underscore that the opinions reflected here are my own and do not reflect the overall Mercurial Project or even a consensus within it.
The Future of Python 3 and Mercurial
Mercurial's port to Python 3 is still ongoing. While we've shipped Python 3 support and the test harness is clean on Python 3, I view shipping as only a milestone - arguably the most important one - in a longer journey. There's still a lot of work to do.
It is now 2020 and Python 2 support is now officially dead from the perspective of the Python language maintainers. Linux distributions are starting to rip out Python 2. Packages are dropping Python 2 support in new versions. The world is moving to Python 3 only. But Mercurial still officially supports Python 2. And it is still yet to be determined how long we will retain support for Python 2 in the code base. We've only had one release supporting Python 3. Our users still need to port their extensions (implemented in Python). Our users still need to start widely using Mercurial with Python 3. Even our own developers need to switch to Python 3 (old habits are hard to break).
I anticipate a long tail of random bugs in Mercurial on Python 3. While the tests may pass, our code coverage is not 100%. And even if it were, Python is a dynamic language and there are tons of invariants that aren't caught at compile time and can only be discovered at run time. These invariants cannot all be detected by tests, no matter how good your test coverage is. This is a feature/limitation of dynamic languages. Our users will likely be finding a long tail of miscellaneous bugs on Python 3 for years.
At present, our code base is littered with tons of random hacks to bridge the gap between Python 2 and 3. Once Python 2 support is dropped, we'll need to remove these hacks and make the source tree Python 3 native, with minimal shims to wallpaper over differences in Python 3 versions. Removing this Python version bridge code will likely require hundreds of commits and will be a non-trivial effort. It's likely to be deemed a low priority (it is glorified busy work after all), and code for the express purpose of supporting Python 2 will likely linger for years.
We are also still shoring up our packaging and distribution story on Python 3. This is easier on some platforms than others. I created PyOxidizer partially because of the poor experience I had with Python application packaging and distribution through the Mercurial Project. The Mercurial Project has already signed off on using PyOxidizer for distributing Mercurial in the future. So look for an oxidized Mercurial distribution in the near future! (You could argue PyOxidizer is an epic yak shave to better support Mercurial. But that's for another post.)
Then there's Windows support. A Python 3 powered Mercurial on Windows still has a handful of known issues. It may require a few more releases before we consider Python 3 on Windows to be stable.
Because we're still on a code base that must support Python 2, our
adoption of Python 3 features is very limited. The only Python 3
feature that Mercurial developers seem to almost universally get excited
about is type annotations. We already have some people playing around
with pytype
using comment-based annotations and pytype
has already
caught a few bugs. We're eager to go all in on type annotations and
uncover lots of dynamic typing bugs and poorly implemented APIs.
Beyond type annotations, I can't name any feature that people are screaming
to adopt and which makes a lot of sense for Mercurial. There's a long
tail of minor features I'm sure will get utilized. But none of the
marquee features that define major language releases seem that interesting
to us. Time will tell.
Commentary on Python 3
Having described Mercurial's ongoing journey to Python 3, I now want to focus more on Python itself. Again, the opinions here are my own and don't reflect those of the Mercurial Project.
Succinctly, my experience porting Mercurial and other projects to Python 3 has significantly soured my perceptions of Python. As much as I have historically loved Python - from the language to the welcoming community - I am still struggling to understand how Python could manage to inflict so much hardship on the community by choosing the transition plan that they did. I believe Python's choices represent a terrific example of what not to do when managing a large project or ecosystem. Maintainers of other largely-deployed systems would benefit from taking the time to understand and reflect on Python's missteps.
Python 3.0 was released on December 3, 2008. And it took the better part of a decade for the community to embrace it. This should be universally recognized as a failure. While hindsight is 20/20, many of the issues with Python 3 were obvious at the time and could have been mitigated had the language maintainers been more accommodating - and dare I say empathetic - to its users.
Initially, Python 3 had a rather cavalier attitude towards backwards and forwards compatibility. In the early years of Python 3, the attitude of Python's maintainers was Python 3 is a new, better language: you should target it explicitly. There were some tools and methods to ease the transition. But nothing super polished, especially in the early years. Adoption of Python 3 in the overall community was slow. Python developers in the wild justifiably complained that the value proposition of Python 3 was too weak to justify porting effort. Not helping was that the early advice for targeting Python 3 was to rewrite the source code to become Python 3 native. This is in contrast with using the same source to run on both Python 2 and 3. For library and application maintainers, this potentially meant maintaining separate versions of your code or forcing end-users to make a giant leap, which would realistically orphan users on an old version, fragmenting your user base. Neither of those were great alternatives, so you can understand why many projects didn't bite.
For many projects of non-trivial size, flag day transitions from Python 2 to 3 were simply not viable: the pathway to Python 3 was to make code dual Python 2/3 compatible and gradually switch over the runtime to Python 3. But initial versions of Python 3 made this effectively impossible! Let me give a few specific examples.
In Python 2, a string literal ''
is effectively an array of bytes. In
Python 3, it is a series of Unicode code points - a fundamentally different
type! In Python 2, you could write b''
to be explicit that a string literal
was bytes or you could write u''
to indicate a Unicode literal, mimicking
Python 3's behavior. In Python 3, you could write b''
to create a bytes
instance. But for whatever reason, Python 3 initially removed the u''
syntax,
meaning there wasn't as easy way to explicitly denote the type of each
string literal so that it was consistent between Python 2 and 3! Python 3.3
(released September 2012) restored u''
support, making it more viable to
write Python source code that worked on both Python 2 and 3. For nearly 4
years, Python 3 took away the consistent syntax for denoting bytes/Unicode
string literals.
Another feature was %
formatting of strings. Python 2 allowed use of the
%
formatting operator on both its string types. But Python 3 initially
removed the implementation of %
from bytes
. Why, I have no clue. It
is perfectly reasonable to splice byte sequences into a buffer via use of
a formatting string. But the Python language maintainers insisted otherwise.
And it wasn't until the community complained about its absence loudly enough
that this feature was
restored in Python 3.5,
which was released in September 2015. Fun fact: the lack of this feature was
once considered a blocker for Mercurial moving to Python 3 because
Mercurial uses bytes
almost universally, which meant that nearly every use
of %
would have to be changed to something else. And to this day, Python
3's bytes
still doesn't have a format()
method, so the alternative was
effectively string concatenation, which is a massive step backwards from the
expressiveness of %
formatting.
The initial approach of Python 3 mirrors a folly that many developers and projects make: attempting a rewrite instead of performing incremental evolution. For established projects, large scale rewrites often go poorly. And Python 3 is no exception. Yes, from a code level, CPython (and likely other Python implementations) were incremental changes over Python 2 using the same code base. But from a language and standard library level, the differences in Python 3 were significant enough that I - and even Python's core maintainers - considered it a new language, and therefore a rewrite. When your random project attempts a rewrite and fails, the blast radius of that is often contained to that project. Maybe you don't publish a new release as soon as you otherwise would. But when you are powering an ecosystem, the ripple effects from a failed rewrite percolate throughout that ecosystem and last for years and have many second order effects. We see this with Python 3, where poor choices made in the late 2000s are inflicting significant hardship still in 2020.
From the initial restrained adoption of Python 3, it is obvious that the
Python ecosystem overwhelmingly rejected the initial boil the oceans approach
of Python 3. Python's maintainers eventually got the message and started
restoring features like u''
and bytes
%
formatting back into the
language to placate the community. All the while Python 3 had been accumulating
new features and the cumulative sum of those features was compelling enough
to win over users.
For many projects (including Mercurial), Python 3.4/3.5 was the first viable porting target for Python 3. Python 3.5 was released in September 2015, almost 7 years after Python 3.0 was released in December 2008. Seven. Years. An ecosystem that falters for that long is generally not healthy. What may have saved Python from total collapse here is that Python 2 was still going strong and people were generally happy with it. I really do think Python dodged a bullet here, because there was a massive window where the language could have hemorrhaged a critical amount of its user base and been relegated to an afterthought. One could draw an analogy to Perl, which lost out to PHP, Python, and Ruby, and whose fall from grace aligned with a lengthy transition from Perl 5 to 6.
If you look back at the early history of Python 3, I think you are forced to conclude that Python effectively kneecapped itself for 5-7 years through questionable implementation choices that prevented users from incurring incremental transitions between the major language versions. 2008 to 2013-2015 should be known as the lost years of Python because so much opportunity and energy was squandered. Yes, Python is still healthy today and Python 3 is (finally) being adopted at scale. But had earlier versions of Python 3 been more empathetic towards Python 2 users porting to it, Python and Python 3 in 2020 would be even stronger than it is. The community was artificially hindered for years. And we won't know until 2023-2025 what things could have looked like in 2020 had the Python core language team spent more time paving a smoother road between the major language versions.
To be clear, I do think Python 3 is generally a better language than Python 2. It has fewer warts, more compelling features, and better performance (except for startup time, which is still slower than Python 2). I am ecstatic the community is finally rallying around Python 3! For my Python coding, it has reached the point where I curse under my breath when I need to support Python 2 or even older versions of Python 3, like 3.5 or 3.6: I just wish the world would move on and adopt the future already!
But I would be remiss if I failed to mention some of my gripes with Python 3 beyond the transition shenanigans.
Perhaps my least favorite feature of Python 3 is its insistence that the
world is Unicode. In Python 2, the default string type was backed by
bytes. In Python 3, the default string type is backed by Unicode code
points. As part of that transition, large parts of the standard library
now operate in the Unicode space instead of the domain of bytes. I understand
why Python does this: they want strings to be Unicode and don't want
users to have to spend that much energy thinking about when to use
str
versus bytes
. This approach is admirable and somewhat defensible
because it takes a stand on a solution that is arguably good enough for
most users. However, the approach of assuming the world is Unicode is
flat out wrong and has significant implications for systems level
applications (like version control tools).
There are a myriad of places in Python's standard library where Python
insists on using the Unicode-backed str
type and rejects bytes
. For
example, various networking modules refuse to accept bytes
for hostnames
or URLs. HTTP libraries won't accept bytes
for HTTP header names or values.
Functions that are proxies to POSIX-defined functions won't accept bytes
even though the POSIX function it calls into is using char *
and isn't
Unicode aware. Then there's filename handling, where Python assumes the
existence of a global encoding for filenames and uses this encoding to convert
between str
and bytes
. And it does this despite POSIX filesystem paths
being a bag of bytes where the only rules are that \0
terminates the
filename and /
is special.
In cases like Python refusing to accept bytes
for things like HTTP
header names (which will just be spit out over the wire as bytes), Python's
pendulum has swung too far towards Unicode only. In my opinion, Python needs
to be more accommodating and allow bytes
when it makes sense. I hope the
pendulum knocks some sense into people when it swings back towards a more
reasonable solution that better acknowledges the realities of the world we
live in.
For areas like filename handling, the world is more complicated. Python
is effectively an abstraction layer over the operating system APIs exposing
this functionality. And there is often an impedance mismatch between operating
systems. For example, POSIX (Linux) tends to use char *
for everything
and doesn't care about encoding and Windows tends to use 16 bit character
types where the encoding is... a can of worms.
The reality here is that it is impossible to abstract over differences between operating system behavior without compromises that can result in data loss, outright wrong behavior, or loss of functionality. But Python 3 attempts to do it anyway, making Python 3 unsuitable (or at least highly undesirable) for certain systems level applications that rely on it (like a version control tool).
In fairness to Python, it isn't the only programming language that gets
this wrong. The only language I've seen properly implement higher-order
abstractions on top of operating system facilities is Rust, whose approach can
be generalized as use Python 3's solution of normalizing to Unicode/UTF-8 by
default, but expose escape hatches which allow access to the raw underlying
types and APIs used by the operating system for the advanced consumers who
require it. For example, Rust's Path
type which represents a filesystem path
allows access
to the raw OsStr value
used by the operating system, not a normalization of it to bytes or Unicode,
which may be lossy. This allows consumers to e.g. create and retrieve
OS-native filesystem paths without data loss. This functionality is critical
in some domains. Python 3's awareness/insistence that the world is
Unicode (which it isn't universally) reduces Python's applicability in these
domains.
Speaking of Rust, at the Mercurial developer meetup in October 2019, we were discussing the use of Rust in Mercurial and one of the core maintainers blurted out something along the lines of if Rust were at its current state 5 years ago, Mercurial would have likely ported from Python 2 to Rust instead of Python 3. As crazy as it initially sounded, I think I agree with that assessment. With the benefit of hindsight, having been a key player in the Python 3 porting effort, seeing all the complications and headaches Python 3 is introducing, and having learned Rust and witnessed its benefits for performance, control, and correctness firsthand, porting to Rust would likely have been the correct move for the project at that point in time. 2020 is not 2014, however, and I'm not sure if I would opt for a rewrite in Rust today. (Most rewrites are follies after all.) But I know one thing: I certainly wouldn't implement a new version control tool in Python 3 and I would probably choose Rust as an implementation language for most new projects in the systems level space or with an expected shelf life of 10+ years. (I really should blog about how awesome Rust is.)
Back to the topic of Python itself, I'm really soured on Python at this point in time. The effort required to port to Python 3 was staggering. For Mercurial, Python 3 introduces a ton of problems and doesn't really solve many. We effectively sludged through mud for several years only to wind up in a state that feels strictly worse than where we started. I'm sure it will be strictly better in a few years. But at that point, we're talking about a 5+ year transition. To call the Python 3 transition disruptive and distracting for the project would be an understatement. As a project maintainer, it's natural to ask what we could have accomplished if we weren't forced to carry out this sideshow.
I can't shake the feeling that a lot of the pain afflicted by the Python 3
transition could have been avoided had Python's language leadership made
a different set of decisions and more highly prioritized the transition
experience. (Like not initially removing features like u''
and bytes %
and not introducing gratuitous backwards compatibility breaks, like with
items()/iteritems()
. I would have also liked to see a feature like
from __future__
- maybe from __past__
- that would make it easier for
Python 3 code to target semantics in earlier versions in order to provide
a more turnkey on-ramp onto new versions.) I simultaneously see Python 3
losing its position as a justifiable tool in some domains (like systems
level tooling) due to ongoing design decisions and poor implementation (like
startup overhead problems). (In contrast, I see Rust excelling where Python
is faltering and find Rust code surprisingly expressive to write and maintain
given how low-level it is and therefore feel that Rust is a compelling
alternative to Python in a surprisingly large number of domains.)
Look, I know it is easy for me to armchair quarterback and critique with the benefit of hindsight/ignorance. I'm sure there is a lot of nuance here. I'm sure there was disagreement within the Python community over a lot of these issues. Maintaining a large and successful programming language and community like Python's is hard and you aren't going to please all the people all the time. And speaking as a maintainer, I have mad respect for the people leading such a large community. But niceties aside, everyone knows the Python 3 transition was rough and could have gone better. It should not have taken 11 years to get to where we are today.
I'd like to encourage the Python Project to conduct a thorough postmortem on the transition to Python 3. Identify what went well, what could have gone better, and what should be done next time such a large language change is wanted. Speaking as a Python user, a maintainer of a Python project, and as someone in industry who is now skeptical about use of Python at work due to risks of potentially company crippling high-effort migrations in the future, a postmortem would help restore my confidence that Python's maintainers learned from the various missteps on the road to Python 3 and these potentially ecosystem crippling mistakes won't be made again.
Python had a wildly successful past few decades. And it can continue to thrive for several more. But the Python 3 migration was painful for all involved. And as much as we need to move on and leave Python 2 behind us, there are some important lessons to be learned. I hope the Python community takes the opportunity to reflect and am confident it will grow stronger by taking the time to do so.
Problems with Pull Requests and How to Fix Them
January 07, 2020 at 12:10 PM | categories: Mercurial, GitYou've probably used or at least heard of pull requests: the pull request is the contribution workflow practiced on and made popular by [code] collaboration sites like GitHub, GitLab, Bitbucket, and others. Someone (optionally) creates a fork, authors some commits, pushes them to a branch, then creates a pull request to track integrating those commits into a target repository and branch. The pull request is then used as a vehicle for code review, tracking automated checks, and discussion until it is ready to be integrated. Integration is usually performed by a project maintainer, often with the click of a merge button on the pull request's web page.
It's worth noting that the term pull request is not universally used:
GitLab calls them merge requests for example. Furthermore I regard the
terms pull request and merge request to be poorly named, as the terms
can be conflated with terminology used by your version control tool (e.g.
git pull
or git merge
. And the implementations of a pull or merge
request may not even perform a pull or a merge (you can also rebase a
pull/merge request, but nobody is calling them rebase requests). A
modern day pull request is so much more than a version control tool
operation or even a simple request to pull or merge a branch: it is a
nexus to track the integration of a proposed change before during and after
that change is integrated. But alas. Because GitHub coined the term and is
the most popular collaboration platform implementing this functionality, I'll
refer to this general workflow as implemented on GitHub, GitLab, Bitbucket,
and others as pull requests for the remainder of this post.
Pull requests have existed in essentially their current form for over a decade. The core workflow has remained mostly unchanged. What is different are the addition of value-add features, such as integrating status checks like CI results, the ability to rebase or squash commits instead of merging, code review tooling improvements, and lots of UI polish. GitLab deserves a call out here, as their implementation of merge requests tracks so much more than other tools do. (This is a side-effect of GitLab having more built-in features than comparable tools.) I will also give kudos to GitLab for adding new features to pull requests when GitHub was asleep at the wheel as a company a few years ago. (Not having a CEO for clear product/company leadership really showed.) Fortunately, both companies (and others) are now churning out new, useful features at a terrific clip, greatly benefiting the industry!
While I don't have evidence of this, I suspect pull requests (and the forking
model used by services that implement them) came into existence when someone
thought how do I design a collaboration web site built on top of Git's new and
novel distributed nature and branching features. They then proceeded to invent
forking and pull requests. After all, the pull request as implemented by
GitHub was initially a veneer over a common Git workflow of create a clone,
create a branch, and send it somewhere. Without GitHub, you would run git
clone
, git branch
, then some other command like
git request-pull (where have I seen
those words before) to generate/send your branch somewhere. On GitHub, the
comparable steps are roughly create a fork, create a branch to your fork,
and submit a pull request. Today, you can even do all of this straight from
the web interface without having to run git
directly! This means that GitHub
can conceptually be thought of as a purely server-side abstraction/implementation
of the Git feature branch workflow.
At its core, the pull request is fundamentally a nice UI and feature layer built around the common Git feature branch workflow. It was likely initially conceived as polish and value-add features over this historically client-side workflow. And this core property of pull requests from its very first days has been copied by vendors like Bitbucket and GitLab (and in Bitbucket's case it was implemented for Mercurial - not Git - as Bitbucket was initially Mercurial only).
A decade is an eternity in the computer industry. As they say, if you aren't moving forward, you are moving backward. I think it is time for industry to scrutinize the pull request model and to evolve it into something better.
I know what you are thinking: you are thinking that pull requests work great and that they are popular because they are a superior model compared to what came before. These statements - aside from some nuance - are true. But if you live in the version control space (like I do) or are paid to deliver tools and workflows to developers to improve productivity and code/product quality (which I am), the deficiencies in the pull request workflow and implementation of that workflow among vendors like GitHub, GitLab, Bitbucket, etc are obvious and begging to be overhauled if not replaced wholesale.
So buckle in: you've started a ten thousand word adventure about everything you didn't think you wanted to know about pull requests!
Problems with Pull Requests
To build a better workflow, we first have to understand what is wrong/sub-optimal with pull requests.
I posit that the foremost goal of an pull request is to foster the incorporation of a high quality and desired change into a target repository with minimal overhead and complexity for submitter, integrator, and everyone in between. Pull requests achieve this goal by fostering collaboration to discuss the change (including code review), tracking automated checks against the change, linking to related issues, etc. In other words, the way I see the world is that a specific vendor's pull request implementation is just that: an implementation detail. And like all implementation details, they should be frequently scrutinized and changed, if necessary.
Let's start dissecting the problems with pull requests by focusing on the size of review units. Research by Google, Microsoft here, and here, and others has shown an inverse correlation with review unit size and defect rate. In Google's words (emphasis mine):
The size distribution of changes is an important factor in the quality of the code review process. Previous studies have found that the number of useful comments decreases and the review latency increases as the size of the change increases. Size also influences developers' perception of the code review process; a survey of Mozilla contributors found that developers feel that size-related factors have the greatest effect on review latency. A correlation between change size and review quality is acknowledged by Google and developers are strongly encouraged to make small, incremental changes (with the exception of large deletions and automated refactoring). These findings and our study support the value of reviewing small changes and the need for research and tools to help developers create such small, self-contained code changes for review.
Succinctly, larger changes result in fewer useful comments during review (meaning quality is undermined) and make reviews take longer (meaning productivity is undermined). Takeaway: if you care about defect rate / quality and/or velocity, you should be authoring and reviewing more, smaller changes as opposed to fewer, larger changes.
I stronger agree with Google's opinion on this matter and wholeheartedly endorse writing more, smaller changes. Having practiced both forms of change authorship, I can say without a doubt that more, smaller changes is superior: superior for authors, superior for code reviewers, and superior for people looking at repository history later. The main downside with this model is that it requires a bit more knowledge of your version control tool to execute. And, it requires corresponding tooling to play well with this change authorship model and to introduce as little friction as possible along the way since the number of interactions with tooling will increase as change size decreases, velocity increases, and there are more distinct units of change being considered for integration.
That last point is important and is germane to this post because the common implementation of pull requests today is not very compatible with the many small changes workflow. As I'll argue, the current implementation of pull requests actively discourages the many smaller changes workflow. And since smaller changes result in higher quality and faster reviews, today's implementations of pull requests are undermining quality and velocity.
I don't mean to pick on them, but since they are the most popular and the people who made pull requests popular, let's use GitHub's implementation of pull requests to demonstrate my point.
I posit that in order for us to author more, smaller changes, we must either a) create more, smaller pull requests or b) have pull request reviews put emphasis on the individual commits (as opposed to the overall merge diff). Let's examine these individually.
If we were to author more, smaller pull requests, this would seemingly necessitate the need for dependencies between pull requests in order to maintain velocity. And dependencies between pull requests adds a potentially prohibitive amount of overhead. Let me explain. We don't want to sacrifice the overall rate at which authors and maintainers are able to integrate proposed changes. If we were to split existing proposed changes into more, smaller pull requests, we would have a lot more pull requests. Without dependencies between them, authors could wait for each pull request to be integrated before sending the next one. But this would incur more round trips between author and integrator and would almost certainly slow down the overall process. That's not desirable. The obvious mitigation to that is to allow multiple, related pull requests in flight simultaneously. But this would necessitate the invention of dependencies between pull requests in order to track relationships so one pull request doesn't integrate before another it logically depends on. This is certainly technically doable. But it imposes considerable overhead of its own. How do you define dependencies? Are dependencies automatically detected or updated based on commits in a DAG? If yes, what happens when you force push and it is ambiguous whether a new commit is a logically new commit or a successor of a previous one? If no, do you really want to impose additional hurdles on submitters to define dependencies between every pull request? In the extreme case of one pull request per commit, do you make someone submitting a series of say twenty commits and pull requests really annotate nineteen dependencies? That's crazy!
There's another, more practical issue at play: the interplay between
Git branches and pull requests. As implemented on GitHub, a pull request
is tracked by a Git branch. If we have N inter-dependent pull requests,
that means N Git branches. In the worst case, we have one Git branch for
every Git commit. Managing N in-flight Git branches would be absurd.
It would impose considerable overhead on pull request submitters. It
would perfectly highlight the inefficiency in Git's
game of refs
branch management that I blogged about two years ago. (Succinctly,
once you are accustomed to workflows - like Mercurial's - which don't
require you to name commits or branches, Git's forced naming of branches
and all the commands requiring those branch names feels grossly inefficient
and a mountain of overhead.) Some tooling could certainly be implemented
to enable efficient submission of pull requests. (See
ghstack for an example.) But I
think the interplay between Git branches and GitHub pull requests is
sufficiently complex that the tooling and workflow would be intractable
for anything but the most trivial and best-case scenarios. Keep in mind
that any sufficiently user-friendly solution to this problem would
also entail improving git rebase
so it moves branches on rewritten
ancestor commits instead of leaving them on the old versions of
commits. (Seriously, someone should implement this feature: it arguably
makes sense as the default behavior for local branches.) In other
words, I don't think you can implement the multiple pull request model
reliably and without causing excessive burden on people without
fundamentally changing the requirement that a pull request be a Git
branch. (I'd love to be proven wrong.)
Therefore, I don't think the more, smaller changes workflow can be easily practiced with multiple pull requests using the common GitHub model without effectively moving the definition of a pull request away from equivalence with a Git branch (more on this later). And I also don't mean to imply that dependencies between pull requests can't be implemented: they can and GitLab is evidence. But GitLab's implementation is somewhat simple and crude (possibly because doing anything more complicated is really hard as I speculate).
So without fundamentally changing the relationship between a pull request and a branch, that leaves us with our alternative of pull requests putting more emphasis on the individual changes rather than the merge diff. Let's talk about that now.
Pull requests have historically placed emphasis on the merge diff.
That is, GitHub (or another provider) takes the Git branch you have
submitted, runs a git merge
against the target branch behind the
scenes, and displays that diff front and center for review as the main
proposed unit of change: if you click the Files changed tab to
commence review, you are seeing this overall merge diff. You can
click on the Commits tab then select an individual commit to review
just that commit. Or you can use the dropdown on the Files changed
tab to select an individual commit to review it. These (relatively
new) features are a very welcome improvement and do facilitate
performing a commit-by-commit review, which is a requirement to realize
the benefits of a more, smaller changes workflow. Unfortunately, they
are far from sufficient to fully realize the benefits of that workflow.
Defaults matter and GitHub's default is to show the merge diff when conducting review. (I bet a large percentage of users don't even know it is possible to review individual commits.) Since larger changes result in a higher defect rate and slower review, GitHub's default of showing the merge diff effectively means GitHub is defaulting to lower quality, longer-lasting reviews. (I suppose this is good for engagement numbers, as it inflates service usage both immediately and in the long-term due to subsequent bugs driving further usage. But I sincerely hope no product manager is thinking let's design a product that undermines quality to drive engagement.)
Unfortunately, a trivial change of the default to show individual commits instead of the merge diff is not so simple, as many authors and projects don't practice clean commit authorship practices, where individual commits are authored such that they can be reviewed in isolation.
(One way of classifying commit authorship styles is by whether a series of commits is authored such that each commit is good in isolation or whether the effect of applying the overall series is what matters. A handful of mature projects - like the Linux kernel, Firefox, Chrome, Git, and Mercurial - practice the series of individually-good commits model, which I'll call a commit-centric workflow. I would wager the majority of projects on GitHub and similar services practice the we only care about the final result of the series of commits model. A litmus test for practicing the latter model is whether pull requests contain commits like fixup foo or if subsequent revisions to pull requests create new commits instead of amending existing ones. I'm a strong proponent of a clean commit history where each commit in the final repository history stands as good in isolation. But I tend to favor more grown-up software development practices and am a version control guru. That being said, the subject/debate is fodder for another post.)
If GitHub (or someone else) switched the pull request default to a per-commit review without otherwise changing the relationship between a pull request and a Git branch, that would force a lot of less experienced users to familiarize themselves with history rewriting in Git. This would impose considerable pain and suffering on pull request authors, which would in turn upset users, hurt engagement, etc. Therefore, I don't think this is a feasible global default that can be changed. Maybe if Git's user experience for history rewriting were better or we didn't have a decade of behavior to undo we'd be in a better position... But pull request implementations don't need to make a global change: they could right the ship by offering projects that practice clean commit practices an option to change the review default so it emphasizes individual commits instead of the merge diff. This would go a long way towards encouraging authoring and reviewing individual commits, which should have positive benefits on review velocity and code quality outcomes.
But even if these services did emphasize individual commits by default in pull request reviews, there's still a handful of significant deficiencies that would undermine the more, smaller changes workflow that we desire.
While it is possible to review individual commits, all the review comments are still funneled into a single per pull request timeline view of activity. If submitter and reviewer make the effort to craft and subsequently review individual commits, your reward is that all the feedback for the discrete units of change gets lumped together into one massive pile of feedback for the pull request as a whole. This unified pile of feedback (currently) does a poor job of identifying which commit it applies to and gives the author little assistance in knowing which commits need amending to address the feedback. This undermines the value of commit-centric workflows and effectively pushes commit authors towards the fixup style of commit authorship. In order to execute per-commit review effectively, review comments and discussion need to be bucketed by commit and not combined into a unified pull request timeline. This would be a massive change to the pull request user interface and would be a daunting undertaking, so it is understandable why it hasn't happened yet. And such an undertaking would also require addressing subtly complex issues like how to preserve reviews in the face of force pushes. Today, GitHub's review comments can lose context when force pushes occur. Things are better than they used to be, when review comments left on individual commits would flat out be deleted (yes: GitHub really did effectively lose code review comments for several years.) But even with tooling improvements, problems still remain and should adoption of commit-level review tracking occur, these technical problems would likely need resolution to appease users of this workflow.
Even if GitHub (or someone else) implements robust per-commit review for pull requests, there's still a problem with velocity. And that problem is that if the pull request is your unit of integration (read: merging), then you have to wait until every commit is reviewed before integration can occur. This may sound tolerable (it's what we practice today after all). But I argue this is less optimal than a world where a change integrates as soon as it is ready to, without having to wait for the changes after it. As an author and maintainer, if I see a change that is ready to integrate, I prefer to integrate it as soon as possible, without delay. The longer a ready-to-integrate change lingers, the longer it is susceptible to bit rot (when the change is no longer valid/good due to other changes in the system). Integrating a judged-good change sooner also reduces the time to meaningful feedback: if there is a fundamental problem early in a series of changes that isn't caught before integration, integrating earlier changes sooner without waiting for the ones following will expose problems sooner. This minimizes deltas in changed systems (often making regression hunting easier), often minimizes the blast radius if something goes wrong, and gives the author more time and less pressure to amend subsequent commits that haven't been integrated yet. And in addition to all of this, integrating more often just feels better. The Progress Principle states that people feel better and perform better work when they are making continuous progress. But setbacks more than offset the power of small wins. While I'm not aware of any explicit research in this area, my interpretation of the Progress Principle to change authorship and project maintenance(which is supported by anecdotal observation) is that a steady stream of integrated changes feels a hell of a lot better than a single monolithic change lingering in review purgatory for what can often seem like an eternity. While you need to be cognizant to not confuse movement with meaningful progress, I think there is real power to the Progress Principle and that we should aim to incorporate changes as soon as they are ready and not any later. Applied to version control and code review, this means integrating a commit as soon as author, reviewer, and our machine overlords reporting status checks all agree it is ready, without having to wait for a larger unit of work, like the pull request. Succinctly, move forward as soon as you are able to!
This desire to integrate faster has significant implications for
pull requests. Again, looking at GitHub's implementation of pull
requests, I don't see how today's pull requests could adapt to this
desired end state without significant structural changes. For starters,
review must grow the ability to track per-commit state otherwise
integrating individual commits without the entirety of the parts makes
little sense. But this entails all the complexity I described above.
Then there's the problem of Git branches effectively defining a pull
request. What happens when some commits in a pull request are integrated
and the author rebases or merges their local branch against their new
changes? This may or may not just work. And when it doesn't just work,
the author can easily find themselves in merge conflict hell, where
one commit after the other fails to apply cleanly and their carefully
curated stack of commits quickly becomes a liability and impediment to
forward progress. (As an aside, the Mercurial version control tool has
a concept called changeset evolution where it tracks which commits -
changesets in Mercurial parlance - have been rewritten as other commits
and gracefully reacts in situations like a rebase. For example, if you
have commits X
and Y
and X
is integrated via a rebase as X'
, an
hg rebase
of Y
onto X'
will see that X
was rewritten as X'
and
skip attempting to rebase X
because it is already applied! This cleanly
sidesteps a lot of the problems with history rewriting - like merge
conflicts - and can make the end-user experience much more pleasant as
a result.) While it is certainly possible to integrate changes as soon
as they are ready with a pull request workflow, I think that it is
awkward and that by the time you've made enough changes to accommodate
the workflow, very little is left of the pull request workflow as we
know it and it is effectively a different workflow altogether.
The above arguments overly hinge on the assumption that more smaller changes is superior for quality and/or velocity and that we should design workflows around this assertion. While I strongly believe in the merits of smaller units of change, others may disagree. (If you do disagree, you should ask yourself whether you believe the converse: that larger units of change are better for quality and velocity. I suspect most people can't justify this. But I do believe there is merit to the argument that smaller units of change impose additional per-unit costs or have second order effects that undermine their touted quality or velocity benefits.)
But even if you don't buy into the change size arguments, there's still a very valid reason why we should think beyond pull requests as they are implemented today: tool scalability.
The implementation of pull requests today is strongly coupled with
how Git works out of the box. A pull request is initiated from a Git
branch pushed to a remote Git repository. When the pull request is
created, the server creates a Git branch/ref referring to that pull
request's head commits. On GitHub, these refs are
named
pull/ID/head
(you can fetch these from the remote Git repository but
they are not fetched by default). Also when a pull request is created
or updated, a git merge
is performed to produce a diff for review.
On GitHub, the resulting merge commit is saved and pointed to on
the open pull request via a pull/ID/merge
ref, which can also be
fetched locally. (The merge ref is deleted when the pull request is
closed.)
Herein resides our scalability problem: unbound growth of Git refs and ever-increasing rate of a change for a growing project. Each Git ref adds overhead to graph walking operations and data exchange. While involved operations are continuously getting optimized (often through the use of more advanced data structures or algorithms), there are intrinsic scaling challenges with this unbound growth that - speaking as a version control tool maintainer - I want no part of. Are technical solutions enabling things to scale to millions of Git refs viable? Yes'ish. But it requires high-effort solutions like JGit's Reftable, which required ~90 review rounds spanned across ~4 months to land. And that's after design of the feature was first proposed at least as far back as July 2017. Don't get me wrong: am I glad Reftable exists: yes. It is a fantastic solution to a real problem and reading how it works will probably make you a better engineer. But simultaneously, it is a solution to a problem that does not need to exist. There is a space for scaling graph data structures and algorithms to millions or even billions of nodes, edges, and paths: your version control tool should not be it. Millions or billions of commits and files: that's fine. But scaling the number of distinct paths through that graph by introducing millions of DAG heads is insane given how much complexity it introduces in random areas of the tool. In my opinion it requires unjustifiably large amounts of investment to make work at scale. As an engineer, my inclination when posed with problems like these is to avoid them in the first place. The easiest problems to solve are those you don't have.
Unfortunately, the tight coupling of pull requests to Git branches/refs introduces unbound growth and a myriad of problems associated with it. Most projects may not grow to a size that experiences these problems. But as someone who has experience with this problem space at multiple companies, I can tell you the problem is very real and the performance and scalability issues it creates undermines the viability of using today's implementation of pull requests once you've reached a certain scale. Since we can likely fix the underlying scaling issues with Git, I don't think the explosion of Git refs is a long-term deal breaker for scaling pull requests. But it is today and will remain so until Git and the tools built on top of it improve.
In summary, some high-level problems with pull requests are as follows:
- Review of merge diff by default encourages larger units of review, which undermines quality and velocity outcomes.
- Inability to incrementally integrate commits within a pull request, which slows down velocity, time to meaningful feedback, and can lower morale.
- Tight coupling of pull requests with Git branches adds rigidity to workflows and shoehorns into less flexible and less desired workflows.
- Deficiencies in the Git user experience - particular around what happens when rewrites (including rebase) occur - significantly curtail what workflows can be safely practiced with pull requests.
- Tight coupling of pull requests with Git branches can lead to performance issues at scale.
We can invert language to arrive at a set of more ideal outcomes:
- Review experience is optimized for individual commits - not the merge diff - so review units are smaller and quality and velocity outcomes are improved.
- Ability to incrementally integrate individual commits from a larger set so ready-to-go changes are incorporated sooner, improving velocity, time to meaningful feedback, and morale.
- How you use Git branches does not impose significant restrictions on handling of pull requests.
- You can use your version control tool how you want, without having to worry about your workflow being shoehorned by how pull requests work.
- Pull request server can scale to the most demanding use cases with relative ease.
Let's talk about how we could achieve these more desirable outcomes.
Exploring Alternative Models
A pull request is merely an implementation pattern for the general problem space of integrating a proposed change. There are other patterns used by other tools. Before I describe them, I want to coin the term integration request to refer to the generic concept of requesting some change being integrated elsewhere. GitHub pull requests and GitLab merge requests are implementations of integration requests, for example.
Rather than describe alternative tools in detail, I will outline the key areas where different tools differ from pull requests and assess the benefits and drawbacks to the different approaches.
Use of the VCS for Data Exchange
One can classify implementations of integration requests by how they utilize the underlying version control tools.
Before Git and GitHub came along, you were probably running a centralized version control tool which didn't support offline commits or feature branches (e.g. CVS or Subversion). In this world, the common mechanism for integration requests was exchanging diffs or patches through various media - email, post to a web service of a code review tool, etc. Your version control tool didn't speak directly to a VCS server to initiate an integration request. Instead, you would run a command which would export a text-based representation of the change and then send it somewhere.
Today, we can classify integration requests by whether or not they speak
the version control tool's native protocol for exchanging data or whether
they exchange patches through some other mechanism. Pull requests speak the
VCS native protocol. Tools like Review Board and Phabricator exchange patches
via custom HTTP web services. Typically, tools using non-native exchange
will require additional client-side configuration, including potentially the
installation of a custom tool (e.g. RBTools
for Review Board or Arcanist
for Phabricator). Although modern version control tools sometimes have
this functionality built-in. (e.g. Git and Mercurial fulfill
Zawinski's law and
Mercurial has a Phabricator extension in its official distribution).
An interesting outlier is Gerrit, which ingests its integration requests via
git push
. (See
the docs.)
But the way Gerrit's ingestion via git push
works is fundamentally different
from how pull requests work! With pull requests, you are pushing your local
branch to a remote branch and a pull request is built around that remote branch.
With Gerrit, your push command is like git push gerrit HEAD:refs/for/master
.
For the non-gurus, that HEAD:refs/for/master
syntax means, push the HEAD
commit (effectively the commit corresponding to the working directory) to the
refs/for/master
ref on the gerrit
remote (the SOURCE:DEST
syntax specifies a mapping of local revision identifier to remote ref). The
wizard behind the curtain here is that Gerrit runs a special Git server that
implements non-standard behavior for the refs/for/*
refs. When you push to
refs/for/master
, Gerrit receives your Git push like a normal Git server would.
But instead of writing a ref named refs/for/master
, it takes the incoming commits
and ingests them into a code review request! Gerrit will create Git refs for the
pushed commits. But it mainly does that for its internal tracking (Gerrit stores
all its data in Git - from Git data to review comments). And if
that functionality isn't too magical for you, you can also pass parameters to
Gerrit via the ref name! e.g. git push gerrit HEAD refs/for/master%private
will create a private review request that requires special permissions to see.
(It is debatable whether overloading the ref name for additional functionality
is a good user experience for average users. But you can't argue that this isn't
a cool hack!)
On the surface, it may seem like using the version control tool's native data
exchange is a superior workflow because it is more native and more modern.
(Emailing patches is so old school.) Gone are the days of having to configure
client-side tools to export and submit patches. Instead, you run git push
and
your changes can be turned into an integration request automatically or with
a few mouse clicks. And from a technical level, this exchange methodology is
likely safer, as round-tripping a text-based representation of a change without
data loss is surprisingly finicky. (e.g. JSON's lack of lossless binary data
exchange without encoding to e.g. base64 first often means that many services
exchanging text-based patches are lossy, especially in the presence of content
which doesn't conform to UTF-8, which can be commonplace in tests. You would be
surprised how many tools experience data loss when converting version control
commits/diffs to text. But I digress). Having Git's wire protocol exchange
binary data is safer than exchanging text patches and probably easier to use
since it doesn't require any additional client-side configuration.
But despite being more native, modern, and arguably robust, exchange via the version control tool may not be better.
For starters, use of the version control tool's native wire protocol inhibits use of arbitrary version control tools on the client. When your integration request requires the use of a version control tool's wire protocol, the client likely needs to be running that version control tool. With other approaches like exchange of text based patches, the client could be running any software it wanted: as long as it could spit out a patch or API request in the format the server needed, an integration request could be created! This meant there was less potential for lock-in, as people could use their own tools on their machines if they wanted and they (hopefully) wouldn't be inflicting their choice on others. Case in point, a majority of Firefox developers use Mercurial - the VCS of the canonical repository - but a large number use Git on the client. Because Firefox is using Phabricator (Review Board and Bugzilla before that) for code review and because Phabricator ingests text-based patches, the choice of the VCS on the client doesn't matter that much and the choice of the server VCS can be made without inciting a holy war among developers who would be forced to use a tool they don't prefer. Yes, there are good reasons for using a consistent tool (including organizational overhead) and sometimes mandates for tool use are justified. But in many cases (such as random open source contributions), it probably doesn't or shouldn't matter. And in cases like Git and Mercurial, where tools like the fantastic git-cinnabar make it possible to easily convert between the repositories without data loss and acceptable overhead, adoption of the version control tool's native wire protocol can exclude or inhibit the productivity of contributors since it can mandate use of specific, undesired tooling.
Another issue with using the version control tool's wire protocol is that it
often forces or strongly encourages you to work a certain way. Take GitHub
pull requests for example. The pull request is defined around the remote Git
branch that you git push
. If you want to update that branch, you need to know
its name. So that requires some overhead to either create and track that branch
or find its name when you want to update it. Contrast with Gerrit, where you
don't have an explicit remote branch you push to: you simply git push gerrit
HEAD:refs/for/master
and it figures things out automatically (more on this
later). With Gerrit, I don't have to create a local Git branch to initiate an
integration request. With pull requests, I'm compelled to. And this can undermine
my productivity by compelling me to practice less-efficient workflows!
Our final point of comparison involves scalability. When you use the version control tool wire protocol as part of integration requests, you have introduced the problem of scaling your version control server. Take it from someone who has had multiple jobs involving scaling version control servers and who is intimately aware of the low-level details of both the Git and Mercurial wire protocols: you don't want to be in the business of scaling a version control server. The wire protocols for both Git and Mercurial were designed in a now-ancient era of computing and weren't designed by network protocol experts. They are fundamentally difficult to scale at just the wire protocol level. I've heard stories that at one time, the most expensive single server at Google was their Perforce or Perforce-derived server (this was several years ago - Google has since moved on to a better architecture).
The poor network protocols of version control tools have many side-effects, including the inability or sheer difficulty of using distributed storage on the server. So in order to scale compute horizontally, you need to invest in expensive network storage solutions or devise a replication and synchronization strategy. And take it from someone who worked on data synchronization products (outside of the source control space) at three companies: this is a problem you don't want to solve yourself. Data synchronization is intrinsically difficult and rife with difficult trade-offs. It's almost always a problem best avoided if you have a choice in the matter.
If creating Git refs is part of creating an integration request, you've introduced a scaling challenge with the number of Git refs. Do these Git refs live forever? What happens when you have thousands of developers - possibly all working in the same repository - and the number of refs or ref mutations grows to the hundreds of thousands or millions per year?
Can your version control server handle ingesting a push every second or two
with reasonable performance? Unless you are Google, Facebook, or a handful of
other companies I'm aware of, it can't. And before you cry that I'm talking
about problems that only plague the 0.01% of companies out there, I can name a
handful of companies under 10% the size of these behemoths where this is a
problem for them. And I also guarantee that many people don't have client-side
metrics for their git push
P99 times or reliability and don't even realize
there is a problem! Scaling version control is probably not a core part of your
company's business. Unfortunately, it all too often becomes something companies
have to allocate resources for because of poorly designed or utilized tools.
Contrast the challenges of scaling integration requests with a native version control server versus just exchanging patches. With the more primitive approach, you are probably sending the patch over HTTP to a web service. And with tools like Phabricator and Review Board, that patch gets turned into rows in a relational database. I guarantee it will be easier to scale an HTTP web service fronting a relational database than it will be your version control server. If nothing else, it should be easier to manage and debug, as there are tons more experts in these domains than in the version control server domain!
Yes, it is true that many will not hit the scaling limits of the version control server. And some nifty solutions for scaling do exist. But large segments of this problem space - including the version control tool maintainers having to support crazy scaling vectors in their tools - could be avoided completely if integration requests didn't lean so heavily on the version control tools's default mode of operation. Unfortunately, solutions like GitHub pull requests and Gerrit's use of Git refs for storing everything exert a lot of pressure on scaling the version control server and make this a very real problem once you reach a certain scale.
Hopefully the above paragraphs enlightened you to some of the implications that the choice of a data exchange mechanism has on integration requests! Let's move on to another point of comparison.
Commit Tracking
One can classify implementations of integration requests by how they track
commits through their integration lifecycle. What I mean by this is how the
integration request follows the same logical change as it evolves. For
example, if you submit a commit then amend it, how does the system know that the
commit evolved from commit X
to X'
.
Pull requests don't track commits directly. Instead, a commit is part of a Git branch and that branch is tracked as the entity the pull request is built around. The review interface presents the merge diff front and center. It is possible to view individual commits. But as far as I know, none of these tools have smarts to explicitly track or map commits across new submissions. Instead, they simply assume that the commit order will be the same. If commits are reordered or added or removed in the middle of an existing series, the tool can get confused quite easily. (With GitHub, it was once possible for a review comment left on a commit to disappear entirely. The behavior has since been corrected and if GitHub doesn't know where to print a comment from a previous commit, it renders it as part of the pull request's timeline view.)
If all you are familiar with is pull requests, you may not realize there are alternatives to commit tracking! In fact, the most common alternative (which isn't do nothing) predates pull requests entirely and is still practiced by various tools today.
The way that Gerrit, Phabricator, and Review Board work is the commit message
contains a unique token identifying the integration request for that commit.
e.g. a commit message for a Phabricator review will contain the line
Differential Revision: https://phab.mercurial-scm.org/D7543
. Gerrit will
have something like Change-Id: Id9bfca21f7697ebf85d6a6fa7bac7de4358d7a43
.
The way this annotation appears in the commit message differs by tool.
Gerrit's web UI advertises a shell one-liner to clone repositories which
not only performs a git clone
but also uses curl
to download a shell
script from the Gerrit server and install it as Git's commit-msg
hook in the
newly-cloned repositories. This Git hook will ensure that any newly-created
commit has a Change-ID: XXX
line containing a randomly generated, hopefully
unique identifier. Phabricator and Review Board leverage client-side tooling
to rewrite commit messages after submission to their respective tool so
the commit message contains the URL of the code review. One can debate
which approach is better - they each have advantages and drawbacks.
Fortunately, this debate is not germane to this post, so we won't cover it here.
What is important is how this metadata in commit messages is used.
The commit message metadata comes into play when a commit is being ingested into an integration request. If a commit message lacks metadata or references an entity that doesn't exist, the receiving system assumes it is new. If the metadata matches an entity on file, the incoming commit is often automatically matched up to an existing commit, even if its Git SHA is different!
This approach of inserting a tracking identifier into commit messages works surprisingly well for tracking the evolution of commits! Even if you amend, reorder, insert, or remove commits, the tool can often figure out what matches up to previous submissions and reconcile state accordingly. Although support for this varies by tool. Mercurial's extension for submitting to Phabricator is smart enough to take the local commit DAG into account and change dependencies of review units in Phabricator to reflect the new DAG shape, for example.
The tracking of commits is another one of those areas where the simpler and more modern features of pull requests often don't work as well as the solutions that came before. Yes, inserting an identifier into commit messages feels hacky and can be brittle at times (some tools don't implement commit rewriting very well and this can lead to a poor user experience). But you can't argue with the results: using explicit, stable identifiers to track commits is far more robust than the heuristics that pull requests rely on. The false negative/positive rate is so much lower. (I know this from first hand experience because we attempted to implement commit tracking heuristics for a code review tool at Mozilla before Phabricator was deployed and there were a surprising number of corner cases we couldn't handle properly. And this was using Mercurial's obsolescence markers, which gave us commit evolution data generated directly by the version control tool! If that didn't work well enough, it's hard to imagine an heuristic that would. We eventually gave up and used stable identifiers in commit messages, which fixed most of the annoying corner cases.)
The use of explicit commit tracking identifiers may not seem like it makes a meaningful difference. But it's impact is profound.
The obvious benefit of tracking identifiers is that they allow rewriting commits without confusing the integration request tool. This means that people can perform advanced history rewriting with near impunity as to how it would affect the integration request. I am a heavy history rewriter. I like curating a series of individually high-quality commits that can each stand in isolation. When I submit a series like this to a GitHub pull request and receive feedback on something I need to change, when I enact those changes I have to think will my rewriting history here make re-review harder? (I try to be empathetic with the reviewer and make their life easier whenever possible. I ask what I would appreciate someone doing if I were reviewing their change and tend to do that.) With GitHub pull requests, if I reorder commits or add or remove a commit in the middle of a series, I realize that this may make review comments left on those commits hard to find since GitHub won't be able to sort out the history rewriting. And this may mean those review comments get lost and are ultimately not acted upon, leading to bugs or otherwise deficient changes. This is a textbook example of tooling deficiencies dictating a sub-optimal workflow and outcome: because pull requests don't track commits explicitly, I'm forced to adopt a non-ideal workflow or sacrifice something like commit quality in order to minimize risks that the review tool won't get confused. In general, tools should not externalize these kinds of costs or trade-offs onto users: they should just work and optimize for generally agreed-upon ideal outcomes.
Another benefit to tracking identifiers is that they enable per-commit review to be viable. Once you can track the logical evolution of a single commit, you can start to associate things like review comments with individual commits with a high degree of confidence. With pull requests (as they are implemented today), you can attempt to associate comments with commits. But because you are unable to track commits across rewrites with an acceptably high degree of success, rewritten commits often fall through the cracks, orphaning data like review comments with them. Data loss is bad, so you need a place to collect this orphaned data. The main pull request activity timeline facilitates this function.
But once you can track commits reliably (and tools like Gerrit and Phabricator prove this is possible), you don't have this severe problem of data loss and therefore don't need to worry about finding a place to collect orphaned data! You are then able to create per-commit review units, each as loosely coupled with other commits and an overall series as you want to make it!
It is interesting to note the different approaches in different tools here. it is doubly interesting to note behavior that is possible with the review tool itself and what it does by default!
Let's examine Phabricator. Phabricator's review unit is the
Differential revision. (Differential is the name of the code review tool
in Phabricator, which is actually a suite of functionality - like GitLab,
but not nearly as feature complete.) A Differential revision represents a
single diff. Differential revisions can have parent-child relationships
with others. Multiple revisions associated like this form a conceptual
stack in Phabricator's terminology. Go to
https://phab.mercurial-scm.org/D4414
and search for stack to see it in action. (Stack is a bit misleading name
because the parent-child relationships actually form a DAG and Phabricator is
capable of rendering things like multiple children in its graphical view.)
Phabricator's official client side tool for
submitting to Phabricator - Arcanist or arc
- has default behavior
of collapsing all Git commits into a single Differential revision.
Phabricator can preserve metadata from the individual commits (it can render at least the commit messages in the web UI so you can see where the Differential revision came from). In other words, by default Arcanist does not construct multiple Differential revisions for each commit and therefore does not construct parent-child relationships for them. So there is no stack to render here. To be honest, I'm not sure if modern versions of Arcanist even support doing this. I do know both Mercurial and Mozilla authored custom client side tools for submitting to Phabricator to work around deficiencies like this in Arcanist. Mozilla's may or may not be generally suitable for users outside of Mozilla - I'm not sure.
Another interesting aspect of Phabricator is that there is no concept of an over-arching series. Instead, each Differential revision stands in isolation. They can form parent-child relationships and constitute a stack. But there is no primary UI or APIs for stacks (the last I looked anyway). This may seem radical. You may be asking questions like how do I track the overall state of a series or how do I convey information pertinent to the series as a whole. These are good questions. But without diving into them, the answer is that as radical as it sounds to not have an overall tracking entity for a series of Differential revisions, it does work. And having used this workflow with the Mercurial Project for a few years, I can say I'm not missing the functionality that much.
Gerrit is also worth examining. Like Phabricator, Gerrit uses an identifier in commit messages to track the commit. But whereas Phabricator rewrites commit messages at initially submission time to contain the URL that was created as part of that submission, Gerrit peppers the commit message with a unique identifier at commit creation time. The server then maintains a mapping of commit identifier to review unit. Implementation details aside, the end result is similar: individual commits can be tracked more easily.
What distinguishes Gerrit from Phabricator is that Gerrit does have a stronger grouping around multiple commits. Gerrit will track when commits are submitted together and will render both a relation chain and submitted together list automatically. While it lacks the visual beauty of Phabricator's implementation, it is effective and is shown in the UI by default, unlike Phabricator.
Another difference from Phabricator is that Gerrit uses per-commit review by default. Whereas you need a non-official client for Phabricator to submit a series of commits to constitute a linked chain, Gerrit does this by default. And as far as I can tell, there's no way to tell Gerrit to squash your local commits down to a single diff for review: if you want a single review to appear, you must first squash commits locally then push the squashed commit. (More on this topic later in the post.)
A secondary benefit of per-commit review is that this model enables incremental integration workflows, where some commits in a series or set can integrate before others, without having to wait for the entire batch. Incremental integration of commits can drastically speed up certain workflows, as commits can integrate as soon as they are ready and not any longer. The benefits of this model can be incredible. But actually deploying this workflow can be tricky. One problem is that your version control tool may get confused when you rebase or merge partially landed state. Another problem is it can increase the overall change rate of the repository, which may strain systems from version control to CI to deployment mechanisms. Another potential problem involves communicating review sign-off from integration sign-off. Many tools/workflows conflate I sign off on this change and I sign off on landing this change. While they are effectively identical in many cases, there are some valid cases where you want to track these distinctly. And adopting a workflow where commits can integrate incrementally will expose these corner cases. So before you go down this path, you want to be thinking about who integrates commits and when they are integrated. (You should probably be thinking about this anyway because it is important.)
Designing a Better Integration Request
Having described some problems with pull requests and alternate ways of going about solving the general problem of integration requests, it is time to answer the million dollar problem: designing a better integration request. (When you factor in the time people spend in pull requests and the cost of bugs / low quality changes that slip through due to design of existing tooling, improving integration requests industry wide would be a lot more valuable than $1M.)
As a reminder, the pull request is fundamentally a nice UI and set of features built around the common Git feature branch workflow. This property is preserved from the earliest days of pull requests in 2007-2008 and has been copied by vendors like Bitbucket and GitLab in the years since. In my mind, pull requests should be ripe for overhaul.
Replace Forks
The first change I would make to pull requests is to move away from forks being a required part of the workflow. This may seem radical. But it isn't!
A fork on services like GitHub is a fully fledged project - just like the canonical project it was forked from. It has its own issues, wiki, releases, pull requests, etc. Now show of hands: how often do you use these features on a fork? Me neither. In the overwhelming majority of cases, a fork exists solely as a vehicle to initiate a pull request against the repository it was forked from. It serves little to no additional meaningful functionality. Now, I'm not saying forks don't serve a purpose - they certainly do! But in the case of someone wanting to propose a change to a repository, a fork is not strictly required and its existence is imposed on us by the current implementation of pull requests.
I said impose in the previous sentence because forks introduce overhead and confusion. The existence of a fork may confuse someone as to where a canonical project lives. Forks also add overhead in the version control tool. Their existence forces the user to manage an additional Git remote and branches. It forces people to remember to keep their branches in sync on their fork. As if remembering to keep your local repository in sync wasn't hard enough! And if pushing to a fork, you need to re-push data that was already pushed to the canonical repository, even though that data already exists on the server (just in a different view of the Git repository). (I believe Git is working on wire protocol improvements to mitigate this.)
When merely used as a vehicle to initiate integration requests, I do not believe forks offer enough value to justify their existence. Should forks exist: yes. Should people be forced to use them in order to contribute changes, no. (Valid use cases for a fork would be to perform a community splinter of a project and to create an independent entity for reasons such as better guarantees of data availability and integrity.)
Forks are essentially a veneer on top of a server-side git clone
. And the
reason why a separate Git repository is used at all is probably because the
earliest versions of GitHub were just a pile of abstractions over git
commands. The service took off in popularity, people copied its features
almost verbatim, and nobody ever looked back and thought why are we doing
things like this in the first place.
To answer what we would replace forks with, we have to go back to first principles and ask what are we trying to do. And that is propose a unit a change against an existing project. And for version control tools, all you need to propose a change is a patch/commit. So to replace forks, we just need an alternate mechanism to submit patches/commits to an existing project.
My preferred alternative to forks is to use git push
directly to the
canonical repository. This could be implemented like Gerrit where you push
to a special ref. e.g. git push origin HEAD:refs/for/master
. Or - and this is
my preferred solution - version control servers could grow more smarts about
how pushes work - possibly even changing what commands like git push
do if
the server is operating in special modes.
One idea would be for the Git server to expose different refs namespaces
depending on the authenticated user. For example, I'm indygreg
on GitHub.
If I wanted to propose a change to a project - let's say python/cpython
-
I would git clone git@github.com:python/cpython
. I would create a branch -
say indygreg/proposed-change
. I would then
git push origin indygreg/proposed-change
and because the branch prefix matches
my authenticated username, the server lets it through. I can then open a
pull request without a fork! (Using branch prefixes is less than ideal, but
it should be relatively easy to implement on the server. A better approach
would rely on remapping Git ref names. But this may require a bit more
configuration with current versions of Git than users are willing to stomach.
An even better solution would be for Git to grow some functionality to make
this easier. e.g. git push --workspace origin proposed-change
would push
proposed-change
to a workspace on the origin
remote, which Git would
know how to translate to a proper remote ref update.)
Another idea would be for the version control server to invent a new concept
for exchanging commits - one based on sets of commits instead of DAG
synchronization. Essentially, instead of doing a complicated discovery
dance to synchronize commits with the underlying Git repository, the server
would ingest and expose representations of sets of commits stored next
to - but not within - the repository itself. This way you are not scaling the
repository DAG to infinite heads - which is a hard problem! A concrete
implementation of this might have the client run a git push --workspace
origin proposed-change
to tell the remote server to store your
proposed-change
branch in your personal workspace (sorry for reusing the term
from the previous paragraph). The Git server would receive your commits,
generate a standalone blob to hold them, save that blob to a key-value
store like S3, then update a mapping of which commits/branches are in
which blobs in a data store such as a relational database somewhere. This
would effectively segment the core project data from the more transient
branch data, keeping the core repository clean and pure. It allows the server
to lean on easier-to-scale data stores such as key-value blob stores and
relational databases instead of the version control tool. I know this idea
is feasible because Facebook implemented it for Mercurial. The infinitepush
extension essentially siphons Mercurial bundles (standalone files holding
commit data) off to a blob store when pushes come in over the wire. At
hg pull
time, if a requested revision is not present in the repository, the
server asks the database-backed blob index if the revision exists anywhere.
If it does, the blob/bundle is fetched, dynamically overlayed onto the
repository in memory, and served to the client. While the infinitepush
extension in the official Mercurial project is somewhat lacking (through no
fault of Facebook's), the core idea is solid and I wish someone would spend
the time to flush out the design a bit more because it really could lead
to logically scaling repositories to infinite DAG heads without the
complexities of actually scaling scaling DAG algorithms, repository storage,
and version control tool algorithms to infinite heads. Getting back to the
subject of integration requests, one could imagine having a target for
workspace pushes. For example, git push --workspace=review origin
would
push to the review
workspace, which would automatically initiate a code
review.
Astute readers of this blog may find these ideas familiar. I proposed user namespaces in my /blog/2017/12/11/high-level-problems-with-git-and-how-to-fix-them/ post a few years ago. So read there for more on implications of doing away with forks.
Could forks be done away with as a requirement to submit pull requests?
Yes! Gerrit's git push origin HEAD:refs/for/master
mechanism proves it.
Is Gerrit's approach too much magic or confusing for normal users? I'm not
sure. Could Git grow features to make the user experience much better so
users don't need to be burdened with complexity or magic and could
simply run commands like git submit --for review
? Definitely!
Shift Focus From Branches to Individual Commits
My ideal integration request revolves around individual commits, not branches. While the client may submit a branch to initiate or update an integration request, the integration request is composed of a set of loosely coupled commits, where parent-child relationships can exist to express a dependency between commits. Each commit is evaluated individually. Although someone may need to inspect multiple commits to gain a full understanding of the proposed change. And some UI enabling operations against a group of related commits (such as mass deleting abandoned commits) may be warranted.
In this world, the branch would not matter. Instead, commits are king. Because we would be abandoning the branch name as a tracker for the integration request, we would need something to replace it, otherwise we have no way of knowing how to update an existing integration request! We should do what tools like Phabricator, Gerrit, and Review Board do and add a persistent identifier to commits which survive history rewriting. (Branch-based pull requests should do this anyway so history rewrites don't confuse the review tool and e.g. cause comments to get orphaned - see above.)
It's worth noting that a commit-centric integration request model does not imply that everyone is writing or reviewing series of smaller commits! While titans of industry and I strongly encourage the authorship of smaller commits, commit-centric integration requests don't intrinsically force you to do so. This is because commit-centric integration requests aren't forcing you to change your local workflow! If you are the type of person who doesn't want to curate a ton of small, good-in-isolation commits (it does take a bit more work after all), nobody would be forcing you to do so. Instead, if this is your commit authorship pattern, the submission of the proposed change could squash these commits together as part of the submission, optionally rewriting your local history in the process. If you want to keep dozens of fixup commits around in your local history, that's fine: just have the tooling collapse them all together on submission. While I don't think those fixup commits are that valuable and shouldn't be seen by reviewers, if we wanted, we could have tools continue to submit them and make them visible (like they are in e.g. GitHub pull requests today). But they wouldn't be the focus of review (again like GitHub pull requests today). Making integration requests commit-centric doesn't force people to adopt a different commit authorship workflow. But it does enable projects that wish to adopt more mature commit hygiene to do so. That being said, hows tools are implemented can impose restrictions. But that's nothing about commit-centric review that fundamentally prohibits the use of fixup commits in local workflows.
While I should create a dedicated post espousing the virtues of commit-centric workflows, I'll state my case through proxy by noting that some projects aren't using modern pull requests precisely because commit-centric workflows are not viable. When I was at Mozilla, one of the blockers to moving to GitHub was the pull request review tooling wasn't compatible with our world view that review units should be small. (This view is generally shared by Google, Facebook, and some prominent open source projects, among others.) And for reasons outlined earlier in this post, I think that as long as pull requests revolve around branches / merge diffs and aren't robust in the face of history rewriting (due to the lack of robust commit tracking), projects that insist on more refined practices will continue to eschew pull requests. Again, a link between review size and quality has been established. And better quality - along with its long-term effect of lowering development costs due to fewer bugs - can tip the scales in its favor, even against all the benefits you receive when using a product like GitHub, GitLab, or Bitbucket.
The Best of What's Around
Aspects of a better integration request exist in tools today. Unfortunately, many of these features are not present on pull requests as implemented by GitHub, GitLab, Bitbucket, etc. So to improve the pull request, these products will need to borrow ideas from other tools.
Integration requests not built around Git branches (Gerrit, Phabricattor, Review Board, etc) use identifiers in commit messages to track commits. This helps tracking commits across changes. There are compelling advantages to this model. Robust commit tracking is a requirement for commit-centric workflows. And it would even improve the functionality of branch-based pull requests. A well-designed integration request would have a robust commit tracking mechanism.
Gerrit has the best-in-class experience for commit-centric workflows. It is the only popular implementation of integration requests I'm aware of that supports and caters to this workflow by default. In fact, I don't think you can change this! (This behavior is user hostile in some cases since it forces users to know how to rewrite commits, which is often perilous in Git land. It would be nice if you could have Gerrit squash commits into the same review unit automatically on the server. But I understand the unwillingness to implement this feature because this has its own set of challenges around commit tracking, which I won't bore you with.) Gerrit also shows groups of related commits front and center when viewing a proposed change.
Phabricator is the only other tool I know of where one can achieve a reasonable commit-centric workflow without the pitfalls of orphaned comments, context overload, etc mentioned earlier in this post. But this requires non-standard submission tooling and commit series aren't featured prominently in the web UI. So Phabricator's implementation is not as solid as Gerrit's.
Another Gerrit feature worth lauding is the submission mechanism. You simply
git push
to a special ref. That's it. There's no fork to create. No need to
create a Git branch. No need to create a separate pull request after the push.
Gerrit just takes the commits you pushed and turns them into a request for
review. And it doesn't require any additional client-side tooling!
Using a single common git
command to submit and update an integration
request is simpler and arguably more intuitive than other tools. Is Gerrit's
submission perfect? No. The git push origin HEAD:refs/for/master
syntax is
not intuitive. And overloading submission options by effectively encoding URL
parameters on the ref name is a gross - albeit effective - hack. But users will
likely quickly learn the one-liner's or create more intuitive aliases.
The elegance of using just a git push
to initiate an integration request
puts Gerrit in a league of its own. I would be ecstatic if the GitHubs of the
world reduced the complexity of submitting pull requests to simply clone the
canonical repository, create some commits, and run a git
command. The future
of submitting integration requests* hopefully looks more like Gerrit than
other alternatives.
What Needs Built
Some aspects of the better integration request don't yet exist or need considerable work before I consider them viable.
For tools which leverage the native version control tool for submission
(e.g. via git push
), there needs to be some work to support submission
via a more generic, HTTP endpoint. I'm fine with leveraging git push
as a
submission mechanism because it makes the end-user experience so
turnkey. But making it the only submission mechanism is a bit unfortunate.
There is some support for this: I believe you can cobble together a pull
request from scratch via GitHub's APIs, for example. But it isn't as simple
as submit a patch to an endpoint, which it arguably should be. Even
Gerrit's
robust HTTP API,
does not seem to allow creating new commits/diffs via that API. Anyway, this
limitation not only excludes non-Git tools from using these tools, but also
limits other tooling from submitting without using Git. For example, you may
want to write a bot that proposes automated changes and it is much easier
to produce a diff than to use git
since the former does not require a
filesystem (this matters in serverless environments for example).
A larger issue with many implementations is the over-reliance on Git for server
storage. This is most pronounced in Gerrit, where not only are your git push
es
stored in a Git repository on the Gerrit server, but every code review comment
and reply is stored in Git as well! Git is a generic key-value store and you
can store any data you want in it if you shoehorn it properly. And it is cool
that all your Gerrit data can be replicated via git clone
- this pretty much
eliminates the we took a decentralized tool and centralized it via GitHub
series of arguments. But if you apply this store everything in Git approach
at scale, it means you will be running a Git server at scale. And not just any
Git server - a write load heavy Git server! And if you have thousands of
developers possibly all working out of the same repository, then you are looking
at potentially millions of new Git refs per year. While the Git, Gerrit, and
JGit people have done some fantastic work making these tools scale, I'd feel
much better if we eschewed the make Git scale to infinite pushes and refs
problem and used a more scalable approach, like an HTTP ingestion endpoint
which writes data to key-value stores or relational databases. In order words,
use of a version control tool for servicing integration requests at scale
is a self-imposed footgun and could be avoided.
Conclusion
Congratulations on making it through my brain dump! As massive as the wall of text is, there are still plenty of topics I could have covered but didn't. This includes the more specific topic of code review and the various features that entails. I also largely ignored some general topics like the value that an integration request can serve on the overall development lifecycle: integration requests are more than just code review - they serve as a nexus to track the evolution of a change throughout time.
Hopefully this post gave you some idea at some of the structural issues at play with the integration of pull requests and integration requests. And if you are someone in a position to design or implement a better integration request or tooling around them (including in version control tools themselves), hopefully it gave you some good ideas or where to go next.
C Extension Support in PyOxidizer
June 30, 2019 at 04:40 PM | categories: Python, PyOxidizerThe initial release of PyOxidizer generated a bit of excitement across the Internet! The post was commented on heavily in various forums and my phone was constantly buzzing from all the Twitter activity. There has been a steady stream of GitHub Issues for the project, which I consider a good sign. So thank you everybody for the support and encouragement! And especially thank you to everyone who filed an issue or submitted a pull request!
While I don't usually read the comments, I was looking at various forums posting about PyOxidizer to see what reactions were like. A common negative theme was the lack of C extension support in PyOxidizer. People seemed dismissive of PyOxidizer because it didn't support C extensions. Despite the documentation stating that the feature was planned and that I had an idea for how to implement it, people seemed pessimistic. Perhaps I didn't adequately communicate that making C extensions work is actually a subset of the already-solved single file executable problem and therefore was already a solved problem at the technical level (only the integration with the Python and PyOxidizer build systems was missing). So in my mind C extension support was only a matter of time and the only open question was how many hacks would be needed to make it work, not whether it would work.
Well, I'm pleased to report that the just-released version 0.2 of PyOxidizer
supports C extensions on Windows, macOS, and Linux. If you install a Python
package through a pip-install-simple
, pip-requirements-file
, or
setup-py-install
packaging rule, C extensions will be compiled in a special
way that enables them to be embedded in the same binary containing Python
itself. I've tested it with the zope.interface
, zstandard
, and
mercurial
packages and it seems to work (although Mercurial has other
issues that prevent it from being packaged as a PyOxidizer application - but
the C extensions do compile).
There are some limitations to the support, however. I'm pretty confident the limitations can be eliminated given enough time. Given how many people were hung up on the lack of C extensions and were seemingly writing off PyOxidizer thinking it was snake oil or something, I wanted to deliver basic C extension support to curtail this line of criticism. Perfect is the enemy of good and hopefully basic C extension support is good enough to ease concerns about PyOxidizer's viability.
Also in PyOxidizer 0.2 are some minor new features, like the --pip-install
and
--python-code
flags to pyoxidizer init
. These allow you to generate a
pyoxidizer.toml
file pre-configured to install some packages from pip and run
custom Python code. So now applications can be created and built with a one-liner
without having to edit a pyoxidizer.toml
file!
The full release notes are available. As always, please keep filing issues. I'm particularly interested in hearing about packages whose C extensions don't work properly.
« Previous Page -- Next Page »