Gregory Szorc's Digital Home

Using Rust to Power Python Importing With oxidized_importer

May 10, 2020 at 01:15 PM | categories: Python, PyOxidizer

I'm pleased to announce the availability of the oxidized_importer Python package, a standalone version of the custom Python module importer used by PyOxidizer. oxidized_importer - a Python extension module implemented in Rust - enables Python applications to start and run quicker by providing an alternate, more efficient mechanism for loading Python resources (such as source and bytecode modules).

Installation instructions and detailed usage information are available in the official documentation. The rest of this post hopefully answers the questions of why are you doing this and why should I care.

In a traditional Python process, Python's module importer inspects the filesystem at run-time to find and load resources like Python source and bytecode modules. It is highly dynamic in nature and relies on the filesystem as a point-in-time source of truth for resource availability.

oxidized_importer takes a different approach to resource loading that is more static in nature and more suitable to application environments (where Python resources aren't changing). Instead of dynamically probing the filesystem for available resources, resources are instead indexed ahead of time. When Python goes to resolve a resource (say it is looking to import a module), oxidized_importer simply needs to perform a lookup in an in-memory data structure to locate said resource. This means oxidized_importer only has marginal reliance on the filesystem, which can make it much faster than Python's traditional importer. (Performance benefits of binaries built with PyOxidizer have already been clearly demonstrated.)

The oxidized_importer Python extension module exposes parts of PyOxidizer's packaging and run-time functionality to Python code, without requiring the full use of PyOxidizer for application packaging. Specifically, oxidized_importer allows you to:

Install a custom, high-performance module importer (OxidizedFinder) to service Python import statements and resource loading (potentially from memory, using zero-copy).
Scan the filesystem for Python resources (source modules, bytecode files, package resources, distribution metadata, etc) and turn them into Python objects, which can be loaded into OxidizedFinder instances.
Serialize Python resource data into an efficient binary data structure for loading into an OxidizedFinder instance. This facilitates producing a standalone resources blob that can be distributed with a Python application which contains all the Python modules, bytecode, etc required to power that application. See the docs on freezing an application with oxidized_importer.

oxidized_importer can be thought of as PyOxidizer-lite: it provides just enough functionality to allow Python application maintainers to leverage some of the technical advancements of PyOxidizer (such as in-memory module imports) without using PyOxidizer for application packaging. oxidized_importer can work with the Python distribution already installed on your system. You just pip install it like any other Python package.

By releasing oxidized_importer as a standalone Python package, my hope is to allow more people to leverage some of the technical achievements and performance benefits coming out of PyOxidizer. I also hope that having more users of PyOxidizer's underlying code will help uncover bugs and conformance issues, raising the quality and viability of the projects.

I would also like to use oxidized_importer as an opportunity to advance the discourse around Python's resource loading mechanism. Filesystem I/O can be extremely slow, especially in mobile and embedded environments. Dynamically probing the filesystem to service module imports can therefore be slow. (The Python standard library has the zipimport module for importing Python resources from a zip file. But in my opinion, we can do much better.) I would like to see Python move towards leveraging immutable, serialized data structures for loading resources as efficiently as possible. After all, Python resources like the Python standard library are likely not changing between Python process invocations. The performance zealot in me cringes thinking of all the overhead that Python's filesystem probing approach incurs - all of the excessive stat() and other filesystem I/O calls that must be performed to answer questions about state that is easily indexed and often doesn't change. oxidized_importer represents my vision for what a high-performance Python resource loader should look like. I hope it can be successful in steering Python towards a better approach for resource loading.

I plan to release oxidized_importer independently from PyOxidizer. While the projects will continue to be developed in the same repository and will leverage the same underlying Rust code, I view them as somewhat independent and serving different audiences.

While oxidized_importer evolved from facilitating PyOxidizer's run-time use cases, I'm not opposed to taking it in new directions. For example, I would entertain implementing Python's dynamic filesystem probing logic in oxidized_importer, allowing it to serve as a functional stand-in for the official importer shipped with the Python standard library. I have little doubt an importer implemented in 100% Rust would outperform the official importer, which is implemented in Python. There's all kinds of possibilities here, such as using a background thread to index sys.path outside the constraints of the GIL. But I don't want to get ahead of myself...

If you are a Python application maintainer and want to make your Python processes execute a bit faster by leveraging a pre-built index of available Python resources and/or taking advantage of in-memory module importing, I highly encourage you to take a look at oxidized_importer!

PyOxidizer 0.7

April 09, 2020 at 09:00 PM | categories: Python, PyOxidizer

I am very pleased to announce the 0.7 release of PyOxidizer, a modern Python application packaging tool.

There are a host of notable new features in this release. You can read all about them in the project history.

I want to use this blog post to call out the more meaningful ones.

I started PyOxidizer as a science experiment of sorts: I sat out to prove the hypothesis that it was possible to produce high performance single file executables embedding Python and all of its resources (Python modules, non-module resource files, compiled extensions, etc). PyOxidizer has achieved this on Windows, Linux, and macOS since its very earliest releases. Hypothesis confirmed!

In order to actually achieve single file executables, you have to fundamentally change aspects of Python's behavior. Some of these changes invalidate deeply rooted assumptions about how Python works, such as the existence of __file__ in modules. As you can imagine, these broken assumptions translated to numerous compatibility issues and PyOxidizer didn't work with many popular Python packages.

With the science experiment phase of PyOxidizer out of the way, I have been making a concerted effort to broaden the user base of PyOxidizer. While single file executables can be an amazing property, it isn't critical for many use cases and the issues it was causing were preventing people from exploring PyOxidizer.

This brings us to what I think are the major new features in PyOxidizer 0.7.

Better Support for Loading Extension Modules

Earlier versions of PyOxidizer insisted that you compile Python (C) extension modules from source and statically link them into a produced binary. This requirement prevented the use of pre-built extension modules (commonly found in Python binary wheels available on PyPI) with PyOxidizer, forcing people to compile them locally. While this often just worked for many extension modules, it frequently failed on complex extension modules and it frequently failed on Windows.

PyOxidizer now supports loading compiled extension modules from standalone files (typically .so or .pyd files, which are actually shared libraries). There are still some sharp edges and known deficiencies. But in many cases, if you tell PyOxidizer to run pip install and package the result, pre-built wheels can be installed and PyOxidizer will pick up the standalone files.

On Windows, PyOxidizer even supports embedding the shared library data into the produced .exe and loading the .pyd/DLL directly from memory.

Loading Resources from the Filesystem

Binaries built with PyOxidizer contain a blob holding an index of available Python resources along with their data.

Earlier versions of PyOxidizer only allowed you to define resources as in-memory. If the resource was defined in this blob, it was imported from memory. Otherwise it wasn't known to PyOxidizer. You could still install files next to the produced binary and tell PyOxidizer to enable Python's default filesystem-based importer. But PyOxidizer didn't explicitly know about these files on the filesystem.

In PyOxidizer 0.7, the blob index of Python resources is able to express different locations for that resource. Currently, a resource can have its data made available in-memory or filesystem-relative. in-memory works as before: the raw data is embedded next to the next in memory and loaded from there (using 0-copy). filesystem-relative encodes a filesystem path to the resource. During packaging, PyOxidizer will place the resource next to the executable (using a typical Python file layout scheme) and store the relative path to that resource in the resources index.

The filesystem-relative resource indexing feature has a few implications for PyOxidizer.

First, it is more standard. When PyOxidizer loads a Python module from the filesystem, it sets __file__, __path__, etc and the module semantics should behave as if the file were imported by Python's standard importer. This means that if a package is having issues with in-memory importing, you can simply fall back to filesystem-relative to get standard Python behavior and everything should just work.

Second, PyOxidizer's filesystem resource loading is faster than Python's! When Python's standard importer goes to import a module, it needs to stat() various paths to first locate the file. It then performs some sanity checking and other minor actions before actually importing the module. All of this has overhead. Since the goal of PyOxidizer is to produce standalone applications and applications should be immutable, PyOxidizer can avoid most of this overhead. PyOxidizer simply tries to open() and read() the relative path baked into the resource index at build time. If that works, the resource is loaded. Else there is a failure. The code path in PyOxidizer to locate a Python resource is effectively a lookup in a Rust HashMap<&str, T>.

I thought it would be interesting to isolate the performance benefits of this new feature. I ran Mercurial's test harness with different variants of hg on Linux on my Ryzen 3950X.

traditional - A hg script with a #!/path/to/python3.7 shebang.
oxidized - A hg executable built with PyOxidizer, without PyOxidizer's custom module importer.
filesystem - A hg executable built with PyOxidizer using the new filesystem-relative resource index.
in-memory - A hg executable built with PyOxidizer with all resources loaded from memory (how PyOxidizer has traditionally worked).

The results are quite clear:

Variant	CPU Time (s)	Delta (s)	% Orig
traditional	11,287	-552	100
oxidized	10,735	-552	95.1
filesystem	10,186	-1,101	90.2
in-memory	9,883	-1,404	87.6

We see a nice win just from using a native executable built with PyOxidizer (traditional to oxidized).

Then from oxidized to filesystem we see another jump of ~5%. This difference is attributed to using PyOxidizer's Rust-powered importer with an index of resources available on the filesystem. In other words, all that work that Python's standard importer is doing to discover files and then operate on them is non-trivial!

Finally, the smaller jump from filesystem to in-memory isolates the benefits of importing resource data from memory instead of involving filesystem I/O. (Filesystems are generally slow.) While I haven't measured explicitly, I hypothesize that macOS and Windows will see a bigger jump between these two variants, as the filesystem performance on these platforms generally isn't as good as it is on Linux.

PyOxidizer's Future

With PyOxidizer now supporting a couple of much-needed features to support a broader set of users, I'm hoping that future releases of PyOxidizer continue to broaden the utility of PyOxidizer.

The over-arching goal of PyOxidizer is to solve large aspects of the Python application packaging and distribution problem. So far a lot of focus has been spent on the former. PyOxidizer in its current form can materialize files on the filesystem that you can copy or package up manually and distribute. But I want these processes to be part of PyOxidizer: I want it to be possible for PyOxidizer to emit a Windows MSI installer, a macOS dmg, a Debian package, etc for a Python application.

In order to support the aforementioned marquee features of this PyOxidizer release, I had to pay down a lot of technical debt in the code base left over from the science experiment phase of PyOxidizer's inception.

In the short term, I plan to continue shoring up the code base and rounding out support for features requested in the issue tracker on GitHub. The next release of PyOxidizer will also likely require Python 3.8, as this will improve run-time control over the embedded Python interpreter and enable PyOxidizer to better support package metadata (importlib.metadata), enabling support for features like entry points.

I've also been thinking about extracting PyOxidizer's custom module importer to be usable as a standalone Python extension module. I think there's some value in publishing a pyoxidizer_importer package on PyPI that you can easily add to your installed packages to speed up Python's standard filesystem importer by a few percent. If nothing else, this may drum up interest in the larger Python community for standardizing a format for serializing Python resources in a single file. Perhaps we can get other Python packaging tools producing the same packed resources data blob that PyOxidizer uses so we can all standardize on a more efficient mechanism for loading Python modules. Time will tell.

Enjoy the new release. File issues at https://github.com/indygreg/PyOxidizer as you encounter them.

Mercurial's Journey to and Reflections on Python 3

January 13, 2020 at 08:45 AM | categories: Python, Mercurial

Mercurial 5.2 was released on November 5, 2019. It is the first version of Mercurial that supports Python 3. This milestone comes nearly 11 years after Python 3.0 was first released on December 3, 2008.

Speaking as a maintainer of Mercurial and an avid user of Python, I feel like the experience of making Mercurial work with Python 3 is worth sharing because there are a number of lessons to be learned.

This post is logically divided into two sections: a mostly factual recount of Mercurial's Python 3 porting effort and a more opinionated commentary of the transition to Python 3 and the Python language ecosystem as a whole. Those who don't care about the mechanics of porting a large Python project to Python 3 may want to skip the next section or two.

Porting Mercurial to Python 3

Let's start with a brief history lesson of Mercurial's support for Python 3 as told by its own commit history.

The Mercurial version control tool was first released in April 2005 (the same month that Git was initially released). Version 1.0 came out in March 2008. The first reference to Python 3 I found in the code base was in September 2008. Then not much happens for a while until June 2010, when someone authors a bunch of changes to make the Python C extensions start to recognize Python 3. Then things were again quiet for a while until January 2013, when a handful of changes landed to remove 2 argument raise. There were a handful of commits in 2014 but nothing worth calling out.

Mercurial's meaningful journey to Python 3 started in 2015. In code, the work started in April 2015, with effort to make Mercurial's test harness run with Python 3. Part of this was a decision that Python 3.5 (to be released several months later in September 2015) would be the minimum Python 3 version that Mercurial would support.

Once the Mercurial Project decided it wanted to port to Python 3 (as opposed to another language), one of the earliest decisions was how to perform that port. Mercurial's code base was too large to attempt a flag day conversion where there would be a Python 2 version and a Python 3 version and one day everyone would switch from Python 2 to 3. Mercurial needed a way to run the same code (or as much of the same code) on both Python 2 and 3. We would maintain a single code base and users would gradually switch from running with Python 2 to Python 3.

In May 2015, Mercurial dropped support for Python 2.4 and 2.5. Dropping support for these older Python versions was critical, as it was effectively impossible to write Python code that ran on this wide gamut of versions because of incompatibilities in syntax and language features. For example, you needed Python 2.6 to get print() via from __future__ import print_function. The project's late start at a Python 3 port can be significantly attributed to Python 2.4 and 2.5 compatibility holding us back.

The main goal with Mercurial's early porting work was just getting the code base to a point where import mercurial would work. There were a myriad of places where Mercurial used syntax that was invalid on Python 3 and Python 3 couldn't even parse the source code, let alone compile it to bytecode and execute it.

This effort began in earnest in June 2015 with global source code rewrites like using modern octal syntax, modern exception catching syntax (except Exception as e instead of except Exception, e), print() instead of print, and a modern import convention along with the use of from __future__ import absolute_import.

In the early days of the port, our first goal was to get all source code parsing as valid Python 3. The next step was to get all the modules importing cleanly. This entailed fixing code that ran at import time to work on Python 3. Our thinking was that we would need the code base to be import clean on Python 3 before seriously thinking about run-time behavior. In reality, we quickly ported a lot of modules to import cleanly and then moved on to higher-level porting, leaving a long-tail of modules with import failures.

This initial porting effort played out over months. There weren't many people working on it in the early days: a few people would basically hack on Python 3 as a form of itch scratching and most of the project's energy was focused on improving the existing Python 2 based product. You can get a rough idea of the timeline and participation in the early porting effort through the history of test-check-py3-compat.t. We see the test being added in December 2015, By June 2016, most of the code base was ported to our modern import convention and we were ready to move on to more meaningful porting.

One of the biggest early hurdles in our porting effort was how to overcome the string literals type mismatch between Python 2 and 3. In Python 2, a '' string literal is a sequence of bytes. In Python 3, a '' string literal is a sequence of Unicode code points. These are fundamentally different types. And in Mercurial's code base, most of our string types are binary by design: use of a Unicode based str for representing data is flat out wrong for our use case. We knew that Mercurial would need to eventually switch many string literals from '' to b'' to preserve type compatibility. But doing so would be problematic.

In the early days of Mercurial's Python 3 port in 2015, Mercurial's project maintainer (Matt Mackall) set a ground rule that the Python 3 port shouldn't overly disrupt others: he wanted the Python 3 port to more or less happen in the background and not require every developer to be aware of Python 3's low-level behavior in order to get work done on the existing Python 2 code base. This may seem like a questionable decision (and I probably disagreed with him to some extent at the time because I was doing Python 3 porting work and the decision constrained this work). But it was the correct decision. Matt knew that it would be years before the Python 3 port was either necessary or resulted in a meaningful return on investment (the value proposition of Python 3 has always been weak to Mercurial because Python 3 doesn't demonstrate a compelling advantage over Python 2 for our use case). What Matt was trying to do was minimize the externalized costs that a Python 3 port would inflict on the project. He correctly recognized that maintaining the existing product and supporting existing users was more important than a long-term bet in its infancy.

This ground rule meant that a mass insertion of b'' prefixes everywhere was not desirable, as that would require developers to think about whether a type was a bytes or str, a distinction they didn't have to worry about on Python 2 because we practically never used the Unicode-based string type in Mercurial.

In addition, there were some other practical issues with doing a bulk b'' prefix insertion. One was that the added b characters would cause a lot of lines to grow beyond our length limits and we'd have to reformat code. That would require manual intervention and would significantly slow down porting. And a sub-issue of adding all the b prefixes and reformatting code is that it would break annotate/blame more than was tolerable. The latter issue was addressed by teaching Mercurial's annotate/blame feature to skip revisions. The project now has a convention of annotating commit messages with # skip-blame <reason> so structural only changes can easily be ignored when performing an annotate/blame.

A stop-gap solution to the b'' everywhere issue came in July 2016, when I introduced a custom Python module importer that rewrote source code as part of import when running on Python 3. (I have previously blogged about this hack.) What this did was transparently add b'' prefixes to all un-prefixed string literals as well as modify how a few common functions were called so that we wouldn't need to modify source code so things would run natively on Python 3. The source transformer allowed us to have the benefits of progressing in our Python 3 port without having to rewrite tens of thousands of lines of source code. The solution was hacky. But it enabled us to make significant progress on the Python 3 port without externalizing a lot of cost onto others.

I thought the source transformer would be relatively short-lived and would be removed shortly after the project inevitably decided to go all in on Python 3. To my surprise, others built additional transforms over the years and the source transformer persisted all the way until October 2019, when I removed it just before the first non-alpha Python 3 compatible version of Mercurial was released.

A common problem Mercurial faced with making the code base dual Python 2/3 native was dealing with standard library differences. Most of the problems stemmed from changes between Python 2.7 and 3.5+. But there are changes within the versions of Python 3 that we had to wallpaper over as well. In April 2016, the mercurial.pycompat module was introduced to export aliases or wrappers around standard library functionality to abstract the differences between Python versions. This file grew over time and eventually became Mercurial's version of six. To be honest, I'm not sure if we should have used six from the beginning. six probably would have saved some work. But we had to eventually write a lot of shims for converting between str and bytes and would have needed to invent a pycompat layer in some form anyway. So I'm not sure six would have saved enough effort to justify the baggage of integrating a 3rd party package into Mercurial. (When Mercurial accepts a 3rd party package, downstream packagers like Debian get all hot and bothered and end up making questionable patches to our source code. So we prefer to minimize the surface area for problems by minimizing dependencies on 3rd party packages.)

Once we had a source transforming module importer and the pycompat compatibility shim, we started to focus in earnest on making core functionality actually work on Python 3. We established a convention of annotating changesets needed for Python 3 with py3, so a commit message search yields a lot of the history. (But it isn't a full history since not every Python 3 oriented change used this convention). We see from that history that after the source importer landed, a lot of porting effort was spent on things very early in the hg process lifetime. This included handling environment variables, loading config files, and argument parsing. We introduced a test-check-py3-commands.t test to track the progress of hg commands working in Python 3. The very early history of that file shows the various error messages changing, as underlying early process functionality was slowly ported to work on Python 3. By December 2016, we had hg version working on Python 3!

With basic hg command dispatch ported to Python 3 at the end of 2016, 2017 represented an inflection point in the Python 3 porting effort. With the early process functionality working, different people could pick up different commands and code paths and start making code work with Python 3. By March 2017, basic repository opening and hg files worked. Shortly thereafter, hg init started working as well. And hg status and hg commit did as well.

Within a few months, enough of Mercurial's functionality was working with Python 3 that we started to track which tests passed on Python 3. The evolution of this file shows a reasonable history of the porting velocity.

In May 2017, we dropped support for Python 2.6. This significantly reduced the complexity of supporting Python 3, as there was tons of functionality in Python 2.7 that made it easier to target both Python 2 and 3 and now our hands were untied to utilize it.

In November 2017, I landed a test harness feature to report exceptions seen during test runs. I later refined the output so the most frequent failures were reported more prominently. This feature greatly enabled our ability to target the most common exceptions, allowing us to write patches to fix the most prevalent issues on Python 3 and uncover previously unknown failures.

By the end of 2017, we had most of the structural pieces in place to complete the port. Essentially all that was required at that point was time and labor. We didn't have a formal mechanism in place to target porting efforts. Instead, people would pick up a component or test that they wanted to hack on and then make incremental changes towards making that work. All the while, we didn't have a strict policy on not regressing Python 3 and regressions in Python 3 porting progress were semi-frequent. Although we did tend to correct regressions quickly. And over time, developers saw a flurry of Python 3 patches and slowly grew awareness of how to accommodate Python 3, and the number of Python 3 regressions became less frequent.

As useful as the source-transforming module importer was, it incurred some additional burden for the porting effort. The source transformer effectively converted all un-prefixed string literals ('') to bytes literals (b'') to preserve string type behavior with Python 2. But various aspects of Python 3 didn't like the existence of bytes. Various standard library functionality now wanted unicode str and didn't accept bytes, even though the Python 2 implementation used the equivalent of bytes. So our pycompat layer grew pretty large to accommodate calling into various standard library functionality. Another side-effect which we didn't initially anticipate was the **kwargs calling convention. Python allows you to use ** with a dict with string keys to turn those keys into named arguments in a function call. But Python 3 requires these dict keys to be str and outright rejects bytes keys, even if the bytes instance is ASCII safe and has the same underlying byte representation of the string data as the str instance would. So we had to invent support functions that would convert dict keys from bytes to str for use with **kwargs and another to convert a **kwargs dict from str keys to bytes keys so we could use '' syntax to access keys in our source code! Also on the string type front, we had to sprinkle the codebase with raw string literals (r'') to force the use of str irregardless of which Python version you were running on (our source transformer only changed unprefixed string literals, so existing r'' strings would be preserved as str).

Blind transformation of all string literals to bytes was less than ideal and it did impose some unwanted side-effects. But, again, most strings in Mercurial are bytes by design, so we thought it would be easier to byteify all strings then selectively undo that where native strings were actually warranted (like keys in most dicts) than to take the up-front cost to examine every string and make an intelligent determination as to what type it should be. I go back and forth as to whether this was the correct call. But when you factor in that the source transforming module importer unblocked Python 3 porting at a time in the project's history when there was so much focus on improving the core product and it did so without externalizing many costs onto the people doing the critical core product work, I think it was the right call.

By mid 2019, the number of test failures in Python 3 had been whittled down to a reasonable, less daunting number. It felt like victory was in grasp and inevitable. But a few significant issues lingered.

One remaining question was around addressing differences between Python 3 versions. At the time, Python 3.5, 3.6, and 3.7 were released and 3.8 was scheduled for release by the end of the year. We had a surprising number of issues with differences in Python 3 versions. Many of us were running Python 3.7, so it had the fewest failures. We had to spend extra effort to get Python 3.5 and 3.6 working as well as 3.7. Same for 3.8.

Another task we deferred until the second half of 2019 was standing up robust CI for Python 3. We had some coverage, but it was minimal. Wanting a distraction from PyOxidizer for a bit and wanting to overhaul Mercurial's CI system (which is officially built on Buildbot), I cobbled together a serverless CI system built on top of AWS DynamoDB and S3 for storage, Lambda functions and CloudWatch events for all business logic, and EC2 spot instances for job execution. This CI system executed Python 3.5, 3.6, 3.7, and 3.8 variants of our test harness on Linux and Python 3.7 on Windows. This gave developers insight into version-specific failures. More importantly, it also gave insight into Windows failures, which was previously not well tested. It was discovered that Python 3 on Windows was lagging significantly behind POSIX.

By the time of the Mercurial developer meetup in October 2019, nearly all tests were passing on POSIX platforms and we were confident that we could declare Python 3 support as at least beta quality for the Mercurial 5.2 release, planned for early November.

One of our blockers for ripping off the alpha label on Python 3 support was removing our source-transforming module importer. It had performance implications and it wasn't something we wanted to ship because it felt too hacky. A blocker for this was we wanted to automatically format our source tree with black because if we removed the source transformer, we'd have to rewrite a lot of source code to apply changes the transformer was performing, which would necessitate wrapping a lot of lines, which would involve a lot of manual effort. We wanted to blacken our code base first so that mass rewriting source code wouldn't involve a lot of tedious reformatting since black would handle that for us automatically. And rewriting the source tree with black was blocked on a specific feature landing in black! (We did not agree with black's behavior of unwrapping comma-delimited lists of items if they could fit on a single line. So one of our core contributors wrote a patch to black that changed its behavior so a trailing , in a list of items will force items to be formatted on multiple lines. I personally find the multiple line formatting much easier to read. And the behavior is arguably better for code review and annotation, which is line based.) Once this feature landed in black, we reformatted our source tree and started ripping out the source transformations, starting by inserting b'' literals everywhere. By late October, the source transformer was no more and we were ready to release beta quality support for Python 3 (at least on UNIX-like platforms).

Having described a mostly factual overview of Mercurial's port to Python 3, it is now time to shift gears to the speculative and opinionated parts of this post. I want to underscore that the opinions reflected here are my own and do not reflect the overall Mercurial Project or even a consensus within it.

The Future of Python 3 and Mercurial

Mercurial's port to Python 3 is still ongoing. While we've shipped Python 3 support and the test harness is clean on Python 3, I view shipping as only a milestone - arguably the most important one - in a longer journey. There's still a lot of work to do.

It is now 2020 and Python 2 support is now officially dead from the perspective of the Python language maintainers. Linux distributions are starting to rip out Python 2. Packages are dropping Python 2 support in new versions. The world is moving to Python 3 only. But Mercurial still officially supports Python 2. And it is still yet to be determined how long we will retain support for Python 2 in the code base. We've only had one release supporting Python 3. Our users still need to port their extensions (implemented in Python). Our users still need to start widely using Mercurial with Python 3. Even our own developers need to switch to Python 3 (old habits are hard to break).

I anticipate a long tail of random bugs in Mercurial on Python 3. While the tests may pass, our code coverage is not 100%. And even if it were, Python is a dynamic language and there are tons of invariants that aren't caught at compile time and can only be discovered at run time. These invariants cannot all be detected by tests, no matter how good your test coverage is. This is a feature/limitation of dynamic languages. Our users will likely be finding a long tail of miscellaneous bugs on Python 3 for years.

At present, our code base is littered with tons of random hacks to bridge the gap between Python 2 and 3. Once Python 2 support is dropped, we'll need to remove these hacks and make the source tree Python 3 native, with minimal shims to wallpaper over differences in Python 3 versions. Removing this Python version bridge code will likely require hundreds of commits and will be a non-trivial effort. It's likely to be deemed a low priority (it is glorified busy work after all), and code for the express purpose of supporting Python 2 will likely linger for years.

We are also still shoring up our packaging and distribution story on Python 3. This is easier on some platforms than others. I created PyOxidizer partially because of the poor experience I had with Python application packaging and distribution through the Mercurial Project. The Mercurial Project has already signed off on using PyOxidizer for distributing Mercurial in the future. So look for an oxidized Mercurial distribution in the near future! (You could argue PyOxidizer is an epic yak shave to better support Mercurial. But that's for another post.)

Then there's Windows support. A Python 3 powered Mercurial on Windows still has a handful of known issues. It may require a few more releases before we consider Python 3 on Windows to be stable.

Because we're still on a code base that must support Python 2, our adoption of Python 3 features is very limited. The only Python 3 feature that Mercurial developers seem to almost universally get excited about is type annotations. We already have some people playing around with pytype using comment-based annotations and pytype has already caught a few bugs. We're eager to go all in on type annotations and uncover lots of dynamic typing bugs and poorly implemented APIs. Beyond type annotations, I can't name any feature that people are screaming to adopt and which makes a lot of sense for Mercurial. There's a long tail of minor features I'm sure will get utilized. But none of the marquee features that define major language releases seem that interesting to us. Time will tell.

Commentary on Python 3

Having described Mercurial's ongoing journey to Python 3, I now want to focus more on Python itself. Again, the opinions here are my own and don't reflect those of the Mercurial Project.

Succinctly, my experience porting Mercurial and other projects to Python 3 has significantly soured my perceptions of Python. As much as I have historically loved Python - from the language to the welcoming community - I am still struggling to understand how Python could manage to inflict so much hardship on the community by choosing the transition plan that they did. I believe Python's choices represent a terrific example of what not to do when managing a large project or ecosystem. Maintainers of other largely-deployed systems would benefit from taking the time to understand and reflect on Python's missteps.

Python 3.0 was released on December 3, 2008. And it took the better part of a decade for the community to embrace it. This should be universally recognized as a failure. While hindsight is 20/20, many of the issues with Python 3 were obvious at the time and could have been mitigated had the language maintainers been more accommodating - and dare I say empathetic - to its users.

Initially, Python 3 had a rather cavalier attitude towards backwards and forwards compatibility. In the early years of Python 3, the attitude of Python's maintainers was Python 3 is a new, better language: you should target it explicitly. There were some tools and methods to ease the transition. But nothing super polished, especially in the early years. Adoption of Python 3 in the overall community was slow. Python developers in the wild justifiably complained that the value proposition of Python 3 was too weak to justify porting effort. Not helping was that the early advice for targeting Python 3 was to rewrite the source code to become Python 3 native. This is in contrast with using the same source to run on both Python 2 and 3. For library and application maintainers, this potentially meant maintaining separate versions of your code or forcing end-users to make a giant leap, which would realistically orphan users on an old version, fragmenting your user base. Neither of those were great alternatives, so you can understand why many projects didn't bite.

For many projects of non-trivial size, flag day transitions from Python 2 to 3 were simply not viable: the pathway to Python 3 was to make code dual Python 2/3 compatible and gradually switch over the runtime to Python 3. But initial versions of Python 3 made this effectively impossible! Let me give a few specific examples.

In Python 2, a string literal '' is effectively an array of bytes. In Python 3, it is a series of Unicode code points - a fundamentally different type! In Python 2, you could write b'' to be explicit that a string literal was bytes or you could write u'' to indicate a Unicode literal, mimicking Python 3's behavior. In Python 3, you could write b'' to create a bytes instance. But for whatever reason, Python 3 initially removed the u'' syntax, meaning there wasn't as easy way to explicitly denote the type of each string literal so that it was consistent between Python 2 and 3! Python 3.3 (released September 2012) restored u'' support, making it more viable to write Python source code that worked on both Python 2 and 3. For nearly 4 years, Python 3 took away the consistent syntax for denoting bytes/Unicode string literals.

Another feature was % formatting of strings. Python 2 allowed use of the % formatting operator on both its string types. But Python 3 initially removed the implementation of % from bytes. Why, I have no clue. It is perfectly reasonable to splice byte sequences into a buffer via use of a formatting string. But the Python language maintainers insisted otherwise. And it wasn't until the community complained about its absence loudly enough that this feature was restored in Python 3.5, which was released in September 2015. Fun fact: the lack of this feature was once considered a blocker for Mercurial moving to Python 3 because Mercurial uses bytes almost universally, which meant that nearly every use of % would have to be changed to something else. And to this day, Python 3's bytes still doesn't have a format() method, so the alternative was effectively string concatenation, which is a massive step backwards from the expressiveness of % formatting.

The initial approach of Python 3 mirrors a folly that many developers and projects make: attempting a rewrite instead of performing incremental evolution. For established projects, large scale rewrites often go poorly. And Python 3 is no exception. Yes, from a code level, CPython (and likely other Python implementations) were incremental changes over Python 2 using the same code base. But from a language and standard library level, the differences in Python 3 were significant enough that I - and even Python's core maintainers - considered it a new language, and therefore a rewrite. When your random project attempts a rewrite and fails, the blast radius of that is often contained to that project. Maybe you don't publish a new release as soon as you otherwise would. But when you are powering an ecosystem, the ripple effects from a failed rewrite percolate throughout that ecosystem and last for years and have many second order effects. We see this with Python 3, where poor choices made in the late 2000s are inflicting significant hardship still in 2020.

From the initial restrained adoption of Python 3, it is obvious that the Python ecosystem overwhelmingly rejected the initial boil the oceans approach of Python 3. Python's maintainers eventually got the message and started restoring features like u'' and bytes % formatting back into the language to placate the community. All the while Python 3 had been accumulating new features and the cumulative sum of those features was compelling enough to win over users.

For many projects (including Mercurial), Python 3.4/3.5 was the first viable porting target for Python 3. Python 3.5 was released in September 2015, almost 7 years after Python 3.0 was released in December 2008. Seven. Years. An ecosystem that falters for that long is generally not healthy. What may have saved Python from total collapse here is that Python 2 was still going strong and people were generally happy with it. I really do think Python dodged a bullet here, because there was a massive window where the language could have hemorrhaged a critical amount of its user base and been relegated to an afterthought. One could draw an analogy to Perl, which lost out to PHP, Python, and Ruby, and whose fall from grace aligned with a lengthy transition from Perl 5 to 6.

If you look back at the early history of Python 3, I think you are forced to conclude that Python effectively kneecapped itself for 5-7 years through questionable implementation choices that prevented users from incurring incremental transitions between the major language versions. 2008 to 2013-2015 should be known as the lost years of Python because so much opportunity and energy was squandered. Yes, Python is still healthy today and Python 3 is (finally) being adopted at scale. But had earlier versions of Python 3 been more empathetic towards Python 2 users porting to it, Python and Python 3 in 2020 would be even stronger than it is. The community was artificially hindered for years. And we won't know until 2023-2025 what things could have looked like in 2020 had the Python core language team spent more time paving a smoother road between the major language versions.

To be clear, I do think Python 3 is generally a better language than Python 2. It has fewer warts, more compelling features, and better performance (except for startup time, which is still slower than Python 2). I am ecstatic the community is finally rallying around Python 3! For my Python coding, it has reached the point where I curse under my breath when I need to support Python 2 or even older versions of Python 3, like 3.5 or 3.6: I just wish the world would move on and adopt the future already!

But I would be remiss if I failed to mention some of my gripes with Python 3 beyond the transition shenanigans.

Perhaps my least favorite feature of Python 3 is its insistence that the world is Unicode. In Python 2, the default string type was backed by bytes. In Python 3, the default string type is backed by Unicode code points. As part of that transition, large parts of the standard library now operate in the Unicode space instead of the domain of bytes. I understand why Python does this: they want strings to be Unicode and don't want users to have to spend that much energy thinking about when to use str versus bytes. This approach is admirable and somewhat defensible because it takes a stand on a solution that is arguably good enough for most users. However, the approach of assuming the world is Unicode is flat out wrong and has significant implications for systems level applications (like version control tools).

There are a myriad of places in Python's standard library where Python insists on using the Unicode-backed str type and rejects bytes. For example, various networking modules refuse to accept bytes for hostnames or URLs. HTTP libraries won't accept bytes for HTTP header names or values. Functions that are proxies to POSIX-defined functions won't accept bytes even though the POSIX function it calls into is using char * and isn't Unicode aware. Then there's filename handling, where Python assumes the existence of a global encoding for filenames and uses this encoding to convert between str and bytes. And it does this despite POSIX filesystem paths being a bag of bytes where the only rules are that \0 terminates the filename and / is special.

In cases like Python refusing to accept bytes for things like HTTP header names (which will just be spit out over the wire as bytes), Python's pendulum has swung too far towards Unicode only. In my opinion, Python needs to be more accommodating and allow bytes when it makes sense. I hope the pendulum knocks some sense into people when it swings back towards a more reasonable solution that better acknowledges the realities of the world we live in.

For areas like filename handling, the world is more complicated. Python is effectively an abstraction layer over the operating system APIs exposing this functionality. And there is often an impedance mismatch between operating systems. For example, POSIX (Linux) tends to use char * for everything and doesn't care about encoding and Windows tends to use 16 bit character types where the encoding is... a can of worms.

The reality here is that it is impossible to abstract over differences between operating system behavior without compromises that can result in data loss, outright wrong behavior, or loss of functionality. But Python 3 attempts to do it anyway, making Python 3 unsuitable (or at least highly undesirable) for certain systems level applications that rely on it (like a version control tool).

In fairness to Python, it isn't the only programming language that gets this wrong. The only language I've seen properly implement higher-order abstractions on top of operating system facilities is Rust, whose approach can be generalized as use Python 3's solution of normalizing to Unicode/UTF-8 by default, but expose escape hatches which allow access to the raw underlying types and APIs used by the operating system for the advanced consumers who require it. For example, Rust's Path type which represents a filesystem path allows access to the raw OsStr value used by the operating system, not a normalization of it to bytes or Unicode, which may be lossy. This allows consumers to e.g. create and retrieve OS-native filesystem paths without data loss. This functionality is critical in some domains. Python 3's awareness/insistence that the world is Unicode (which it isn't universally) reduces Python's applicability in these domains.

Speaking of Rust, at the Mercurial developer meetup in October 2019, we were discussing the use of Rust in Mercurial and one of the core maintainers blurted out something along the lines of if Rust were at its current state 5 years ago, Mercurial would have likely ported from Python 2 to Rust instead of Python 3. As crazy as it initially sounded, I think I agree with that assessment. With the benefit of hindsight, having been a key player in the Python 3 porting effort, seeing all the complications and headaches Python 3 is introducing, and having learned Rust and witnessed its benefits for performance, control, and correctness firsthand, porting to Rust would likely have been the correct move for the project at that point in time. 2020 is not 2014, however, and I'm not sure if I would opt for a rewrite in Rust today. (Most rewrites are follies after all.) But I know one thing: I certainly wouldn't implement a new version control tool in Python 3 and I would probably choose Rust as an implementation language for most new projects in the systems level space or with an expected shelf life of 10+ years. (I really should blog about how awesome Rust is.)

Back to the topic of Python itself, I'm really soured on Python at this point in time. The effort required to port to Python 3 was staggering. For Mercurial, Python 3 introduces a ton of problems and doesn't really solve many. We effectively sludged through mud for several years only to wind up in a state that feels strictly worse than where we started. I'm sure it will be strictly better in a few years. But at that point, we're talking about a 5+ year transition. To call the Python 3 transition disruptive and distracting for the project would be an understatement. As a project maintainer, it's natural to ask what we could have accomplished if we weren't forced to carry out this sideshow.

I can't shake the feeling that a lot of the pain afflicted by the Python 3 transition could have been avoided had Python's language leadership made a different set of decisions and more highly prioritized the transition experience. (Like not initially removing features like u'' and bytes % and not introducing gratuitous backwards compatibility breaks, like with items()/iteritems(). I would have also liked to see a feature like from __future__ - maybe from __past__ - that would make it easier for Python 3 code to target semantics in earlier versions in order to provide a more turnkey on-ramp onto new versions.) I simultaneously see Python 3 losing its position as a justifiable tool in some domains (like systems level tooling) due to ongoing design decisions and poor implementation (like startup overhead problems). (In contrast, I see Rust excelling where Python is faltering and find Rust code surprisingly expressive to write and maintain given how low-level it is and therefore feel that Rust is a compelling alternative to Python in a surprisingly large number of domains.)

Look, I know it is easy for me to armchair quarterback and critique with the benefit of hindsight/ignorance. I'm sure there is a lot of nuance here. I'm sure there was disagreement within the Python community over a lot of these issues. Maintaining a large and successful programming language and community like Python's is hard and you aren't going to please all the people all the time. And speaking as a maintainer, I have mad respect for the people leading such a large community. But niceties aside, everyone knows the Python 3 transition was rough and could have gone better. It should not have taken 11 years to get to where we are today.

I'd like to encourage the Python Project to conduct a thorough postmortem on the transition to Python 3. Identify what went well, what could have gone better, and what should be done next time such a large language change is wanted. Speaking as a Python user, a maintainer of a Python project, and as someone in industry who is now skeptical about use of Python at work due to risks of potentially company crippling high-effort migrations in the future, a postmortem would help restore my confidence that Python's maintainers learned from the various missteps on the road to Python 3 and these potentially ecosystem crippling mistakes won't be made again.

Python had a wildly successful past few decades. And it can continue to thrive for several more. But the Python 3 migration was painful for all involved. And as much as we need to move on and leave Python 2 behind us, there are some important lessons to be learned. I hope the Python community takes the opportunity to reflect and am confident it will grow stronger by taking the time to do so.

C Extension Support in PyOxidizer

June 30, 2019 at 04:40 PM | categories: Python, PyOxidizer

The initial release of PyOxidizer generated a bit of excitement across the Internet! The post was commented on heavily in various forums and my phone was constantly buzzing from all the Twitter activity. There has been a steady stream of GitHub Issues for the project, which I consider a good sign. So thank you everybody for the support and encouragement! And especially thank you to everyone who filed an issue or submitted a pull request!

While I don't usually read the comments, I was looking at various forums posting about PyOxidizer to see what reactions were like. A common negative theme was the lack of C extension support in PyOxidizer. People seemed dismissive of PyOxidizer because it didn't support C extensions. Despite the documentation stating that the feature was planned and that I had an idea for how to implement it, people seemed pessimistic. Perhaps I didn't adequately communicate that making C extensions work is actually a subset of the already-solved single file executable problem and therefore was already a solved problem at the technical level (only the integration with the Python and PyOxidizer build systems was missing). So in my mind C extension support was only a matter of time and the only open question was how many hacks would be needed to make it work, not whether it would work.

Well, I'm pleased to report that the just-released version 0.2 of PyOxidizer supports C extensions on Windows, macOS, and Linux. If you install a Python package through a pip-install-simple, pip-requirements-file, or setup-py-install packaging rule, C extensions will be compiled in a special way that enables them to be embedded in the same binary containing Python itself. I've tested it with the zope.interface, zstandard, and mercurial packages and it seems to work (although Mercurial has other issues that prevent it from being packaged as a PyOxidizer application - but the C extensions do compile).

There are some limitations to the support, however. I'm pretty confident the limitations can be eliminated given enough time. Given how many people were hung up on the lack of C extensions and were seemingly writing off PyOxidizer thinking it was snake oil or something, I wanted to deliver basic C extension support to curtail this line of criticism. Perfect is the enemy of good and hopefully basic C extension support is good enough to ease concerns about PyOxidizer's viability.

Also in PyOxidizer 0.2 are some minor new features, like the --pip-install and --python-code flags to pyoxidizer init. These allow you to generate a pyoxidizer.toml file pre-configured to install some packages from pip and run custom Python code. So now applications can be created and built with a one-liner without having to edit a pyoxidizer.toml file!

The full release notes are available. As always, please keep filing issues. I'm particularly interested in hearing about packages whose C extensions don't work properly.

Building Standalone Python Applications with PyOxidizer

June 24, 2019 at 09:00 AM | categories: Python, PyOxidizer, Rust

Python application distribution is generally considered an unsolved problem. At their PyCon 2019 keynote talk, Russel Keith-Magee identified code distribution as a potential black swan - an existential threat for longevity - for Python. In their words, Python hasn't ever had a consistent story for how I give my code to someone else, especially if that someone else isn't a developer and just wants to use my application. I completely agree. And I want to add my opinion that unless your target user is a Python developer, they shouldn't need to know anything about Python packaging, Python itself, or even the existence of Python in order to use your application. (And you can replace Python in the previous sentence with any programming language or software technology: most end-users don't care about the technical implementation, they just want to get stuff done.)

Today, I'm excited to announce the first release of PyOxidizer (project, documentation), an open source utility that aims to solve the Python application distribution problem! (The installation instructions are in the docs.)

Standalone Single File, No Dependencies Executable Python Applications

PyOxidizer's marquee feature is that it can produce a single file executable containing a fully-featured Python interpreter, its extensions, standard library, and your application's modules and resources. In other words, you can have a single .exe providing your application. And unlike other tools in this space which tend to be operating system specific, PyOxidizer works across platforms (currently Windows, macOS, and Linux - the most popular platforms for Python today). Executables built with PyOxidizer have minimal dependencies on the host environment nor do they do anything complicated at run-time. I believe PyOxidizer is the only open source tool to have all these attributes.

On Linux, it is possible to build a fully statically linked executable. You can drop this executable into a chroot or container where it is the only file and it will just work. On macOS and Windows, the only library dependencies are on always-present or extremely common libraries. More details are in the docs.

At execution time, binaries built with PyOxidizer do not do anything special to run the Python interpreter. (Other tools in this space do things like create a temporary directory or SquashFS filesystem and extract Python to it.) PyOxidizer loads everything from memory and there is no explicit I/O being performed. When you import a Python module, the bytecode for that module is being loaded from a memory address in the executable using zero-copy. This makes PyOxidizer executables faster to start and import - faster than a python executable itself!

Current Release and Future Roadmap

Today's release of PyOxidizer is just the first release milestone in what I envision is a long and successful project history. While my over-arching goal with PyOxidizer is to solve vast swaths of the Python application distribution problem, I want to be clear that this first release comes nowhere close to doing so. I toiled with what features must be in the initial release. I ultimately decided that PyOxidizer's current functionality is extremely valuable to some audiences and that the project has matured to the point where more eyeballs and users would substantially help its development. (I could definitely use some help prioritizing which features to work on and for that I need users and user feedback.)

In today's release, PyOxidizer is good at producing executables embedding Python. It doesn't yet venture too far into the distribution part of the problem (I want it to be trivial to produce MSI installers, DMG images, deb/rpm packages, etc). But on Linux, this is already a huge step forward because PyOxidizer makes it easy (hopefully!) to produce binaries that should just work on other machines. (Anyone who has attempted to distribute Linux applications will tell you how painful this problem can be.)

Despite its limitations, I believe today's release of PyOxidizer to be a viable tool for some applications. And I believe PyOxidizer can start to replace existing tools in this space. (See the Comparisons to Other Tools document for how PyOxidizer compares to other Python packaging and distribution tools.)

Using today's release of PyOxidizer, larger user-facing applications using Python (like Dropbox, Kodi, MusicBrainz Picard, etc) could use PyOxidizer to produce self-contained executables. This would likely cut down on installer size, decrease install/update time (fewer files means faster operations), and hopefully make packaging simpler for application maintainers. Maintainers of Python utilities could produce self-contained executables, making their utilities faster to start and easier to package and distribute.

New Possibilities and Reliability for Python

By enabling support for self-contained, single file Python applications, PyOxidizer opens exciting new doors for Python. Because Python has historically required an explicit, separate runtime not part of the executable, Python was not viable (or was a hinderance) in many domains. For example, if you wanted to use Python to bootstrap a fresh server or empty container environment, you had a chicken-and-egg problem because you needed to install Python before you could use it.

Let's take Ansible for example. One of Ansible's features is that it remotes into a machine and runs things. The way it does this is it dynamically generates Python scripts locally, uploads them to the remote machine, and tells the remote to execute them. Those Python scripts require the existence of a Python interpreter on the remote machine. This means you need to install Python on a machine before you can control it with Ansible. Furthermore, because the remote's Python isn't under Ansible's control, you can assume very little about its behavior and capabilities, making interaction a bit brittle.

Using PyOxidizer, projects like Ansible could produce a self-contained executable containing a Python interpreter. They could transfer that single binary to the remote machine and execute it, instantly giving the remote machine access to a fully-featured and modern Python interpreter. From there, the sky is the limit. In Ansible's case, the executable could contain the full Ansible runtime, along with any 3rd party Python packages they wanted to leverage. This would allow execution to occur (possibly mostly independently) on the remote machine. This architecture is simpler, scales better, would likely result in faster operations, and would probably improve the quality of life for everyone involved, from application developers to its end users.

Self-contained Python applications built with PyOxidizer essentially solve the Python interpreter bootstrapping and reliability problems. By providing a Python interpreter and a known set of Python modules, you provide a highly deterministic and reliable execution environment for your application. You don't need to fret about which version of Python is installed: you know which version of Python you are using. You don't need to worry about which Python packages are installed: you control explicitly which packages are available. You don't need to worry about whether you are running in a virtualenv, what sys.path is set to, whether .pth files come into play, whether various PYTHON* environment variables can mess up your application, whether some Linux distribution packaged Python differently, what to put in your script's shebang, etc: executables built with PyOxidizer behave as you have instructed them to because they are compiled that way.

All of the concerns in the previous paragraph contribute to a larger problem in the eyes of application maintainers that can be summarized as Python isn't reliable. And because Python isn't reliable, many people reach the conclusion that Python shouldn't be used (this is the black swan that was referred to earlier). With PyOxidizer, the Python environment is isolated and highly deterministic making the reliability problem largely go away. This makes Python a more viable technology choice. And it enables application maintainers to aggressively adopt modern Python versions, utilize third party packages fearlessly, and spend far less time chasing an extremely long tail of issues related to Python environment variance. Succinctly, application developers can focus on building great applications instead of toiling with Python environment problems.

Project Status

PyOxidizer is still in its relative infancy. While it is far from feature complete, I'm mentally committed to working on the remaining major functionality. The Status document lists major missing functionality, lesser missing functionality, and potential future value-add functionality.

I want PyOxidizer to provide a Python application packaging and distribution experience that just works with minimal cognitive effort from Python application maintainers. I have spent a lot of effort documenting PyOxidizer. I care passionately about user experience and want everything about PyOxidizer to be simple and frustration free. I know things aren't there yet. The problems that PyOxidizer is attempting to solve are hard (that's a reason nobody has solved them well yet). I know there's details floating around in my head that haven't been added to the documentation yet. I know there's missing features and bugs in PyOxidizer. I know there are Packaging Pitfalls yet to be discovered.

This is where you come in.

I need your help to make PyOxidizer great. I encourage Python application maintainers reading this to head over to Getting Started and the Packaging User Guide and try to package your applications with PyOxidizer. If things don't work, let me know by filing an issue. If you are confused by lack of or unclear documentation, file an issue. If something frustrates you, file an issue. If you want to suggest I work on a certain feature or fix a bug, file an issue! Tweet to @indygreg to engage with me there. Join the pyoxidizer-users mailing list. While I feel PyOxidizer is usable today (that's why I'm announcing it), I need your feedback to help guide future prioritization.

Finally, I know PyOxidizer has significant implications for some companies and projects that use Python. While I'm not looking to enrich myself or make my livelihood from PyOxidizer, if PyOxidizer is useful to you and you'd like to send money my way as appreciation, you can do so on Patreon or PayPal. If not, that's totally fine: I wouldn't be making PyOxidizer open source if I didn't want to share it with the world for free! And I am financially well off as well. I just feel like there should be more financial contribution to open source because it would improve the health of the ecosystem and I can help achieve that end by advocating for it and giving myself.

Leveraging Rust

The oxidize part of PyOxidizer comes from Rust (See the Wikipedia Rust article - for the chemical not the programming language - to understand where oxidize comes from.) The build time packaging and building functionality is implemented in Rust. And the binary that embeds and controls the Python interpreter in built applications is Rust code. Rationale for these decisions is explained in the FAQ.

This is my first non-toy project using Rust and I have to say that Rust is... incredible! I may have to author a dedicated blog post extolling the virtues of Rust. In short, Rust is now my go-to language for systems level projects. Unless you need the target platform versatility, I don't think C or C++ are defensibles choices in 2019 given their security deficiencies. Languages like Go, Java, and various JVM or CLR languages are acceptable if you can tolerate having a garbage collector and/or a larger runtime. But what makes Rust superior in my mind is the ability for the compiler to prevent large classes of software bugs (especially those that turn into CVEs) and inefficiencies that have plagued our industry for decades. Rust is the first programming language I've used where I feel like the language itself, the compiler, the tools around it (cargo, rustfmt, clippy, rustup, etc), and the community surrounding it all actually care about and assist me with writing high quality software. Nothing else I've used comes even close.

What I've been most surprised about Rust is how high level it feels for a systems level language that isn't garbage collected. When you program lower-level languages like C or C++, compared to a higher level language like Python, you have to type a lot more and be more explicit in nearly everything you do. While Rust is certainly not as expressive or compact as say Python, it is far, far closer to Python than I was expecting it to be. Yes, you do have to type more and think more about your code to appease the Rust compiler's constraints. But the return on that investment is the compiler preventing entire classes of bugs and C/C++ levels of performance. When I started PyOxidizer, the build time logic was implemented in Python and only the run-time pieces were in Rust. After learning a bit more Rust and realizing the obvious code quality benefits, I ditched Python and adopted Rust for the build time logic. And as the code base has grown and gone through various refactorings, I am so glad I did so! The Rust compiler has caught dozens of would-be bugs in Python. Granted, many of these can be attributed to having strong typing and compile time type checking and Rust is little different than say Java on this front. But a significant number of prevented bugs covered invariants in the code because of the way Rust's type system often intersects with control flow. e.g. match arms must be exhaustive, so you can't have unhandled values/types and unchecked Result instances result ina compiler warning. And clippy has been just fantastic helping to guide me towards writing more acceptable code following community accepted best practices.

Even though PyOxidizer is implemented in Rust, most end-users shouldn't have to care (beyond having to install a Rust compiler and build PyOxidizer from source). The existence of Rust should be abstracted away from Python packagers. I did this on purpose because I believe that users of an application shouldn't have to care about the technical implementation of that application. It is a bit unfortunate that I force users to install Rust before using PyOxidizer, but in my defense the target audience is technically savvy developers, bootstrapping Rust is easy, and PyOxidizer is young, so I think it is acceptble for now. If people get hung up on it, I can provide pre-compiled pyoxidizer executables.

But if you do know Rust, PyOxidizer being implemented in Rust opens up some exciting possibilities!

One exciting possibility with PyOxidizer is the ability to add Rust code to your Python application. PyOxidizer works by generating a default Rust application (main.rs) that simply instantiates and runs an embedded Python interpreter then exits. It essentially does what python or a Python script would do. The key takeaway here is your Python application is technically a Rust application (in the same way that python is technically a C application). And being a Rust application means you can add Rust code to that application. You can modify the autogenerated main.rs to do things before, during, and after the embedded Python interpreter runs. It's a regular Rust program and can do anything that Rust programs can do!

Another possibility - and variant of above - is embedding Python in existing Rust projects. PyOxidizer's mechanism for embedding a Python interpreter is implemented as a standalone Rust crate. One can add the pyembed crate to an existing Rust project and a little of build system magic later, your Rust project can now embed and run a Python interpreter!

There's a lot of potential for hybrid Rust + Python programs. And I am very excited about the possibilities.

If you are a Rust programmer, PyOxidizer allows you to easily embed Python in your Rust application. If you are a Python programmer, PyOxidizer allows your to easily leverage Rust in your Python application. In short, the package ecosystem of the other becomes available to you. And if you aren't familiar with Rust, there are some potentially crazy possibilities. For example, Alacritty is a GPU accelerated terminal emulator written in Rust and Servo is an entire web browser engine written in Rust. With PyOxidizer, you could integrate a terminal emulator or browser engine as part of your Python application if you really wanted to. And, yes, Rust's packaging tools are so good that stuff like this tends to just work. As a concrete example, the pyoxidizer CLI tool contains libgit2 for performing in-process interactions with Git repositories. Adding this required a single line change to a Cargo.toml file and it just worked on Linux, macOS, and Windows. Stuff like this often takes hours to days to integrate in C/C++. It is quite ridiculous how easy it is to add (complex) components to Rust projects!

For years, Python projects have implemented extensions in C to realize performance wins. If your Python application is a Rust executable, then implementing this functionality in Rust (rather than C) seems rationale. So we may see oxidized Python applications have their performance critical pieces slowly be rewritten in Rust. (Honestly, the Rust crates to interface between Rust and the CPython API still leave a bit to be desired, so the experience of writing this Rust code still isn't great. But things will certainly improve over time.)

This type of inside-out split language work has been practiced in Python for years. What PyOxidizer brings to the table is the ability to more easily port code outside-in. For example, you could implement performance-criticial, early application logic such as config file parsing and command line argument parsing in Rust. You could then have Rust service some application functionality without Python. Why would you want this? Performance is a valid reason. Starting a Python interpreter, importing modules, and running code can consume several dozen or even hundreds of milliseconds. If you are writing performance sensitive applications, the existence of any Python can add enough latency that people no longer perceive the interaction as instananeous. This added latency can make Python totally inappropriate for some contexts, such as for programs that run as part of populating your shell's prompt. Writing such code in Rust instead of Python dramatically increases the probability that the code is fast and likely delivers stronger correctness guarantees courtesy of Rust's compile time validation as well!

An extreme practice of outside-in porting of Python to Rust would be to incrementally rewrite an entire Python application in Rust. Rust's ergonomics are exceptional and I do think we'll see people choose Rust where they previously would have chosen Python. I've done this myself with PyOxidizer and feel it is a very defensible decision to reach! I feel a bit conflicted releasing a tool which may undermine Python's popularity by encouraging use of Rust over Python. But at the end of the day, PyOxidizer increases the utility of both Python and Rust by giving each more readily accessible access to the other and PyOxidizer improves the overall utility of Python by improving the application distribution story. I have no doubt PyOxidizer is a net benefit for the Python ecosystem, even if it does help usher in more people choosing Rust over Python. If I have an ulterior motive in developing PyOxidizer, it is to enable Mercurial's official distribution to be a Rust executable and for some functionality (like hg status) to be runnable without Python (for performance reasons).

Another possible use of PyOxidizer is as a library. All the build time functionality of PyOxidizer exists in a Rust crate. So, you can add the pyoxidizer crate to your own Rust project and use its code to do things like build a library containing Python, compile Python source modules to bytecode, or walk a directory tree and find Python resources within. The code is still heavily geared towards PyOxidizer and there's no promise of API stability. But this potential for library usage exists and if others want to experiment with building custom Python binaries not using the pyoxidizer CLI tool, using PyOxidizer as a library might save you a lot of time.

Standalone Python Distributions

One of the most time consuming parts of building PyOxidizer was figuring out how to build self-contained Python distributions. Typically, a Python build consists of a library, shared libraries for various extension modules, shared libraries required by the prior items, and a hodgepodge of other files, such as .py files implementing the Python standard library. The python-build-standalone project was created to automate creating special builds of Python which are self-contained and distributable. This requires doing dirty things with build systems. But I don't want to inflict the details on you here. What I do think is worth mentioning is how those Python distributions are distributed. The output of the build is a tarball containing the Python installation, build artifacts that can be used to link a custom libpython, and a PYTHON.json file describing the contents of the distribution. PyOxidizer reads the PYTHON.json file and learns how it should interact with that distribution. If you produce a Python distribution conforming to the format that python-build-standalone defines, you can use that Python with PyOxidizer.

While I have no urgency to do so at this time, I could see a future where this Python distribution format is standardized. Then maintainers of various Python distributions (CPython, PyPy, etc) would independently produce their own distributable artifacts conforming to this standard, in turn allowing machine consumers of Python distributions (such as PyOxidizer) to easily consume different Python distributions and do interesting things with them. You could even imagine these Python distribution archives being readily available as packages in your system's package manager and their locations exposed via the sysconfig Python module, making it easy for tools (like PyOxidizer) to find and use them.

Over time, I could see PyOxidizer's functionality rolling up into official packaging tools like pip, which would know how to consume the distribution archives and produce an executable containing a Python interpreter, required Python modules, etc.

Getting PyOxidizer's functionality rolled into official Python packaging tools is likely years away (if it ever happens). But I think standardizing a format describing a Python distribution and (optionally) contains build artifacts that can be used to repackage it is a prerequisite and would be a good place to start this journey. I would certainly love for Python distributions (like CPython) to be in charge of producing official repackagable distributions because this is not something I want to be in the business of doing long term (I'm lazy, less equipped to make the correct decisions, and there are various trust and security concerns). And while I'm here, I am definitely interested in upstreaming some of the python-build-standalone functionality into the existing CPython build system because coercing CPython's build system to produce distributable binaries is currently a major pain and I'd love to enable others to do this. I just haven't had time nor do I know if the patches would be well received. If a CPython maintainer wants to get in touch, I'd love to have a conversation!

Conclusion

I started hacking on PyOxidizer in November 2018. After months of chipping away at it, I think I finally have a useful utility for some audiences. There's still a lot of missing features and some rough edges. But the core functionality is there and I'm convinced that PyOxidizer or its underlying technology could be an integral part of solving Python's application distribution black swan problem. I'm particularly proud of the hacks I concocted to coerce Python into importing module bytecode from memory using zero-copy. Those are documented in this blog post and in the pyembed crate docs.

So what are you waiting for? Head on over to the documentation, install PyOxidizer, and let me know how it goes by filing issues!

I hope you enjoy oxidizing your Python applications!

« Previous Page -- Next Page »