Modern CI is Too Complex and Misdirected

April 07, 2021 at 09:00 AM | categories: CI, build system

The state of CI platforms is much stronger than it was just a few years ago. Overall, this is a good thing: access to powerful CI platforms enables software developers and companies to ship more reliable software more frequently, which benefits its users/customers. Centralized CI platforms like GitHub Actions, GitLab Pipelines, and Bitbucket provide benefits of scale, as the Internet serves as a collective information repository for how to use them. Do a search for how to do X on CI platform Y and you'll typically find some code you can copy and paste. Nobody wants to toil with wrangling their CI configuration after all: they just want to ship.

Modern CI Systems are Too Complex

The advancements in CI platforms have come at a cost: increased complexity. And the more I think about it, I'm coming around to the belief that modern CI systems are too complex. Let me explain.

At its core, a CI platform is a specialized remote code execution as a service (it's a feature, not a CVE!) where the code being executed is in pursuit of building, testing, and shipping software (unless you abuse it to mine cryptocurrency). So, CI platforms typically throw in a bunch of value-add features to enable you to ship software more easily. There are vastly different approaches and business models here. (I must tip my hat to GitHub Actions leveraging network effects via community maintained actions: this lowers TCO for GitHub as they don't need to maintain many actions, creates vendor lock-in as users develop a dependence on platform-proprietary actions, all while increasing the value of the platform for end-users - a rare product trifecta.) A common value-add property of CI platforms is some kind of configuration file (often YAML) which itself offers common functionality, such as configuring the version control checkout and specifying what commands to run. This is where we start to get into problems.

(I'm going to focus on GitHub Actions here, not because they are the worst (far from it), but because they seem to be the most popular and readers can relate more easily. But my commentary applies to other platforms like GitLab as well.)

The YAML configuration of modern CI platforms is... powerful. Here are features present in GitHub Actions workflow YAML:

  • An embedded templating system that results in the source YAML being expanded into a final YAML document that is actually evaluated. This includes a custom expression mini language.
  • Triggers for when to run jobs.
  • Named variables.
  • Conditional job execution.
  • Dependencies between jobs.
  • Defining Docker-based run-time environment.
  • Encrypted secrets.
  • Steps constituting each job and what actions those steps should take.

If we expand scope slightly to include actions maintained by GitHub, we also have steps/actions features for:

  • Performing Git checkouts.
  • Storing artifacts used by workflows/jobs.
  • Caching artifacts used by workflows/jobs.
  • Installing common programming languages and environments (like Java, Node.js, Python, and Ruby).
  • And a whole lot more.

And then of course there are 3rd party Actions. And there's a lot of them!

There's a lot of functionality here and a lot of it is arguably necessary: I'm hard pressed to name a feature to cut. (Although I'm no fan of using YAML as a programming language but I concede it use is a fair compromise compared to forcing people to write code to produce YAML or make equivalent API calls to do what the YAML would do.) All these features seem necessary for a sufficiently powerful CI offering. Nobody would use your offering if it didn't offer turnkey functionality after all.

So what's my complaint?

I posit that a sufficiently complex CI system becomes indistinguishable from a build system. I challenge you: try to convince me or yourself that GitHub Actions, GitLab CI, and other CI systems aren't build systems. The basic primitives are all there. GitHub Actions Workflows comprised of jobs comprised of steps are little different from say Makefiles comprised of rules comprised of commands to execute for that rule, with dependencies gluing everything together. The main difference is the form factor and the execution model (build systems are traditionally local and single machine but CI systems are remote/distributed).

Then we have a similar conjecture: a sufficiently complex build system becomes indistinguishable from a CI system. Earlier I said that CI systems are remote code execution as a service. While build systems are historically things that run locally (and therefore not a service), modern build systems like Bazel (or Buck or Gradle) are completely different animals. For example, Bazel has remote execution and remote caching as built-in features. Hey - those are built-in features of modern CI systems too! So here's a thought experiment: if I define a build system in Bazel and then define a server-side Git push hook so the remote server triggers Bazel to build, run tests, and post the results somewhere, is that a CI system? I think it is! A crude one. But I think that qualifies as a CI system.

If you squint hard enough, sufficiently complex CI systems and sufficiently complex build systems start to look like the same thing to me. At a very high level, both are providing a pool of servers offering general compute/execute functionality with specialized features in the domain of building/shipping software, like inter-task artifact exchange, caching, dependencies, and a frontend language to define how everything works.

(If you squint really hard you can start to see a value proposition of Kubernetes for even more general compute scheduling, but I'm not going to go that far in this post because it is a much harder point to make and I don't necessarily believe in it myself. But I thought I'd mention it as an interesting thought experiment. But an easier leap to make is to throw batch job execution (as is often found in data warehouses) in with build and CI systems as belonging in the same bucket: batch job execution also tends to have dependencies, exchange of artifacts between jobs, and I think can strongly resemble a CI system and therefore a build system.)

The thing that bugs me about modern CI systems is that I inevitably feel like I'm reinventing a build system and fragmenting build system logic. Your CI configuration inevitably devolves into a bunch of complex YAML with all kinds of caching and dependency optimizations to keep execution time low and reliability in check - just like your build system. You find yourself contorting your project's build system to work in the context of CI and vice versa. You end up managing two complex DAGs and platforms/systems instead of one.

Because build systems are more generic than CI systems (I think a sufficiently advanced build system can do a superset of the things that a sufficiently complex CI system can do), that means that CI systems are redundant with sufficiently advanced build systems. So going beyond the section title: CI systems aren't too complex: they shouldn't need to exist. Your CI functionality should be an extension of the build system.

In addition to the redundancy argument, I think unified systems are more user friendly. By integrating your CI system into your build system (which by definition can be driven locally as part of regular development workflows), you can expose the full power of the CI system to developers more easily. Think running ad-hoc CI jobs without having to push your changes to a remote server first, just like you can with local builds or tests. This is huge for ergonomics and can drastically compress the cycle time for changes to these systems (which are often brittle to change/test).

Don't get me wrong, aspects of CI systems not traditionally found in build systems (such as centralized results reporting and a UI/API for (re)triggering jobs) absolutely need to exist. Instead, it is the remote compute and work definition aspects that are completely redundant with build systems.

Let's explore the implications of build and CI systems being more of the same.

Modern CI Offerings are Targeting the Wrong Abstraction

If you assume that build and CI systems can be / are more of the same, then it follows that many modern CI offerings like GitHub Actions, GitLab CI, and others are targeting the wrong abstraction: they are defined as domain specific platforms for running CI systems when instead they should take a step back and target the broader general compute platform that is also needed for build systems (and maybe batch job execution, such as what's commonly found in data warehouses/pipelines).

Every CI offering is somewhere different on the spectrum here. I would go so far as to argue that GitHub Actions is more a CI product than a platform. Let me explain.

In my ideal CI platform, I have the ability to schedule an ad-hoc graph of tasks against that platform. I have the ability to hit some APIs with definitions of the tasks I want that platform to run and it accepts them, executes them, uploads artifacts somewhere, reports task results so dependent tasks can execute, etc.

There is a GitHub Actions API that allows you to interact with the service. But the critical feature it doesn't let me do is define ad-hoc units of work: the actual remote execute as a service. Rather, the only way to define units of work is via workflow YAML files checked into your repository. That's so constraining!

GitLab Pipelines is a lot better. GitLab Pipelines supports features like parent-child pipelines (dependencies between different pipelines), multi-project pipelines (dependencies between different projects/repos), and dynamic child pipelines (generate YAML files in pipeline job that defines a new pipeline). (I don't believe GitHub Actions supports any of these features.) Dynamic child pipelines are an important feature, as they mostly divorce the checked-in YAML configuration from the remote execute as a service feature. The main missing feature here is a generic API that allows you achieve this functionality without having to go through a parent pipeline / YAML first. If that API existed, you could build your own build/CI/batch execute system on top of GitLab Pipelines with fewer constraints imposed on you by GitLab Pipeline's opinionated YAML configuration files and the intended use of its creators. (Generally, I think a good litmus test for a well-designed platform or tool is when its authors are surprised by someone's unintended use for it. Of course this knife cuts both ways, as sometimes people do undesirable things, like mine cryptocurrency.)

CI offerings like GitHub Actions and GitLab Pipelines are more products than platforms because they tightly couple an opinionated configuration mechanism (YAML files) and web UI (and corresponding APIs) on top of a theoretically generic remote execute as a service offering. For me to consider these offerings as platforms, they need to grow the ability to schedule arbitrary compute via an API, without being constrained by the YAML officially supported out of the box. GitLab is almost there (the critical missing link is a schedule an inline-defined pipeline API). It is unknown if GitHub is - or is even interested in - pursuing this direction. (More on this later.)

Taskcluster: The Most Powerful CI Platform You've Never Heard Of

I wanted to just mention Taskcluster in passing as a counterexample to the CI offerings that GitHub, GitLab, and others are pursuing. But I found myself heaping praises towards it, so you get a full section on Taskcluster. This content isn't critical to the overall post, so feel free to skip. But if you want to know what a CI platform built for engineers looks like or you are a developer of CI platforms and would like to read about some worthwhile ideas to steal, keep reading.

Mozilla's Taskcluster is a generic CI platform originally built for Firefox. At the time it was conceived and initially built out in 2014-2015, there was nothing else quite like it. And I'm still not aware of anything that can match its raw capabilities. There might be something proprietary behind corporate walls. But nothing close to it in the open source domain. And even the proprietary CI platforms I'm aware of often fall short of Taskcluster's feature list.

To my knowledge, Taskcluster is the only publicly available, mega project scale, true CI platform in existence.

Germane to this post, one thing I love about Taskcluster is its core primitives around defining execution units. The core execute primitive in Taskcluster is a task. Tasks are connected together to form a DAG. (This is not unlike how a build system works.)

A task is created by issuing an API request to a queue service. That API request essentially says schedule this unit of work.

Tasks are defined somewhat generically, essentially as units of arbitrary compute along with metadata, such as task dependencies, permissions/scopes that task has, etc. That unit of work has many of the primitives that are familiar to you if you use GitHub Actions, GitLab Pipelines, etc: a list of commands to execute, which Docker image to execute in, paths to files constituting artifacts, retry settings, etc.

Taskcluster has features far beyond what are offered by GitHub, GitLab, and others today.

For example, Taskcluster offers an IAM-like scopes feature that moderates access control. Scopes control what actions you can perform, what services you have access to, which runner features you can use (e.g. whether you can use ptrace), which secrets you have access to, and more. As a concrete example, Firefox's Taskcluster settings are such that the cryptographic keys/secrets used to sign Firefox builds are inaccessible to untrusted tasks (like the equivalent of tasks initiated by PRs - the Try Server in Mozilla speak). Taskcluster is the only CI platform I'm aware of that has sufficient protections in place to mitigate the fact that CI platforms are gaping remote code execution as a service risks that can and should keep your internal security and risk teams up at night. Taskcluster's security model makes GitHub Actions, GitLab Pipelines, and other commonly used CI services look like data exfiltration and software supply chain vulnerability factories by comparison.

Taskcluster does support adding a YAML file to your repository to define tasks. However, because there's a generic scheduling API, you don't need to use it and you aren't constrained by its features. You could roll your own configuration/frontend for defining tasks: Taskcluster doesn't care because it is a true platform. In fact, Firefox mostly eschews this Taskcluster YAML, instead building out its own functionality for defining tasks. There's a pile of code checked into the Firefox repository that when run will derive the thousands of discrete tasks constituting Firefox's build and release DAG and will register the appropriate sub-graph as Taskcluster tasks. (This also happens to be a pile of YAML. But the programming primitives and control flow are largely absent from YAML files, making it a bit cleaner than the YAML DSL that e.g. GitHub and GitLab CI YAML has evolved into.) This functionality is its own mini build system where the Taskcluster platform is the execution/evaluation mechanism.

Taskcluster's model and capabilities are vastly beyond anything in GitHub Actions or GitLab Pipelines today. There's a lot of great ideas worth copying.

Unfortunately, Taskcluster is very much a power user CI offering. There's no centralized instance that anyone can use (unlike GitHub or GitLab). The learning curve is quite steep. All that power comes at a cost of complexity. I can't in good faith recommend Taskcluster to casual users. But if you want to host your own CI platform, other CI offerings don't quite cut it for you, and you can afford a few people to support your CI platform on an ongoing basis (i.e. your total cost to operate CI including people and machines is >$1M annually), then Taskcluster is worth considering.

Let's get back to the post at hand.

Looking to the Future

In my ideal world there exists a single remote code execution as a service platform purpose built for servicing both near real time and batch/delayed execution. It is probably tailored towards supporting software development, as those domain specific features set it apart from generic compute as a service tools like Kubernetes, Lambda, and others. But something more generic could potentially work.

The concept of a DAG is strongly baked into the execution model so you can define execution units as a graph, capturing dependencies. Sure, you could define isolated, ad-hoc units of work. But if you wanted to define a set of units, you could do that without having to run a persistent agent to coordinate execution through completion like build systems typically do. (Think of this as uploading your DAG to an execution service.)

In my ideal world, there is a single DAG dictating all build, testing, and release tasks. There is no DAG fragmentation at the build, CI, and other batch execute boundaries. No N+1 system or configuration to manage and additional platform to maintain because everything is unified. Economies of scale applies and overall efficiency improves through consolidation.

The platform consists of pools of workers running agents capable of performing work. There are probably pools for near real time / synchronous RPC style invocations and pools for scheduled / delayed / asynchronous execution. You can define your own worker pools and bring your own workers. Advanced customers will likely throw autoscaling groups consisting of highly ephemeral workers (such as EC2 spot instances) at these pools, scaling capacity to meet demand relatively cheaply, terminating workers and machines when capacity is no longer needed to save on billing costs (this is what Firefox's Taskcluster instance has been doing for at least 6 years).

To end-users, a local build consists of driving or scheduling the subset of the complete task graph necessary to produce the build artifacts you need. A CI build/test consists of the subset of the task graph necessary to achieve that (it is probably a superset of the local build graph). Same for releasing.

As for the configuration frontend and how execution units are defined, this platform only needs to provide a single thing: an API that can be used to schedule/execute work. However, for this product offering to be user-friendly, it should offer something like YAML configuration files like CI systems do today. That's fine: many (most?) users will stick to using the simplified YAML interface. Just as long as power users have an escape/scaling vector and can use the low-level schedule/execute API to write their own driver. People will write plug-ins for their build systems enabling it to integrate with this platform. Someone will coerce existing extensible build systems like Bazel, Buck, and Gradle to convert nodes in the build graph to compute tasks in this platform. This unlocks the unification of the build and CI systems (and maybe things like data pipelines too).

Finally, because we're talking about a specialized system tailored for software development, we need robust result/reporting APIs and interfaces. What good is all this fancy distributed remote compute if nobody can see what it is doing? This is probably the most specialized service of the bunch, as how you track results is exceptionally domain specific. Power users may want to build their own result tracking service, so keep that in mind. But the platform should provide a generic one (like what GitHub Actions and GitLab Pipelines do today) because it is a massive value add and few will use your product without such a feature.

Quickly, my proposed unified world will not alleviate the CI complexity concerns raised above: sufficiently large build/CI systems will always have an intrinsic complexity to them and possibly require specialists to maintain. However, because a complex CI system is almost always attached to a complex build system, by consolidating build and CI systems, you reduce the surface area of complexity (you don't have to worry about build/CI interop as much). Lower fragmentation reduces overall complexity, and is therefore a new win. (A similar line of thinking applies to justifying monorepositories.)

All of the components for my vision exist in some working form today. Bazel, Gradle Enterprise, and other modern build systems have RPCs for remote execute and/or caching. They are even extensible and you can write your own plugins to change core functionality for how the build system runs (to varying degrees of course). CI offerings like Taskcluster and GitLab Pipelines support scheduling DAGs of tasks (with Taskcluster's support far more suited for the desired end state). There are batch job execution frameworks like Airflow that look an awful lot like a domain-specific, specialized versions of Taskcluster. What we don't have a is a single product or service with all these features bundled as a cohesive offering.

I'm convinced that building what I'd like to see is not a question of if it can be done but whether we should and who will do it.

And this is where we probably run into problems. I hate to say it, but I'm skeptical this will exist as a widely available service outside a few corporations' walls any time soon. The reason is the total addressable market.

The value of my vision is through unification of discrete systems (build, CI, and maybe some one-offs like data pipelines) that are themselves complex enough that unification is something you'd want to do for business/efficiency reasons. After all, if it isn't complex/inefficient, you probably don't care about making it simpler/faster. Right here we are probably filtering out >90% of the market because their systems just aren't complex enough for this to matter.

This vision requires adoption of a sufficiently advanced build system so it can serve as the brains behind a unified DAG driving remote execute. Some companies and projects will adopt compatible, advanced build systems like Bazel because they have the resources, technical know-how, and efficiency incentives to pull it off. But many won't. The benefit of a more advanced build system over something simpler is often marginal. Factor in that many companies perceive build and CI support as product development overhead and a virtual cost center whose line item needs to be minimized. If you can get by on a less advanced build system that is good enough for a fraction of the cost without excessive hardship, that's the path many companies and projects will follow. Again, people and companies generally don't care about wrangling build and CI systems: they just want to ship.

The total addressable market for this idea seems too small for me to see any major player with the technical know-how to implement and offer such a service in the next few years. After all, we're not even over the hurdle that what I propose (unifying build and CI systems) is a good idea. Having worked in this space for a decade, witnessed the potential of Taskcluster's model, and seen former, present, and potential employers all struggling in this space to varying degrees, I know that this idea would be extremely valuable to some. (For some companies multiple millions of dollars could be saved annually by eliminating redundant human capital maintaining similar systems, reducing machine idle/run costs, and improving turnaround times of critical development loops.) As important as this would be to some companies, my intuition is they represent such a small sliver of the total addressable market that this slice of pie is too small for an existing CI operator like GitHub or GitLab to care about at this time. There are far more lucrative opportunities. (Such as security scanning, as laws/regulation/litigation are finally catching up to the software industry and forcing companies to take security and privacy more seriously, which translates to spending money on security services. This is why GitHub and GitLab have been stumbling over each other to announce new security features over the past 1-2 years.)

I don't think a startup in this area would be a good idea: customer acquisition is too hard. And because much of the core tech already exists in existing tools, there's not much of a moat in the way of proprietary IP to keep copycats with deep pockets at bay. Your best exit here is likely an early acquisition by a Microsoft/GitHub, GitLab, or wannabe player in this space like Amazon/AWS.

Rather, I think our best hope for seeing this vision realized is an operator of an existing major CI platform (either private or public) who also has major build system or other ad-hoc batch execute challenges will implement it and release it upon the world, either as open source or as a service offering. GitHub, GitLab, and other code hosting providers are the ideal candidates since their community effect could help drive industry adoption. But I'd happily accept pretty much any high quality offering from a reputable company!

I'm not sure when, but my money is on GitHub/Microsoft executing on this vision first. They have a stronger incentive in the form of broader market/product tie-ins (think integrated build and CI in Visual Studio or GitHub Workspaces [for Enterprises]). Furthermore, they'll feel the call from within. Microsoft has some really massive build systems and CI challenges (notably Windows). It is clear that elements of Microsoft are conducting development on GitHub, in the open even (at this point Satya Nadella's Microsoft has frozen over so many levels of hell that Dante's classics need new revisions). Microsoft engineers will feel the pain and limitations of discrete build and CI systems. Eventually there will be calls for at least a build system remote execute service/offering on GitHub. (This would naturally fall under GitHub's existing apparent market strategy of capturing more and more of the software development lifecycle.) My hope is GitHub (or whomever) will implement this as a unified platform/service/product rather than discrete services because as I've argued they are practically the same problem. But a unified offering isn't the path of least resistance, so who knows what will happen.

Conclusion

If I could snap my fingers and move industry's discrete build, CI, and maybe batch execute (e.g. data pipelines) ahead 10 years, I would:

  1. Take Mozilla's Taskcluster and its best-in-class specialized remote execute as a service platform.
  2. Add support for a real-time, synchronous execute API (like Bazel's remote execute API) to supplement the existing batch/asynchronous functionality.
  3. Define Starlark dialects so you define CI/release like primitives in build tools like Bazel. (You could also do YAML here. But if your configuration files devolve into DSL, just use a real programming language already.)
  4. Teach build tools like Bazel to work better when units of work that can take minutes or even hours to run (a synchronous/online driver model such as classically employed by build systems isn't appropriate for long-running test, release, or say data pipelines).
  5. Throw a polished web UI for platform interaction, result reporting, etc on top.
  6. Release it to the world.

Will this dream become a reality any time soon? Probably not. But I can dream. And maybe I'll have convinced a reader to pursue it.


Surprisingly Slow

April 06, 2021 at 07:00 AM | categories: Programming

I have an affinity for performance optimization and making software as efficient as possible. Over the years, I've encountered specific instances and common patterns that make software or computers slow. In this post, I'll shine a spotlight on some of them.

I'm titling this post Surprisingly Slow because the slowness was either surprising to me or the sub-optimal practices leading to slowness are prevalent enough that I think many programmers would be surprised by their existence.

The sections below are largely independent. So feel free to cherry pick the ones that interest you.

Environment Detection in Build Systems (e.g. configure and cmake)

This is the topic that inspired this post.

Build systems often feature an environment detection / configuration phase before the build phase. In UNIX land, autoconf generated configure scripts are prevalent. CMake is also popular. These tools run a bunch of code to probe the state of the current system so that the build configuration is appropriate for the current build environment. For example, they'll probe for which compiler to use, its version, and what bugs and capabilities it has.

This environment detection and configuration is a necessary evil because machines and environments often vary substantially and you need to account for those variances.

The problem is that this configuration step often takes longer to run than the build itself! Build systems for small programs or libraries will often spend 10+ seconds running configure and complete the actual compilation and linking in a fraction of that time. In other words, the setup to perform the build takes longer than the build itself!

Depending on how many CPU cores you have, the discrepancy may not be obvious. But I have a 16 core / 32 thread Ryzen 5950X as my primary PC and the relative slowness of the configuration step is painful to observe.

What I find even more shocking is that configuration time often still eclipses actual build time even for large projects. I'm not sure if this is still true, but a few years ago Mozilla observed that building LLVM/Clang on a 96 vCPU EC2 instance resulted in more time spent in cmake/configuring than compiling and linking! And that's a very large C++ project with thousands of source files being compiled!

Build configuration is often a discrete step that executes serially before what most people consider the actual build. To improve efficiency, build configuration needs to be parallelized. Even better, it should be integrated into the main build DAG itself so parts of the build can start running without having to wait for all build configuration. Unfortunately, many common tools performing build configuration can't easily be adapted to this model. So there's not much many of us can do.

Another solution to this problem is avoiding the problem of environment detection in the first place. If you have deterministic and reproducible build environments, you can take a lot of shortcuts to skip environment detection that just isn't needed any more. This is more or less the approach of modern build tools like Bazel. I do wonder how much of the speed gains from tools like Bazel are due to eliminating environment configuration. I suspect it is a lot!

New Process Overhead on Windows

New processes on Windows can't be spawned as quickly as they can on POSIX based operating systems, like Linux. On Windows, assume a new process will take 10-30ms to spawn. On Linux, new processes (often via fork() + exec() will take single digit milliseconds to spawn, if that).

However, thread creation on Windows is very fast (~dozens of microseconds).

These Stack Overflow have some more details.

A few dozen milliseconds is an eternity in CPU time. And it is long enough that it eats into a large percentage of the time budget for people to perceive something as instantaneous. So this may contribute to the perception that Windows is slower than Linux.

If your program architecture consists of spawning new processes left and right (this is common in UNIX land), this can pose performance problems on Windows, as the overhead of new process creation on Windows can really add up:

  • 10ms * 1,000 invocations = 10s
  • 20ms * 10,000 invocations = 200s
  • 30ms * 100,000 invocations = 3,000s

Using the example of configure above, configure files are often shell scripts. And shell scripts often do a lot of their work by spawning other processes like grep, sed, and sort. Even the [ operator could be a new process (seriously: there's probably a /usr/bin/[ executable in your POSIX environment). (Although [ might be a shell built-in.) Command pipe chains (e.g. command | grep | awk) spawn multiple processes serially and can be visually slow to run. Anyway, it is not uncommon for a configure script to spawn thousands of new processes. Assuming 10ms per process, at 1,000 invocations that is 10s of overhead just spawning new processes! This further exacerbates the problem in the previous section!

If your software runs on Windows, consider the impact that relatively slow process spawning will have. Consider a multi-threaded architecture or using longer-lived daemon/background processes instead.

Closing File Handles on Windows

Many years ago I was profiling Mercurial to help improve the working directory checkout speed on Windows, as users were observing that checkout times on Windows were much slower than on Linux, even on the same machine.

I thought I could chalk this up to NTFS versus Linux filesystems or general kernel/OS level efficiency differences. What I actually learned was much more surprising.

When I started profiling Mercurial on Windows, I observed that most I/O APIs were completing in a few dozen microseconds, maybe a single millisecond or two ever now and then. Windows/NTFS performance seemed great!

Except for CloseHandle(). These calls were often taking 1-10+ milliseconds to complete. It seemed odd to me that file writes - even sustained file writes that were sufficient to blow past any write buffering capacity - were fast but closes slow. It was even more perplexing that CloseHandle() was slow even if you were using completion ports (i.e. async I/O). This behavior for completion ports was counter to what the MSDN documentation said should happen (the function should return immediately and its status can be retrieved later).

While I didn't realize it at the time, the cause for this was/is Windows Defender. Windows Defender (and other anti-virus / scanning software) typically work on Windows by installing what's called a filesystem filter driver. This is a kernel driver that essentially hooks itself into the kernel and receives callbacks on I/O and filesystem events. It turns out the close file callback triggers scanning of written data. And this scanning appears to occur synchronously, blocking CloseHandle() from returning. This adds milliseconds of overhead. The net effect is that file mutation I/O on Windows is drastically reduced by Windows Defender and other A/V scanners.

As far as I can tell, as long as Windows Defender (and presumably other A/V scanners) are running, there's no way to make the Windows I/O APIs consistently fast. You can disable A/V scanning (at your own peril). But the trick that Mercurial employs (which has later been emulated by rustup among other tools) is to use a thread pool for calling CloseHandle(). Even if you perform all file open and write I/O on a single thread and use a background thread pool only for calling CloseHandle(), you can see a >3x speedup in time to write files. This optimization should ideally be employed by any software that creates or mutates as little as a few hundred files on Windows. This includes version control tools, installers, and archive extraction tools. Fun fact: rustup can extract tar files on Windows faster than open source and commercial fast extraction/copy tools because it employs this trick and more. I believe rustup on Windows is actually faster at extracting tar archives than it is on Linux!

The artificial I/O latency added by scanning software such as Windows Defender is super annoying. But the performance gains from working around it by using a thread pool for background is often worth the complexity. I have no doubt that if this optimization were baked into popular Windows tools (namely installers), people would be shocked by how much faster things could be.

Writing to Terminals

As a maintainer of Firefox's build system, I fielded a handful of reports from people complaining about builds being slower than their peers on identical hardware. While there are many causes for this, one of the most surprising was the impact the terminal has on build performance.

Writing to the terminal is usually fast. Until it isn't.

What I learned is that writing tons of output or getting clever with writing to the terminal (e.g. writing colors, moving the cursor position to write over existing content) can drastically slow down applications.

Writing to the terminal via stderr/stdout is likely performed via blocking I/O. So if the thing handling your write() (the terminal emulator) doesn't finish its handling promptly, your process just sits around waiting on the terminal to do its thing.

We discovered that different terminals have their own quirks. Historically, the Windows Command Prompt and the built-in Terminal.app on macOS were very slow at handling tons of output. I remember (but can't find the bug or commit to Firefox) when we made the build system quiet by default and that reduced build times by minutes in some configurations.

A few years ago, npm infamously had a performance sucking progress spinner. While I'm not sure how much of this was terminal slowness versus calling progress update code too frequently, the terminal likely played a part because terminals do have a limit to how often they can accept input to draw.

I've found that modern terminals are better about writing a ton of plain text than they were in ~2012, when I was tackling these problems in Firefox's build system. But I would still exercise extreme caution when doing fancy things with the terminal, like coloring text, drawing footers, etc. Always use buffered I/O to minimize the number of write() actually going to the terminal, flushing as needed (hopefully sparingly). Consider using an async thread for writing to stdout/stderr. Record the total time spent in blocking I/O to stdout/stderr so you can measure terminal I/O latency. And periodically compare the wall time delta between stdout/stderr connected to a terminal and /dev/null when running your program to see if there is a discrepancy worth caring about. Finally, consider throttling writes to the terminal. Instead of writing a footer after every line of output, consider buffering lines for a few milliseconds and emitting all lines plus the new footer in batches. If drawing a progress bar or spinner or something of that nature, I would limit drawing to ~10 Hz to minimize terminal overhead.

Thermal Throttling / ACPI C/P-States / Processor Throttling Behavior

We like to think that a computer and its processors are either on or off. If only things were that simple.

Processors are constantly changing their operating envelope as they are running. The following statements are all true (although not every item applies to all machines or CPU models):

  • The MHz each CPU core is running at can fluctuate wildly from 1 second to the next.
  • CPU cores may go to sleep or enter a very low power mode, even if others are running.
  • Cores may underclock significantly if temperature goes beyond a threshold. They may refuse to run faster until the temperature drops. Faulty sensors can lead to premature behavior.
  • Cores may only reach their maximum frequency if other cores are also running. The physical proximity of that other core may matter.
  • It could take dozens, hundreds, or even thousands of milliseconds for an idling core to ramp up to its full speed.
  • The behavior of power scaling can vary substantially depending on whether a machine is connected to external power or running off the battery.
  • The behavior of power scaling can vary substantially depending on whether the battery is fully charged or nearly empty.
  • Apple laptops may exhibit thermal throttling when charging from the left side. (Yes, seriously: always charge your MacBook Pro from the right. And if your employees use Apple laptops for CPU heavy tasks, consider an awareness campaign to encourage charging from the right side. Even better, deploy software that checks for left side charging and alert accordingly. Although I have yet to find any software or API to detect this.)
  • A core may slow down in order to process certain instructions (like AVX-512).

Modern CPUs are really dynamic beasts and their operating behavior is often seemingly unpredictable. Furthermore, CPU models can vary from one to the next. For example, an EPYC or Xeon processor will likely behave differently from a Ryzen or Core i7/i9 which will behave differently depending on whether you are running in a desktop or laptop. (I observed a few years ago that Xeon cores won't turbo as easily as consumer grade CPUs.)

Power fluctuations and their impact on performance are one of the reasons why it is extremely difficult to conduct proper benchmarks. When benchmarking, you need to control the power variable or at least report its state so results are qualified appropriately. I am very skeptical of benchmark results that don't report the power configuration/methodology (this is most of them, sadly) and especially of benchmarks conducted on laptops, as battery operated devices are much more susceptible to power throttling than desktops or servers.

I have personally had a MacBook Pro become thermal throttled because an internal screw came loose and blocked a fan from spinning. macOS didn't warn me: all I knew was that my Firefox builds become 2-3x slower for no apparent reason! I have also observed my MacBook Pro becoming hot due to left side charging. Charging from the right magically made things faster.

At Mozilla, when we started rolling out Xeon desktops to employees, we had reports of wildly varying build speeds. On some operating systems (Mozilla had very lax central machine provisioning and allowed people full domain of their company issued hardware), the default ACPI C/P-States were such that CPU cores were scaling differently.

What we observed was the compile phase of the build was fine. But some people were reporting linking times 2-4x longer (dozens of seconds to minutes) than others on equivalent configurations! This was a big deal because the wall time of an incremental/non-full build is dominated by linking time. We eventually discovered that on the slow machines, the CPU core doing the linking was only running at 25-50% of its potential. Think 1.0-1.5 GHz. But if you started additional CPU heavy tasks, that core ramped up. We discovered that different operating systems had different defaults for the ACPI C/P-States. The more conservative settings would result in CPU cores not scaling their frequency unless there was sufficient CPU load to merit it. Changing to more aggressive power settings ensured better and consistent results.

Laptops are highly susceptible to thermal throttling and aggressive power throttling to conserve battery. I hold the general opinion that laptops are just too variable to have reliable performance. Given the choice, I want CPU heavy workloads running in controlled and observed desktops or server environments.

But servers aren't immune: their ACPI C-State and P-State settings can drastically impact performance. Dialing these up to max so all the cores run at full (or are ready to run at full in a few milliseconds) is possible. However, this may greatly increase your power consumption. You can do this on some cloud providers (like AWS) for no additional direct cost to you. However, higher energy consumption is bad for the environment. Data centers already have a carbon footprint about the size of the airline industry (during non-pandemic times) and that footprint is growing. So think about your ethical responsibilities to the environment before having your server fleet consume potentially megawatts more power.

Python, Node.js, Ruby, and other Interpreter Startup Overhead

Complex systems will often execute Python, Node.js, and other interpreters thousands or more times during their execution. For example, the Firefox build system invokes thousands of Python processes performing common tasks, such as wrapping the compiler invocation. And the Mercurial test harness invokes thousands of Python processes by running hg as part of its testing. I've heard of similar stories involving Node.js, Ruby, and other interpreters, often in the context of use in build systems.

An oft ignored fact about launching a new interpreter process is that each invocation often takes single to dozens of milliseconds to initialize the interpreter. i.e. the new process spends time at the beginning of process execution just getting to the code you are telling it to run. Sometimes the new process overhead is so bad that the slowdown is obvious and rules out the use of a technology. The JVM historically has been notorious for this, which is why use of Java typically entails fewer, longer-running processes over more, domain-limited processes.

I've written about Python's startup overhead before. In 2014 I measured that Mercurial's test harness spends 10-18% of its total CPU time just getting to the point where the interpreter/process can run custom bytecode and 30-38% of its total CPU time getting to the point where Mercurial performs command dispatch (additional time here is mostly module importing overhead).

You may think that a few milliseconds of overhead can't matter that much. But if you multiply by 1,000, 10,000, 100,000 or more, milliseconds matter:

  • 1ms * 1,000 invocations = 1s
  • 10ms * 10,000 invocations = 100s
  • 100ms * 100,000 invocations = 10,000s (2.77 hours)

On Windows, this problem is compounded because of it relatively slow new process startup (see section above).

Programmers need to think long and hard about your process invocation model. Consider the use of fewer processes and/or consider alternative programming languages that don't have significant startup overhead if this could become a problem (anything that compiles down to assembly is usually fine).

Pretty Much all Storage I/O

Of my general affinity for performance optimization, I have a special affinity for I/O optimization. I think the main reason is that the disconnect between the potential for modern storage devices and what is actually achieved is so wide. On paper, software should be getting ~10x the performance from modern storage devices than what we typically see.

Modern storage devices are absurdly fast. The NVMe storage in my primary PC can sustain reads at >3 GB/s (>6 GB/s sequential), writes at ~1 GB/s (4+ GB/s sequential), can perform >500,000 I/O operations per second, and can service many I/O operations in the ~10 microsecond latency range. Modern NVMe storage is roughly on par with the performance of DDR2 DRAM (launched in 2003) in terms of throughput (latency still trails but ~10us is nothing to scoff at).

For comparison, the 1 TB Western Digital Caviar Black spinning disk I retired from my PC a few weeks ago can only do ~90 MB/s sequential reads and writes, 1-2 MB/s random reads and writes, ~12 ms access times. I'm unsure what IOPS is, but considering ~12 ms access times and the physical nature of spinning disks, it can't be more than a few hundred.

Modern NVMe storage is 1.5-3 magnitudes faster than the best spinning disks from little over a decade ago. So why isn't all storage I/O ~instantaneous?

The short answer is that most software fails to utilize the potential of modern storage devices or even worse actively undermines it through bad practices.

For the former, I'll refer you to the excellent Modern Storage is Plenty Fast. It is the APIs That are Bad. tl;dr you can harness the full power of your modern storage device if you bypass the standard OS/kernel I/O primitives and issue I/O operations directly against the device. So, software abstractions in the OS/kernel are eating a lot of potential.

For the software undermining storage device potential aspect, I'll briefly touch on the fsync() POSIX function. By calling this function, you effectively say be sure the state of this file descriptor is persisted to the storage device or I don't want to lose any changes I've made.

Data consistency and durability are important. But the cost to achieving them can be absurdly high. And as it turns out, it is also subtly difficult to do correctly in practice. I'll refer you to Dan Luu's excellent Files are Hard. The papers linked offer a sobering assessment. I'll reinforce the message with PostgreSQL's fsync() surprise, which chronicles how PostgreSQL maintainers learned about how Linux can flat out drop errors when performing device I/O, leading to data corruption. Yikes!

Anyway, about fsync(). The concept of fsync() is sound: ensure this thing is persisted to the storage device. But the implementation is often a pile of inefficiency leading to slowness.

On many Linux filesystems (including ext4), the implementation of fsync() is such that upon calls, all unflushed writes are persisted to storage. So if process A writes out a 1 GB file and process B writes 1 byte to another file and calls fsync() on that single byte write, Linux/ext4 will need to write 1 GB to the storage device, not 1 byte. So on Linux/ext4, all it takes is a random process somewhere to issue fsync() and all dirty page cache entries need to be flushed. On most systems, there's usually something continuously incurring write I/O, so the amount of storage device I/O incurred by fsync() is almost always larger than just the mutated file/directory you actually want persisted.

This behavior can cause a ton of problems. For starters, it artificially increases I/O latency. You'd think that calling fsync() after a minimal change would be ~instantaneous. But if there are lots of dirty pages to be flushed, it could take seconds. At my current employer, we ran into this exact problem with GitHub Enterprise, which has a monolithic architecture. A MySQL database was running off the same ext4 filesystem as the Git repositories. MySQL will call fsync() frequently to ensure transactions and the transaction journal are persisted to storage. But if a Git GC were running and Git just finished writing a multi-gigabyte packfile, MySQL's fsync() would be stuck waiting on Git's large write to finish persisting. This led to slowness of future MySQL transactions and even some application-level timeouts. When people say databases and other stores should be isolated to their own volumes/filesystems, fsync()'s wonky behavior is a big reason why.

Fortunately, newer versions of Linux/ext4 contain a fast commits feature that changes behavior and enables more granular flushing of fsync() to storage, just like it is documented to do. But as the feature is pretty new, it could take a while to stabilize and make its way to distros. I can't wait for it though!

Another problem with fsync() is that it is called more often than it needs to be. Now, if you have mission critical data and need consistency and durability, you should absolutely be calling fsync() appropriately. But the reality is that many data workloads and machine environments don't actually need strong data guarantees!

Take for example Kubernetes pods or CI runners. Or even servers for a stateless service. Ask yourself, what's the worst that could happen if the machine loses power and there is data loss on the local filesystem? In a lot of scenarios the answer is nothing. You've designed your system to be stateless and fault tolerant. You manage your servers as cattle. You treat local filesystems as ephemeral. So if a machine fails, you provision a new one to replace it. In these scenarios, fsync() buys you little to nothing but can cost you a lot!

The cost of avoidable fsync() can be substantial. Combined with the inefficient global flushing behavior of Linux/ext4, it can be a performance sapper, especially on slower storage devices. Fortunately, there are options. Many databases and other popular software has a way to prevent the issuance of fsync(). If your data is ephemeral, consider disabling fsync() for a likely significant performance boost! For software that doesn't support disabling fsync(), the aptly named eatmydata tool and LD_PRELOAD library can be used to nerf fsync() and other similar functionality by intercepting the function calls and making them no-op. Last but not least, for ephemeral machines, consider building a patched Linux kernel that turns fsync() and friends into no-ops. (I'm not sure of anyone who does this. But I've considered it because getting eatmydata to work in places like launched containers can be a bit of a pain.)

I'll close this section with a link to my favorite commit to the Firefox repository: Disable Places during reftests, preventing 50 GB of I/O. While this commit goes beyond disabling fsync(), fsync() (and its Windows equivalent) was responsible for some of the performance loss. Excessive I/O and needless persisting of changes to device can really sap performance. Storage software usually errors on the side of consistency (this is the correct default in my opinion). Given the costs that consistency imposes, you should seriously consider nerfing the guarantees and speeding up I/O when that option is viable for you.

Data Compression

I could write an entire post on the topic of data compression and its widespread suboptimal use. Here is the concise version.

At its core, data compression is a trade-off between CPU and I/O usage. Typically it involves one of the following scenarios:

  1. I/O (either storage or network) is the bottleneck, so we want to trade more CPU to reduce I/O throughput.
  2. At rest storage is expensive, so we want to trade more CPU for lower storage utilization/costs.

Since the early days of computing, a maxim has been that storage is slow and expensive compared to CPU. So trading CPU to reduce storage utilization seemed like a solid bet.

Fast forward to 2021.

As I wrote in the previous section, modern storage I/O is absurdly fast. It is also historically cheap.

Networks have also gotten faster. 1 gbps (125 MB/s) is pretty universal at this point. 2.5 gbps (312 MB/s) is getting deployed in consumer and office environments. 10 gbps (1250 MB/s) is common in data centers. And faster than 10 gbps is possible.

Meanwhile CPUs have somewhat plateaued in their single core performance in the past decade. We've been stuck at ~4 GHz for years. All of the performance gains from CPUs have come from adding more CPU cores to the package and instructions per cycle (IPC) efficiency wins (we've also gotten some agonizing security vulnerabilities like Spectre and Meltdown out of this IPC work as well).

What this all means is that the relative performance difference between CPUs and I/O has compressed significantly (pardon my pun). ~30 years ago, CPUs ran at ~100 MHz and Internet was using dial-up at say 50 kbps, or 0.05 mbps, or 6.25 kBps. That's 16,000 cycles per byte. Today, we're at ~4 GHz with say 1 Gbps / 125 MB/s networks. That's 32 cycles per byte, a decrease of 500x. (In fairness, the ratio closes when you consider that we likely have >1 CPU core competing for I/O and factor in IPC gains. But we're still talking about the relative difference in CPU and I/O decreasing by 1-1.5 magnitudes.) Years ago, trading CPU to lessen the I/O load was often obviously correct. Today, because of the advancements in I/O performance relative to CPU and a substantially reduced cycles per I/O byte budget, the answer is a lot murkier.

Not helping is the prevalence of ancient compression algorithms. DEFLATE - the algorithm behind the ubiquitous zlib library and gzip data format - is ~30 years old. DEFLATE was designed in an era when computers had like 1 MB RAM and 100 MB hard drives. Different times.

DEFLATE/zlib became very popular in a world where I/O was much slower and compression was often a necessity. Not using compression on a dial-up modem resulted in massive performance differences! And because of its popularity in the early days of the Internet, DEFLATE/zlib is available in the standard library of many programming languages. It seems to be the first compression format people reach for when someone says/thinks add compression.

The ubiquity of zlib is good from a dependency perspective: everyone can read zlib/gzip. But for scenarios where you control the reader and writer, use of zlib in 2021 constitutes negligence because its performance lags contemporary solutions. Modern compression libraries (zstandard is my favorite) can yield substantially faster compression and decompression speed while delivering better compression ratios in most data sets. My 2017 Better Compression with Zstandard post dives into the numbers. (I've been meaning to revisit that post since zstandard has since seen multiple 10+% speedups in subsequent releases, making it even more compelling.) If you don't need the ubiquity of zlib (e.g. you control the writers and readers), there's little reason to use zlib over something more modern. Compared to zlib, modern compression libraries like zstandard are the closest thing to magical pixie dust that you can sprinkle on your software for free performance.

If you are using compression (especially zlib) for real-time compression (sending compressed data somewhere where it will be decompressed immediately), you need to measure the line speed of the compressor and decompressor. Then compare that to the uncompressed line speed. Are you bottlenecked by I/O in the uncompressed case? If not, do you need the bandwidth or I/O capacity being saved by compression? If not, why are you using compression at all? You just measured that all compression did was artificially slow down your software for no reason! Given that zlib compression will often fail to saturate a 1 gbps link, there's a very real chance your use of compression introduces an artificial CPU bottleneck!

If you are using compression (especially zlib) for data archiving (storing compressed data somewhere where it will be decompressed eventually), you need to measure and compare compression ratios and line speeds of different compression formats and their settings. Like the real-time compression scenario, if decompression reduces your line speed from uncompressed, you are artificially slowing down access to your data. Maybe that's justified to save on storage costs. But in many cases, you can swap in a different compression library and get similar to better compression ratios while achieving better (de)compression speeds. Who wouldn't want free performance and storage cost reductions?

As an aside, one of the reasons I love zstandard is it can be tuned from something that is screaming fast (GB/s at compression and decompression ends) to something that is very slow on the compression side but yields terrific compression ratios, while still preserving GB/s decompression speeds. This enables you to use the same format for vastly different use cases. You can also dynamically change the storage characteristics of your data. For example, you can initially write data with a fast setting so you aren't CPU constrained on the writer. Then you can have some batch job come around and recompress your data with more aggressive settings, making it much smaller. It's not like zlib where the range of compression settings goes from kinda slow and not very good compression ratios to pretty slow and still not very good compression ratios.

When you know to look for it, inefficiency due to unjustified use of compression or failure to leverage modern compression libraries is everywhere. Here are some common operations in my daily workflow that are bottlenecked by use of slow compression formats and could be made faster by using a different compression format:

  • Installing Apt packages (packages are gzip compressed). (Fun fact, installing apt packages is also subject to fsync() slowness as described above because the package manager will issue an fsync() at least once for each package.)
  • Installing Homebrew packages (packages are gzip compressed).
  • Installing Python packages via pip (source archives are gzip tarballs and wheels are zip files, which use zlib compression).
  • Pushing/pulling Docker images (layers inside Docker images are gzip compressed).
  • Git (wire protocol data exchange and on-disk storage is using zlib). (When I added zstandard support to Mercurial, it reduced the transfer size from servers to ~89% of original while using ~60% of the server-side CPU.)

In the corporate world, there's probably multiple petabyte scale data warehouses, data lakes, data coliseums (I can't keep up with what we're calling them now) storing data in gzip. Dozens of terabytes could likely be shaved by moving to something like zstandard. If using LZMA (which has extremely slow decompression speeds), storage costs are cheap, but data access is extremely slow, making your data querying slow. I haven't had the opportunity to measure it, but I suspect some of the reputation Hadoop and other Big Data systems have for being slow is because they are CPU constrained by suboptimal use of compression.

My experience is that many programmers don't understand the trade-offs and nuances of compression and/or lack knowledge about the existence of more modern, superior compression libraries. Instead, the collective opinion is compression is good, use [zlib] compression. Like many things in software, the real world is complex and nuanced. The dynamics of the relative power and cost of computer components has shifted the pendulum towards compression adding more cost than it saves. And it hasn't helped that industry still widely uses a ~30 year old compression format (DEFLATE/zlib) that is far from ideal for modern computers. If you take the time to measure, I'm sure you'll find many cases where use of compression is either ill-advised or would benefit from a more modern compression library (like zstandard).

x86_64 Binaries in Linux Distribution Packages

Linux distributions often provide pre-built binaries to install via packaging tools (e.g. apt install or yum install).

To keep things simpler and to ensure maximum compatibility, these pre-built binaries are built such that they run on as many computers as possible. Currently, many Linux distributions (including RHEL and Debian) have binary compatibility with the first x86_64 processor, the AMD K8, launched in 2003. These processors featured modern instruction sets, like MMX, 3DNow!, SSE, and SSE2.

What this means is that by default, binaries provided by many Linux distributions won't contain instructions from modern Instruction Set Architectures (ISAs). No SSE4. No AVX. No AVX2. And more. (Well, technically binaries can contain newer instructions. But they likely won't be in default code paths and there will likely be run-time dispatching code to opt into using them.)

Furthermore, C/C++ compilers (like Clang and GCC) will also target an ancient x86_64 microarchitecture level by default (this is where the distribution's binary compatibility defaults come from). So if you compile your own code and don't specify settings like -march or -mtune to change the default targeting settings, your compiled binaries won't leverage SSE4, AVX, etc. You can still force your application to emit these instructions in dynamic code paths without -march/-mtune overrides. But you have to opt in and add additional code complexity to do that.

Because of the conservative microarchitecture targeting settings of compilers and distribution binaries by extension, that's nearly 20 years of ISA work and efficiency gains from more powerful ISAs (like superlinear vectorized instructions) left on the table. And here I get frustrated when my PRs linger unreviewed for more than a day. Imagine what it is like to be an AMD or Intel engineer and have your ISA work take ~decades to be adopted at scale!

Truth be told, I'm unsure how much of a performance impact this ISA backwards compatibility sacrifices. It will vary heavily from workload to workload. But I have no doubt there are some very large datacenters running CPU intensive workloads that could see massive efficiency gains by leveraging modern ISAs. If you are running thousands of servers and your CPU load isn't coming from a JIT'ed language like Java (JITs can emit instructions for the machine they are running on... because they compile just in time), it might very well be worth compiling CPU heavy packages (and their dependencies of course) from source targeting a modern microarchitecture level so you don't leave the benefits of modern ISAs on the table. And be forewarned: use of modern ISAs isn't a silver bullet! Some instructions can actually result in the CPU underclocking in order to run them, making code using those instructions fast but other code slow.

Maintaining binary compatibility with a vanishingly small number of ancient CPUs at the expense of performance on modern CPUs seems... questionable. Fortunately, Linux distributions and Clang/GCC are paying attention.

GCC 11 and Clang 12 define x86_64-{v2, v3, v4} architecture levels targeting ~Nehalem (released 2008), ~Haswell (released 2013), and AVX-512 (~2015), respectively. So you can add e.g. -march=x86_64-v3 to target Haswell era and newer and have the compiler emit SSE4, AVX, AVX2, and other modern instructions.

RHEL9 will be raising their minimum architecture requirement from x86_64 to x86_64-v2, effectively requiring CPUs from 2008+ instead of 2003+.

If you'd like to learn more about this topic, start at this Pharonix article and follow the links to other articles and mailing list discussions.

It's worth noting that at the time I write this, AWS 4th generation EC2 instances (c4, m4, and r4) all support AVX2 and I believe are compatible with GCC/Clang's x86_64-v3 target. And 5th generation Intel instances have AVX-512, presumably making them compatible with x86_64-v4. So even if your distribution targets x86_64-v2, there is still potential free performance from newer ISAs on the table.

If I were operating a server fleet consisting of thousands of machines, I would be very tempted to compile all packages from source targeting a modern microarchitecture level. This would be costly in terms of complexity. But for some workloads, the performance gains could be worth the effort. And this conservative targeting approach may provide justification for running modern-optimized Linux distributions or cloud vendor specific Linux distributions (e.g. Amazon Linux). I'm unsure if distributions like Amazon Linux take advantage of this. If not, they should look into it!

Read the next section for an example of where failure to leverage modern ISAs translates to a performance loss.

Many Implementations of Myers Diff and Other Line Based Diffing Algorithms

This one is rather domain specific but I find it an illustrious example because the behavior is quite counter-intuitive!

Various classes of software need to take two text documents and emit a textual diff of their contents. Think what git diff displays.

There are various algorithms for generating a diff of text. Myers Diff is probably the most famous. The run-time of the algorithms is proportional to the number of lines. Probably O(nlog(n)) or O(n^2).

These text-based diffing algorithms often operate at the line level (rather than say the byte or codepoint level) because it drastically limits the search space and minimizes n to keep the algorithm run-time in check.

Over the years, various people have realized that when diffing two text documents, large parts of the inputs are often identical (why would you diff unrelated content after all). So most implementations of diff algorithms have a myriad of optimizations to limit the number of lines compared. Two common optimizations are to identify and exclude the common prefix and suffix of the input.

This is over-simplified, but text-based diffing algorithms often do the following:

  1. Split the input into lines.
  2. Hash each line to facilitate fast line equivalence testing (comparing a u32 or u64 checksum is a ton faster than memcmp() or strcmp()).
  3. Identity and exclude common prefix and suffix lines.
  4. Feed remaining lines into diffing algorithm.

The idea is that steps 1-3 - which should be O(n) - reduce work for an algorithm (step 4) with run-time complexity worse than O(n). Sounds good on paper.

So what actually happens?

If you profile a number of these diff implementations, you find that steps 1-3 actually take more time than the supposedly slow/expensive algorithm! How can this be?!

One culprit is the line splitting. Even assuming we can use memory 0-copy / references for storing the line contents (as opposed to allocating a new string to hold each parsed line, which can be much less efficient), splitting text into lines can be grossly inefficient!

There are various reasons for this. Maybe you are decoding the text into code points rather than operate in the domain of bytes (you shouldn't need to decode the entire input to search for newlines). Maybe you are traversing the file one character/byte at a time looking for LF.

An efficient solution to this problem employs the use of vectorized CPU instructions (like AVX/AVX2) which can scan several bytes at a time looking for a sentinel value or matching a byte mask. So instead of 1 instruction per input byte, you have 1/n. Your C runtime library probably has assembly implementations of memchr(), strchr(), and similar functions and automatically chooses the newest/fastest assembly/instructions supported by the run-time CPU (glibc does).

In theory, compilers recognize such patterns and emit modern vectorized instructions automagically. In reality, because the default target ISA of compilers is relatively ancient compared to what your CPU is capable of (see previous section), you are stuck with old instructions and linear scanning. Your best bet is to stick with functions in the C runtime that are probably backed by assembly. (Although watch out for function call overhead.)

Another culprit causing inefficiency is hashing each line. The hashing is performed to reduce equivalence testing to a u32/u64 compare rather than strcmp(). Many implementations don't seem to give much consideration to the hashing algorithm, using something like crc32 or djb2. An inefficiency here is many older hashing algorithms operate at the byte level: you need to feed in 1 byte at a time, update state (XOR if often employed), then feed in the next byte. This is inefficient because it throws away the instruction pipelining and superscalar properties of modern CPUs. A better approach is to use a hashing algorithm that digests 4, 8, or more bytes at a time. Again, this lowers run-time from ~n cycles per byte to ~1/n.

Another common inefficiency is computing the lines and hashes of content in the common prefix and suffix. Use of memcmp() (or even better: hand-rolled assembly to give you the offset of the first divergence) is more efficient, as again, your C runtime library probably has assembly implementations of memcmp() which can compare input at near native memory speed.

I quite enjoy this example because it demonstrates that something that is seemingly O(n) is slower than O(nlog(n))/O(n^2). This is because often the result of the optimization reduces the n of the expensive algorithm to such a small value that the computational complexity is trivial. Compilers targeting ancient microarchitectures and failing to leverage vectorized instructions which unlock superlinear performance further shift the time towards the O(n) optimizations.

Conclusion

Computers and software can be surprisingly slow for surprising reasons. While this post was long and touched on a number of topics, it only scratched the surface of potential topics. I could easily find another 10 topics to write about. But that will have to be for another post.

Before I go, if you find inaccuracies in this post, please shoot me an email (address in resume in site header) so I can correct the post, as I don't want to unintentionally mislead others.

Also, computers and software are complex. When it comes to performance and optimizations, always be measuring. The issues I described could be manifesting in your software and environments but the effort to address them may not be worth the reward. Computers and software, like life, are full of trade-offs. Performance is just one trade-off. Please don't cargo cult my advice without measuring and applying critical thinking first.


Announcing the 0.9 Release of PyOxidizer

October 18, 2020 at 10:00 PM | categories: Python, PyOxidizer

I have decided to make up for the 6 month lull between PyOxidizer's 0.7 and 0.8 releases by releasing PyOxidizer 0.9 just 1 week after 0.8!

The full 0.9 changelog is found in the docs. First time user? See the Getting Started documentation.

While the 0.9 release is far smaller in terms of features compared to 0.8, it is an important release because of progress closing compatibility gaps.

Build a python Executable

PyOxidizer 0.8 quietly shipped the ability to build executables that behave like python executables via enhancements to the configurability of embedded Python interpreters.

PyOxidizer 0.9 made some minor changes to make this scenario work better and there is even official documentation on how to achieve this. So now you can emit a python executable next to your application's executable. Or you could use PyOxidizer to build a highly portable, self-contained python executable and ship your Python scripts next to it, using PyOxidizer's python in your #!.

Support Packaging Files as Files for Maximum Compatibility

There is a long-tail of Python packages that don't just work with PyOxidizer. A subset of these packages don't work because of bugs with how PyOxidizer attempts to classify files as specific types of Python resources.

The way that normal Python works is you materialize a bunch of files on the filesystem and at run-time the filesystem-based importer stat()s a bunch of paths until it finds a candidate file satisfying the import request. This works of course. But it is inefficient. Since PyOxidizer has awareness of every resource being packaged at build time, it attempts to index all known resources and serialize them to an efficient data structure so finding and loading a resource can be extremely quick (effectively just a hashmap lookup in Rust code to resolve the memory address of data).

PyOxidizer's approach does work in the majority of cases. But there are edge cases. For example, NumPy's binary wheels have installed file paths like numpy.libs/libopenblasp-r0-ae94cfde.3.9.dev.so. The numpy.libs directory is not a valid Python package directory since it has a . and since it doesn't have an __init__.py[c] file. This is a case where PyOxidizer's code for turning files into resources is currently confused.

It is tempting to argue that file layouts like NumPy's are wrong. But there doesn't seem to be any formal specification preventing the use of such layouts. The arbiter of truth here is what Python packaging tools accept and the current code for installing wheels gladly accepts file layouts like these. So I've accepted that PyOxidizer is just going to have to support edge cases like this. (I've captured more details about this particular issue in the docs).

Anyway, PyOxidizer 0.9 ships a new, simpler mode for handling files: files mode. In files mode, PyOxidizer disables its code for classifying files as typed Python resources (like module sources and extension modules) and instead treats a file as... a file.

When in files mode, actions that invoke Python packaging tools return files objects instead of classified resources. If you then add these files for packaging, those files are materialized on the filesystem next to your built executable. You can then use Python's standard filesystem importer to load these files at run-time.

This allows you to use PyOxidizer with packages like NumPy that were previously incompatible due to bugs with file/resource classification. In fact, getting NumPy working with PyOxidizer is now in the official documentation!

Files mode is still in its infancy. There exists code for embedding files data in the produced executable. I plan to eventually teach PyOxidizer's run-time code to extract these embedded files to a temporary directory, SquashFS FUSE filesystem, etc. This is the approach that other Python packaging tools like PyInstaller and XAR use. While it is less efficient, this approach is highly compatible with Python code in the wild since you sidestep issues with __file__ and other assumptions about installed file layouts. So it makes sense for PyOxidizer to provide support for this so you can still achieve the friendliness of a self-contained executable without worrying about compatibility. Look for improvements to files mode in future releases.

And to help debug issues with PyOxidizer's file handling and resource classification, the new pyoxidizer find-resources command can be used to invoke PyOxidizer's code for scanning and classifying files. Hopefully this makes it easier to diagnose bugs in this critical component of PyOxidizer!

Some Important Bug Fixes

PyOxidizer 0.8 shipped with some pretty annoying bugs and behavior quirks.

The ability to set custom sys.path values via Starlark was broken. How I managed to ship that, I'm not sure. But it is fixed in 0.9.

Another bug I can't believe I shipped was the PythonExecutable.read_virtualenv() Starlark method being broken due to a typo. You can read from virtualenvs again in PyOxidizer 0.9.

Another important improvement is in the default Python interpreter configuration. We now automatically initialize Python's locales configuration by default. Without this, the encoding of filesystem paths and sys.argv may not have been correct. If someone passed a non-ASCII argument, the Python str value was likely mangled. PyOxidizer built binaries should behave reasonably by default now. The issue is a good read if the subtle behaviors of how encodings work in Python and on different operating systems is interesting to you.

Better Binary Portability Documentation

The documentation on binary portability has been overhauled. Hopefully it is much more clear about the capabilities of PyOxidizer to produce a binary that just works on other machines.

I eventually want to get PyOxidizer to a point where users don't have to think about binary portability. But until PyOxidizer starts generating installers and providing the ability to run builds in deterministic and reproducible environments, it is sadly a problem that is being externalized to end users.

In Conclusion

PyOxidizer 0.9 is a small release representing just 1 week of work. But it contains some notable features that I wanted to get out the door.

As always, please report any issues or feedback in the GitHub issue tracker or the users mailing list.


Announcing the 0.8 Release of PyOxidizer

October 12, 2020 at 12:45 AM | categories: Python, PyOxidizer

I am very excited to announce the 0.8 release of PyOxidizer, a modern Python application packaging tool. You can find the full changelog in the docs. First time user? See the Getting Started documentation.

Foremost, I apologize that this release took so long to publish (0.7 was released on 2020-04-09). I fervently believe that frequent releases are a healthy software development practice. And 6 months between PyOxidizer releases was way too long. Part of the delay was due to world events (it has proven difficult to focus on... anything given a global pandemic, social unrest, and wildfires further undermining any resemblance of lifestyle normalcy in California). Another contributing factor was I was waiting on a few 3rd party Rust crates to have new versions published to crates.io (you can't release a crate to crates.io unless all your dependencies are also published there).

Release delay and general life hardships aside, the 0.8 release is here and it is full of notable improvements!

Python 3.8 and 3.9 Support

PyOxidizer 0.8 now targets Python 3.8 by default and support for Python 3.9 is available by tweaking configuration files. Previously, we only supported Python 3.7 and this release drops support for Python 3.7. I feel a bit bad for dropping compatibility. But Python 3.8 introduced a new C API for initializing Python interpreters (thank you Victor Stinner!) and this makes PyOxidizer's run-time code for interfacing with Python interpreters vastly simpler. I decided that given the beta nature of PyOxidizer, it wasn't worth maintaining complexity to continue to support Python 3.7. I'm optimistic that I'll be able to support Python 3.8 as a baseline for a while.

Better Default Packaging Settings

PyOxidizer started as a science experiment of sorts to see if I could achieve the elusive goal of producing a single file executable providing a Python application. I was successful in proving this hypothesis. But the cost to achieving this outcome was rather high in terms of end-user experience: in order to produce single file executables, you had to break a lot of assumptions about how Python typically works and this in turn broke a lot of Python code and packages in the wild.

In other words, PyOxidizer's opinionated defaults of producing a single file executable were externalizing hardship on end-users and preventing them from using PyOxidizer.

PyOxidizer 0.8 contains a handful of changes to defaults that should hopefully lessen the friction.

On Windows, the default Python distribution now has a more traditional build configuration (using .pyd extension modules and a pythonXY.dll file). This means that PyOxidizer can consume pre-built extension modules without having to recompile them from source. If you publish a Windows binary wheel on PyPI, in many cases it will just work with PyOxidizer 0.8! (There are some notable exceptions to this, such as numpy, which is doing wonky things with the location of shared libraries in wheels - but I aim to fix this soon.)

Also on Windows, we no longer attempt to embed Python extension modules (.pyd files) and their shared library dependencies in the produced binary and load them from memory by default. This is because PyOxidizer's from-memory library loader didn't work in all cases. For example, some OpenSSL functionality used by the _ssl module in the standard library didn't work, preventing Python from establishing TLS connections. The old mode enabling you to produce a single file executable on Windows is still available. But you have to opt in to it (at the likely cost of more packaging and compatibility pain).

Starlark Configuration Overhaul

PyOxidizer 0.8 contains a ton of changes to its Starlark configuration files. There are so many changes that you may find it easier to port to PyOxidizer 0.8 by creating a new configuration file rather than attempting to port an existing one.

I apologize for this churn and recognize it will be disruptive. However, this churn needed to happen for various reasons.

Much of the old Starlark configuration semantics was rooted in the days when configuration files were static TOML files. Now that configuration files provide the power of a (Python-inspired) programming language, we are free to expose much more flexibility. But that flexibility requires refactoring things so the experience feels more native.

Many changes to Starlark were rooted in necessity. For example, the methods for invoking setup.py or pip install used to live on a Python distribution type and have been moved to a type representing executables. This is because the binary we are targeting influences how packaging actions behave. For example, if the binary only supports loading resources from memory (as opposed to standalone files), we need to know that when invoking the packaging tool so we can produce files (notably Python extension modules) compatible with the destination.

A major change to Starlark in 0.8 is around resource location handling. Before, you could define a static string denoting the resources policy for where things should be placed. And there were 10+ methods for adding different resource types (source, bytecode, extensions, package data) to different load locations (memory, filesystem). This mechanism is vastly simplified and more powerful in PyOxidizer 0.8!

In PyOxidizer 0.8, there is a single add_python_resource() method for adding a resource to a binary and the Starlark objects you add can denote where they should be added by defining attributes on those objects.

Furthermore, you can define a Starlark function that is called when resource objects are created to apply custom packaging rules using custom Starlark code defined in your PyOxidizer config file. So rather than having everyone try to abide by a few pre-canned policies for packaging resources, you can define a proper function in your config file that can be as complex as you want/need it to be! I feel this is vastly simpler and more powerful than implementing a custom DSL in static configuration files (like TOML, JSON, YAML, etc).

While the ability to implement your own arbitrarily complex packaging policies is useful, there is a new PythonPackagingPolicy Starlark type with enough flexibility to suit most needs.

Shipping oxidized_importer

During the development of PyOxidizer 0.8, I broke out the custom Rust-based Python meta-path importer used by PyOxidizer's run-time code into a standalone Python package. This sub-project is called oxidized_importer and I previously blogged about it.

PyOxidizer 0.8 ships oxidized_importer and all of its useful APIs available to Python. Read more in the official docs. The new Python APIs should make debugging issues with PyOxidizer-packaged applications vastly simpler: I found them invaluable when tracking down user-reported bugs!

Tons of New Tests and Refactored Code

PyOxidizer was my first non-toy Rust project. And the quality of the Rust code I produced in early versions of PyOxidizer clearly showed it. And when I was in the rapid-prototyping phase of PyOxidizer, I eschewed writing tests in favor of short-term progress.

PyOxidizer 0.8 pays down a ton of technical debt in the code base. Lots of Rust code has been refactored and is using somewhat reasonable practices. I'm not yet a Rust guru. But I'm at the point where I cringe when I look at some of the early code I wrote, which is a good sign. I do have to say that Rust has been a dream to work with during this transition. Despite being a low-level language, my early misuse of Rust did not result in crashes like you would see in languages like C/C++. And Rust's seemingly omniscient compiler and IDE tools facilitating refactoring have ensured that code changes aren't accompanied by subtle random bugs that would occur in dynamic programming languages. I really need to write a dedicated post espousing the virtues of Rust...

There are a ton of new tests in PyOxidizer 0.8 and I now feel somewhat confident that the main branch of PyOxidizer should be considered production-ready at any time assuming the tests pass. This will hopefully lead to more rapid releases in the future.

There are now tests for the pyembed Rust crate, which provides the run-time code for PyOxidizer-built binaries. We even have Python-based unit tests for validating the Python-exposed APIs behave as expected. These tests have been invaluable for ensuring that the run-time code works as expected. So now when someone files a bug I can easily write a test to capture it and keep the code working as intended through various refactors.

The packaging-time Rust code has also gained its fair share of tests. We now have fairly comprehensive test coverage around how resources are added/packaged. Python extension modules have proved to be highly nuanced in how they are handled. Tremendously helping testing of extension modules is that we're able to run tests for platform non-native extensions! While not yet exposed/supported by Starlark configuration files, I've taught PyOxidizer's core Rust code to be cross-compiling aware so that we can e.g. test Windows or macOS behavior from Linux. Before, I'd have to test Windows wheel handling on Windows. But after writing a wheel parser in Rust and teaching PyOxidizer to use a different Python distribution for the host architecture from the target architecture, I'm now able to write tests for platform-specific functionality that run on any platform that PyOxidizer can run on. This may eventually lead to proper cross-compiling support (at least in some configuration). Time will tell. But the foundation is definitely there!

New Rust Crates

As part of the aforementioned refactoring of PyOxidizer's Rust code, I've been extracting some useful/generic functionality built as part of developing PyOxidizer to their own Rust crates.

As part of this release, I'm publishing the initial 0.1 release of the python-packaging crate (docs). This crate provides pure Rust code for various Python packaging related functionality. This includes:

  • Rust types representing Python resource types (source modules, bytecode modules, extension modules, package resources, etc).
  • Scanning the filesystem for Python resource files .
  • Configuring an embedded Python interpreter.
  • Parsing PKG-INFO and related files.
  • Parsing wheel files.
  • Collecting Python resources and serializing them to a data structure.

The crate is somewhat PyOxidizer centric. But if others are interested in improving its utility, I'll happily accept pull requests!

PyOxidizer's crates footprint now includes:

Major Documentation Updates

I strongly believe that software should be documented thoroughly and I strive for PyOxidizer's documentation to be useful and comprehensive.

There have been a lot of changes to PyOxidizer's documentation since the 0.7 release.

All configuration file documentation has been consolidated.

Likewise, I've attempted to consolidate a lot of the paved road documentation for how to use PyOxidizer in the Packaging User Guide section of the docs.

I'll be honest, since I have so much of PyOxidizer's workings internalized, it can be difficult for me to empathize with PyOxidizer's users. So if you have difficult with the readability of the documentation, please file an issue and report what is confusing so the documentation can be improved!

Mercurial Shipping With PyOxidizer 0.8

PyOxidizer is arguably an epic yak shave of mine to help the Mercurial version control tool transition to Python 3 and Rust.

I'm pleased to report that Mercurial is now shipping PyOxidizer-built distributions on Windows as of the 5.2.2 release a few days ago! If a complex Python application like Mercurial can be configured to work with PyOxidizer, chances are your Python application will work as well.

Whats Next

I view PyOxidizer 0.8 as a pivotal release where PyOxidizer is turning the corner from a prototyping science experiment to something more generally usable. The investments in test coverage and refactoring of the Rust internals are paving the way towards future features and bug fixes.

In upcoming releases, I'd like to close remaining known compatibility gaps with popular Python packages (such as numpy and other packages in the scientific/data space). I have a general idea of what work needs to be done and I've been laying the ground work via various refactorings to execute here.

I want a general theme of future releases to be eliminating reasons why people can't use PyOxidizer. PyOxidizer's historical origin was as a science experiment to see if single file Python applications were possible. It is clear that achieving this is fundamentally incompatible with compatibility with tons of Python packages in the wild. I'd like to find a way where PyOxidizer can achieve 99% package compatibility by default so new users don't get discouraged when using PyOxidizer. And for the subset of users who want single file executables, they can spend the magnitude of additional effort to achieve that.

At some point, I also want to make a pivot towards focusing on producing distributable artifacts (Debian/RPM packages, MSI installers, macOS DMG files, etc). I'm slightly bummed that I haven't made much progress here. But I have a vision in my mind of where I want to go (I'll be making a standalone Rust crate + Starlark dialect to facilitate producing distributable artifacts for any application) and I'm anticipating starting this work in the next few months. In the mean time, PyOxidizer 0.8 should be able to give people a directory tree that they can coerce into distributable artifacts using existing packaging tooling. That's not as turnkey as I would like it to be. But the technical problems around building a distributable Python application binary still needs some work and I view that as the most pressing need for the Python ecosystem. So I'll continue to focus there so there is a solid foundation to build upon.

In conclusion, I hope you enjoy the new release! Please report any issues or feedback in the GitHub issue tracker.


Using Rust to Power Python Importing With oxidized_importer

May 10, 2020 at 01:15 PM | categories: Python, PyOxidizer

I'm pleased to announce the availability of the oxidized_importer Python package, a standalone version of the custom Python module importer used by PyOxidizer. oxidized_importer - a Python extension module implemented in Rust - enables Python applications to start and run quicker by providing an alternate, more efficient mechanism for loading Python resources (such as source and bytecode modules).

Installation instructions and detailed usage information are available in the official documentation. The rest of this post hopefully answers the questions of why are you doing this and why should I care.

In a traditional Python process, Python's module importer inspects the filesystem at run-time to find and load resources like Python source and bytecode modules. It is highly dynamic in nature and relies on the filesystem as a point-in-time source of truth for resource availability.

oxidized_importer takes a different approach to resource loading that is more static in nature and more suitable to application environments (where Python resources aren't changing). Instead of dynamically probing the filesystem for available resources, resources are instead indexed ahead of time. When Python goes to resolve a resource (say it is looking to import a module), oxidized_importer simply needs to perform a lookup in an in-memory data structure to locate said resource. This means oxidized_importer only has marginal reliance on the filesystem, which can make it much faster than Python's traditional importer. (Performance benefits of binaries built with PyOxidizer have already been clearly demonstrated.)

The oxidized_importer Python extension module exposes parts of PyOxidizer's packaging and run-time functionality to Python code, without requiring the full use of PyOxidizer for application packaging. Specifically, oxidized_importer allows you to:

  • Install a custom, high-performance module importer (OxidizedFinder) to service Python import statements and resource loading (potentially from memory, using zero-copy).
  • Scan the filesystem for Python resources (source modules, bytecode files, package resources, distribution metadata, etc) and turn them into Python objects, which can be loaded into OxidizedFinder instances.
  • Serialize Python resource data into an efficient binary data structure for loading into an OxidizedFinder instance. This facilitates producing a standalone resources blob that can be distributed with a Python application which contains all the Python modules, bytecode, etc required to power that application. See the docs on freezing an application with oxidized_importer.

oxidized_importer can be thought of as PyOxidizer-lite: it provides just enough functionality to allow Python application maintainers to leverage some of the technical advancements of PyOxidizer (such as in-memory module imports) without using PyOxidizer for application packaging. oxidized_importer can work with the Python distribution already installed on your system. You just pip install it like any other Python package.

By releasing oxidized_importer as a standalone Python package, my hope is to allow more people to leverage some of the technical achievements and performance benefits coming out of PyOxidizer. I also hope that having more users of PyOxidizer's underlying code will help uncover bugs and conformance issues, raising the quality and viability of the projects.

I would also like to use oxidized_importer as an opportunity to advance the discourse around Python's resource loading mechanism. Filesystem I/O can be extremely slow, especially in mobile and embedded environments. Dynamically probing the filesystem to service module imports can therefore be slow. (The Python standard library has the zipimport module for importing Python resources from a zip file. But in my opinion, we can do much better.) I would like to see Python move towards leveraging immutable, serialized data structures for loading resources as efficiently as possible. After all, Python resources like the Python standard library are likely not changing between Python process invocations. The performance zealot in me cringes thinking of all the overhead that Python's filesystem probing approach incurs - all of the excessive stat() and other filesystem I/O calls that must be performed to answer questions about state that is easily indexed and often doesn't change. oxidized_importer represents my vision for what a high-performance Python resource loader should look like. I hope it can be successful in steering Python towards a better approach for resource loading.

I plan to release oxidized_importer independently from PyOxidizer. While the projects will continue to be developed in the same repository and will leverage the same underlying Rust code, I view them as somewhat independent and serving different audiences.

While oxidized_importer evolved from facilitating PyOxidizer's run-time use cases, I'm not opposed to taking it in new directions. For example, I would entertain implementing Python's dynamic filesystem probing logic in oxidized_importer, allowing it to serve as a functional stand-in for the official importer shipped with the Python standard library. I have little doubt an importer implemented in 100% Rust would outperform the official importer, which is implemented in Python. There's all kinds of possibilities here, such as using a background thread to index sys.path outside the constraints of the GIL. But I don't want to get ahead of myself...

If you are a Python application maintainer and want to make your Python processes execute a bit faster by leveraging a pre-built index of available Python resources and/or taking advantage of in-memory module importing, I highly encourage you to take a look at oxidized_importer!


« Previous Page -- Next Page »