What I've Learned About Optimizing Python

January 10, 2019 at 03:00 PM | categories: Python

I've used Python more than any other programming language in the past 4-5 years. Python is the lingua franca for Firefox's build, test, and CI tooling. Mercurial is written in mostly Python. Many of my side-projects are in Python.

Along the way, I've accrued a bit of knowledge about Python performance and how to optimize Python. This post is about sharing that knowledge with the larger community.

My experience with Python is mostly with the CPython interpreter, specifically CPython 2.7. Not all observations apply to all Python distributions or have the same characteristics across Python versions. I'll try to call this out when relevant. And this post is in no way a thorough survey of the Python performance landscape. I mainly want to highlight areas that have particularly plagued me.

Startup and Module Importing Overhead

Starting a Python interpreter and importing Python modules is relatively slow if you care about milliseconds.

If you need to start hundreds or thousands of Python processes as part of a workload, this overhead will amount to several seconds of overhead.

If you use Python to provide CLI tools, the overhead can cause enough lag to be noticeable by people. If you want instantaneous CLI tools, launching a Python interpreter on every invocation will make it very difficult to achieve that with a sufficiently complex tool.

I've written about this problem extensively. My 2014 post on python-dev outlines the problem. Posts in May 2018 and October 2018 restate and refine it.

There's not much you can do to alleviate interpreter startup overhead: fixing this mostly resides with the maintainers of the Python interpreter because they control the code that is taking precious milliseconds to complete. About the best you can do is disable the site import in your shebangs and invocations to avoid some extra Python code running at startup. However, many applications rely on functionality provided by site.py, so use at your own risk.

Related to this is the problem of module importing. What good is a Python interpreter if it doesn't have code to run! And the way code is made available to the interpreter is often through importing modules.

There are multiple steps to importing modules. And there are sources of overhead in each one.

There is overhead in finding modules and reading their data. As I've demonstrated with PyOxidizer, replacing the default find and load a module from the filesystem with an architecturally simpler solution of read the module data from an in-memory data structure makes importing the Python standard library take 70-80% of its original time! Having a single module per filesystem file introduces filesystem overhead and can slow down Python applications in the critical first milliseconds of execution. Solutions like PyOxidizer can mitigate this. And hopefully the Python community sees the overhead in the current approach and considers moving towards module distribution mechanisms that don't rely so much on separate files per module.

Another source of module importing overhead is executing code in that module at import time. Some modules have code in the module scope outside of functions and classes that runs when the module is imported. This code execution can add overhead to importing. A mitigation for this is to not run as much code at import time: only run code as needed. Python 3.7 supports a module __getattr__ that will be called when a module attribute is not found. This can be used to lazily populate module attributes on first access.

Another workaround for module importing slowness is lazy module importing. Instead of actually loading a module when it is imported, you register a custom module importer that returns a stub for that module instead. When that stub is first accessed, it will load the actual module and mutate itself to be that module.

By avoiding the filesystem and module running overhead for unused modules (modules are typically imported globally and then only used by certain functions in a module), you can easily shave dozens of milliseconds from applications importing several dozens of modules.

But lazy module importers are a bit fragile. Lots of modules have a pattern where they try: import foo; except ImportError:. A lazy module importer may never raise ImportError here because to do so, it would need to search the filesystem for a module to know if it exists and searching the filesystem would add overhead, so they don't do it! You work around this by accessing an attribute on the imported module. This forces the ImportError to be raised if the module doesn't exist but undermines the laziness of the module import! This problem is quite nasty. Mercurial's lazy module importer has to maintain a list of modules that are known to not be lazy importable to work around it. Another issue is the from foo import x, y syntax, which also undermines lazy module importing in cases where foo is a module (as opposed to a package) because in order to return a reference to x and y, the module has to be imported.

PyOxidizer, having a fixed set of modules frozen into the binary, can be efficient about raising ImportError. And Python 3.7's module __getattr__ provides additional flexibility for lazy module importers. I hope to integrate a robust lazy module importer into PyOxidizer so these gains are realized automatically.

The best solution to avoiding the interpreter startup and module import overhead problem is to run a persistent Python process. If you run Python in a daemon process (say for a web server), you pretty much get this for free. Mercurial's solution to this is to run a persistent Python process in the background which exposes a command server protocol. hg is aliased to a C (or now Rust) executable which connects to that persistent process and dispatches a command. The command server approach is a lot of work and can be a bit fragile and has security concerns. I'm exploring the idea of shipping a command server with PyOxidizer so executable can easily gain its benefits and the cost to solving the problem only needs to be paid in one central place: the PyOxidizer project.

Function Call Overhead

Function calls in Python are relatively slow. (This observation applies less to PyPy, which can JIT code execution.)

I've seen literally dozens of patches to Mercurial where we inline code or combine Python functions in order to avoid function call overhead. In the current development cycle, some effort was made to reduce the number of functions called when updating progress bars. (We try to use progress bars for any operation that could take a while so users know what is going on.) The old progress bar update code would dispatch to a handful of functions. Caching function call results and avoiding simple lookups via functions shaves dozens to hundreds of milliseconds off execution when we're talking about 1 million executions.

If you have tight loops or recursive functions in Python where hundreds of thousands or more function calls could be in play, you need to be aware of the overhead of calling an individual function, as it can add up quickly! Consider in-lining simple functions and combining functions to avoid the overhead.

Attribute Lookup Overhead

This problem is similar to function call overhead because it can actually be the same problem!

Resolving an attribute in Python can be relatively slow. (Again, this observation applies less to PyPy.)

Again, working around this issue is something we do a lot in Mercurial.

Say you have the following code:

obj = MyObject()
total = 0

for i in len(obj.member):
    total += obj.member[i]

Ignoring that there are better ways to write this example (total = sum(obj.member) should work), as written, the loop here will need to resolve obj.member on every iteration. Python has a relatively complex mechanism for resolving attributes. For simple types, it can be quite fast. But for complex types, that attribute access can silently be invoking __getattr__, __getattribute__, various other dunder methods, and even custom @property functions. What looks like it should be a fast attribute lookup can silently be several function calls, leading to function call overhead! And this overhead can compound if you are doing things like obj.member1.member2.member3 etc.

Each attribute lookup adds overhead. And since nearly everything in Python is a dictionary, it is somewhat accurate to equate each attribute lookup as a dictionary lookup. And we know from basic data structures that dictionary lookups are intrinsically not as fast as having say a pointer. Yes, there are some tricks in CPython to avoid the dictionary lookup overhead. But the general theme I want to get across is that each attribute lookup is a potential performance sink.

For tight loops - especially those over potentially hundreds of thousands of iterations - you can avoid this measurable attribute lookup overhead by aliasing the value to a local. We would write the example above as:

obj = MyObject()
total = 0

member = obj.member
for i in len(member):
    total += member[i]

Of course, this is only safe when the aliased item isn't replaced inside the loop! If that happens, your iterator will hold a reference to the old item and things may blow up.

The same trick can be used when calling a method of an object. Instead of:

obj = MyObject()

for i in range(1000000):
    obj.process(i)

Do the following:

obj = MyObject()
fn = obj.process

for i in range(1000000:)
    fn(i)

It's also worth noting that in cases where the attribute lookup is used to call a method (such as the previous example), Python 3.7 is significantly faster than previous releases. But I'm pretty sure this is due to dispatch overhead to the method function itself, not attribute lookup overhead. So things will be faster yet by avoiding the attribute lookup.

Finally, unless attribute lookup is calling functions to resolve the attribute, attribute lookup is generally less of a problem than function call overhead. And it generally requires eliminating a lot of attribute lookups for you to notice a meaningful improvement. That being said, once you add up all attribute accesses inside a loop, you may be talking about 10 or 20 attributes in the loop alone - before function calls. And loops with only thousands or low tens of thousands of iterations can quickly provide hundreds of thousands or millions of attribute lookups. So be on the lookout!

Object Overhead

From the perspective of the Python interpreter, every value is an object. In CPython, each value is a PyObject struct. Each object managed by the interpreter is on the heap and needs to have its own memory holding its reference count, its type, and other state. Every object is garbage collected. This means that each new object introduces overhead for the reference counting / garbage collection mechanism to process. (Again, PyPy can avoid some of this overhead by being more intelligent about the lifetimes of short-lived values.)

As a general rule of thumb, the more unique Python values/objects you create, the slower things are.

For example, say you are iterating over a collection of 1 million objects. You call a function to process that object into a tuple:

for x in my_collection:
    a, b, c, d, e, f, g, h = process(x)

In this example, process() returns an 8-tuple. It doesn't matter of we destructure the return value or not: this tuple requires the creation of at least 9 Python values: 1 for the tuple itself and 8 for its inner members. OK, in reality there could be fewer values if process() returns a reference to an existing value. Or there could be more if the types aren't simple types and require multiple PyObject to represent. My point is that under the hood the interpreter is having to juggle multiple objects to represent things.

From my experience, this overhead is only relevant for operations that benefit from speedups when implemented in a native language like C or Rust. The reason is the CPython interpreter is just unable to execute bytecode fast enough for object overhead itself to matter. Instead, you will likely hit performance issues with function call overhead, processing overhead, etc long before object overhead. But there are some exceptions to this, such as constructing tuples or dicts with several members.

As a concrete example of this overhead, Mercurial has C code for parsing some of the lower-level data structures. In terms of raw parsing speed, the C code runs on an order of two magnitudes faster than CPython. But once we have that C code create PyObject to present the result, the speedup drops to just a few times faster, if that. In other words, the overhead is coming from creating and managing Python values so they can be used by Python code.

A workaround for this is to produce fewer Python values. If you only need to access a single value, have a function return that single value instead of say a tuple or dict with N values. However, watch out for function call overhead!

When you have a lot of speedup code using the CPython C API and values need to be shared across different modules, pass around Python types that expose data as C structs and have the compiled code access those C structs instead of going through the CPython C API. By avoiding the Python C API for data access, you will be avoiding most of its overhead.

Treating values as data (instead of having functions for accessing everything) is more Pythonic. So another workaround for compiled code is is to lazily create PyObject instances. If you create a custom Python type (PyTypeObject) to represent your complex values, you can define the tp_members and/or tp_getset fields to register custom C functions to resolve the value for an attribute. If you are say writing a parser and you know that consumers will only access a subset of the parsed fields, you can quickly construct a type holding the raw data, return that type, and have the Python attribute lookup call a C function which resolves the PyObject. You can even defer parsing until this function is called, saving additional overhead if a parse is never required! This technique is quite rare (because it requires writing a non-trivial amount of code against the Python C API). But it can result in substantial wins.

Pre-Sizing Collections

This one applies to the CPython C API.

When creating collections like lists or dicts, use e.g. PyList_New() + PyList_SET_ITEM() to populate new collections when their size is known at collection creation time. This will pre-size the collection to have capacity to hold the final number of elements. And it skips checks when inserting elements that the collection is large enough to hold them. When creating collections of thousands of elements, this can save a bit of overhead!

Using Zero-copy in the C API

The CPython C API really likes to make copies of things rather than return references. For example, PyBytes_FromStringAndSize() copies a char* to memory owned by Python. If you are doing this for a large number of values or sufficiently large data, we could be talking about gigabytes of memory I/O and associated allocator overhead.

If writing high-performance code against the C API, you'll want to become familiar with the buffer protocol and related types, like memoryview.

The buffer protocol is implemented by Python types and allows the Python interpreter to cast a type to/from bytes. It essentially allows the interpreter's C code to get a handle on a void* of certain size representing the object. This allows you to associate any address in memory with a PyObject. Many functions operating on binary data transparently accept any object implementing the buffer protocol. And if you are coding against the C API and want to accept any object that can be treated as bytes, you should be using the s*, y* or w* format units when parsing function arguments.

By using the buffer protocol, you give the interpreter the best opportunity possible to be using zero-copy operations and avoiding having to copy bytes around in memory.

By using Python types like memoryview, you are also allowing Python to reference slices of memory by reference instead of by copy.

When you have gigabytes of data flowing through your Python program, astute use of Python types that support zero-copy can make a world of difference on performance. I once measured that python-zstandard was faster than some Python LZ4 bindings (LZ4 should be faster than zstandard) because I made heavy use of the buffer protocol and avoiding excessive memory I/O in python-zstandard!

Conclusion

This post has outlined some of the things I've learned optimizing Python programs over the years. This post is by no means a comprehensive overview of Python performance techniques and gotchas. I recognize that my use of Python is probably more demanding than most and that the recommendations I made are not applicable to many Python programs. You should not mass update your Python code to e.g. inline functions and remove attribute lookups after reading this post. As always, when it comes to performance optimization, measure first and optimize where things are observed to be slow. I highly recommend py-spy for profiling Python applications. That being said, it's hard to attach a time value to low-level activity in the Python interpreter such as calling functions and looking up attributes. So if you e.g. have a loop that you know is tight, experiment with suggestions in this post, and see if you can measure an improvement!

Finally this post should not be interpreted as a dig against Python or its performance properties. Yes, you can make arguments that Python should or shouldn't be used in particular areas because of performance properties. But Python is extremely versatile - especially with PyPy delivering exceptional performance for a dynamic programming language. The performance of Python is probably good enough for most people. For better or worse, I have used Python for uses cases that often feel like outliers across all users. And I wanted to share my experiences such that others know what life at the frontier is like. And maybe, just maybe, I can cause the smart people who actually maintain Python distributions to think about the issues I've had in more detail and provide improvements to mitigate them.


PyOxidizer Support for Windows

January 06, 2019 at 10:00 AM | categories: Python, PyOxidizer, Rust

A few weeks ago I introduced PyOxidizer, a project that aims to make it easier to produce completely self-contained executables embedding a Python interpreter (using Rust). A few days later I observed some PyOxidizer performance benefits.

After a few more hacking sessions, I'm very pleased to report that PyOxidizer is now working on Windows!

I am able to produce a standalone Windows .exe containing a fully featured CPython interpreter, all its library dependencies (OpenSSL, SQLite, liblzma, etc), and a copy of the Python standard library (both source and bytecode data). The binary weighs in at around 25 MB. (It could be smaller if we didn't embed .py source files or stripped some dependencies.) The only DLL dependencies of the exe are vcruntime140.dll and various system DLLs that are always present on Windows.

Like I did for Linux and macOS, I produced a Python script that performs ~500 import statements for the near entirety of the Python standard library. I then ran this script with both the official 64-bit Python distribution and an executable produced with PyOxidizer:

# Official CPython 3.7.2 Windows distribution.
$ time python.exe < import_stdlib.py
real    0m0.475s

# PyOxidizer with non-PGO CPython 3.7.2
$ time target/release/pyapp.exe < import_stdlib.py
real    0m0.347s

Compared to the official CPython distribution, a PyOxidizer executable can import almost the entirety of the Python standard library ~125ms faster - or ~73% of original. In terms of the percentage of speedup, the gains are similar to Linux and macOS. However, there is substantial new process overhead on Windows compared to POSIX architectures. On the same machine, a hello world Python process will execute in ~10ms on Linux and ~40ms on Windows. If we remove the startup overhead, importing the Python standard library runs at ~70% of its original time, making the relative speedup on par with that seen on macOS + APFS.

Windows support is a major milestone for PyOxidizer. And it was the hardest platform to make work. CPython's build system on Windows uses Visual Studio project files. And coercing the build system to produce static libraries was a real pain. Lots of CPython's build tooling assumes Python is built in a very specific manner and multiple changes I made completely break those assumptions. On top of that, it's very easy to encounter problems with symbol name mismatch due to the use of __declspec(dllexport) and __declspec(dllimport). I spent several hours going down a rabbit hole learning how Rust generates symbols on Windows for extern {} items. Unfortunately, we currently have to use a Rust Nightly feature (the static-nobundle linkage kind) to get things to work. But I think there are options to remove that requirement.

Up to this point, my work on PyOxidizer has focused on prototyping the concept. With Windows out of the way and PyOxidizer working on Linux, macOS, and Windows, I have achieved confidence that my vision of a single executable embedding a full-featured Python interpreter is technically viable on major desktop platforms! (BSD people, I care about you too. The solution for Linux should be portable to BSD.) This means I can start focusing on features, usability, and optimization. In other words, I can start building a tool that others will want to use.

As always, you can follow my work on this blog and by following the python-build-standalone and PyOxidizer projects on GitHub.


Faster In-Memory Python Module Importing

December 28, 2018 at 12:40 PM | categories: Python, PyOxidizer, Rust

I recently blogged about distributing standalone Python applications. In that post, I announced PyOxidizer - a tool which leverages Rust to produce standalone executables embedding Python. One of the features of PyOxidizer is the ability to import Python modules embedded within the binary using zero-copy.

I also recently blogged about global kernel locks in APFS, which make filesystem operations slower on macOS. This was the latest wrinkle in a long battle against Python's slow startup times, which I've posted about on the official python-dev mailing list over the years.

Since I announced PyOxidizer a few days ago, I've had some productive holiday hacking sessions!

One of the reached milestones is PyOxidizer now supports macOS.

With that milestone reached, I thought it would be interesting to compare the performance of a PyOxidizer executable versus a standard CPython build.

I produced a Python script that imports almost the entirety of the Python standard library - at least the modules implemented in Python. That's 508 import statements. I then executed this script using a typical python3.7 binary (with the standard library on the filesystem) and PyOxidizer-produced standalone executables with a module importer that loads Python modules from memory using zero copy.

# Homebrew installed CPython 3.7.2

# Cold disk cache.
$ sudo purge
$ time /usr/local/bin/python3.7 < import_stdlib.py
real   0m0.694s
user   0m0.354s
sys    0m0.121s

# Hot disk cache.
$ time /usr/local/bin/python3.7 < import_stdlib.py
real   0m0.319s
user   0m0.263s
sys    0m0.050s

# PyOxidizer with non-PGO/non-LTO CPython 3.7.2
$ time target/release/pyapp < import_stdlib.py
real   0m0.223s
user   0m0.201s
sys    0m0.017s

# PyOxidizer with PGO/non-LTO CPython 3.7.2
$ time target/release/pyapp < import_stdlib.py
real   0m0.234s
user   0m0.210s
sys    0m0.019

# PyOxidizer with PTO+LTO CPython 3.7.2
$ sudo purge
$ time target/release/pyapp < import_stdlib.py
real   0m0.442s
user   0m0.252s
sys    0m0.059s

$ time target/release/pyall < import_stdlib.py
real   0m0.221s
user   0m0.197s
sys    0m0.020s

First, the PyOxidizer times are all relatively similar regardless of whether PGO or LTO is used to build CPython. That's not too surprising, as I'm exercising a very limited subset of CPython (and I suspect the benefits of PGO/LTO aren't as pronounced due to the nature of the CPython API).

But the bigger result is the obvious speedup with PyOxidizer and its in-memory importing: PyOxidizer can import almost the entirety of the Python standard library ~100ms faster - or ~70% of original - than a typical standalone CPython install with a hot disk cache! This comes out to ~0.19ms per import statement. If we run purge to clear out the disk cache, the performance delta increases to 252ms, or ~64% of original. All these numbers are on a 2018 6-core 2.9 GHz i9 MacBook Pro, which has a pretty decent SSD.

And on Linux on an i7-6700K running in a Hyper-V VM:

# pyenv installed CPython 3.7.2

# Cold disk cache.
$ time ~/.pyenv/versions/3.7.2/bin/python < import_stdlib.py
real   0m0.405s
user   0m0.165s
sys    0m0.065s

# Hot disk cache.
$ time ~/.pyenv/versions/3.7.2/bin/python < import_stdlib.py
real   0m0.193s
user   0m0.161s
sys    0m0.032s

# PyOxidizer with PGO CPython 3.7.2

# Cold disk cache.
$ time target/release/pyapp < import_stdlib.py
real   0m0.227s
user   0m0.145s
sys    0m0.016s

# Hot disk cache.
$ time target/release/pyapp < import_stdlib.py
real   0m0.152s
user   0m0.136s
sys    0m0.016s

On a hot disk cache, the run-time improvement of PyOxidizer is ~41ms, or ~78% of original. This comes out to ~0.08ms per import statement. When flushing caches by writing 3 to /proc/sys/vm/drop_caches, the delta increases to ~178ms, or ~56% of original.

Using dtruss -c to execute the binaries, the breakdown in system calls occurring >10 times is clear:

# CPython standalone
fstatfs64                                      16
read_nocancel                                  19
ioctl                                          20
getentropy                                     22
pread                                          26
fcntl                                          27
sigaction                                      32
getdirentries64                                34
fcntl_nocancel                                106
mmap                                          114
close_nocancel                                129
open_nocancel                                 130
lseek                                         148
open                                          168
close                                         170
read                                          282
fstat64                                       403
stat64                                        833

# PyOxidizer
lseek                                          10
read                                           12
read_nocancel                                  14
fstat64                                        16
ioctl                                          22
munmap                                         31
stat64                                         33
sysctl                                         33
sigaction                                      36
mmap                                          122
madvise                                       193
getentropy                                    315

PyOxidizer avoids hundreds of open(), close(), read(), fstat64(), and stat64() calls. And by avoiding these calls, PyOxidizer not only avoids the userland-kernel overhead intrinsic to them, but also any additional overhead that APFS is imposing via its global lock(s).

(Why the PyOxidizer binary is making hundreds of calls to getentropy() I'm not sure. It's definitely coming from Python as a side-effect of a module import and it is something I'd like to fix, if possible.)

With this experiment, we finally have the ability to better isolate the impact of filesystem overhead on Python module importing and preliminary results indicate that the overhead is not insignificant - at least on the tested systems (I'll get data for Windows when PyOxidizer supports it). While the test is somewhat contrived (I don't think many applications import the entirety of the Python standard library), some Python applications do import hundreds of modules. And as I've written before, milliseconds matter. This is especially true if you are invoking Python processes hundreds or thousands of times in a build system, when running a test suite, for scripting, etc. Cumulatively you can be importing tens of thousands of modules. So I think shaving even fractions of a millisecond from module importing is important.

It's worth noting that in addition to the system call overhead, CPython's path-based importer runs substantially more Python code than PyOxidizer and this likely contributes several milliseconds of overhead as well. Because PyOxidizer applications are static, the importer can remain simple (finding a module in PyOxidizer is essentially a Rust HashMap<String, Vec<u8> lookup). While it might be useful to isolate the filesystem overhead from Python code overhead, the thing that end-users care about is overall execution time: they don't care where that overhead is coming from. So I think it is fair to compare PyOxidizer - with its intrinsically simpler import model - with what Python typically does (scan sys.path entries and looking for modules on the filesystem).

Another difference is that PyOxidizer is almost completely statically linked. By contrast, a typical CPython install has compiled extension modules as standalone shared libraries and these shared libraries often link against other shared libraries (such as libssl). From dtruss timing information, I don't believe this difference contributes to significant overhead, however.

Finally, I haven't yet optimized PyOxidizer. I still have a few tricks up my sleeve that can likely shave off more overhead from Python startup. But so far the results are looking very promising. I dare say they are looking promising enough that Python distributions themselves might want to look into the area more thoroughly and consider distribution defaults that rely less on the every-Python-module-is-a-separate-file model.

Stay tuned for more PyOxidizer updates in the near future!

(I updated this post a day after initial publication to add measurements for Linux.)


Distributing Standalone Python Applications

December 18, 2018 at 03:35 PM | categories: Python, PyOxidizer, Rust

The Problem

Packaging and application distribution is a hard problem on multiple dimensions. For Python, large aspects of this problem space are more or less solved if you are distributing open source Python libraries and your target audience is developers (use pip and PyPI). But if you are distributing Python applications - standalone executables that use Python - your world can be much more complicated.

One of the primary reasons why distributing Python applications is difficult is because of the complex and often sensitive relationship between a Python application and the environment it runs in.

For starters we have the Python interpreter itself. If your application doesn't distribute the Python interpreter, you are at the whims of the Python interpreter provided by the host machine. You may want to target Python 3.7 only. But because Python 3.5 or 3.6 is the most recent version installed by many Linux distros, you are forced to support older Python versions and all their quirks and lack of features.

Going down the rabbit hole, even the presence of a supposedly compatible version of the Python interpreter isn't a guarantee for success! For example, the Python interpreter could have a built-in extension that links against an old version of a library. Just last week I was encountering weird SQlite bugs in Firefox's automation because Python was using an old version of SQLite with known bugs. Installing a modern SQLite fixed the problems. Or the interpreter could have modifications or extra installed packages interfering with the operation of your application. There are never-ending corner cases. And I can tell you from my experience with having to support the Firefox build system (which uses Python heavily) that you will encounter these corner cases given a broad enough user base.

And even if the Python interpreter on the target machine is fully compatible, getting your code to run on that interpreter could be difficult! Several Python applications leverage compiled extensions linking against Python's C API. Distributing the precompiled form of the extension can be challenging, especially when your code needs to link against 3rd party libraries, which may conflict with something on the target system. And, the precompiled extensions need to be built in a very delicate manner to ensure they can run on as many target machines as possible. But not distributing pre-built binaries requires the end-user be able to compile Python extensions. Not every user has such an environment and forcing this requirement on them is not user friendly.

From an application developer's point of view, distributing a copy of the Python interpreter along with your application is the only reliable way of guaranteeing a more uniform end-user experience. Yes, you will still have variability because every machine is different. But you've eliminated the the Python interpreter from the set of unknowns and that is a huge win. (Unfortunately, distributing a Python interpreter comes with a host of other problems such as size bloat, security/patching concerns, poking the OS packaging bears, etc. But those problems are for another post.)

Existing Solutions

There are tons of existing tools for solving the Python application distribution problem.

The approach that tools like Shiv and PEX take is to leverage Python's built-in support for running zip files. Essentially, if there is a zip file containing a __main__.py file and you execute python file.zip (or have a zip file with a #!/usr/bin/env python shebang), Python can load modules in that zip file and execute an application within. Pretty cool!

This approach works great if your execution environment supports shebangs (Windows doesn't) and the Python interpreter is suitable. But if you need to support Windows or don't have control over the execution environment and can't guarantee the Python interpreter is good, this approach isn't suitable.

As stated above, we want to distribute the Python interpreter with our application to minimize variability. Let's talk about tools that do that.

XAR is a pretty cool offering from Facebook. XAR files are executables that contain SquashFS filesystems. Upon running the executable, SquashFS filesystems are created. For Python applications, the XAR contains a copy of the Python interpreter and all your Python modules. At run-time, these files are extracted to SquashFS filesystems and the Python interpreter is executed. If you squint hard enough, it is kind of like a pre-packaged, executable virtualenv which also contains the Python interpreter.

XARs are pretty cool (and aren't limited to Python). However, because XARs rely on SquashFS, they have a run-time requirement on the target machine. This is great if you only need to support Linux and macOS and your target machines support FUSE and SquashFS. But if you need to support Windows or a general user population without SquashFS support, XARs won't help you.

Zip files and XARs are great for enterprises that have tightly controlled environments. But for a general end-user population, we need something more robust against variance among target machines.

There are a handful of tools for packaging Python applications along with the Python interpreter in more resilient manners.

Nuitka converts Python source to C code then compiles and links that C code against libpython. You can perform a static link and compile everything down to a single executable. If you do the compiling properly, that executable should just work on pretty much every target machine. That's pretty cool and is exactly the kind of solution application distributors are looking for: you can't get much simpler than a self-contained executable! While I'd love to vouch for Nuitka and recommend using it, I haven't used it so can't. And I'll be honest, the prospect of compiling Python source to C code kind of terrifies me. That effectively makes Nuitka a new Python implementation and I'm not sure I can (yet) place the level of trust in Nuitka that I have for e.g. CPython and PyPy.

And that leads us to our final category of tools: freezing your code. There are a handful of tools like PyInstaller which automate the process of building your Python application (often via standard setup.py mechanisms), assembling all the requisite bits of the Python interpreter, and producing an artifact that can be distributed to end users. There are even tools that produce Windows installers, RPMs, DEBs, etc that you can sign and distribute.

These freezing tools are arguably the state of the art for Python application distribution to general user populations. On first glance it seems like all the needed tools are available here. But there are cracks below the surface.

Issues with Freezing

A common problem with freezing is it often relies on the Python interpreter used to build the frozen application. For example, when building a frozen application on Linux, it will bundle the system's Python interpreter with the frozen application. And that interpreter may link against libraries or libc symbol versions not available on all target machines. So, the build environment has to be just right in order for the binaries to run on as many target systems as possible. This isn't an insurmountable problem. But it adds overhead and complexity to application maintainers.

Another limitation is how these frozen applications handle importing Python modules.

Multiple tools take the approach of embedding an archive (usually a zip file) in the executable containing the Python standard library bits not part of libpython. This includes C extensions (compiled to .so or .pyd files) and Python source (.py) or bytecode (.pyc) files. There is typically a step - either at application start time or at module import time - where a file is extracted to the filesystem such that Python's filesystem-based importer can load it from there.

For example, PyInstaller extracts the standard library to a temporary directory at application start time (at least when running in single file mode). This can add significant overhead to the startup time of applications - more than enough to blow through people's ability to perceive something as instantaneous. This is acceptable for long-running applications. But for applications (like CLI tools or support tools for build systems), the overhead can be a non-starter. And, the mere fact that you are doing filesystem write I/O establishes a requirement that the application have write access to the filesystem and that write I/O can perform reasonably well lest application performance suffer. These can be difficult pills to swallow!

Another limitation is that these tools often assume the executable being produced is only a Python application. Sometimes Python is part of a larger application. It would be useful to produce a library that can easily be embedded within a larger application.

Improving the State of the Art

Existing Python application distribution mechanisms don't tick all the requirements boxes for me. We have tools that are suitable for internal distribution in well-defined enterprise environments. And we have tools that target general user populations, albeit with a burden on application maintainers and often come with a performance hit and/or limited flexibility.

I want something that allows me to produce a standalone, single file executable containing a Python interpreter, the Python standard library (or a subset of it), and all the custom code and resources my application needs. That executable should not require any additional library dependencies beyond what is already available on most target machines (e.g. libc). That executable should not require any special filesystem providers (e.g. FUSE/SquashFS) nor should it require filesystem write access nor perform filesystem write I/O at run-time. I should be able to embed a Python interpreter within a larger application, without the overhead of starting the Python interpreter if it isn't needed.

No existing solution ticks all of these boxes.

So I set out to build one.

One problem is producing a Python interpreter that is portable and fully-featured. You can't punt on this problem because if the core Python interpreter isn't produced in just the right way, your application will depend on libraries or symbol versions not available in all environments.

I've created the python-build-standalone project for automating the process of building Python interpreters suitable for use with standalone, distributable Python applications. The project produces (and has available for download) binary artifacts including a pre-compiled Python interpreter and object files used for compiling that interpreter. The Python interpreter is compiled with PGO/LTO using a modern Clang, helping to ensure that Python code runs as fast as it can. All of Python's dependencies are compiled from source with the modern toolchain and everything is aggressively statically linked to avoid external dependencies. The toolchain and pre-built distribution are available for downstream consumers to compile Python extensions with/against.

It's worth noting that use of a modern Clang toolchain is likely sufficiently different from what you use today. When producing manylinux wheels, it is recommended to use the pypa/manylinux Docker images. These Docker images are based on CentOS 5 (for maximum libc and other system library compatibility). While they do install a custom toolchain, Python and any extensions compiled in that environment are compiled with GCC 4.8.2 (as of this writing). That's a GCC from 2013. A lot has changed in compilers since 2013 and building Python and extensions with a compiler released in 2018 should result in various benefits (faster code, better warnings, etc).

If producing custom CPython builds for standalone distribution interests you, you should take a look at how I coerced CPython to statically link all extensions. Spoiler: it involves producing a custom-tailored Modules/Setup.local file that bypasses setup.py, along with some Makefile hacks. Because the build environment is deterministic and isolated in a container, we can get away with some ugly hacks.

A statically linked libpython from which you can produce a standalone binary embedding Python is only the first layer in the onion. The next layer is how to handle the Python standard library.

libpython only contains the code needed to run the core bits of the Python interpreter. If we attempt to run a statically linked python executable without the standard library in the filesystem, things fail pretty fast:

$ rm -rf lib
$ bin/python
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: initfsencoding: Unable to get the locale encoding
ModuleNotFoundError: No module named 'encodings'

Current thread 0x00007fe9a3432740 (most recent call first):
Aborted (core dumped)

I'll spare you the details for the moment, but initializing the CPython interpreter (via Py_Initialize() requires that parts of the Python standard library be available). This means that in order to fulfill our dream of a single file executable, we will need custom code that teaches the embedded Python interpreter to load the standard library from within the binary... somehow.

As far as I know, efficient embedded standard library handling without run-time requirements does not exist in the current Python packaging/distribution ecosystem. So, I had to devise something new.

Enter PyOxidizer. PyOxidizer is a collection of Rust crates that facilitate building an embeddable Python library, which can easily be added to an executable. We need native code to interface with the Python C APIs in order to influence Python interpreter startup. It is 2018 and Rust is a better C/C++, so I chose Rust for this driver functionality instead of C. Plus, Rust's integrated build system makes it easier to automate the integration of the custom Python interpreter files into binaries.

The role of PyOxidizer is to take the pre-built Python interpreter files from python-build-standalone, combine those files with any other Python files needed to run an application, and marry them to a Rust crate. This Rust crate can trivially be turned into a self-contained executable containing a Python application. Or, it can be combined with another Rust project. Or it can be emitted as a library and integrated with a non-Rust application. There's a lot of flexibility by design.

The mechanism I used for embedding the Python standard library into a single file executable without incurring explicit filesystem access at run-time is (I believe) new, novel, and somewhat crazy. Let me explain how it works.

First, there are no .so/.pyd shared library compiled Python extensions to worry about. This is because all compiled extensions are statically linked into the Python interpreter. To the interpreter, they exist as built-in modules. Typically, a CPython build will have some modules like _abc, _io, and sys provided by built-in modules. Modules like _json exist as standalone shared libraries that are loaded on demand. python-build-standalone's modifications to CPython's build system converts all these would-be standalone shared libraries into built-in modules. (Because we distribute the object files that compose the eventual libpython, it is possible to filter out unwanted modules to cut down on binary size if you don't want to ship a fully-featured Python interpreter.) Because there are no standalone shared libraries providing Python modules, we don't have the problem of needing to load a shared library to load a module, which would undermine our goal of no filesystem access to import modules. And that's a good thing, too, because dlopen() requires a path: you can't load a shared library from a memory address. (Fun fact: there are hacks like dlopen_with_offset() that provide an API to load a library from memory, but they require a custom libc. Google uses this approach for their internal single-file Python application solution.)

From the python-build-standalone artifacts, PyOxidizer collects all files belonging to the Python standard library (notably .py and .pyc files). It also collects other source, bytecode, and resource files needed to run a custom application.

The relevant files are assembled and serialized into data structures which contain the names of the resources and their raw content. These data structures are made available to Rust as &'static [u8] variables (essentially a static void* if you don't speak Rust).

Using the rust-cpython crate, PyOxidizer defines a custom Python extension module implemented purely in Rust. When loaded, the module parses the data structures containing available Python resource names and data into HashMap<&str, &[u8]> instances. In other words, it builds a native mapping from resource name to a pointer to its raw data. The Rust-implemented module exports to Python an API for accessing that data. From the Python side, you do the equivalent of MODULES.get_code('foo') to request the bytecode for a named Python module. When called, the Rust code will perform the lookup and return a memoryview instance pointing to the raw data. (The use of &[u8] and memoryview means that embedded resource data is loaded from its static, read-only memory location instead of copied into a data structure managed by Python. This zero copy approach translates to less overhead for importing modules. Although, the memory needs to be paged in by the operating system. So on slow filesystems, reducing I/O and e.g. compressing module data might be a worthwhile optimization. This can be a future feature.)

Making data embedded within a binary available to a Python module is relatively easy. I'm definitely not the first person to come up with this idea. What is hard - and what I might be the first person to actually do - is how you make the Python module importing mechanism load all standard library modules via such a mechanism.

With a custom extension module built-in to the binary exposing module data, it should just be a matter of registering a custom sys.meta_path importer that knows how to load modules from that custom location. This problem turns out to be quite hard!

The initialization of a CPython interpreter is - as I've learned - a bit complex. A CPython interpreter must be initialized via Py_Initialize() before any Python code can run. That means in order to modify sys.meta_path, Py_Initialize() must finish.

A lot of activity occurs under the hood during initialization. Applications embedding Python have very little control over what happens during Py_Initialize(). You can change some superficial things like what filesystem paths to use to bootstrap sys.path and what encodings to use for stdio descriptors. But you can't really influence the core actions that are being performed. And there's no mechanism to directly influence sys.meta_path before an import is performed. (Perhaps there should be?)

During Py_Initialize(), the interpreter needs to configure the encodings for the filesystem and the stdio descriptors. Encodings are loaded from Python modules provided by the standard library. So, during the course of Py_Initialize(), the interpreter needs to import some modules originally backed by .py files. This creates a dilemma: if Py_Initialize() needs to import modules in the standard library, the standard library is backed by memory and isn't available to known importing mechanisms, and there's no opportunity to configure a custom sys.meta_path importer before Py_Initialize() runs, how do you teach the interpreter about your custom module importer and the location of the standard library modules needed by Py_Initialize()?

This is an extremely gnarly problem and it took me some hours and many false leads to come up with a solution.

My first attempt involved the esoteric frozen modules feature. (This work predated the use of a custom data structure and module containing modules data.) The Python interpreter has a const struct _frozen* PyImport_FrozenModules data structure defining an array of frozen modules. A frozen module is defined by its module name and precompiled bytecode data (roughly equivalent to .pyc file content). Partway through Py_Initialize(), the Python interpreter is able to import modules. And one of the built-in importers that is automatically registered knows how to load modules if they are in PyImport_FrozenModules!

I attempted to audit Python interpreter startup and find all modules that were imported during Py_Initialize(). I then defined a custom PyImport_FrozenModules containing these modules. In theory, the import of these modules during Py_Initialize() would be handled by the FrozenImporter and everything would just work: if I were able to get Py_Initialize() to complete, I'd be able to register a custom sys.meta_path importer immediately afterwards and we'd be set.

Things did not go as planned.

FrozenImporter doesn't fully conform to the PEP 451 requirements for setting specific attributes on modules. Without these attributes, the from . import aliases statement in encodings/__init__.py fails because the importer is unable to resolve the relative module name. Derp. One would think CPython's built-in importers would comply with PEP 451 and that all of Python's standard library could be imported as frozen modules. But this is not the case! I was able to hack around this particular failure by using an absolute import. But I hit another failure and did not want to excavate that rabbit hole. Once I realized that FrozenImporter was lacking mandated module attributes, I concluded that attempting to use frozen modules as a general import-from-memory mechanism was not viable. Furthermore, the C code backing FrozenImporter walks the PyImport_FrozenModules array and does a string compare on the module name to find matches. While I didn't benchmark, I was concerned that un-indexed scanning at import time would add considerable overhead when hundreds of modules were in play. (The C code backing BuiltinImporter uses the same approach and I do worry CPython's imports of built-in extension modules is causing measurable overhead.)

With frozen modules off the table, I needed to find another way to inject a custom module importer that was usable during Py_Initialize(). Because we control the source Python interpreter, modifications to the source code or even link-time modifications or run-time hacks like trampolines weren't off the table. But I really wanted things to work out of the box because I don't want to be in the business of maintaining patches to Python interpreters.

My foray into frozen modules enlightened me to the craziness that is the bootstrapping of Python's importing mechanism.

I remember hearing that the Python module importing mechanism used to be written in C and was rewritten in Python. And I knew that the importlib package defined interfaces allowing you to implement your own importers, which could be registered on sys.meta_path. But I didn't know how all of this worked at the interpreter level.

The internal initimport() C function is responsible for initializing the module importing mechanism. It does the equivalent of import _frozen_importlib, but using the PyImport_ImportFrozenModule() API. It then manipulates some symbols and calls _frozen_importlib.install() with references to the sys and imp built-in modules. Later (in initexternalimport()), a _frozen_importlib_external module is imported and has code within it executed.

I was initially very confused by this because - while there are references to _frozen_importlib and _frozen_importlib_external all over the CPython code base, I couldn't figure out where the code for those modules actually lived! Some sleuthing of the build directory eventually revealed that the files Lib/importlib/_bootstrap.py and Lib/importlib/_bootstrap_external.py were frozen to the module names _frozen_importlib and _frozen_importlib_external, respectively.

Essentially what is happening is the bulk of Python's import machinery is implemented in Python (rather than C). But there's a chicken-and-egg problem where you can't run just any Python code (including any import statement) until the interpreter is partially or fully initialized.

When building CPython, the Python source code for importlib._bootstrap and importlib._bootstrap_external are compiled to bytecode. This bytecode is emitted to .h files, where it is exposed as a static char *. This bytecode is eventually referenced by the default PyImport_FrozenModules array, allowing the modules to be imported via the frozen importer's C API, which bypasses the higher-level importing mechanism, allowing it to work before the full importing mechanism is initialized.

initimport() and initexternalimport() both call Python functions in the frozen modules. And we can clearly look at the source of the corresponding modules and see the Python code do things like register the default importers on sys.meta_path.

Whew, that was a long journey into the bowels of CPython's internals. How does all this help with single file Python executables?

Well, the predicament that led us down this rabbit hole was there was no way to register a custom module importer before Py_Initialize() completes and before an import is attempted during said Py_Initialize().

It took me a while, but I finally realized the frozen importlib._bootstrap_external module provided the window I needed! importlib._bootstrap_external/_frozen_importlib_external is always executed during Py_Initialize(). So if you can modify this module's code, you can run arbitrary code during Py_Initialize() and influence Python interpreter configuration. And since _frozen_importlib_external is a frozen module and the PyImport_FrozenModules array is writable and can be modified before Py_Initialize() is called, all one needs to do is replace the _frozen_importlib / _frozen_importlib_external bytecode in PyImport_FrozenModules and you can run arbitrary code during Python interpreter startup, before Py_Initialize() completes and before any standard library imports are performed!

My solution to this problem is to concatenate some custom Python code to importlib/_bootstrap_external.py. This custom code defines a sys.meta_path importer that knows how to use our Rust-backed built-in extension module to find and load module data. It redefines the _install() function so that this custom importer is registered on sys.meta_path when the function is called during Py_Initialize(). The new Python source is compiled to bytecode and the PyImport_FrozenModules array is modified at run-time to point to the modified _frozen_importlib_external implementation. When Py_Initialize() executes its first standard library import, module data is provided by the custom sys.meta_path importer, which grabs it from a Rust extension module, which reads it from a read-only data structure in the executable binary, which is converted to a Python memoryview instance and sent back to Python for processing.

There's a bit of magic happening behind the scenes to make all of this work. PyOxidizer attempts to hide as much of the gory details as possible. From the perspective of an application maintainer, you just need to define a minimal config file and it handles most of the low-level details. And there's even a higher-level Rust API for configuring the embedded Python interpreter, should you need it.

python-build-standalone and PyOxidizer are still in their infancy. They are very much alpha quality. I consider them technology previews more than usable software at this point. But I think enough is there to demonstrate the viability of using Rust as the build system and run-time glue to build and distribute standalone applications embedding Python.

Time will tell if my utopian vision of zero-copy, no explicit filesystem I/O for Python module imports will pan out. Others who have ventured into this space have warned me that lots of Python modules rely on __file__ to derive paths to other resources, which are later stat()d and open()d. __file__ for in-memory modules doesn't exactly make sense and can't be operated on like normal paths/files. I'm not sure what the inevitable struggles to support these modules will lead to. Maybe we'll have to extract things to temporary directories like other standalone Python applications. Maybe PyOxidizer will take off and people will start using the ResourceReader API, which is apparently the proper way to do these things these days. (Caveat: PyOxidizer doesn't yet implement this API but support is planned.) Time will tell. I'm not opposed to gross hacks or writing more code as needed.

Producing highly distributable and performant Python applications has been far too difficult for far too long. My primary goal for PyOxidizer is to lower these barriers. By leveraging Rust, I also hope to bring Python and Rust closer together. I want to enable applications and libraries to effortlessly harness the powers of both of these fantastic programming languages.

Again, PyOxidizer is still in its infancy. I anticipate a significant amount of hacking over the holidays and hope to share updates in the weeks ahead. Until then, please leave comments, watch the project on GitHub, file issues for bugs and feature requests, etc and we'll see where things lead.


Global Kernel Locks in APFS

October 29, 2018 at 02:20 PM | categories: Python, Mercurial, Apple

Over the past several months, a handful of people had been complaining that Mercurial's test harness was executing much slower on Macs. But this slowdown seemingly wasn't occurring on Linux or Windows. And not every Mac user experienced the slowness!

Before jetting off to the Mercurial 4.8 developer meetup in Stockholm a few weeks ago, I sat down with a relatively fresh 6+6 core MacBook Pro and experienced the problem firsthand: on my 4+4 core i7-6700K running Linux, the Mercurial test harness completes in ~12 minutes, but on this MacBook Pro, it was executing in ~38 minutes! On paper, this result doesn't make any sense because there's no way that the MacBook Pro should be ~3x slower than that desktop machine.

Looking at Activity Monitor when running the test harness with 12 tests in parallel revealed something odd: the system was spending ~75% of overall CPU time inside the kernel! When reducing the number of tests that ran in parallel, the percentage of CPU time spent in the kernel decreased and the overall test harness execution time also decreased. This kind of behavior is usually a sign of something very inefficient in kernel land.

I sample profiled all processes on the system when running the Mercurial test harness. Aggregate thread stacks revealed a common pattern: readdir() being in the stack.

Upon closer examination of the stacks, readdir() calls into apfs_vnop_readdir(), which calls into some functions with bt or btree in their name, which call into lck_mtx_lock(), lck_mtx_lock_grab_mutex() and various other functions with lck_mtx in their name. And the caller of most readdir() appeared to be Python 2.7's module importing mechanism (notably import.c:case_ok()).

APFS refers to the Apple File System, which is a filesystem that Apple introduced in 2017 and is the default filesystem for new versions of macOS and iOS. If upgrading an old Mac to a new macOS, its HFS+ filesystems would be automatically converted to APFS.

While the source code for APFS is not available for me to confirm, the profiling results showing excessive time spent in lck_mtx_lock_grab_mutex() combined with the fact that execution time decreases when the parallel process count decreases leads me to the conclusion that APFS obtains a global kernel lock during read-only operations such as readdir(). In other words, APFS slows down when attempting to perform parallel read-only I/O.

This isn't the first time I've encountered such behavior in a filesystem: last year I blogged about very similar behavior in AUFS, which was making Firefox CI significantly slower.

Because Python 2.7's module importing mechanism was triggering the slowness by calling readdir(), I posted to python-dev about the problem, as I thought it was important to notify the larger Python community. After all, this is a generic problem that affects the performance of starting any Python process when running on APFS. i.e. if your build system invokes many Python processes in parallel, you could be impacted by this. As part of obtaining data for that post, I discovered that Python 3.7 does not call readdir() as part of module importing and therefore doesn't exhibit a severe slowdown. (Python's module importing code was rewritten significantly in Python 3 and the fix was likely introduced well before Python 3.7.)

I've produced a gist that can reproduce the problem. The script essentially performs a recursive directory walk. It exercises the opendir(), readdir(), closedir(), and lstat() functions heavily and is essentially a benchmark of the filesystem and filesystem cache's ability to return file metadata.

When you tell it to walk a very large directory tree - say a Firefox version control checkout (which has over 250,000 files) - the excessive time spent in the kernel is very apparent on macOS 10.13 High Sierra:

$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 172.209s

real    2m52.470s
user    1m54.053s
sys    23m42.808s

$ time ./slow-readdir.py -l 12 -j 1 ~/src/firefox
ran 12 walks across 1 processes in 523.440s

real    8m43.740s
user    1m13.397s
sys     3m50.687s

$ time ./slow-readdir.py -l 18 -j 18 ~/src/firefox
ran 18 walks across 18 processes in 210.487s

real    3m30.731s
user    2m40.216s
sys    33m34.406s

On the same machine upgraded to macOS 10.14 Mojave, we see a bit of a speedup!:

$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 97.833s

real    1m37.981s
user    1m40.272s
sys    10m49.091s

$ time ./slow-readdir.py -l 12 -j 1 ~/src/firefox
ran 12 walks across 1 processes in 461.415s

real    7m41.657s
user    1m05.830s
sys     3m47.041s

$ time ./slow-readdir.py -l 18 -j 18 ~/src/firefox
ran 18 walks across 18 processes in 140.474s

real    2m20.727s
user    3m01.048s
sys    17m56.228s

Contrast with my i7-6700K Linux machine backed by EXT4:

$ time ./slow-readdir.py -l 8 ~/src/firefox
ran 8 walks across 8 processes in 6.018s

real    0m6.191s
user    0m29.670s
sys     0m17.838s

$ time ./slow-readdir.py -l 8 -j 1 ~/src/firefox
ran 8 walks across 1 processes in 33.958s

real    0m34.164s
user    0m17.136s
sys     0m13.369s

$ time ./slow-readdir.py -l 12 -j 12 ~/src/firefox
ran 12 walks across 12 processes in 25.465s

real    0m25.640s
user    1m4.801s
sys     1m20.488s

It is apparent that macOS 10.14 Mojave has received performance work relative to macOS 10.13! Overall kernel CPU time when performing parallel directory walks has decreased substantially - to ~50% of original on some invocations! Stacks seem to reveal new code for lock acquisition, so this might indicate generic improvements to the kernel's locking mechanism rather than APFS specific changes. Changes to file metadata caching could also be responsible for performance changes. Although it is difficult to tell without access to the APFS source code. Despite those improvements, APFS is still spending a lot of CPU time in the kernel. And the kernel CPU time is still comparatively very high compared to Linux/EXT4, even for single process operation.

At this time, I haven't conducted a comprehensive analysis of APFS to determine what other filesystem operations seem to acquire global kernel locks: all I know is readdir() does. A casual analysis of profiled stacks when running Mercurial's test harness against Python 3.7 seems to show apfs_* functions still on the stack a lot and that seemingly indicates more APFS slowness under parallel I/O load. But HFS+ exhibited similar problems (it appeared HFS+ used a single I/O thread inside the kernel for many operations, making I/O on macOS pretty bad), so I'm not sure if these could be considered regressions the way readdir()'s new behavior is.

I've reported this issue to Apple at https://bugreport.apple.com/web/?problemID=45648013 and on OpenRadar at https://openradar.appspot.com/radar?id=5025294012383232. I'm told that issues get more attention from Apple when there are many duplicates of the same issue. So please reference this issue if you file your own report.

Now that I've elaborated on the technical details, I'd like to add some personal commentary. While this post is about APFS, this issue of global kernel locks during common I/O operations is not unique to APFS. I already referenced similar issues in AUFS. And I've encountered similar behaviors with Btrfs (although I can't recall exactly which operations). And NTFS has its own bag of problems.

This seeming pattern of global kernel locks for common filesystem operations and slow filesystems is really rubbing me the wrong way. Modern NVMe SSDs are capable of reading and writing well over 2 gigabytes per second and performing hundreds of thousands of I/O operations per second. We even have Intel soon producing persistent solid state storage that plugs into DIMM slots because it is that friggin fast.

Today's storage hardware is capable of ludicrous performance. It is fast enough that you will likely saturate multiple CPU cores processing the read or written data coming from and going to storage - especially if you are using higher-level, non-JITed (read: slower) programming languages (like Python). There has also been a trend that systems are growing more CPU cores faster than they are instructions per second per core. And SSDs only achieve these ridiculous IOPS numbers if many I/O operations are queued and can be more efficiently dispatched within the storage device. What this all means is that it probably makes sense to use parallel I/O across multiple threads in order to extract all potential performance from your persistent storage layer.

It's also worth noting that we now have solid state storage that outperforms (in some dimensions) what DRAM from ~20 years ago was capable of. Put another way I/O APIs and even some filesystems were designed in an era when its RAM was slower than what today's persistent storage is capable of! While I'm no filesystems or kernel expert, it does seem a bit silly to be using APIs and filesystems designed for an era when storage was multiple orders of magnitude slower and systems only had a single CPU core.

My takeaway is I can't help but feel that systems-level software (including the kernel) is severely limiting the performance potential of modern storage devices. If we have e.g. global kernel locks when performing common I/O operations, there's no chance we'll come close to harnessing the full potential of today's storage hardware. Furthermore, the behavior of filesystems is woefully under documented and software developers have little solid advice for how to achieve optimal I/O performance. As someone who cares about performance, I want to squeeze every iota of potential out of hardware. But the lack of documentation telling me which operations acquire locks, which strategies are best for say reading or writing 10,000 files using N threads, etc makes this extremely difficult. And even if this documentation existed, because of differences in behavior across filesystems and operating systems and the difficulty in programmatically determining the characteristics of filesystems at run time, it is practically impossible to design a one size fits all approach to high performance I/O.

The filesystem is a powerful concept. I want to agree and use the everything is a file philosophy. Unfortunately, filesystems don't appear to be scaling very well to support the potential of modern day storage technology. We're probably at the point where commodity priced solid state storage is far more capable than today's software for the majority of applications. Storage hardware manufacturers will keep producing faster and faster storage and their marketing teams will keep convincing us that we need to buy it. But until software catches up, chances are most of us won't come close to realizing the true potential of modern storage hardware. And that's even true for specialized applications that do employ tricks taking hundreds or thousands of person hours to implement in order to eek out every iota of performance potential. The average software developer and application using filesystems as they were designed to be used has little to no chance of coming close to utilizing the performance potential of modern storage devices. That's really a shame.


« Previous Page -- Next Page »