Faster In-Memory Python Module Importing
December 28, 2018 at 12:40 PM | categories: Python, PyOxidizer, RustI recently blogged about distributing standalone Python applications. In that post, I announced PyOxidizer - a tool which leverages Rust to produce standalone executables embedding Python. One of the features of PyOxidizer is the ability to import Python modules embedded within the binary using zero-copy.
I also recently blogged about global kernel locks in APFS, which make filesystem operations slower on macOS. This was the latest wrinkle in a long battle against Python's slow startup times, which I've posted about on the official python-dev mailing list over the years.
Since I announced PyOxidizer a few days ago, I've had some productive holiday hacking sessions!
One of the reached milestones is PyOxidizer now supports macOS.
With that milestone reached, I thought it would be interesting to compare the performance of a PyOxidizer executable versus a standard CPython build.
I produced a Python script
that imports almost the entirety of the Python standard library - at least the
modules implemented in Python. That's 508 import
statements. I then
executed this script using a typical python3.7
binary (with the standard
library on the filesystem) and PyOxidizer-produced standalone executables
with a module importer that loads Python modules from memory using zero copy.
# Homebrew installed CPython 3.7.2
# Cold disk cache.
$ sudo purge
$ time /usr/local/bin/python3.7 < import_stdlib.py
real 0m0.694s
user 0m0.354s
sys 0m0.121s
# Hot disk cache.
$ time /usr/local/bin/python3.7 < import_stdlib.py
real 0m0.319s
user 0m0.263s
sys 0m0.050s
# PyOxidizer with non-PGO/non-LTO CPython 3.7.2
$ time target/release/pyapp < import_stdlib.py
real 0m0.223s
user 0m0.201s
sys 0m0.017s
# PyOxidizer with PGO/non-LTO CPython 3.7.2
$ time target/release/pyapp < import_stdlib.py
real 0m0.234s
user 0m0.210s
sys 0m0.019
# PyOxidizer with PTO+LTO CPython 3.7.2
$ sudo purge
$ time target/release/pyapp < import_stdlib.py
real 0m0.442s
user 0m0.252s
sys 0m0.059s
$ time target/release/pyall < import_stdlib.py
real 0m0.221s
user 0m0.197s
sys 0m0.020s
First, the PyOxidizer times are all relatively similar regardless of whether PGO or LTO is used to build CPython. That's not too surprising, as I'm exercising a very limited subset of CPython (and I suspect the benefits of PGO/LTO aren't as pronounced due to the nature of the CPython API).
But the bigger result is the obvious speedup with PyOxidizer and its
in-memory importing: PyOxidizer can import almost the entirety of the
Python standard library ~100ms faster - or ~70% of original - than a
typical standalone CPython install with a hot disk cache! This comes
out to ~0.19ms per import
statement. If we run purge
to clear out
the disk cache, the performance delta increases to 252ms, or ~64% of
original. All these numbers are on a 2018 6-core 2.9 GHz i9 MacBook Pro,
which has a pretty decent SSD.
And on Linux on an i7-6700K running in a Hyper-V VM:
# pyenv installed CPython 3.7.2
# Cold disk cache.
$ time ~/.pyenv/versions/3.7.2/bin/python < import_stdlib.py
real 0m0.405s
user 0m0.165s
sys 0m0.065s
# Hot disk cache.
$ time ~/.pyenv/versions/3.7.2/bin/python < import_stdlib.py
real 0m0.193s
user 0m0.161s
sys 0m0.032s
# PyOxidizer with PGO CPython 3.7.2
# Cold disk cache.
$ time target/release/pyapp < import_stdlib.py
real 0m0.227s
user 0m0.145s
sys 0m0.016s
# Hot disk cache.
$ time target/release/pyapp < import_stdlib.py
real 0m0.152s
user 0m0.136s
sys 0m0.016s
On a hot disk cache, the run-time improvement of PyOxidizer is ~41ms, or
~78% of original. This comes out to ~0.08ms per import
statement. When
flushing caches by writing 3
to /proc/sys/vm/drop_caches
, the delta
increases to ~178ms, or ~56% of original.
Using dtruss -c
to execute the binaries, the breakdown in system calls
occurring >10 times is clear:
# CPython standalone
fstatfs64 16
read_nocancel 19
ioctl 20
getentropy 22
pread 26
fcntl 27
sigaction 32
getdirentries64 34
fcntl_nocancel 106
mmap 114
close_nocancel 129
open_nocancel 130
lseek 148
open 168
close 170
read 282
fstat64 403
stat64 833
# PyOxidizer
lseek 10
read 12
read_nocancel 14
fstat64 16
ioctl 22
munmap 31
stat64 33
sysctl 33
sigaction 36
mmap 122
madvise 193
getentropy 315
PyOxidizer avoids hundreds of open()
, close()
, read()
,
fstat64()
, and stat64()
calls. And by avoiding these calls,
PyOxidizer not only avoids the userland-kernel overhead intrinsic to them,
but also any additional overhead that APFS is imposing via its global
lock(s).
(Why the PyOxidizer binary is making hundreds of calls to getentropy()
I'm not sure. It's definitely coming from Python as a side-effect of a
module import and it is something I'd like to fix, if possible.)
With this experiment, we finally have the ability to better isolate the impact of filesystem overhead on Python module importing and preliminary results indicate that the overhead is not insignificant - at least on the tested systems (I'll get data for Windows when PyOxidizer supports it). While the test is somewhat contrived (I don't think many applications import the entirety of the Python standard library), some Python applications do import hundreds of modules. And as I've written before, milliseconds matter. This is especially true if you are invoking Python processes hundreds or thousands of times in a build system, when running a test suite, for scripting, etc. Cumulatively you can be importing tens of thousands of modules. So I think shaving even fractions of a millisecond from module importing is important.
It's worth noting that in addition to the system call overhead, CPython's
path-based importer runs
substantially more
Python code
than PyOxidizer
and this likely contributes several milliseconds of overhead as well. Because
PyOxidizer applications are static, the importer can remain simple (finding a
module in PyOxidizer is essentially a Rust HashMap<String, Vec<u8>
lookup).
While it might be useful to isolate the filesystem overhead from Python code
overhead, the thing that end-users care about is overall execution time: they
don't care where that overhead is coming from. So I think it is fair to compare
PyOxidizer - with its intrinsically simpler import model - with what Python
typically does (scan sys.path
entries and looking for modules on the
filesystem).
Another difference is that PyOxidizer is almost completely statically linked.
By contrast, a typical CPython install has compiled extension modules as
standalone shared libraries and these shared libraries often link against
other shared libraries (such as libssl). From dtruss
timing information,
I don't believe this difference contributes to significant overhead, however.
Finally, I haven't yet optimized PyOxidizer. I still have a few tricks up my sleeve that can likely shave off more overhead from Python startup. But so far the results are looking very promising. I dare say they are looking promising enough that Python distributions themselves might want to look into the area more thoroughly and consider distribution defaults that rely less on the every-Python-module-is-a-separate-file model.
Stay tuned for more PyOxidizer updates in the near future!
(I updated this post a day after initial publication to add measurements for Linux.)
Distributing Standalone Python Applications
December 18, 2018 at 03:35 PM | categories: Python, PyOxidizer, RustThe Problem
Packaging and application distribution is a hard problem on multiple dimensions. For Python, large aspects of this problem space are more or less solved if you are distributing open source Python libraries and your target audience is developers (use pip and PyPI). But if you are distributing Python applications - standalone executables that use Python - your world can be much more complicated.
One of the primary reasons why distributing Python applications is difficult is because of the complex and often sensitive relationship between a Python application and the environment it runs in.
For starters we have the Python interpreter itself. If your application doesn't distribute the Python interpreter, you are at the whims of the Python interpreter provided by the host machine. You may want to target Python 3.7 only. But because Python 3.5 or 3.6 is the most recent version installed by many Linux distros, you are forced to support older Python versions and all their quirks and lack of features.
Going down the rabbit hole, even the presence of a supposedly compatible version of the Python interpreter isn't a guarantee for success! For example, the Python interpreter could have a built-in extension that links against an old version of a library. Just last week I was encountering weird SQlite bugs in Firefox's automation because Python was using an old version of SQLite with known bugs. Installing a modern SQLite fixed the problems. Or the interpreter could have modifications or extra installed packages interfering with the operation of your application. There are never-ending corner cases. And I can tell you from my experience with having to support the Firefox build system (which uses Python heavily) that you will encounter these corner cases given a broad enough user base.
And even if the Python interpreter on the target machine is fully compatible, getting your code to run on that interpreter could be difficult! Several Python applications leverage compiled extensions linking against Python's C API. Distributing the precompiled form of the extension can be challenging, especially when your code needs to link against 3rd party libraries, which may conflict with something on the target system. And, the precompiled extensions need to be built in a very delicate manner to ensure they can run on as many target machines as possible. But not distributing pre-built binaries requires the end-user be able to compile Python extensions. Not every user has such an environment and forcing this requirement on them is not user friendly.
From an application developer's point of view, distributing a copy of the Python interpreter along with your application is the only reliable way of guaranteeing a more uniform end-user experience. Yes, you will still have variability because every machine is different. But you've eliminated the the Python interpreter from the set of unknowns and that is a huge win. (Unfortunately, distributing a Python interpreter comes with a host of other problems such as size bloat, security/patching concerns, poking the OS packaging bears, etc. But those problems are for another post.)
Existing Solutions
There are tons of existing tools for solving the Python application distribution problem.
The approach that tools like Shiv
and PEX take is to leverage Python's
built-in support for running zip files. Essentially, if there is a zip
file containing a __main__.py
file and you execute python file.zip
(or have a zip file with a #!/usr/bin/env python
shebang), Python
can load modules in that zip file and execute an application within. Pretty
cool!
This approach works great if your execution environment supports shebangs (Windows doesn't) and the Python interpreter is suitable. But if you need to support Windows or don't have control over the execution environment and can't guarantee the Python interpreter is good, this approach isn't suitable.
As stated above, we want to distribute the Python interpreter with our application to minimize variability. Let's talk about tools that do that.
XAR is a pretty cool offering from Facebook. XAR files are executables that contain SquashFS filesystems. Upon running the executable, SquashFS filesystems are created. For Python applications, the XAR contains a copy of the Python interpreter and all your Python modules. At run-time, these files are extracted to SquashFS filesystems and the Python interpreter is executed. If you squint hard enough, it is kind of like a pre-packaged, executable virtualenv which also contains the Python interpreter.
XARs are pretty cool (and aren't limited to Python). However, because XARs rely on SquashFS, they have a run-time requirement on the target machine. This is great if you only need to support Linux and macOS and your target machines support FUSE and SquashFS. But if you need to support Windows or a general user population without SquashFS support, XARs won't help you.
Zip files and XARs are great for enterprises that have tightly controlled environments. But for a general end-user population, we need something more robust against variance among target machines.
There are a handful of tools for packaging Python applications along with the Python interpreter in more resilient manners.
Nuitka converts Python source to C code then compiles and links that C code against libpython. You can perform a static link and compile everything down to a single executable. If you do the compiling properly, that executable should just work on pretty much every target machine. That's pretty cool and is exactly the kind of solution application distributors are looking for: you can't get much simpler than a self-contained executable! While I'd love to vouch for Nuitka and recommend using it, I haven't used it so can't. And I'll be honest, the prospect of compiling Python source to C code kind of terrifies me. That effectively makes Nuitka a new Python implementation and I'm not sure I can (yet) place the level of trust in Nuitka that I have for e.g. CPython and PyPy.
And that leads us to our final category of tools:
freezing your code. There
are a handful of tools like PyInstaller
which automate the process of building your Python application (often via
standard setup.py
mechanisms), assembling all the requisite bits of
the Python interpreter, and producing an artifact that can be distributed
to end users. There are even tools that produce Windows installers, RPMs,
DEBs, etc that you can sign and distribute.
These freezing tools are arguably the state of the art for Python application distribution to general user populations. On first glance it seems like all the needed tools are available here. But there are cracks below the surface.
Issues with Freezing
A common problem with freezing is it often relies on the Python interpreter used to build the frozen application. For example, when building a frozen application on Linux, it will bundle the system's Python interpreter with the frozen application. And that interpreter may link against libraries or libc symbol versions not available on all target machines. So, the build environment has to be just right in order for the binaries to run on as many target systems as possible. This isn't an insurmountable problem. But it adds overhead and complexity to application maintainers.
Another limitation is how these frozen applications handle importing Python modules.
Multiple tools take the approach of embedding an archive (usually a zip file)
in the executable containing the Python standard library bits not part of
libpython. This includes C extensions (compiled to .so
or .pyd
files)
and Python source (.py
) or bytecode (.pyc
) files. There is typically
a step - either at application start time or at module import time - where a
file is extracted to the filesystem such that Python's filesystem-based
importer can load it from there.
For example, PyInstaller extracts the standard library to a temporary directory at application start time (at least when running in single file mode). This can add significant overhead to the startup time of applications - more than enough to blow through people's ability to perceive something as instantaneous. This is acceptable for long-running applications. But for applications (like CLI tools or support tools for build systems), the overhead can be a non-starter. And, the mere fact that you are doing filesystem write I/O establishes a requirement that the application have write access to the filesystem and that write I/O can perform reasonably well lest application performance suffer. These can be difficult pills to swallow!
Another limitation is that these tools often assume the executable being produced is only a Python application. Sometimes Python is part of a larger application. It would be useful to produce a library that can easily be embedded within a larger application.
Improving the State of the Art
Existing Python application distribution mechanisms don't tick all the requirements boxes for me. We have tools that are suitable for internal distribution in well-defined enterprise environments. And we have tools that target general user populations, albeit with a burden on application maintainers and often come with a performance hit and/or limited flexibility.
I want something that allows me to produce a standalone, single file executable containing a Python interpreter, the Python standard library (or a subset of it), and all the custom code and resources my application needs. That executable should not require any additional library dependencies beyond what is already available on most target machines (e.g. libc). That executable should not require any special filesystem providers (e.g. FUSE/SquashFS) nor should it require filesystem write access nor perform filesystem write I/O at run-time. I should be able to embed a Python interpreter within a larger application, without the overhead of starting the Python interpreter if it isn't needed.
No existing solution ticks all of these boxes.
So I set out to build one.
One problem is producing a Python interpreter that is portable and fully-featured. You can't punt on this problem because if the core Python interpreter isn't produced in just the right way, your application will depend on libraries or symbol versions not available in all environments.
I've created the python-build-standalone project for automating the process of building Python interpreters suitable for use with standalone, distributable Python applications. The project produces (and has available for download) binary artifacts including a pre-compiled Python interpreter and object files used for compiling that interpreter. The Python interpreter is compiled with PGO/LTO using a modern Clang, helping to ensure that Python code runs as fast as it can. All of Python's dependencies are compiled from source with the modern toolchain and everything is aggressively statically linked to avoid external dependencies. The toolchain and pre-built distribution are available for downstream consumers to compile Python extensions with/against.
It's worth noting that use of a modern Clang toolchain is likely sufficiently different from what you use today. When producing manylinux wheels, it is recommended to use the pypa/manylinux Docker images. These Docker images are based on CentOS 5 (for maximum libc and other system library compatibility). While they do install a custom toolchain, Python and any extensions compiled in that environment are compiled with GCC 4.8.2 (as of this writing). That's a GCC from 2013. A lot has changed in compilers since 2013 and building Python and extensions with a compiler released in 2018 should result in various benefits (faster code, better warnings, etc).
If producing custom CPython builds for standalone distribution interests
you, you should take a look at how I coerced CPython to statically link
all extensions. Spoiler: it involves producing a custom-tailored
Modules/Setup.local
file that bypasses setup.py
, along with some
Makefile
hacks. Because the build environment is deterministic and isolated
in a container, we can get away with some ugly hacks.
A statically linked libpython
from which you can produce a standalone
binary embedding Python is only the first layer in the onion. The next layer
is how to handle the Python standard library.
libpython
only contains the code needed to run the core bits of the Python
interpreter. If we attempt to run a statically linked python
executable
without the standard library in the filesystem, things fail pretty fast:
$ rm -rf lib
$ bin/python
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: initfsencoding: Unable to get the locale encoding
ModuleNotFoundError: No module named 'encodings'
Current thread 0x00007fe9a3432740 (most recent call first):
Aborted (core dumped)
I'll spare you the details for the moment, but initializing the CPython
interpreter (via Py_Initialize()
requires that parts of the Python
standard library be available). This means that in order to fulfill our dream
of a single file executable, we will need custom code that teaches the
embedded Python interpreter to load the standard library from within the
binary... somehow.
As far as I know, efficient embedded standard library handling without run-time requirements does not exist in the current Python packaging/distribution ecosystem. So, I had to devise something new.
Enter PyOxidizer. PyOxidizer is a collection of Rust crates that facilitate building an embeddable Python library, which can easily be added to an executable. We need native code to interface with the Python C APIs in order to influence Python interpreter startup. It is 2018 and Rust is a better C/C++, so I chose Rust for this driver functionality instead of C. Plus, Rust's integrated build system makes it easier to automate the integration of the custom Python interpreter files into binaries.
The role of PyOxidizer is to take the pre-built Python interpreter files from python-build-standalone, combine those files with any other Python files needed to run an application, and marry them to a Rust crate. This Rust crate can trivially be turned into a self-contained executable containing a Python application. Or, it can be combined with another Rust project. Or it can be emitted as a library and integrated with a non-Rust application. There's a lot of flexibility by design.
The mechanism I used for embedding the Python standard library into a single file executable without incurring explicit filesystem access at run-time is (I believe) new, novel, and somewhat crazy. Let me explain how it works.
First, there are no .so
/.pyd
shared library compiled Python extensions
to worry about. This is because all compiled extensions are statically linked
into the Python interpreter. To the interpreter, they exist as
built-in modules.
Typically, a CPython build will have some modules like _abc
, _io
, and
sys
provided by built-in modules. Modules like _json
exist as standalone
shared libraries that are loaded on demand. python-build-standalone
's
modifications to CPython's build system converts all these would-be standalone
shared libraries into built-in modules. (Because we distribute the object
files that compose the eventual libpython
, it is possible to filter out
unwanted modules to cut down on binary size if you don't want to ship a
fully-featured Python interpreter.) Because there are no standalone shared
libraries providing Python modules, we don't have the problem of needing to
load a shared library to load a module, which would undermine our goal of
no filesystem access to import modules. And that's a good thing, too,
because dlopen()
requires a path: you can't load a shared library from
a memory address. (Fun fact: there are hacks like
dlopen_with_offset()
that provide an API to load a library from memory, but they require a custom
libc. Google uses this approach for their internal single-file Python
application solution.)
From the python-build-standalone
artifacts, PyOxidizer collects all files
belonging to the Python standard library (notably .py
and .pyc
files).
It also collects other source, bytecode, and resource files needed to run
a custom application.
The relevant files are assembled and serialized into data structures which
contain the names of the resources and their raw content. These data structures
are made available to Rust as &'static [u8]
variables (essentially a
static void*
if you don't speak Rust).
Using the rust-cpython crate,
PyOxidizer defines a custom Python extension module implemented purely in Rust.
When loaded, the module parses the data structures containing available
Python resource names and data into HashMap<&str, &[u8]>
instances. In other
words, it builds a native mapping from resource name to a pointer to its raw
data. The Rust-implemented module exports to Python an API for accessing that
data. From the Python side, you do the equivalent of MODULES.get_code('foo')
to request the bytecode for a named Python module. When called, the Rust code
will perform the lookup and return a memoryview
instance pointing to the
raw data. (The use of &[u8]
and memoryview
means that embedded resource
data is loaded from its static, read-only memory location instead of copied
into a data structure managed by Python. This zero copy approach translates to
less overhead for importing modules. Although, the memory needs to be paged
in by the operating system. So on slow filesystems, reducing I/O and e.g.
compressing module data might be a worthwhile optimization. This can be a
future feature.)
Making data embedded within a binary available to a Python module is relatively easy. I'm definitely not the first person to come up with this idea. What is hard - and what I might be the first person to actually do - is how you make the Python module importing mechanism load all standard library modules via such a mechanism.
With a custom extension module built-in to the binary exposing module data, it should just be a matter of registering a custom sys.meta_path importer that knows how to load modules from that custom location. This problem turns out to be quite hard!
The initialization of a CPython interpreter is - as I've learned - a bit
complex. A CPython interpreter must be initialized via Py_Initialize()
before any Python code can run. That means in order to modify sys.meta_path
,
Py_Initialize()
must finish.
A lot of activity occurs under the hood during initialization. Applications
embedding Python have very little control over what happens during
Py_Initialize()
. You can change some superficial things like what
filesystem paths to use to bootstrap sys.path
and what encodings to use
for stdio descriptors. But you can't really influence the core actions that are
being performed. And there's no mechanism to directly influence
sys.meta_path
before an import
is performed. (Perhaps there should be?)
During Py_Initialize()
, the interpreter needs to configure the encodings
for the filesystem and the stdio descriptors. Encodings are loaded from
Python modules provided by the standard library. So, during the course of
Py_Initialize()
, the interpreter needs to import some modules originally
backed by .py
files. This creates a dilemma: if Py_Initialize()
needs to import
modules in the standard library, the standard library
is backed by memory and isn't available to known importing mechanisms, and
there's no opportunity to configure a custom sys.meta_path
importer
before Py_Initialize()
runs, how do you teach the interpreter about
your custom module importer and the location of the standard library modules
needed by Py_Initialize()
?
This is an extremely gnarly problem and it took me some hours and many false leads to come up with a solution.
My first attempt involved the esoteric
frozen modules
feature. (This work predated the use of a custom data structure and module
containing modules data.) The Python interpreter has a
const struct _frozen* PyImport_FrozenModules
data structure defining an
array of frozen modules. A frozen module is defined by its module
name and precompiled bytecode data (roughly equivalent to .pyc
file
content). Partway through Py_Initialize()
, the Python interpreter is able
to import modules. And one of the built-in importers that is automatically
registered knows how to load modules if they are in PyImport_FrozenModules
!
I attempted to audit Python interpreter startup and find all modules
that were imported during Py_Initialize()
. I then defined a custom
PyImport_FrozenModules
containing these modules. In theory, the import
of these modules during Py_Initialize()
would be handled by the
FrozenImporter
and everything would just work: if I were able to get Py_Initialize()
to
complete, I'd be able to register a custom sys.meta_path
importer
immediately afterwards and we'd be set.
Things did not go as planned.
FrozenImporter
doesn't fully conform to the
PEP 451 requirements for
setting specific attributes on modules. Without these attributes, the
from . import aliases
statement in encodings/__init__.py
fails
because the importer is unable to resolve the relative module name. Derp.
One would think CPython's built-in importers would comply with PEP 451
and that all of Python's standard library could be imported as frozen modules.
But this is not the case! I was able to hack around this particular failure
by using an absolute import. But I hit another failure and did not want to
excavate that rabbit hole. Once I realized that FrozenImporter
was lacking
mandated module attributes, I concluded that attempting to use frozen modules
as a general import-from-memory mechanism was not viable. Furthermore, the
C code backing FrozenImporter
walks the PyImport_FrozenModules
array and
does a string compare on the module name to find matches. While I didn't
benchmark, I was concerned that un-indexed scanning at import time would
add considerable overhead when hundreds of modules were in play. (The C code
backing BuiltinImporter
uses the same approach and I do worry CPython's
imports of built-in extension modules is causing measurable overhead.)
With frozen modules off the table, I needed to find another way to inject
a custom module importer that was usable during Py_Initialize()
. Because
we control the source Python interpreter, modifications to the source code
or even link-time modifications or run-time hacks like trampolines weren't
off the table. But I really wanted things to work out of the box because
I don't want to be in the business of maintaining patches to Python
interpreters.
My foray into frozen modules enlightened me to the craziness that is the bootstrapping of Python's importing mechanism.
I remember hearing that the Python module importing mechanism used to be written in C and was rewritten in Python. And I knew that the importlib package defined interfaces allowing you to implement your own importers, which could be registered on sys.meta_path. But I didn't know how all of this worked at the interpreter level.
The internal initimport()
C function is responsible for initializing the module importing mechanism. It
does the equivalent of import _frozen_importlib
, but using the
PyImport_ImportFrozenModule()
API. It then manipulates some symbols and calls _frozen_importlib.install()
with references to the sys
and imp
built-in modules. Later (in
initexternalimport()
), a _frozen_importlib_external
module is imported
and has code within it executed.
I was initially very confused by this because - while there are references
to _frozen_importlib
and _frozen_importlib_external
all over the
CPython code base, I couldn't figure out where the code for those modules
actually lived! Some sleuthing of the build directory eventually revealed
that the files Lib/importlib/_bootstrap.py
and Lib/importlib/_bootstrap_external.py
were frozen to the module names _frozen_importlib
and
_frozen_importlib_external
, respectively.
Essentially what is happening is the bulk of Python's import machinery is
implemented in Python (rather than C). But there's a chicken-and-egg
problem where you can't run just any Python code (including any import
statement) until the interpreter is partially or fully initialized.
When building CPython, the Python source code for importlib._bootstrap
and importlib._bootstrap_external
are compiled to bytecode. This
bytecode is emitted to .h
files, where it is exposed as a
static char *
. This bytecode is eventually referenced by the
default PyImport_FrozenModules
array, allowing the modules to be
imported via the frozen importer's C API, which bypasses the higher-level
importing mechanism, allowing it to work before the full importing
mechanism is initialized.
initimport()
and initexternalimport()
both call Python functions in
the frozen modules. And we can clearly look at the source of the
corresponding modules and see the Python code do things like
register the default importers on sys.meta_path
.
Whew, that was a long journey into the bowels of CPython's internals. How does all this help with single file Python executables?
Well, the predicament that led us down this rabbit hole was there was no
way to register a custom module importer before Py_Initialize()
completes and before an import
is attempted during said Py_Initialize()
.
It took me a while, but I finally realized the frozen
importlib._bootstrap_external
module provided the window I needed!
importlib._bootstrap_external
/_frozen_importlib_external
is always
executed during Py_Initialize()
. So if you can modify this module's code,
you can run arbitrary code during Py_Initialize()
and influence Python
interpreter configuration. And since _frozen_importlib_external
is a frozen
module and the PyImport_FrozenModules
array is writable and can be modified
before Py_Initialize()
is called, all one needs to do is replace the
_frozen_importlib
/ _frozen_importlib_external
bytecode in
PyImport_FrozenModules
and you can run arbitrary code during Python
interpreter startup, before Py_Initialize()
completes and before any
standard library imports are performed!
My solution to this problem is to concatenate some custom Python code to
importlib/_bootstrap_external.py
. This custom code defines a
sys.meta_path
importer that knows how to use our Rust-backed built-in
extension module to find and load module data. It redefines the _install()
function so that this custom importer is registered on sys.meta_path
when the function is called during Py_Initialize()
. The new Python
source is compiled to bytecode and the PyImport_FrozenModules
array is
modified at run-time to point to the modified _frozen_importlib_external
implementation. When Py_Initialize()
executes its first standard library
import, module data is provided by the custom sys.meta_path
importer,
which grabs it from a Rust extension module, which reads it from a
read-only data structure in the executable binary, which is converted
to a Python memoryview
instance and sent back to Python for processing.
There's a bit of magic happening behind the scenes to make all of this work. PyOxidizer attempts to hide as much of the gory details as possible. From the perspective of an application maintainer, you just need to define a minimal config file and it handles most of the low-level details. And there's even a higher-level Rust API for configuring the embedded Python interpreter, should you need it.
python-build-standalone
and PyOxidizer
are still in their infancy.
They are very much alpha quality. I consider them technology previews more
than usable software at this point. But I think enough is there to demonstrate
the viability of using Rust as the build system and run-time glue to build
and distribute standalone applications embedding Python.
Time will tell if my utopian vision of zero-copy, no explicit filesystem
I/O for Python module imports will pan out. Others who have ventured into
this space have warned me that lots of Python modules rely on __file__
to derive paths to other resources, which are later stat()
d and
open()
d. __file__
for in-memory modules doesn't exactly make sense
and can't be operated on like normal paths/files. I'm not sure what the
inevitable struggles to support these modules will lead to. Maybe we'll have
to extract things to temporary directories like other standalone Python
applications. Maybe PyOxidizer
will take off and people will start using
the ResourceReader
API, which is apparently the proper way to do these things these days.
(Caveat: PyOxidizer
doesn't yet implement this API but support is planned.)
Time will tell. I'm not opposed to gross hacks or writing more code as
needed.
Producing highly distributable and performant Python applications has been far too difficult for far too long. My primary goal for PyOxidizer is to lower these barriers. By leveraging Rust, I also hope to bring Python and Rust closer together. I want to enable applications and libraries to effortlessly harness the powers of both of these fantastic programming languages.
Again, PyOxidizer
is still in its infancy. I anticipate a significant amount
of hacking over the holidays and hope to share updates in the weeks ahead. Until
then, please leave comments, watch the project on GitHub,
file issues for bugs and feature requests, etc and we'll see where things lead.
Absorbing Commit Changes in Mercurial 4.8
November 05, 2018 at 09:25 AM | categories: Mercurial, MozillaEvery so often a tool you use introduces a feature that is so useful
that you can't imagine how things were before that feature existed.
The recent 4.8 release of the
Mercurial version control tool introduces
such a feature: the hg absorb
command.
hg absorb
is a mechanism to automatically and intelligently incorporate
uncommitted changes into prior commits. Think of it as hg histedit
or
git rebase -i
with auto squashing.
Imagine you have a set of changes to prior commits in your working
directory. hg absorb
figures out which changes map to which commits
and absorbs each of those changes into the appropriate commit. Using
hg absorb
, you can replace cumbersome and often merge conflict ridden
history editing workflows with a single command that often just works.
Read on for more details and examples.
Modern version control workflows often entail having multiple unlanded commits in flight. What this looks like varies heavily by the version control tool, standards and review workflows employed by the specific project/repository, and personal preferences.
A workflow practiced by a lot of projects is to author your commits into a sequence of standalone commits, with each commit representing a discrete, logical unit of work. Each commit is then reviewed/evaluated/tested on its own as part of a larger series. (This workflow is practiced by Firefox, the Git and Mercurial projects, and the Linux Kernel to name a few.)
A common task that arises when working with such a workflow is the need to incorporate changes into an old commit. For example, let's say we have a stack of the following commits:
$ hg show stack
@ 1c114a ansible/hg-web: serve static files as immutable content
o d2cf48 ansible/hg-web: synchronize templates earlier
o c29f28 ansible/hg-web: convert hgrc to a template
o 166549 ansible/hg-web: tell hgweb that static files are in /static/
o d46d6a ansible/hg-web: serve static template files from httpd
o 37fdad testing: only print when in verbose mode
/ (stack base)
o e44c2e (@) testing: install Mercurial 4.8 final
Contained within this stack are 5 commits changing the way that static files are served by hg.mozilla.org (but that's not important).
Let's say I submit this stack of commits for review. The reviewer spots a problem with the second commit (serve static template files from httpd) and wants me to make a change.
How do you go about making that change?
Again, this depends on the exact tool and workflow you are using.
A common workflow is to not rewrite the existing commits at all: you simply create a new fixup commit on top of the stack, leaving the existing commits as-is. e.g.:
$ hg show stack
o deadad fix typo in httpd config
o 1c114a ansible/hg-web: serve static files as immutable content
o d2cf48 ansible/hg-web: synchronize templates earlier
o c29f28 ansible/hg-web: convert hgrc to a template
o 166549 ansible/hg-web: tell hgweb that static files are in /static/
o d46d6a ansible/hg-web: serve static template files from httpd
o 37fdad testing: only print when in verbose mode
/ (stack base)
o e44c2e (@) testing: install Mercurial 4.8 final
When the entire series of commits is incorporated into the repository,
the end state of the files is the same, so all is well. But this strategy
of using fixup commits (while popular - especially with Git-based tooling
like GitHub that puts a larger emphasis on the end state of changes rather
than the individual commits) isn't practiced by all projects.
hg absorb
will not help you if this is your workflow.
A popular variation of this fixup commit workflow is to author a new commit then incorporate this commit into a prior commit. This typically involves the following actions:
<save changes to a file>
$ hg commit
<type commit message>
$ hg histedit
<manually choose what actions to perform to what commits>
OR
<save changes to a file>
$ git add <file>
$ git commit
<type commit message>
$ git rebase --interactive
<manually choose what actions to perform to what commits>
Essentially, you produce a new commit. Then you run a history editing command. You then tell that history editing command what to do (e.g. to squash or fold one commit into another), that command performs work and produces a set of rewritten commits.
In simple cases, you may make a simple change to a single file. Things are pretty straightforward. You need to know which two commits to squash together. This is often trivial. Although it can be cumbersome if there are several commits and it isn't clear which one should be receiving the new changes.
In more complex cases, you may make multiple modifications to multiple files. You may even want to squash your fixups into separate commits. And for some code reviews, this complex case can be quite common. It isn't uncommon for me to be incorporating dozens of reviewer-suggested changes across several commits!
These complex use cases are where things can get really complicated for version control tool interactions. Let's say we want to make multiple changes to a file and then incorporate those changes into multiple commits. To keep it simple, let's assume 2 modifications in a single file squashing into 2 commits:
<save changes to file>
$ hg commit --interactive
<select changes to commit>
<type commit message>
$ hg commit
<type commit message>
$ hg histedit
<manually choose what actions to perform to what commits>
OR
<save changes to file>
$ git add <file>
$ git add --interactive
<select changes to stage>
$ git commit
<type commit message>
$ git add <file>
$ git commit
<type commit message>
$ git rebase --interactive
<manually choose which actions to perform to what commits>
We can see that the number of actions required by users has already increased
substantially. Not captured by the number of lines is the effort that must go
into the interactive commands like hg commit --interactive
,
git add --interactive
, hg histedit
, and git rebase --interactive
. For
these commands, users must tell the VCS tool exactly what actions to take.
This takes time and requires some cognitive load. This ultimately distracts
the user from the task at hand, which is bad for concentration and productivity.
The user just wants to amend old commits: telling the VCS tool what actions
to take is an obstacle in their way. (A compelling argument can be made that
the work required with these workflows to produce a clean history is too much
effort and it is easier to make the trade-off favoring simpler workflows
versus cleaner history.)
These kinds of squash fixup workflows are what hg absorb
is designed to
make easier. When using hg absorb
, the above workflow can be reduced to:
<save changes to file>
$ hg absorb
<hit y to accept changes>
OR
<save changes to file>
$ hg absorb --apply-changes
Let's assume the following changes are made in the working directory:
$ hg diff
diff --git a/ansible/roles/hg-web/templates/vhost.conf.j2 b/ansible/roles/hg-web/templates/vhost.conf.j2
--- a/ansible/roles/hg-web/templates/vhost.conf.j2
+++ b/ansible/roles/hg-web/templates/vhost.conf.j2
@@ -76,7 +76,7 @@ LimitRequestFields 1000
# Serve static files straight from disk.
<Directory /repo/hg/htdocs/static/>
Options FollowSymLinks
- AllowOverride NoneTypo
+ AllowOverride None
Require all granted
</Directory>
@@ -86,7 +86,7 @@ LimitRequestFields 1000
# and URLs are versioned by the v-c-t revision, they are immutable
# and can be served with aggressive caching settings.
<Location /static/>
- Header set Cache-Control "max-age=31536000, immutable, bad"
+ Header set Cache-Control "max-age=31536000, immutable"
</Location>
#LogLevel debug
That is, we have 2 separate uncommitted changes to
ansible/roles/hg-web/templates/vhost.conf.j2
.
Here is what happens when we run hg absorb
:
$ hg absorb
showing changes for ansible/roles/hg-web/templates/vhost.conf.j2
@@ -78,1 +78,1 @@
d46d6a7 - AllowOverride NoneTypo
d46d6a7 + AllowOverride None
@@ -88,1 +88,1 @@
1c114a3 - Header set Cache-Control "max-age=31536000, immutable, bad"
1c114a3 + Header set Cache-Control "max-age=31536000, immutable"
2 changesets affected
1c114a3 ansible/hg-web: serve static files as immutable content
d46d6a7 ansible/hg-web: serve static template files from httpd
apply changes (yn)?
<press "y">
2 of 2 chunk(s) applied
hg absorb
automatically figured out that the 2 separate uncommitted changes
mapped to 2 different changesets (Mercurial's term for commit). It
print a summary of what lines would be changed in what changesets and
prompted me to accept its plan for how to proceed. The human effort involved
is a quick review of the proposed changes and answering a prompt.
At a technical level, hg absorb
finds all uncommitted changes and
attempts to map each changed line to an unambiguous prior commit. For
every change that can be mapped cleanly, the uncommitted changes are
absorbed into the appropriate prior commit. Commits impacted by the
operation are rebased automatically. If a change cannot be mapped to an
unambiguous prior commit, it is left uncommitted and users can fall back
to an existing workflow (e.g. using hg histedit
).
But wait - there's more!
The automatic rewriting logic of hg absorb
is implemented by following
the history of lines. This is fundamentally different from the approach
taken by hg histedit
or git rebase
, which tend to rely on merge
strategies based on the
3-way merge
to derive a new version of a file given multiple input versions. This
approach combined with the fact that hg absorb
skips over changes with
an ambiguous application commit means that hg absorb
will never
encounter merge conflicts! Now, you may be thinking if you ignore
lines with ambiguous application targets, the patch would always apply
cleanly using a classical 3-way merge. This statement logically sounds
correct. But it isn't: hg absorb
can avoid merge conflicts when the
merging performed by hg histedit
or git rebase -i
would fail.
The above example attempts to exercise such a use case. Focusing on the initial change:
diff --git a/ansible/roles/hg-web/templates/vhost.conf.j2 b/ansible/roles/hg-web/templates/vhost.conf.j2
--- a/ansible/roles/hg-web/templates/vhost.conf.j2
+++ b/ansible/roles/hg-web/templates/vhost.conf.j2
@@ -76,7 +76,7 @@ LimitRequestFields 1000
# Serve static files straight from disk.
<Directory /repo/hg/htdocs/static/>
Options FollowSymLinks
- AllowOverride NoneTypo
+ AllowOverride None
Require all granted
</Directory>
This patch needs to be applied against the commit which introduced it. That commit had the following diff:
diff --git a/ansible/roles/hg-web/templates/vhost.conf.j2 b/ansible/roles/hg-web/templates/vhost.conf.j2
--- a/ansible/roles/hg-web/templates/vhost.conf.j2
+++ b/ansible/roles/hg-web/templates/vhost.conf.j2
@@ -73,6 +73,15 @@ LimitRequestFields 1000
{% endfor %}
</Location>
+ # Serve static files from templates directory straight from disk.
+ <Directory /repo/hg/hg_templates/static/>
+ Options None
+ AllowOverride NoneTypo
+ Require all granted
+ </Directory>
+
+ Alias /static/ /repo/hg/hg_templates/static/
+
#LogLevel debug
LogFormat "%h %v %u %t \"%r\" %>s %b %D \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\""
ErrorLog "/var/log/httpd/hg.mozilla.org/error_log"
But after that commit was another commit with the following change:
diff --git a/ansible/roles/hg-web/templates/vhost.conf.j2 b/ansible/roles/hg-web/templates/vhost.conf.j2
--- a/ansible/roles/hg-web/templates/vhost.conf.j2
+++ b/ansible/roles/hg-web/templates/vhost.conf.j2
@@ -73,14 +73,21 @@ LimitRequestFields 1000
{% endfor %}
</Location>
- # Serve static files from templates directory straight from disk.
- <Directory /repo/hg/hg_templates/static/>
- Options None
+ # Serve static files straight from disk.
+ <Directory /repo/hg/htdocs/static/>
+ Options FollowSymLinks
AllowOverride NoneTypo
Require all granted
</Directory>
...
When we use hg histedit
or git rebase -i
to rewrite this history, the VCS
would first attempt to re-order commits before squashing 2 commits together.
When we attempt to reorder the fixup diff immediately after the commit that
introduces it, there is a good chance your VCS tool would encounter a merge
conflict. Essentially your VCS is thinking you changed this line but the
lines around the change in the final version are different from the lines
in the initial version: I don't know if those other lines matter and therefore
I don't know what the end state should be, so I'm giving up and letting the
user choose for me.
But since hg absorb
operates at the line history level, it knows that this
individual line wasn't actually changed (even though the lines around it did),
assumes there is no conflict, and offers to absorb the change. So not only
is hg absorb
significantly simpler than today's hg histedit
or
git rebase -i
workflows in terms of VCS command interactions, but it can
also avoid time-consuming merge conflict resolution as well!
Another feature of hg absorb
is that all the rewriting occurs in memory
and the working directory is not touched when running the command. This means
that the operation is fast (working directory updates often account for a lot
of the execution time of hg histedit
or git rebase
commands). It also means
that tools looking at the last modified time of files (e.g. build systems
like GNU Make) won't rebuild extra (unrelated) files that were touched
as part of updating the working directory to an old commit in order to apply
changes. This makes hg absorb
more friendly to edit-compile-test-commit
loops and allows developers to be more productive.
And that's hg absorb
in a nutshell.
When I first saw a demo of hg absorb
at a Mercurial developer meetup, my
jaw - along with those all over the room - hit the figurative floor. I thought
it was magical and too good to be true. I thought Facebook (the original authors
of the feature) were trolling us with an impossible demo. But it was all real.
And now hg absorb
is available in the core Mercurial distribution for anyone
to use.
From my experience, hg absorb
just works almost all of the time: I run
the command and it maps all of my uncommitted changes to the appropriate
commit and there's nothing more for me to do! In a word, it is magical.
To use hg absorb
, you'll need to activate the absorb
extension. Simply
put the following in your hgrc
config file:
[extensions]
absorb =
hg absorb
is currently an experimental feature. That means there is
no commitment to backwards compatibility and some rough edges are
expected. I also anticipate new features (such as hg absorb --interactive
)
will be added before the experimental label is removed. If you encounter
problems or want to leave comments, file a bug,
make noise in #mercurial
on Freenode, or
submit a patch.
But don't let the experimental label scare you away from using it:
hg absorb
is being used by some large install bases and also by many
of the Mercurial core developers. The experimental label is mainly there
because it is a brand new feature in core Mercurial and the experimental
label is usually affixed to new features.
If you practice workflows that frequently require amending old commits, I
think you'll be shocked at how much easier hg absorb
makes these workflows.
I think you'll find it to be a game changer: once you use hg abosrb
, you'll
soon wonder how you managed to get work done without it.
Global Kernel Locks in APFS
October 29, 2018 at 02:20 PM | categories: Python, Mercurial, AppleOver the past several months, a handful of people had been complaining that Mercurial's test harness was executing much slower on Macs. But this slowdown seemingly wasn't occurring on Linux or Windows. And not every Mac user experienced the slowness!
Before jetting off to the Mercurial 4.8 developer meetup in Stockholm a few weeks ago, I sat down with a relatively fresh 6+6 core MacBook Pro and experienced the problem firsthand: on my 4+4 core i7-6700K running Linux, the Mercurial test harness completes in ~12 minutes, but on this MacBook Pro, it was executing in ~38 minutes! On paper, this result doesn't make any sense because there's no way that the MacBook Pro should be ~3x slower than that desktop machine.
Looking at Activity Monitor when running the test harness with 12 tests in parallel revealed something odd: the system was spending ~75% of overall CPU time inside the kernel! When reducing the number of tests that ran in parallel, the percentage of CPU time spent in the kernel decreased and the overall test harness execution time also decreased. This kind of behavior is usually a sign of something very inefficient in kernel land.
I sample profiled all processes on the system when running the Mercurial
test harness. Aggregate thread stacks revealed a common pattern:
readdir()
being in the stack.
Upon closer examination of the stacks, readdir()
calls into
apfs_vnop_readdir()
, which calls into some functions with bt
or
btree
in their name, which call into lck_mtx_lock()
,
lck_mtx_lock_grab_mutex()
and various other functions with
lck_mtx
in their name. And the caller of most readdir()
appeared
to be Python 2.7's module importing mechanism (notably
import.c:case_ok()
).
APFS refers to the Apple File System, which is a filesystem that Apple introduced in 2017 and is the default filesystem for new versions of macOS and iOS. If upgrading an old Mac to a new macOS, its HFS+ filesystems would be automatically converted to APFS.
While the source code for APFS is not available for me to confirm, the
profiling results showing excessive time spent in
lck_mtx_lock_grab_mutex()
combined with the fact that execution time
decreases when the parallel process count decreases leads me to the
conclusion that APFS obtains a global kernel lock during read-only
operations such as readdir()
. In other words, APFS slows down when
attempting to perform parallel read-only I/O.
This isn't the first time I've encountered such behavior in a filesystem: last year I blogged about very similar behavior in AUFS, which was making Firefox CI significantly slower.
Because Python 2.7's module importing mechanism was triggering the
slowness by calling readdir()
, I
posted to python-dev
about the problem, as I thought it was important to notify the larger
Python community. After all, this is a generic problem that affects
the performance of starting any Python process when running on APFS.
i.e. if your build system invokes many Python processes in parallel,
you could be impacted by this. As part of obtaining data for that post, I
discovered that Python 3.7 does not call readdir()
as part of
module importing and therefore doesn't exhibit a severe slowdown. (Python's
module importing code was rewritten significantly in Python 3 and the fix
was likely introduced well before Python 3.7.)
I've produced a gist that can reproduce the problem.
The script essentially performs a recursive directory walk. It exercises
the opendir()
, readdir()
, closedir()
, and lstat()
functions
heavily and is essentially a benchmark of the filesystem and filesystem
cache's ability to return file metadata.
When you tell it to walk a very large directory tree - say a Firefox version control checkout (which has over 250,000 files) - the excessive time spent in the kernel is very apparent on macOS 10.13 High Sierra:
$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 172.209s
real 2m52.470s
user 1m54.053s
sys 23m42.808s
$ time ./slow-readdir.py -l 12 -j 1 ~/src/firefox
ran 12 walks across 1 processes in 523.440s
real 8m43.740s
user 1m13.397s
sys 3m50.687s
$ time ./slow-readdir.py -l 18 -j 18 ~/src/firefox
ran 18 walks across 18 processes in 210.487s
real 3m30.731s
user 2m40.216s
sys 33m34.406s
On the same machine upgraded to macOS 10.14 Mojave, we see a bit of a speedup!:
$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 97.833s
real 1m37.981s
user 1m40.272s
sys 10m49.091s
$ time ./slow-readdir.py -l 12 -j 1 ~/src/firefox
ran 12 walks across 1 processes in 461.415s
real 7m41.657s
user 1m05.830s
sys 3m47.041s
$ time ./slow-readdir.py -l 18 -j 18 ~/src/firefox
ran 18 walks across 18 processes in 140.474s
real 2m20.727s
user 3m01.048s
sys 17m56.228s
Contrast with my i7-6700K Linux machine backed by EXT4:
$ time ./slow-readdir.py -l 8 ~/src/firefox
ran 8 walks across 8 processes in 6.018s
real 0m6.191s
user 0m29.670s
sys 0m17.838s
$ time ./slow-readdir.py -l 8 -j 1 ~/src/firefox
ran 8 walks across 1 processes in 33.958s
real 0m34.164s
user 0m17.136s
sys 0m13.369s
$ time ./slow-readdir.py -l 12 -j 12 ~/src/firefox
ran 12 walks across 12 processes in 25.465s
real 0m25.640s
user 1m4.801s
sys 1m20.488s
It is apparent that macOS 10.14 Mojave has received performance work relative to macOS 10.13! Overall kernel CPU time when performing parallel directory walks has decreased substantially - to ~50% of original on some invocations! Stacks seem to reveal new code for lock acquisition, so this might indicate generic improvements to the kernel's locking mechanism rather than APFS specific changes. Changes to file metadata caching could also be responsible for performance changes. Although it is difficult to tell without access to the APFS source code. Despite those improvements, APFS is still spending a lot of CPU time in the kernel. And the kernel CPU time is still comparatively very high compared to Linux/EXT4, even for single process operation.
At this time, I haven't conducted a comprehensive analysis of APFS to
determine what other filesystem operations seem to acquire global kernel
locks: all I know is readdir()
does. A casual analysis of profiled
stacks when running Mercurial's test harness against Python 3.7 seems
to show apfs_*
functions still on the stack a lot and that seemingly
indicates more APFS slowness under parallel I/O load. But HFS+ exhibited
similar problems (it appeared HFS+ used a single I/O thread inside the
kernel for many operations, making I/O on macOS pretty bad), so I'm
not sure if these could be considered regressions the way readdir()
's
new behavior is.
I've reported this issue to Apple at https://bugreport.apple.com/web/?problemID=45648013 and on OpenRadar at https://openradar.appspot.com/radar?id=5025294012383232. I'm told that issues get more attention from Apple when there are many duplicates of the same issue. So please reference this issue if you file your own report.
Now that I've elaborated on the technical details, I'd like to add some personal commentary. While this post is about APFS, this issue of global kernel locks during common I/O operations is not unique to APFS. I already referenced similar issues in AUFS. And I've encountered similar behaviors with Btrfs (although I can't recall exactly which operations). And NTFS has its own bag of problems.
This seeming pattern of global kernel locks for common filesystem operations and slow filesystems is really rubbing me the wrong way. Modern NVMe SSDs are capable of reading and writing well over 2 gigabytes per second and performing hundreds of thousands of I/O operations per second. We even have Intel soon producing persistent solid state storage that plugs into DIMM slots because it is that friggin fast.
Today's storage hardware is capable of ludicrous performance. It is fast enough that you will likely saturate multiple CPU cores processing the read or written data coming from and going to storage - especially if you are using higher-level, non-JITed (read: slower) programming languages (like Python). There has also been a trend that systems are growing more CPU cores faster than they are instructions per second per core. And SSDs only achieve these ridiculous IOPS numbers if many I/O operations are queued and can be more efficiently dispatched within the storage device. What this all means is that it probably makes sense to use parallel I/O across multiple threads in order to extract all potential performance from your persistent storage layer.
It's also worth noting that we now have solid state storage that outperforms (in some dimensions) what DRAM from ~20 years ago was capable of. Put another way I/O APIs and even some filesystems were designed in an era when its RAM was slower than what today's persistent storage is capable of! While I'm no filesystems or kernel expert, it does seem a bit silly to be using APIs and filesystems designed for an era when storage was multiple orders of magnitude slower and systems only had a single CPU core.
My takeaway is I can't help but feel that systems-level software (including the kernel) is severely limiting the performance potential of modern storage devices. If we have e.g. global kernel locks when performing common I/O operations, there's no chance we'll come close to harnessing the full potential of today's storage hardware. Furthermore, the behavior of filesystems is woefully under documented and software developers have little solid advice for how to achieve optimal I/O performance. As someone who cares about performance, I want to squeeze every iota of potential out of hardware. But the lack of documentation telling me which operations acquire locks, which strategies are best for say reading or writing 10,000 files using N threads, etc makes this extremely difficult. And even if this documentation existed, because of differences in behavior across filesystems and operating systems and the difficulty in programmatically determining the characteristics of filesystems at run time, it is practically impossible to design a one size fits all approach to high performance I/O.
The filesystem is a powerful concept. I want to agree and use the everything is a file philosophy. Unfortunately, filesystems don't appear to be scaling very well to support the potential of modern day storage technology. We're probably at the point where commodity priced solid state storage is far more capable than today's software for the majority of applications. Storage hardware manufacturers will keep producing faster and faster storage and their marketing teams will keep convincing us that we need to buy it. But until software catches up, chances are most of us won't come close to realizing the true potential of modern storage hardware. And that's even true for specialized applications that do employ tricks taking hundreds or thousands of person hours to implement in order to eek out every iota of performance potential. The average software developer and application using filesystems as they were designed to be used has little to no chance of coming close to utilizing the performance potential of modern storage devices. That's really a shame.
Benefits of Clone Offload on Version Control Hosting
July 27, 2018 at 03:48 PM | categories: Mercurial, MozillaBack in 2015, I implemented a feature in Mercurial 3.6 that allows
servers to advertise URLs of pre-generated bundle files. When a
compatible client performs a hg clone
against a repository leveraging
this feature, it downloads and applies the bundle from a URL then goes
back to the server and performs the equivalent of an hg pull
to obtain
the changes to the repository made after the bundle was generated.
On hg.mozilla.org, we've been using this feature since 2015. We host bundles in Amazon S3 and make them available via the CloudFront CDN. We perform IP filtering on the server so clients connecting from AWS IPs are served S3 URLs corresponding to the closest region / S3 bucket where bundles are hosted. Most Firefox build and test automation is run out of EC2 and automatically clones high-volume repositories from an S3 bucket hosted in the same AWS region. (Doing an intra-region transfer is very fast and clones can run at >50 MB/s.) Everyone else clones from a CDN. See our official docs for more.
I last reported on this feature in October 2015. Since then, Bitbucket also deployed this feature in early 2017.
I was reminded of this clone bundles feature this week when kernel.org posted Best way to do linux clones for your CI and that post was making the rounds in my version control circles. tl;dr git.kernel.org apparently suffers high load due to high clone volume against the Linux Git repository and since Git doesn't have an equivalent feature to clone bundles built in to Git itself, they are asking people to perform equivalent functionality to mitigate server load.
(A clone bundles feature has been discussed on the Git mailing list before. I remember finding old discussions when I was doing research for Mercurial's feature in 2015. I'm sure the topic has come up since.)
Anyway, I thought I'd provide an update on just how valuable the clone bundles feature is to Mozilla. In doing so, I hope maintainers of other version control tools see the obvious benefits and consider adopting the feature sooner.
In a typical week, hg.mozilla.org is currently serving ~135 TB of data. The overwhelming majority of this data is related to the Mercurial wire protocol (i.e. not HTML / JSON served from the web interface). Of that ~135 TB, ~5 TB is served from the CDN, ~126 TB is served from S3, and ~4 TB is served from the Mercurial servers themselves. In other words, we're offloading ~97% of bytes served from the Mercurial servers to S3 and the CDN.
If we assume this offloaded ~131 TB is equally distributed throughout the week, this comes out to ~1,732 Mbps on average. In reality, we do most of our load from California's Sunday evenings to early Friday evenings. And load is typically concentrated in the 12 hours when the sun is over Europe and North America (where most of Mozilla's employees are based). So the typical throughput we are offloading is more than 2 Gbps. And at a lower level, automation tends to perform clones soon after a push is made. So load fluctuates significantly throughout the day, corresponding to when pushes are made.
By volume, most of the data being offloaded is for the mozilla-unified Firefox repository. Without clone bundles and without the special stream clone Mercurial feature (which we also leverage via clone bundles), the servers would be generating and sending ~1,588 MB of zstandard level 3 compressed data for each clone of that repository. Each clone would consume ~280s of CPU time on the server. And at ~195,000 clones per month, that would come out to ~309 TB/mo or ~72 TB/week. In CPU time, that would be ~54.6 million CPU-seconds, or ~21 CPU-months. I will leave it as an exercise to the reader to attach a dollar cost to how much it would take to operate this service without clone bundles. But I will say the total AWS bill for our S3 and CDN hosting for this service is under $50 per month. (It is worth noting that intra-region data transfer from S3 to other AWS services is free. And we are definitely taking advantage of that.)
Despite a significant increase in the size of the Firefox repository and clone volume of it since 2015, our servers are still performing less work (in terms of bytes transferred and CPU seconds consumed) than they were in 2015. The ~97% of bytes and millions of CPU seconds offloaded in any given week have given us a lot of breathing room and have saved Mozilla several thousand dollars in hosting costs. The feature has likely helped us avoid many operational incidents due to high server load. It has made Firefox automation faster and more reliable.
Succinctly, Mercurial's clone bundles feature has successfully and largely effortlessly offloaded a ton of load from the hg.mozilla.org Mercurial servers. Other version control tools should implement this feature because it is a game changer for server operators and results in a better client-side experience (eliminates server-side CPU bottleneck and may eliminate network bottleneck due to a geo-local CDN typically being as fast as your Internet pipe). It's a win-win. And a massive win if you are operating at scale.
« Previous Page -- Next Page »