Faster In-Memory Python Module Importing

December 28, 2018 at 12:40 PM | categories: Python, PyOxidizer, Rust

I recently blogged about distributing standalone Python applications. In that post, I announced PyOxidizer - a tool which leverages Rust to produce standalone executables embedding Python. One of the features of PyOxidizer is the ability to import Python modules embedded within the binary using zero-copy.

I also recently blogged about global kernel locks in APFS, which make filesystem operations slower on macOS. This was the latest wrinkle in a long battle against Python's slow startup times, which I've posted about on the official python-dev mailing list over the years.

Since I announced PyOxidizer a few days ago, I've had some productive holiday hacking sessions!

One of the reached milestones is PyOxidizer now supports macOS.

With that milestone reached, I thought it would be interesting to compare the performance of a PyOxidizer executable versus a standard CPython build.

I produced a Python script that imports almost the entirety of the Python standard library - at least the modules implemented in Python. That's 508 import statements. I then executed this script using a typical python3.7 binary (with the standard library on the filesystem) and PyOxidizer-produced standalone executables with a module importer that loads Python modules from memory using zero copy.

# Homebrew installed CPython 3.7.2

# Cold disk cache.
$ sudo purge
$ time /usr/local/bin/python3.7 < import_stdlib.py
real   0m0.694s
user   0m0.354s
sys    0m0.121s

# Hot disk cache.
$ time /usr/local/bin/python3.7 < import_stdlib.py
real   0m0.319s
user   0m0.263s
sys    0m0.050s

# PyOxidizer with non-PGO/non-LTO CPython 3.7.2
$ time target/release/pyapp < import_stdlib.py
real   0m0.223s
user   0m0.201s
sys    0m0.017s

# PyOxidizer with PGO/non-LTO CPython 3.7.2
$ time target/release/pyapp < import_stdlib.py
real   0m0.234s
user   0m0.210s
sys    0m0.019

# PyOxidizer with PTO+LTO CPython 3.7.2
$ sudo purge
$ time target/release/pyapp < import_stdlib.py
real   0m0.442s
user   0m0.252s
sys    0m0.059s

$ time target/release/pyall < import_stdlib.py
real   0m0.221s
user   0m0.197s
sys    0m0.020s

First, the PyOxidizer times are all relatively similar regardless of whether PGO or LTO is used to build CPython. That's not too surprising, as I'm exercising a very limited subset of CPython (and I suspect the benefits of PGO/LTO aren't as pronounced due to the nature of the CPython API).

But the bigger result is the obvious speedup with PyOxidizer and its in-memory importing: PyOxidizer can import almost the entirety of the Python standard library ~100ms faster - or ~70% of original - than a typical standalone CPython install with a hot disk cache! This comes out to ~0.19ms per import statement. If we run purge to clear out the disk cache, the performance delta increases to 252ms, or ~64% of original. All these numbers are on a 2018 6-core 2.9 GHz i9 MacBook Pro, which has a pretty decent SSD.

And on Linux on an i7-6700K running in a Hyper-V VM:

# pyenv installed CPython 3.7.2

# Cold disk cache.
$ time ~/.pyenv/versions/3.7.2/bin/python < import_stdlib.py
real   0m0.405s
user   0m0.165s
sys    0m0.065s

# Hot disk cache.
$ time ~/.pyenv/versions/3.7.2/bin/python < import_stdlib.py
real   0m0.193s
user   0m0.161s
sys    0m0.032s

# PyOxidizer with PGO CPython 3.7.2

# Cold disk cache.
$ time target/release/pyapp < import_stdlib.py
real   0m0.227s
user   0m0.145s
sys    0m0.016s

# Hot disk cache.
$ time target/release/pyapp < import_stdlib.py
real   0m0.152s
user   0m0.136s
sys    0m0.016s

On a hot disk cache, the run-time improvement of PyOxidizer is ~41ms, or ~78% of original. This comes out to ~0.08ms per import statement. When flushing caches by writing 3 to /proc/sys/vm/drop_caches, the delta increases to ~178ms, or ~56% of original.

Using dtruss -c to execute the binaries, the breakdown in system calls occurring >10 times is clear:

# CPython standalone
fstatfs64                                      16
read_nocancel                                  19
ioctl                                          20
getentropy                                     22
pread                                          26
fcntl                                          27
sigaction                                      32
getdirentries64                                34
fcntl_nocancel                                106
mmap                                          114
close_nocancel                                129
open_nocancel                                 130
lseek                                         148
open                                          168
close                                         170
read                                          282
fstat64                                       403
stat64                                        833

# PyOxidizer
lseek                                          10
read                                           12
read_nocancel                                  14
fstat64                                        16
ioctl                                          22
munmap                                         31
stat64                                         33
sysctl                                         33
sigaction                                      36
mmap                                          122
madvise                                       193
getentropy                                    315

PyOxidizer avoids hundreds of open(), close(), read(), fstat64(), and stat64() calls. And by avoiding these calls, PyOxidizer not only avoids the userland-kernel overhead intrinsic to them, but also any additional overhead that APFS is imposing via its global lock(s).

(Why the PyOxidizer binary is making hundreds of calls to getentropy() I'm not sure. It's definitely coming from Python as a side-effect of a module import and it is something I'd like to fix, if possible.)

With this experiment, we finally have the ability to better isolate the impact of filesystem overhead on Python module importing and preliminary results indicate that the overhead is not insignificant - at least on the tested systems (I'll get data for Windows when PyOxidizer supports it). While the test is somewhat contrived (I don't think many applications import the entirety of the Python standard library), some Python applications do import hundreds of modules. And as I've written before, milliseconds matter. This is especially true if you are invoking Python processes hundreds or thousands of times in a build system, when running a test suite, for scripting, etc. Cumulatively you can be importing tens of thousands of modules. So I think shaving even fractions of a millisecond from module importing is important.

It's worth noting that in addition to the system call overhead, CPython's path-based importer runs substantially more Python code than PyOxidizer and this likely contributes several milliseconds of overhead as well. Because PyOxidizer applications are static, the importer can remain simple (finding a module in PyOxidizer is essentially a Rust HashMap<String, Vec<u8> lookup). While it might be useful to isolate the filesystem overhead from Python code overhead, the thing that end-users care about is overall execution time: they don't care where that overhead is coming from. So I think it is fair to compare PyOxidizer - with its intrinsically simpler import model - with what Python typically does (scan sys.path entries and looking for modules on the filesystem).

Another difference is that PyOxidizer is almost completely statically linked. By contrast, a typical CPython install has compiled extension modules as standalone shared libraries and these shared libraries often link against other shared libraries (such as libssl). From dtruss timing information, I don't believe this difference contributes to significant overhead, however.

Finally, I haven't yet optimized PyOxidizer. I still have a few tricks up my sleeve that can likely shave off more overhead from Python startup. But so far the results are looking very promising. I dare say they are looking promising enough that Python distributions themselves might want to look into the area more thoroughly and consider distribution defaults that rely less on the every-Python-module-is-a-separate-file model.

Stay tuned for more PyOxidizer updates in the near future!

(I updated this post a day after initial publication to add measurements for Linux.)