Faster In-Memory Python Module Importing
December 28, 2018 at 12:40 PM | categories: Python, PyOxidizer, RustI recently blogged about distributing standalone Python applications. In that post, I announced PyOxidizer - a tool which leverages Rust to produce standalone executables embedding Python. One of the features of PyOxidizer is the ability to import Python modules embedded within the binary using zero-copy.
I also recently blogged about global kernel locks in APFS, which make filesystem operations slower on macOS. This was the latest wrinkle in a long battle against Python's slow startup times, which I've posted about on the official python-dev mailing list over the years.
Since I announced PyOxidizer a few days ago, I've had some productive holiday hacking sessions!
One of the reached milestones is PyOxidizer now supports macOS.
With that milestone reached, I thought it would be interesting to compare the performance of a PyOxidizer executable versus a standard CPython build.
I produced a Python script
that imports almost the entirety of the Python standard library - at least the
modules implemented in Python. That's 508 import
statements. I then
executed this script using a typical python3.7
binary (with the standard
library on the filesystem) and PyOxidizer-produced standalone executables
with a module importer that loads Python modules from memory using zero copy.
# Homebrew installed CPython 3.7.2
# Cold disk cache.
$ sudo purge
$ time /usr/local/bin/python3.7 < import_stdlib.py
real 0m0.694s
user 0m0.354s
sys 0m0.121s
# Hot disk cache.
$ time /usr/local/bin/python3.7 < import_stdlib.py
real 0m0.319s
user 0m0.263s
sys 0m0.050s
# PyOxidizer with non-PGO/non-LTO CPython 3.7.2
$ time target/release/pyapp < import_stdlib.py
real 0m0.223s
user 0m0.201s
sys 0m0.017s
# PyOxidizer with PGO/non-LTO CPython 3.7.2
$ time target/release/pyapp < import_stdlib.py
real 0m0.234s
user 0m0.210s
sys 0m0.019
# PyOxidizer with PTO+LTO CPython 3.7.2
$ sudo purge
$ time target/release/pyapp < import_stdlib.py
real 0m0.442s
user 0m0.252s
sys 0m0.059s
$ time target/release/pyall < import_stdlib.py
real 0m0.221s
user 0m0.197s
sys 0m0.020s
First, the PyOxidizer times are all relatively similar regardless of whether PGO or LTO is used to build CPython. That's not too surprising, as I'm exercising a very limited subset of CPython (and I suspect the benefits of PGO/LTO aren't as pronounced due to the nature of the CPython API).
But the bigger result is the obvious speedup with PyOxidizer and its
in-memory importing: PyOxidizer can import almost the entirety of the
Python standard library ~100ms faster - or ~70% of original - than a
typical standalone CPython install with a hot disk cache! This comes
out to ~0.19ms per import
statement. If we run purge
to clear out
the disk cache, the performance delta increases to 252ms, or ~64% of
original. All these numbers are on a 2018 6-core 2.9 GHz i9 MacBook Pro,
which has a pretty decent SSD.
And on Linux on an i7-6700K running in a Hyper-V VM:
# pyenv installed CPython 3.7.2
# Cold disk cache.
$ time ~/.pyenv/versions/3.7.2/bin/python < import_stdlib.py
real 0m0.405s
user 0m0.165s
sys 0m0.065s
# Hot disk cache.
$ time ~/.pyenv/versions/3.7.2/bin/python < import_stdlib.py
real 0m0.193s
user 0m0.161s
sys 0m0.032s
# PyOxidizer with PGO CPython 3.7.2
# Cold disk cache.
$ time target/release/pyapp < import_stdlib.py
real 0m0.227s
user 0m0.145s
sys 0m0.016s
# Hot disk cache.
$ time target/release/pyapp < import_stdlib.py
real 0m0.152s
user 0m0.136s
sys 0m0.016s
On a hot disk cache, the run-time improvement of PyOxidizer is ~41ms, or
~78% of original. This comes out to ~0.08ms per import
statement. When
flushing caches by writing 3
to /proc/sys/vm/drop_caches
, the delta
increases to ~178ms, or ~56% of original.
Using dtruss -c
to execute the binaries, the breakdown in system calls
occurring >10 times is clear:
# CPython standalone
fstatfs64 16
read_nocancel 19
ioctl 20
getentropy 22
pread 26
fcntl 27
sigaction 32
getdirentries64 34
fcntl_nocancel 106
mmap 114
close_nocancel 129
open_nocancel 130
lseek 148
open 168
close 170
read 282
fstat64 403
stat64 833
# PyOxidizer
lseek 10
read 12
read_nocancel 14
fstat64 16
ioctl 22
munmap 31
stat64 33
sysctl 33
sigaction 36
mmap 122
madvise 193
getentropy 315
PyOxidizer avoids hundreds of open()
, close()
, read()
,
fstat64()
, and stat64()
calls. And by avoiding these calls,
PyOxidizer not only avoids the userland-kernel overhead intrinsic to them,
but also any additional overhead that APFS is imposing via its global
lock(s).
(Why the PyOxidizer binary is making hundreds of calls to getentropy()
I'm not sure. It's definitely coming from Python as a side-effect of a
module import and it is something I'd like to fix, if possible.)
With this experiment, we finally have the ability to better isolate the impact of filesystem overhead on Python module importing and preliminary results indicate that the overhead is not insignificant - at least on the tested systems (I'll get data for Windows when PyOxidizer supports it). While the test is somewhat contrived (I don't think many applications import the entirety of the Python standard library), some Python applications do import hundreds of modules. And as I've written before, milliseconds matter. This is especially true if you are invoking Python processes hundreds or thousands of times in a build system, when running a test suite, for scripting, etc. Cumulatively you can be importing tens of thousands of modules. So I think shaving even fractions of a millisecond from module importing is important.
It's worth noting that in addition to the system call overhead, CPython's
path-based importer runs
substantially more
Python code
than PyOxidizer
and this likely contributes several milliseconds of overhead as well. Because
PyOxidizer applications are static, the importer can remain simple (finding a
module in PyOxidizer is essentially a Rust HashMap<String, Vec<u8>
lookup).
While it might be useful to isolate the filesystem overhead from Python code
overhead, the thing that end-users care about is overall execution time: they
don't care where that overhead is coming from. So I think it is fair to compare
PyOxidizer - with its intrinsically simpler import model - with what Python
typically does (scan sys.path
entries and looking for modules on the
filesystem).
Another difference is that PyOxidizer is almost completely statically linked.
By contrast, a typical CPython install has compiled extension modules as
standalone shared libraries and these shared libraries often link against
other shared libraries (such as libssl). From dtruss
timing information,
I don't believe this difference contributes to significant overhead, however.
Finally, I haven't yet optimized PyOxidizer. I still have a few tricks up my sleeve that can likely shave off more overhead from Python startup. But so far the results are looking very promising. I dare say they are looking promising enough that Python distributions themselves might want to look into the area more thoroughly and consider distribution defaults that rely less on the every-Python-module-is-a-separate-file model.
Stay tuned for more PyOxidizer updates in the near future!
(I updated this post a day after initial publication to add measurements for Linux.)