Technical Notes

CPython Initialization

Most code lives in pylifecycle.c.

Call tree with Python 3.7:

``Py_Initialize()``
  ``Py_InitializeEx()``
    ``_Py_InitializeFromConfig(_PyCoreConfig config)``
      ``_Py_InitializeCore(PyInterpreterState, _PyCoreConfig)``
        Sets up allocators.
        ``_Py_InitializeCore_impl(PyInterpreterState, _PyCoreConfig)``
          Does most of the initialization.
          Runtime, new interpreter state, thread state, GIL, built-in types,
          Initializes sys module and sets up sys.modules.
          Initializes builtins module.
          ``_PyImport_Init()``
            Copies ``interp->builtins`` to ``interp->builtins_copy``.
          ``_PyImportHooks_Init()``
            Sets up ``sys.meta_path``, ``sys.path_importer_cache``,
            ``sys.path_hooks`` to empty data structures.
          ``initimport()``
            ``PyImport_ImportFrozenModule("_frozen_importlib")``
            ``PyImport_AddModule("_frozen_importlib")``
            ``interp->importlib = importlib``
            ``interp->import_func = interp->builtins.__import__``
            ``PyInit__imp()``
              Initializes ``_imp`` module, which is implemented in C.
            ``sys.modules["_imp"} = imp``
            ``importlib._install(sys, _imp)``
            ``_PyImportZip_Init()``

      ``_Py_InitializeMainInterpreter(interp, _PyMainInterpreterConfig)``
        ``_PySys_EndInit()``
          ``sys.path = XXX``
          ``sys.executable = XXX``
          ``sys.prefix = XXX``
          ``sys.base_prefix = XXX``
          ``sys.exec_prefix = XXX``
          ``sys.base_exec_prefix = XXX``
          ``sys.argv = XXX``
          ``sys.warnoptions = XXX``
          ``sys._xoptions = XXX``
          ``sys.flags = XXX``
          ``sys.dont_write_bytecode = XXX``
        ``initexternalimport()``
          ``interp->importlib._install_external_importers()``
        ``initfsencoding()``
          ``_PyCodec_Lookup(Py_FilesystemDefaultEncoding)``
            ``_PyCodecRegistry_Init()``
              ``interp->codec_search_path = []``
              ``interp->codec_search_cache = {}``
              ``interp->codec_error_registry = {}``
              # This is the first non-frozen import during startup.
              ``PyImport_ImportModuleNoBlock("encodings")``
            ``interp->codec_search_cache[codec_name]``
            ``for p in interp->codec_search_path: p[codec_name]``
        ``initsigs()``
        ``add_main_module()``
          ``PyImport_AddModule("__main__")``
        ``init_sys_streams()``
          ``PyImport_ImportModule("encodings.utf_8")``
          ``PyImport_ImportModule("encodings.latin_1")``
          ``PyImport_ImportModule("io")``
          Consults ``PYTHONIOENCODING`` and gets encoding and error mode.
          Sets up ``sys.__stdin__``, ``sys.__stdout__``, ``sys.__stderr__``.
        Sets warning options.
        Sets ``_PyRuntime.initialized``, which is what ``Py_IsInitialized()``
        returns.
        ``initsite()``
          ``PyImport_ImportModule("site")``

CPython Importing Mechanism

Lib/importlib defines importing mechanisms and is 100% Python.

Programs/_freeze_importlib.c is a program that takes a path to an input .py file and path to output .h file. It initializes a Python interpreter and compiles the .py file to marshalled bytecode. It writes out a .h file with an inline const unsigned char _Py_M__importlib array containing bytecode.

Lib/importlib/_bootstrap_external.py compiled to Python/importlib_external.h with _Py_M__importlib_external[].

Lib/importlib/_bootstrap.py compiled to Python/importlib.h with _Py_M__importlib[].

Python/frozen.c has _PyImport_FrozenModules[] effectively mapping _frozen_importlib to importlib._bootstrap and _frozen_importlib_external to importlib._bootstrap_external.

initimport() calls PyImport_ImportFrozenModule("_frozen_importlib"), effectively import importlib._bootstrap. Module import doesn’t appear to have meaningful side-effects.

importlib._bootstrap.__import__ is installed as interp->import_func.

C implemented _imp module is initialized.

importlib._bootstrap._install(sys, _imp is called. Calls _setup(sys, _imp) and adds BuiltinImporter and FrozenImporter to sys.meta_path.

_setup() defines globals _imp and sys. Populates __name__, __loader__, __package__, __spec__, __path__, __file__, __cached__ on all sys.modules entries. Also loads builtins _thread, _warnings, and _weakref.

Later during interpreter initialization, initexternal() effectively calls importlib._bootstrap._install_external_importers(). This runs import _frozen_importlib_external, which is effectively import importlib._bootstrap_external. This module handle is aliased to importlib._bootstrap._bootstrap_external.

importlib._bootstrap_external import doesn’t appear to have significant side-effects.

importlib._bootstrap_external._install() is called with a reference to importlib._bootstrap. _setup() is called.

importlib._bootstrap._setup() imports builtins _io, _warnings, _builtins, marshal. Either posix or nt imported depending on OS. Various module-level attributes set defining run-time environment. This includes _winreg. SOURCE_SUFFIXES and EXTENSION_SUFFIXES are updated accordingly.

importlib._bootstrap._get_supported_file_loaders() returns various loaders. ExtensionFileLoader configured from _imp.extension_suffixes(). SourceFileLoader configured from SOURCE_SUFFIXES. SourcelessFileLoader configured from BYTECODE_SUFFIXES.

FileFinder.path_hook() called with all loaders and result added to sys.path_hooks. PathFinder added to sys.meta_path.

sys.modules After Interpreter Init

Module

Type

Source

__main__

add_main_module()

_abc

builtin

abc

_codecs

builtin

initfsencoding()

_frozen_importlib

frozen

initimport()

_frozen_importlib_external

frozen

initexternal()

_imp

builtin

initimport()

_io

builtin

importlib._bootstrap._setup()

_signal

builtin

initsigs()

_thread

builtin

importlib._bootstrap._setup()

_warnings

builtin

importlib._bootstrap._setup()

_weakref

builtin

importlib._bootstrap._setup()

_winreg

builtin

importlib._bootstrap._setup()

abc

py

builtins

builtin

_Py_InitializeCore_impl()

codecs

py

encodings via initfsencoding()

encodings

py

initfsencoding()

encodings.aliases

py

encodings

encodings.latin_1

py

init_sys_streams()

encodings.utf_8

py

init_sys_streams() + initfsencoding()

io

py

init_sys_streams()

marshal

builtin

importlib._bootstrap._setup()

nt

builtin

importlib._bootstrap._setup()

posix

builtin

importlib._bootstrap._setup()

readline

builtin

sys

builtin

_Py_InitializeCore_impl()

zipimport

builtin

initimport()

Modules Imported by site.py

_collections_abc _sitebuiltins _stat atexit genericpath os os.path posixpath rlcompleter site stat

Random Notes

Frozen importer iterates an array looking for module names. On each item, it calls _PyUnicode_EqualToASCIIString(), which verifies the search name is ASCII. Performing an O(n) scan for every frozen module if there are a large number of frozen modules could contribute performance overhead. A better frozen importer would use a map/hash/dict for lookups. This //may// require CPython API breakages, as the PyImport_FrozenModules data structure is documented as part of the public API and its value could be updated dynamically at run-time.

importlib._bootstrap cannot call import because the global import hook isn’t registered until after initimport().

importlib._bootstrap_external is the best place to monkeypatch because of the limited run-time functionality available during importlib._bootstrap.

It’s a bit wonky that Py_Initialize() will import modules from the standard library and it doesn’t appear possible to disable this. If site.py is disabled, non-extension builtins are limited to codecs, encodings, abc, and whatever encodings.* modules are needed by initfsencoding() and init_sys_streams().

An attempt was made to freeze the set of standard library modules loaded during initialization. However, the built-in extension importer doesn’t set all of the module attributes that are expected of the modules system. The from . import aliases in encodings/__init__.py is confused without these attributes. And relative imports seemed to have issues as well. One would think it would be possible to run an embedded interpreter with all standard library modules frozen, but this doesn’t work.

Desired Changes from Python to Aid PyOxidizer

As part of implementing PyOxidizer, we’ve encountered numerous shortcomings in Python that have made implementation more difficult. This section attempts to capture those along with our desired outcomes.

General Lack of Clear Specifications

PyOxidizer has had to implement a lot of low-level functionality, notably around interpreter initialization and module/resource importing. We have also had to reinvent aspects of packaging so it can be performed in Rust.

Various Python functionality is not defined in specifications. Rather, it is defined by PEPs plus implementations in code. And when there are PEPs, often there isn’t a single PEP outlining the clear current state of the world: many PEPs are stated like builds on top of PEP XYZ. Often the only canonical source of how something works is the implementation in code. And when there are questions for clarification, it isn’t clear whether code or a PEP is wrong because oftentimes there isn’t a single PEP that is the canonical source of truth.

It would be highly preferred for Python to publish clear specifications for how various mechanisms work. A PEP would be a diff to a specification (possibly creating a new specification) and a discussion around it. That way there would be a clear specification that can be consulted as the source of truth for how things should behave.

__file__ Ambiguity

It isn’t clear whether __file__ is actually required and what all is derived from existence of __file__. It also isn’t clear what __file__ should be set to if it wouldn’t be a concrete filesystem path. Can __file__ be virtual? Can it refer to a binary/archive containing the module?

Semantics of __file__ need more clarification.

importlib.metadata Documentation Deficiencies

See https://bugs.python.org/issue38594.

importlib Resources Directory Ambiguity

See https://bugs.python.org/issue36128, https://gitlab.com/python-devs/importlib_resources/issues/58, and https://gitlab.com/python-devs/importlib_resources/-/issues/90.

Standardizing a Python Distribution Format

PyOxidizer consumes Python distributions and repackages them. e.g. it takes an archive containing a Python executable, standard library, support libraries, etc and transforms them into new binaries or distributable artifacts.

There is no standard for representing a Python distribution. This is something that PyOxidizer has had to invent itself via the python-build-standalone project and its PYTHON.json files.

Should Python have a standardized way of describing Python distribution archives and should CPython distribute said distributions, it would make PyOxidizer largely agnostic of the distributor flavor being consumed and allow PyOxidizer (and other Python packaging tools) to more easily target other distribution flavors. e.g. you could swap out CPython for PyPy and tooling largely wouldn’t care.

Ability to Install Meta Path Importers Before Py_Initialize()

Py_Initialize() will import some standard library modules during its execution. It does so using the default meta path importers available to the distribution. This means that standard library modules must come from the filesystem (PathImporter), frozen modules, built-in extension modules, or zip files (via PathImporter).

This restriction prevents importing the entirety of the standard library from the binary containing Python, in effect preventing the use of self-contained executables. PyOxidizer works around this by patching the importlib._bootstrap and importlib._bootstrap_external source code, compiling that to bytecode, and making said bytecode available as a frozen module. The patched code (which runs as part of Py_Initialize()) installs a sys.meta_path importer which imports modules from memory. This solution is extremely hacky, but is necessary to achieve single file executables with all imports serviced from memory.

In order for this to work, PyOxidizer needs a copy of these importlib modules so it can patch them and compile them to bytecode. This is problematic in some cases because e.g. the Windows embeddable Python distributions ship only the bytecode of these modules in a pythonXY.zip file. So PyOxidizer needs to find the source code from another location when consuming these distributions.

But patching the importlib bootstrap modules is hacky itself. It would be better if PyOxidizer didn’t need to do this at all. This could be achieved by splitting up the interpreter initialization APIs to give embedding applications the opportunity to muck with sys.meta_path before any import is performed. It could also be achieved by introducing an initialization config option to somehow inject code at the right point during startup to register the sys.meta_path importer. This could be done by importing a named module (presumably serviced by the frozen or built-in importer) and having that module run code to modify sys.meta_path as a side-effect of module evaluation at import time. A variation would be to define a callable in said module to call after the module is importer. Whatever the solution, there needs to be a way to somehow inject a sys.meta_path importer before any import not serviced by the frozen or built-in importers is performed.

Lacking Support for Statically Linked Builds

Python really wants you to be using shared libraries for libpython and extension modules seem to strongly insist on this.

On Windows, there is no official Visual Studio project configuration for static builds. Actually achieving one requires a lot of hacks to the build system (see python-build-standalone project).

There is ~0 support for building statically linked extension modules in packaging tools, from the build step itself all the way up to distribution. (PyOxidizer’s approach is to hack distutils to record and save the object files that were compiled and then PyOxidizer manually links these object files into the final binary.)

To achieve a statically linked executable containing libpython and extension modules, you effectively need to build everything from source. And if you want to distribute that executable, you often need to build with special toolchains to ensure binary portability.

There is tons of room for Python to better support static linking. A possible good place to start would be for packaging tools to support building extension modules which don’t rely on a dynamic libpython. If artifacts containing the raw object files designed for static linking were made available on PyPI, PyOxidizer could download pre-built binaries and link them directly into an executable or custom libpython. This would avoid having to recompile said extension modules at repackaging time. The compatibility guarantees would likely look a lot like existing binary wheels.

On a related front, it would be nice if musl libc based binary wheels were standardized. There are some concerns about the performance and compatibility of musl libc when it comes to Python. But musl libc is a valid deploy target nonetheless and it would be nice if Python officially supported it. (FWIW the performance concerns seem to stem from memory allocator performance and PyOxidizer supports using jemalloc as the allocator, bypassing this problem.)

Windows Embeddable Distributions Missing Functionality

The Windows embeddable zip file distributions of CPython are missing certain functionality.

The distributions do not contain source code for Python modules in the standard library. This means PyOxidizer can’t easily bundle sources from these distributions.

The ensurepip module is not present in the distribution. So there is no way to install pip using the distribution itself.

The venv module is also not present in the distribution. So there’s no way to create virtualenvs using the distribution itself.

The Python C development headers are not part of the distribution, so even if you install packaging tools, you can’t build C extensions.

Extension Module / Shared Library Filename Ambiguity

On some platforms, Python extension modules and shared libraries have the same filename extension. e.g. on Linux, both are named foo.so.

PyOxidizer’s packaging functionality needs to classify files as specific resource types (source modules, bytecode modules, resource files, extension modules, shared libraries, etc). Because certain file patterns (like .so) are ambiguous, PyOxidizer cannot perform this classification trivially.

It would be much preferred if there were unique file extensions that distinguished Python extension modules from regular shared libraries.

On Windows, this is already the case with the .pyd extension. However, POSIX architectures aren’t so fortunate.

Ambiguous File Classification

This is somewhat related to the previous section but is more generic.

Python’s default path-based importer dynamically looks for presence of various files on the filesystem and loads the first type variant (extension module, bytecode, source, etc) discovered.

PyOxidizer’s importer indexes resources during packaging and its import-time resource resolution is static: the type of resource is baked into the definition of the resource.

These approaches are somewhat at odds with each other. The path-based importer is dynamic in nature: it defers answering questions until a specific resource is requested. PyOxidizer’s importer is static / pre-compiled: it must classify a resource based on its filename/path so it can bake that knowledge into an immutable data structure. It does not have knowledge of what names will be requested at run-time.

Bridging this divide has revealed various ambiguities and corner cases in the filenames of Python resources.

The Python extension module or shared library ambiguity is described above.

There is also an ambiguity with extra files that aren’t part of a known Python package. If you attempt to classify every file in a sys.path directory, it is tempting to classify a file as a Python module (.py, .pyc, or extension module), package resource (importlib.resources), or package metadata (e.g. .dist-info files accessed via importlib.metadata). However, there exists the possibility that a file is not obviously classified as one of these.

For example, a file foo/libfoo.so without the presence of a foo/__init__.py file is ambiguous. We could say this is an extension module (foo.libfoo) due to the extension module shared library ambiguity. We could also consider this a package resource foo:libfoo.so or "":foo/libfoo.so. Although the latter case of using an empty string for the package name doesn’t make much sense. And we arguably shouldn’t consider it a resource of foo because no obvious foo Python package exists!

This is relevant in the real world because various Python packages rely on installing arbitrary files in sys.path directories. For example, numpy installs files like numpy.libs/libz-eb09ad1d.so.1.2.3, where the numpy.libs directory only contains file extensions *.so[.*]. Note that this example is particularly confusing because the directory names in sys.path directories are typically split on . and correspond to Python [sub-]packages.

Because there is no unambiguous way to classify all files in a sys.path directory and because Python packaging tools allow the presence of files not contained within a known Python package (identified by the presence of an __init__ file/module), this externalizes the requirement to introduce an other classification of files. And because a specific file can’t easily be classified as a specific type, this effectively prevents the use of resource loading techniques not involving explicit filesystem I/O without significant smarts. I.e. because PyOxidizer cannot easily unambiguously identify file X as a specific type, it is forced to materialize that file at a similar location on the run-time system. However, if runtimes like PyOxidizer were able to identify the type of a file by its file extension and/or presence of other files, it would know exactly how to load/treat the file at run-time without having to resort to heuristics.

This ambiguity effectively means that PyOxidizer needs to:

  • Determine if a file is a shared library or not (because shared libraries are treated specially and we can’t unambiguously identify a shared library from its file extension).

  • Examine symbols within shared libraries to see if a Python extension module is present (via presence of PyInit_* symbols).

  • Preserve extra files not present in a Python package. (In the case of numpy, there are no obvious links to the shared libraries in the numpy.libs directory: this relative path is encoded within the extension module shared library via e.g. DT_NEEDED.)

The most robust mitigation to this ambiguity is for all files associated with an installable Python package/distribution to be annotated with their type and for Python package installers to refuse to process files that aren’t identified. This could be achieved by having a .dist-info/ file annotating the role of each file.

Push Harder for Wheels

Wheels are superior for Python packaging distribution because they are more static and follow a finite set of rules for how they should be installed. In theory, one could write code to install a wheel in any programming language. Non-wheel distributions, however, are a different matter entirely. A .tar.gz source distribution often relies on running a setup.py file, which requires a Python interpreter.

In the ideal world, PyOxidizer doesn’t care about how a package is built: just the files that comprise the installed package. So wheels are a more desirable distribution format. In fact, PyOxidizer has Rust code for extracting wheels and repackaging their contents: no Python necessary. This means PyOxidizer can do things like download wheels targeting non-native architectures and it just works.

As good as wheels are, they are universal in Python land. There are tons of packages that don’t have wheel distributions and continue to offer the older .tar.gz distribution format.

We would like to see a concerted effort to push harder for the presence of wheels. For example, PyPI could encourage/nag package maintainers to upload wheels.

No Way to Hook open()

oxidized_importer wants to load Python modules and resource data from memory, without using files.

There is a convention of using virtual paths to express paths within some other entity. e.g. the zip importer uses /path/to/archive.zip/foo.py refers to the path foo.py within the /path/to/archive.zip zip file. It is also common to use the current executable’s path to refer to entities within the current executable. e.g. /path/to/myapp/foo.py would refer to a foo.py somehow embedded in the /path/to/myapp executable.

These virtual paths are a great idea. You can even implement pathlib.Path around these paths and have a custom Path.open() that does custom I/O. However, it is really easy for these paths to leak and to get fed in to io.open() or similar APIs for operating on filesystem paths. For example, someone does open(foo.__path__, "rb") instead of foo.__path__.open("rb"). If this happens, you’ll likely get an I/O error since virtual paths aren’t real filesystem paths.

It would be really nice if Python had some abstraction around filesystem I/O that allowed custom paths to be registered. This is what schemes in URIs and URLs are for. e.g. file:///path/to/file. However, schemes aren’t paths per se. So if we want to preserve compatibility with a path based API and allow io.open() to work with virtual paths, we need a mechanism to register a hook to intercept io.open() (and possibly other I/O operations like stat()) so we can plumb in a custom I/O implementation.

PEP 578 almost does this with PyFile_SetOpenCodeHook() and the io.open_code() mechanism. But io.open_code() is only for a limited use case and isn’t generally usable.

sys.executable is a String Instead of List

Python applications often want to invoke a new Python interpreter process. Generally, you use sys.executable to find the filesystem path to python then run that executable.

This is all fine for traditional Python interpreter install layouts that have a python executable. However, in embedded contexts, there may not be a python executable. Rather, the application embedding Python may provide a more advanced way to invoke a Python interpreter. e.g. myapp python <interpreter arguments>.

Since sys.executable is a string and is often fed directly into exec(), it isn’t possible to express a multi-argument run a Python interpreter value through sys.executable.

To do this robustly while maintaining backwards compatibility, we need a new attribute somewhere that defines a list of arguments for invoking a Python interpreter. In traditional Python install environments, this would be [sys.executable].

This idea was proposed at https://mail.python.org/archives/list/python-ideas@python.org/thread/O66N56PB4U6AGICGBSRFD2OWA5JWMFC6/#O66N56PB4U6AGICGBSRFD2OWA5JWMFC6.