Technical Notes¶
CPython Initialization¶
Most code lives in pylifecycle.c
.
Call tree with Python 3.7:
``Py_Initialize()``
``Py_InitializeEx()``
``_Py_InitializeFromConfig(_PyCoreConfig config)``
``_Py_InitializeCore(PyInterpreterState, _PyCoreConfig)``
Sets up allocators.
``_Py_InitializeCore_impl(PyInterpreterState, _PyCoreConfig)``
Does most of the initialization.
Runtime, new interpreter state, thread state, GIL, built-in types,
Initializes sys module and sets up sys.modules.
Initializes builtins module.
``_PyImport_Init()``
Copies ``interp->builtins`` to ``interp->builtins_copy``.
``_PyImportHooks_Init()``
Sets up ``sys.meta_path``, ``sys.path_importer_cache``,
``sys.path_hooks`` to empty data structures.
``initimport()``
``PyImport_ImportFrozenModule("_frozen_importlib")``
``PyImport_AddModule("_frozen_importlib")``
``interp->importlib = importlib``
``interp->import_func = interp->builtins.__import__``
``PyInit__imp()``
Initializes ``_imp`` module, which is implemented in C.
``sys.modules["_imp"} = imp``
``importlib._install(sys, _imp)``
``_PyImportZip_Init()``
``_Py_InitializeMainInterpreter(interp, _PyMainInterpreterConfig)``
``_PySys_EndInit()``
``sys.path = XXX``
``sys.executable = XXX``
``sys.prefix = XXX``
``sys.base_prefix = XXX``
``sys.exec_prefix = XXX``
``sys.base_exec_prefix = XXX``
``sys.argv = XXX``
``sys.warnoptions = XXX``
``sys._xoptions = XXX``
``sys.flags = XXX``
``sys.dont_write_bytecode = XXX``
``initexternalimport()``
``interp->importlib._install_external_importers()``
``initfsencoding()``
``_PyCodec_Lookup(Py_FilesystemDefaultEncoding)``
``_PyCodecRegistry_Init()``
``interp->codec_search_path = []``
``interp->codec_search_cache = {}``
``interp->codec_error_registry = {}``
# This is the first non-frozen import during startup.
``PyImport_ImportModuleNoBlock("encodings")``
``interp->codec_search_cache[codec_name]``
``for p in interp->codec_search_path: p[codec_name]``
``initsigs()``
``add_main_module()``
``PyImport_AddModule("__main__")``
``init_sys_streams()``
``PyImport_ImportModule("encodings.utf_8")``
``PyImport_ImportModule("encodings.latin_1")``
``PyImport_ImportModule("io")``
Consults ``PYTHONIOENCODING`` and gets encoding and error mode.
Sets up ``sys.__stdin__``, ``sys.__stdout__``, ``sys.__stderr__``.
Sets warning options.
Sets ``_PyRuntime.initialized``, which is what ``Py_IsInitialized()``
returns.
``initsite()``
``PyImport_ImportModule("site")``
CPython Importing Mechanism¶
Lib/importlib
defines importing mechanisms and is 100% Python.
Programs/_freeze_importlib.c
is a program that takes a path to an input
.py
file and path to output .h
file. It initializes a Python interpreter
and compiles the .py
file to marshalled bytecode. It writes out a .h
file with an inline const unsigned char _Py_M__importlib
array containing
bytecode.
Lib/importlib/_bootstrap_external.py
compiled to
Python/importlib_external.h
with _Py_M__importlib_external[]
.
Lib/importlib/_bootstrap.py
compiled to
Python/importlib.h
with _Py_M__importlib[]
.
Python/frozen.c
has _PyImport_FrozenModules[]
effectively mapping
_frozen_importlib
to importlib._bootstrap
and
_frozen_importlib_external
to importlib._bootstrap_external
.
initimport()
calls PyImport_ImportFrozenModule("_frozen_importlib")
,
effectively import importlib._bootstrap
. Module import doesn’t appear
to have meaningful side-effects.
importlib._bootstrap.__import__
is installed as interp->import_func
.
C implemented _imp
module is initialized.
importlib._bootstrap._install(sys, _imp
is called. Calls
_setup(sys, _imp)
and adds BuiltinImporter
and FrozenImporter
to sys.meta_path
.
_setup()
defines globals _imp
and sys
. Populates __name__
,
__loader__
, __package__
, __spec__
, __path__
, __file__
,
__cached__
on all sys.modules
entries. Also loads builtins
_thread
, _warnings
, and _weakref
.
Later during interpreter initialization, initexternal()
effectively calls
importlib._bootstrap._install_external_importers()
. This runs
import _frozen_importlib_external
, which is effectively
import importlib._bootstrap_external
. This module handle is aliased to
importlib._bootstrap._bootstrap_external
.
importlib._bootstrap_external
import doesn’t appear to have significant
side-effects.
importlib._bootstrap_external._install()
is called with a reference to
importlib._bootstrap
. _setup()
is called.
importlib._bootstrap._setup()
imports builtins _io
, _warnings
,
_builtins
, marshal
. Either posix
or nt
imported depending
on OS. Various module-level attributes set defining run-time environment.
This includes _winreg
. SOURCE_SUFFIXES
and EXTENSION_SUFFIXES
are updated accordingly.
importlib._bootstrap._get_supported_file_loaders()
returns various
loaders. ExtensionFileLoader
configured from _imp.extension_suffixes()
.
SourceFileLoader
configured from SOURCE_SUFFIXES
.
SourcelessFileLoader
configured from BYTECODE_SUFFIXES
.
FileFinder.path_hook()
called with all loaders and result added to
sys.path_hooks
. PathFinder
added to sys.meta_path
.
sys.modules
After Interpreter Init¶
Module |
Type |
Source |
---|---|---|
|
|
|
|
builtin |
|
|
builtin |
|
|
frozen |
|
|
frozen |
|
|
builtin |
|
|
builtin |
|
|
builtin |
|
|
builtin |
|
|
builtin |
|
|
builtin |
|
|
builtin |
|
|
py |
|
|
builtin |
|
|
py |
|
|
py |
|
|
py |
|
|
py |
|
|
py |
|
|
py |
|
|
builtin |
|
|
builtin |
|
|
builtin |
|
|
builtin |
|
|
builtin |
|
|
builtin |
|
Modules Imported by site.py
¶
_collections_abc
_sitebuiltins
_stat
atexit
genericpath
os
os.path
posixpath
rlcompleter
site
stat
Random Notes¶
Frozen importer iterates an array looking for module names. On each item, it
calls _PyUnicode_EqualToASCIIString()
, which verifies the search name is
ASCII. Performing an O(n) scan for every frozen module if there are a large
number of frozen modules could contribute performance overhead. A better frozen
importer would use a map/hash/dict for lookups. This //may// require CPython
API breakages, as the PyImport_FrozenModules
data structure is documented
as part of the public API and its value could be updated dynamically at
run-time.
importlib._bootstrap
cannot call import
because the global import
hook isn’t registered until after initimport()
.
importlib._bootstrap_external
is the best place to monkeypatch because
of the limited run-time functionality available during importlib._bootstrap
.
It’s a bit wonky that Py_Initialize()
will import modules from the
standard library and it doesn’t appear possible to disable this. If
site.py
is disabled, non-extension builtins are limited to
codecs
, encodings
, abc
, and whatever encodings.*
modules
are needed by initfsencoding()
and init_sys_streams()
.
An attempt was made to freeze the set of standard library modules loaded
during initialization. However, the built-in extension importer doesn’t
set all of the module attributes that are expected of the modules system.
The from . import aliases
in encodings/__init__.py
is confused
without these attributes. And relative imports seemed to have issues as
well. One would think it would be possible to run an embedded interpreter
with all standard library modules frozen, but this doesn’t work.
Desired Changes from Python to Aid PyOxidizer¶
As part of implementing PyOxidizer, we’ve encountered numerous shortcomings in Python that have made implementation more difficult. This section attempts to capture those along with our desired outcomes.
General Lack of Clear Specifications¶
PyOxidizer has had to implement a lot of low-level functionality, notably around interpreter initialization and module/resource importing. We have also had to reinvent aspects of packaging so it can be performed in Rust.
Various Python functionality is not defined in specifications. Rather, it is defined by PEPs plus implementations in code. And when there are PEPs, often there isn’t a single PEP outlining the clear current state of the world: many PEPs are stated like builds on top of PEP XYZ. Often the only canonical source of how something works is the implementation in code. And when there are questions for clarification, it isn’t clear whether code or a PEP is wrong because oftentimes there isn’t a single PEP that is the canonical source of truth.
It would be highly preferred for Python to publish clear specifications for how various mechanisms work. A PEP would be a diff to a specification (possibly creating a new specification) and a discussion around it. That way there would be a clear specification that can be consulted as the source of truth for how things should behave.
__file__
Ambiguity¶
It isn’t clear whether __file__
is actually required and what all
is derived from existence of __file__
. It also isn’t clear what
__file__
should be set to if it wouldn’t be a concrete filesystem
path. Can __file__
be virtual? Can it refer to a binary/archive
containing the module?
Semantics of __file__
need more clarification.
importlib.metadata
Documentation Deficiencies¶
importlib
Resources Directory Ambiguity¶
See https://bugs.python.org/issue36128, https://gitlab.com/python-devs/importlib_resources/issues/58, and https://gitlab.com/python-devs/importlib_resources/-/issues/90.
Standardizing a Python Distribution Format¶
PyOxidizer consumes Python distributions and repackages them. e.g. it takes an archive containing a Python executable, standard library, support libraries, etc and transforms them into new binaries or distributable artifacts.
There is no standard for representing a Python distribution. This is
something that PyOxidizer has had to invent itself via the
python-build-standalone
project and its PYTHON.json
files.
Should Python have a standardized way of describing Python distribution archives and should CPython distribute said distributions, it would make PyOxidizer largely agnostic of the distributor flavor being consumed and allow PyOxidizer (and other Python packaging tools) to more easily target other distribution flavors. e.g. you could swap out CPython for PyPy and tooling largely wouldn’t care.
Ability to Install Meta Path Importers Before Py_Initialize()
¶
Py_Initialize()
will import some standard library modules during
its execution. It does so using the default meta path importers available
to the distribution. This means that standard library modules must come
from the filesystem (PathImporter
), frozen modules, built-in extension
modules, or zip files (via PathImporter
).
This restriction prevents importing the entirety of the standard library
from the binary containing Python, in effect preventing the use of
self-contained executables. PyOxidizer works around this by patching
the importlib._bootstrap
and importlib._bootstrap_external
source
code, compiling that to bytecode, and making said bytecode available as
a frozen module. The patched code (which runs as part of Py_Initialize()
)
installs a sys.meta_path
importer which imports modules from memory.
This solution is extremely hacky, but is necessary to achieve single file
executables with all imports serviced from memory.
In order for this to work, PyOxidizer needs a copy of these importlib
modules so it can patch them and compile them to bytecode. This is
problematic in some cases because e.g. the Windows embeddable Python
distributions ship only the bytecode of these modules in a pythonXY.zip
file. So PyOxidizer needs to find the source code from another location
when consuming these distributions.
But patching the importlib
bootstrap modules is hacky itself. It would
be better if PyOxidizer didn’t need to do this at all. This could be
achieved by splitting up the interpreter initialization APIs to give embedding
applications the opportunity to muck with sys.meta_path
before any
import
is performed. It could also be achieved by introducing an
initialization config option to somehow inject code at the right point
during startup to register the sys.meta_path
importer. This
could be done by importing a named module (presumably serviced by the
frozen or built-in importer) and having that module run code to modify
sys.meta_path
as a side-effect of module evaluation at import time.
A variation would be to define a callable in said module to call after the
module is importer. Whatever the solution, there needs to be a way to
somehow inject a sys.meta_path
importer before any import
not
serviced by the frozen or built-in importers is performed.
Lacking Support for Statically Linked Builds¶
Python really wants you to be using shared libraries for libpython
and extension modules seem to strongly insist on this.
On Windows, there is no official Visual Studio project configuration
for static builds. Actually achieving one requires a lot of hacks to
the build system (see python-build-standalone
project).
There is ~0 support for building statically linked extension modules
in packaging tools, from the build step itself all the way up to
distribution. (PyOxidizer’s approach is to hack distutils
to
record and save the object files that were compiled and then PyOxidizer
manually links these object files into the final binary.)
To achieve a statically linked executable containing libpython
and
extension modules, you effectively need to build everything from source.
And if you want to distribute that executable, you often need to build
with special toolchains to ensure binary portability.
There is tons of room for Python to better support static linking.
A possible good place to start would be for packaging tools to support
building extension modules which don’t rely on a dynamic libpython
.
If artifacts containing the raw object files designed for static
linking were made available on PyPI, PyOxidizer could download
pre-built binaries and link them directly into an executable or custom
libpython
. This would avoid having to recompile said extension
modules at repackaging time. The compatibility guarantees would likely
look a lot like existing binary wheels.
On a related front, it would be nice if musl libc based binary wheels were standardized. There are some concerns about the performance and compatibility of musl libc when it comes to Python. But musl libc is a valid deploy target nonetheless and it would be nice if Python officially supported it. (FWIW the performance concerns seem to stem from memory allocator performance and PyOxidizer supports using jemalloc as the allocator, bypassing this problem.)
Windows Embeddable Distributions Missing Functionality¶
The Windows embeddable zip file distributions of CPython are missing certain functionality.
The distributions do not contain source code for Python modules in the standard library. This means PyOxidizer can’t easily bundle sources from these distributions.
The ensurepip
module is not present in the distribution. So there is
no way to install pip
using the distribution itself.
The venv
module is also not present in the distribution. So there’s
no way to create virtualenvs using the distribution itself.
The Python C development headers are not part of the distribution, so even if you install packaging tools, you can’t build C extensions.
Ambiguous File Classification¶
This is somewhat related to the previous section but is more generic.
Python’s default path-based importer dynamically looks for presence of various files on the filesystem and loads the first type variant (extension module, bytecode, source, etc) discovered.
PyOxidizer’s importer indexes resources during packaging and its import-time resource resolution is static: the type of resource is baked into the definition of the resource.
These approaches are somewhat at odds with each other. The path-based importer is dynamic in nature: it defers answering questions until a specific resource is requested. PyOxidizer’s importer is static / pre-compiled: it must classify a resource based on its filename/path so it can bake that knowledge into an immutable data structure. It does not have knowledge of what names will be requested at run-time.
Bridging this divide has revealed various ambiguities and corner cases in the filenames of Python resources.
The Python extension module or shared library ambiguity is described above.
There is also an ambiguity with extra files that aren’t part of
a known Python package. If you attempt to classify every file in
a sys.path
directory, it is tempting to classify a file as a
Python module (.py
, .pyc
, or extension module), package
resource (importlib.resources
), or package metadata (e.g.
.dist-info
files accessed via importlib.metadata
). However,
there exists the possibility that a file is not obviously classified
as one of these.
For example, a file foo/libfoo.so
without the presence of a
foo/__init__.py
file is ambiguous. We could say this is an
extension module (foo.libfoo
) due to the extension module
shared library ambiguity. We could also consider this a package
resource foo:libfoo.so
or "":foo/libfoo.so
. Although the
latter case of using an empty string for the package name doesn’t
make much sense. And we arguably shouldn’t consider it a resource
of foo
because no obvious foo
Python package exists!
This is relevant in the real world because various Python packages
rely on installing arbitrary files in sys.path
directories.
For example, numpy
installs files like
numpy.libs/libz-eb09ad1d.so.1.2.3
, where the numpy.libs
directory only contains file extensions *.so[.*]
. Note that
this example is particularly confusing because the directory names
in sys.path
directories are typically split on .
and
correspond to Python [sub-]packages.
Because there is no unambiguous way to classify all files in
a sys.path
directory and because Python packaging tools allow
the presence of files not contained within a known Python package
(identified by the presence of an __init__
file/module), this
externalizes the requirement to introduce an other classification
of files. And because a specific file can’t easily be classified
as a specific type, this effectively prevents the use of resource
loading techniques not involving explicit filesystem I/O without
significant smarts. I.e. because PyOxidizer cannot easily
unambiguously identify file X as a specific type, it is forced to
materialize that file at a similar location on the run-time system.
However, if runtimes like PyOxidizer were able to identify the
type of a file by its file extension and/or presence of other files,
it would know exactly how to load/treat the file at run-time without
having to resort to heuristics.
This ambiguity effectively means that PyOxidizer needs to:
Determine if a file is a shared library or not (because shared libraries are treated specially and we can’t unambiguously identify a shared library from its file extension).
Examine symbols within shared libraries to see if a Python extension module is present (via presence of
PyInit_*
symbols).Preserve extra files not present in a Python package. (In the case of numpy, there are no obvious links to the shared libraries in the
numpy.libs
directory: this relative path is encoded within the extension module shared library via e.g.DT_NEEDED
.)
The most robust mitigation to this ambiguity is for all files
associated with an installable Python package/distribution to be
annotated with their type and for Python package installers to refuse
to process files that aren’t identified. This could be achieved by
having a .dist-info/
file annotating the role of each file.
Push Harder for Wheels¶
Wheels are superior for Python packaging distribution because they
are more static and follow a finite set of rules for how they
should be installed. In theory, one could write code to install a
wheel in any programming language. Non-wheel distributions, however,
are a different matter entirely. A .tar.gz
source distribution
often relies on running a setup.py
file, which requires a Python
interpreter.
In the ideal world, PyOxidizer doesn’t care about how a package is built: just the files that comprise the installed package. So wheels are a more desirable distribution format. In fact, PyOxidizer has Rust code for extracting wheels and repackaging their contents: no Python necessary. This means PyOxidizer can do things like download wheels targeting non-native architectures and it just works.
As good as wheels are, they are universal in Python land. There are
tons of packages that don’t have wheel distributions and continue to
offer the older .tar.gz
distribution format.
We would like to see a concerted effort to push harder for the presence of wheels. For example, PyPI could encourage/nag package maintainers to upload wheels.
No Way to Hook open()
¶
oxidized_importer
wants to load Python modules and resource data
from memory, without using files.
There is a convention of using virtual paths to express paths within
some other entity. e.g. the zip importer uses /path/to/archive.zip/foo.py
refers to the path foo.py
within the /path/to/archive.zip
zip file.
It is also common to use the current executable’s path to refer to
entities within the current executable. e.g. /path/to/myapp/foo.py
would refer to a foo.py
somehow embedded in the /path/to/myapp
executable.
These virtual paths are a great idea. You can even implement pathlib.Path
around these paths and have a custom Path.open()
that does custom I/O.
However, it is really easy for these paths to leak and to get fed in to
io.open()
or similar APIs for operating on filesystem paths. For example,
someone does open(foo.__path__, "rb")
instead of foo.__path__.open("rb")
.
If this happens, you’ll likely get an I/O error since virtual paths aren’t
real filesystem paths.
It would be really nice if Python had some abstraction around filesystem
I/O that allowed custom paths to be registered. This is what schemes in URIs
and URLs are for. e.g. file:///path/to/file
. However, schemes aren’t
paths per se. So if we want to preserve compatibility with a path based
API and allow io.open()
to work with virtual paths, we need a mechanism
to register a hook to intercept io.open()
(and possibly other I/O
operations like stat()
) so we can plumb in a custom I/O implementation.
PEP 578 almost does this with PyFile_SetOpenCodeHook()
and the
io.open_code()
mechanism. But io.open_code()
is only for a limited
use case and isn’t generally usable.
sys.executable
is a String Instead of List¶
Python applications often want to invoke a new Python interpreter process.
Generally, you use sys.executable
to find the filesystem path to
python
then run that executable.
This is all fine for traditional Python interpreter install layouts that have
a python
executable. However, in embedded contexts, there may not be
a python
executable. Rather, the application embedding Python may provide
a more advanced way to invoke a Python interpreter. e.g. myapp python
<interpreter arguments>
.
Since sys.executable
is a string and is often fed directly into exec()
,
it isn’t possible to express a multi-argument run a Python interpreter value
through sys.executable
.
To do this robustly while maintaining backwards compatibility, we need a new
attribute somewhere that defines a list of arguments for invoking a Python
interpreter. In traditional Python install environments, this would be
[sys.executable]
.
This idea was proposed at https://mail.python.org/archives/list/python-ideas@python.org/thread/O66N56PB4U6AGICGBSRFD2OWA5JWMFC6/#O66N56PB4U6AGICGBSRFD2OWA5JWMFC6.