from __past__ import bytes_literals
March 13, 2017 at 09:55 AM | categories: Python, Mercurial, MozillaLast year, I simultaneously committed one of the ugliest and impressive hacks of my programming career. I haven't had time to write about it. Until now.
In summary, the hack is a
source-transforming module loader
for Python. It can be used by Python 3 to import a Python 2 source
file while translating certain primitives to their Python 3 equivalents.
It is kind of like 2to3
except it executes at run-time during import
. The main goal of the
hack was to facilitate porting Mercurial to Python 3 while deferring
having to make the most invasive - and therefore most annoying -
elements of the port in the canonical source code representation.
For the technically curious, it works as follows.
The hg
Python executable registers a custom
meta path finder
instance. This entity is invoked during import
statements to try
to find the module being imported. It tells a later phase of the
import mechanism how to load that module from wherever it is
(usually a .py
or .pyc
file on disk) to a Python module object.
The custom finder only responds to requests for modules known
to be managed by the Mercurial project. For these modules, it tells
the next stage of the import mechanism to invoke a custom
SourceLoader
instance. Here's where the real magic is: when the custom loader
is invoked, it tokenizes the Python source code using the
tokenize module,
iterates over the token stream, finds specific patterns, and
rewrites them to something more appropriate. It then untokenizes
back to Python source code then falls back to the built-in loader
which does the heavy lifting of compiling the source to Python code
objects. So, we have Python 2 source files on disk that magically get
transformed to be Python compatible when they are loaded by Python 3.
Oh, and there is no performance penalty for the token transformation
on subsequence loads because the transformed bytecode is cached in
the .pyc
file (using a custom header so we know it was transformed
and can be invalidated when the transformation logic changes).
At the time I wrote it, the token stream manipulation converted most
string literals (''
) to bytes literals (b''
). In other words, it
restored the Python 2 behavior of string literals being bytes
and
not unicode
. We jokingly call it
from __past__ import bytes_literals
(a play on Python 2's
from __future__ import unicode_literals
special syntax which
changes string literals from Python 2's str
/bytes
type to
unicode
to match Python 3's behavior).
Since I implemented the first version, others have implemented:
- Automatically inserting
a
from mercurial.pycompat import ...
statement to the top of the source. This statement is the Mercurial equivalent of importing common wrapper types similar to what six provides. - More robust function argument parsing support. (Because going from a token stream to a higher-level primitive like a function call is difficult.)
- Automatically rewriting
.iteritems()
to.items()
.
As one can expect, when I tweeted a link to this commit, many Python developers (including a few CPython core developers) expressed a mix of intrigue and horror. But mostly horror.
I fully concede that what I did here is a gross hack. And, it is the intention of the Mercurial project to undo this hack and perform a proper port once Python 3 support in Mercurial is more mature. But, I want to lay out my defense on why I did this and why the Mercurial project is tolerant of this ugly hack.
Individuals within the Mercurial project have wanted to port to Python
3 for years. Until recently, it hasn't been a project priority
because a port was too much work for too little end-user gain. And, on
the technical front, a port was just not practical until Python 3.5.
(Two main blockers were no u''
literals - restored in Python 3.3 -
and no %
formatting for b''
literals - restored in 3.5. And as I
understand it, senior members of the Mercurial project had to lobby
Python maintainers pretty hard to get features like %
formatting of
b''
literals restored to Python 3.)
Anyway, after a number of failed attempts to initiate the Python 3 port over the years, the Mercurial project started making some positive steps towards Python 3 compatibility, such as switching to absolute imports and addressing syntax issues that allowed modules to be parsed into an AST and even compiled and loadable. These may seem like small steps, but for a larger project, it was a lot of work.
The porting effort hit a large wall when it came time to actually make the AST-valid Python code run on Python 3. Specifically, we had a strings problem.
When you write software that exchanges data between machines - sometimes machines running different operating systems or having different encodings - and there is an expectation that things work the same and data roundtrips accordingly, trying to force text encodings is essentially impossible and inevitably breaks something or someone. It is much easier for Mercurial to operate bytes first and only take text encoding into consideration when absolutely necessary (such as when emitting bytes to the terminal in the wanted encoding or when emitting JSON). That's not to say Mercurial ignores the existence of encodings. Far from it: Mercurial does attempt to normalize some data to Unicode. But it often does so with a special Python type that internally stores the raw byte sequence of the source so that a consumer can choose to operate at the bytes or Unicode level.
Anyway, this means that practically every string variable in Mercurial
is a bytes
type (or something that acts like a bytes
type). And
since string literals in Python 3 are the str
type (which represents
Unicode), that would mean having to prefix almost every ''
string
literal in Mercurial with b''
in order to placate Python 3. Having
to update every occurrence of simple primitives that could be statically
transformed automatically felt like busy work. We wanted to spend time
on the meaningful parts of the Python 3 port so we could find
interesting problems and challenges, not toil with mechanical
conversions that add little to no short-term value while simultaneously
increasing cognitive dissonance and quite possibly increasing the odds
of introducing a bug in Python 2. In other words, why should humans
do the work that machines can do for us? Thus, the source-transforming
module importer was born.
While I concede what Mercurial did is a giant hack, I maintain it was the correct thing to do. It has allowed the Python 3 port to move forward without being blocked on the more tedious and invasive transformations that could introduce subtle bugs (including performance regressions) in Python 2. Perfect is the enemy of good. People time is valuable. The source-transforming module importer allowed us to unblock an important project without sinking a lot of people time into it. I'd make that trade-off again.
While I won't encourage others to take this approach to porting to Python 3, if you want to, Mercurial's source is available under a GPL license and the custom module importer could be adapted to any project with minimal modifications. If someone does extract it as reusable code, please leave a comment and I'll update the post to link to it.