Gregory Szorc's Digital Home | from __past__ import bytes

from past import bytes_literals

March 13, 2017 at 09:55 AM | categories: Python, Mercurial, Mozilla

Last year, I simultaneously committed one of the ugliest and impressive hacks of my programming career. I haven't had time to write about it. Until now.

In summary, the hack is a source-transforming module loader for Python. It can be used by Python 3 to import a Python 2 source file while translating certain primitives to their Python 3 equivalents. It is kind of like 2to3 except it executes at run-time during import. The main goal of the hack was to facilitate porting Mercurial to Python 3 while deferring having to make the most invasive - and therefore most annoying - elements of the port in the canonical source code representation.

For the technically curious, it works as follows.

The hg Python executable registers a custom meta path finder instance. This entity is invoked during import statements to try to find the module being imported. It tells a later phase of the import mechanism how to load that module from wherever it is (usually a .py or .pyc file on disk) to a Python module object. The custom finder only responds to requests for modules known to be managed by the Mercurial project. For these modules, it tells the next stage of the import mechanism to invoke a custom SourceLoader instance. Here's where the real magic is: when the custom loader is invoked, it tokenizes the Python source code using the tokenize module, iterates over the token stream, finds specific patterns, and rewrites them to something more appropriate. It then untokenizes back to Python source code then falls back to the built-in loader which does the heavy lifting of compiling the source to Python code objects. So, we have Python 2 source files on disk that magically get transformed to be Python compatible when they are loaded by Python 3. Oh, and there is no performance penalty for the token transformation on subsequence loads because the transformed bytecode is cached in the .pyc file (using a custom header so we know it was transformed and can be invalidated when the transformation logic changes).

At the time I wrote it, the token stream manipulation converted most string literals ('') to bytes literals (b''). In other words, it restored the Python 2 behavior of string literals being bytes and not unicode. We jokingly call it from __past__ import bytes_literals (a play on Python 2's from __future__ import unicode_literals special syntax which changes string literals from Python 2's str/bytes type to unicode to match Python 3's behavior).

Since I implemented the first version, others have implemented:

Automatically inserting a from mercurial.pycompat import ... statement to the top of the source. This statement is the Mercurial equivalent of importing common wrapper types similar to what six provides.
More robust function argument parsing support. (Because going from a token stream to a higher-level primitive like a function call is difficult.)
Automatically rewriting .iteritems() to .items().

As one can expect, when I tweeted a link to this commit, many Python developers (including a few CPython core developers) expressed a mix of intrigue and horror. But mostly horror.

I fully concede that what I did here is a gross hack. And, it is the intention of the Mercurial project to undo this hack and perform a proper port once Python 3 support in Mercurial is more mature. But, I want to lay out my defense on why I did this and why the Mercurial project is tolerant of this ugly hack.

Individuals within the Mercurial project have wanted to port to Python 3 for years. Until recently, it hasn't been a project priority because a port was too much work for too little end-user gain. And, on the technical front, a port was just not practical until Python 3.5. (Two main blockers were no u'' literals - restored in Python 3.3 - and no % formatting for b'' literals - restored in 3.5. And as I understand it, senior members of the Mercurial project had to lobby Python maintainers pretty hard to get features like % formatting of b'' literals restored to Python 3.)

Anyway, after a number of failed attempts to initiate the Python 3 port over the years, the Mercurial project started making some positive steps towards Python 3 compatibility, such as switching to absolute imports and addressing syntax issues that allowed modules to be parsed into an AST and even compiled and loadable. These may seem like small steps, but for a larger project, it was a lot of work.

The porting effort hit a large wall when it came time to actually make the AST-valid Python code run on Python 3. Specifically, we had a strings problem.

When you write software that exchanges data between machines - sometimes machines running different operating systems or having different encodings - and there is an expectation that things work the same and data roundtrips accordingly, trying to force text encodings is essentially impossible and inevitably breaks something or someone. It is much easier for Mercurial to operate bytes first and only take text encoding into consideration when absolutely necessary (such as when emitting bytes to the terminal in the wanted encoding or when emitting JSON). That's not to say Mercurial ignores the existence of encodings. Far from it: Mercurial does attempt to normalize some data to Unicode. But it often does so with a special Python type that internally stores the raw byte sequence of the source so that a consumer can choose to operate at the bytes or Unicode level.

Anyway, this means that practically every string variable in Mercurial is a bytes type (or something that acts like a bytes type). And since string literals in Python 3 are the str type (which represents Unicode), that would mean having to prefix almost every '' string literal in Mercurial with b'' in order to placate Python 3. Having to update every occurrence of simple primitives that could be statically transformed automatically felt like busy work. We wanted to spend time on the meaningful parts of the Python 3 port so we could find interesting problems and challenges, not toil with mechanical conversions that add little to no short-term value while simultaneously increasing cognitive dissonance and quite possibly increasing the odds of introducing a bug in Python 2. In other words, why should humans do the work that machines can do for us? Thus, the source-transforming module importer was born.

While I concede what Mercurial did is a giant hack, I maintain it was the correct thing to do. It has allowed the Python 3 port to move forward without being blocked on the more tedious and invasive transformations that could introduce subtle bugs (including performance regressions) in Python 2. Perfect is the enemy of good. People time is valuable. The source-transforming module importer allowed us to unblock an important project without sinking a lot of people time into it. I'd make that trade-off again.

While I won't encourage others to take this approach to porting to Python 3, if you want to, Mercurial's source is available under a GPL license and the custom module importer could be adapted to any project with minimal modifications. If someone does extract it as reusable code, please leave a comment and I'll update the post to link to it.

from __past__ import bytes_literals

Categories

from past import bytes_literals