Gregory Szorc's Digital Home | Making hg-git Faster

Making hg-git Faster

April 14, 2013 at 09:45 PM | categories: Mozilla, Mercurial, Git

When enterprising individuals at Mozilla started maintaining a Git mirror of Firefox's main source repository (hosted in Mercurial), they ran into a significant problem: conversion was slow. The initial conversion apparently took over 6 days and used a lot of memory. Furthermore, each subsequent commit took many seconds, even on modern hardware. This meant that the they could only maintain a Git mirror of a few project branches and that updates would be slow. Furthermore, the slowness of the conversion significantly discouraged people from using the tool locally as part of regular development.

I thought this was unacceptable. I wanted to enable people to use their tool of choice (Git) to develop Firefox. So, I did what annoyed engineers do when confronted with an itch: I scratched it.

Diagnosing the Problem

When I started tackling this problem, I had little knowledge of the problem space other than the problem statement: converting from Mercurial to Git is prohibitively slow and that the slow tool was hg-git. My challenge was thus to make hg-git faster.

When confronted with a performance problem, one of the first things you do is identify the source of the bad performance. Then, you need to acertain whether that is something you have the ability to change.

This often starts by answering some high-level questions then drilling down into more detail as necessary. For a long-running system tool like hg-git, I start with the top test: how much CPU, memory, and I/O is the process utilizing?

In the case of hg-git, we were CPU bound. The Python process was consistently pegging a single CPU core while periodically incurring I/O (but not nearly enough to saturate a magnetic disk). This told me a few things. First, I should look for bottlenecks inside Python. Second, I should investigate whether parallel execution would be possible. The latter is especially important these days because the trend in processors is towards more cores rather than higher clock speeds. It's no longer acceptable to let increases in clock speed or cycle efficiency bail you out: if you want a CPU bound process to run as fast as possible, it's often necessary to involve all available CPU cores.

Once I diagnosed CPU as the limiting factor, I pulled out the next tool in the arsenal: a code profiler. I quickly discovered exactly where the conversion was taking the most CPU time. As feared, it was in the export Mercurial changeset to Git commit function. Specifically, profiling had flagged the conversion of Mercurial manifests to Git trees and blobs. Furthermore, most of the time was spent in functions in Mercurial itself (Mercurial is implemented in Python and hg-git calls into it natively) and Dulwich (a pure Python implementation of Git). So, I was either looking at deficiencies or Mercurial and/or Dulwich, a bad conversion algorithm in hg-git, or both. To know which, I would need a better grasp on the internal storage models of Mercurial and Git.

Learning about Mercurial's and Git's internal storage models

To understand why conversion from Mercurial to Git was slow, I needed to understand how each stored data internally. My hope was that if attained better understanding I could apply the knowledge to assess the algorithm hg-git was using and optimize it, hopefully introducing parallel execution along the way.

Git's internals

I already had a fairly good understanding of how Git works internally. And, it's quite simple really. The Git Internals chapter of the Pro Git is extremely useful. While I encourage readers to read all of the Git Objects section, the gist is:

Git's core storage is a key-value data store. Keys are SHA-1 checksums of content. Each entity is storage in a Git object.
A blob is an object holding the raw content of a file.
A tree is an object holding a list of tree entries. Each tree entry defines a blob, another tree object, etc. A tree is essentially a directory listing.
A commit object holds metadata about an individual Git commit. Each commit object refers to a specific tree object.

When you introduce a new file that hasn't been seen before, a new blob is added to storage. That blob is referenced by a tree. When you update a file, a new tree is created referring to the new blob that was created.

Things get a little complicated when you consider directories. If you update the file foo/bar/baz.c, the tree for foo/bar changes (because the SHA-1 of baz.c changed). And, the SHA-1 for the foo/bar tree changes, so the bar entry in foo's tree changes, changing the SHA-1 for the root tree.

That's essentially how Git addresses commits, directories, and files. If you don't grok this, please, please read the aforementioned page on it - it may even help you better grok Git!

Mercurial's internals

Unlike Git, I didn't really have a clue how Mercurial worked internally. So, I needed to do some self-education here.

The best resource for Mercurial's storage model I've found is the Behind the Scenes chapter from Mercurial: The Definitive Guide. The gist is:

History for an individual file is stored in a filelog. Each filelog contains the history of a single file. Each file revision has a hash based on the file contents.
The manifest lists every file, its permissions, and its file revision for each changeset in the repository.
The changelog contains information about each changeset, including the revision of the manifest to use.
Each of these logs contain revisions and you can address an individual revision within the log.

From a high level, Mercurial's storage model is very similar to Git's. They both address files by hashing their content. Where Git uses multiple tree objects to define every file in a commit, Mercurial has a single manifest containing a flat list. Aside from that, the differences are mostly in implementation details. These are important, as we'll soon see.

Analyzing hg-git's conversion algorithm

Armed with knowledge of how Git and Mercurial internally store data, I was ready to analyze how hg-git was performing conversion from Mercurial to Git. Since profiling revealed it was the convert a single changeset into Git commit function that was taking all the time, I started there.

In Python (but not the actual Python), the algorithm was essentially:

def export_changeset_to_git(changeset, git, already_converted):
    """Receives the Mercurial changeset and a handle on a Git object storre."""
    # This is an entity that helps us build Git tree objects from
    # paths and blobs. The logic is at
    # https://github.com/jelmer/dulwich/blob/2a8548be3b1fd4a1ae7d0436dce91611112c47c2/dulwich/index.py#L298
    tree_builder = TreeBuilder()

    for file in changeset.manifest:
        blob_id = already_converted.get(file.id, None)

        if blob_id is None:
            blob = Blob(file.data())
            git.store(blob.id, blob.content)
            already_converted[file.id] = blob.id
            blob_id = blob.id

        tree_builder.add_file(file.path, blob_id, file.mode)

    for tree in tree_builder.all_trees():
        git.store(tree.id, tree.content)

    root_tree = tree_builder.root_tree

    # And proceed to build the Git commit and insert it.

On the face of it, this code doesn't seem too bad. If I were writing the functionality from scratch, I'd likely do something very similar. So, why is it so slow?

As I mentioned earlier, profiling results had identified Mercurial and Dulwich as the hot spots. The Mercurial hotspot was in iteration over the files in the manifest. And the Dulwich offender with Git tree object construction. By why?

First, it turns out that iterating a manifest the way hg-git was isn't exactly performant. I never traced all the gory details, but I'm pretty sure every time it accessed the file context through the change context there was I/O involved. Not good, especially if you may not need the information contained if it was already cached!

Second, it turns out that creating Git tree objects in Dulwich is rather slow. And, the problem is magnified when converting large repositories - like mozilla-central (Firefox's canonical repository).

So, I was faced with a decision: make Mercurial and/or Dulwich faster or change hg-git. Since improving these would have benefits outside of hg-git, I initially went down those roads. However, I eventually abandonded the effort because of effort involved. And, in the case of Dulwich, improving things would likely require rewriting some pieces in C - not something I cared to do nor something that the Dulwich people would likely accept since Dulwich is all about being a pure Python implementation of Git! And in hindsight, this was the right call. Mercurial and Dulwich are fast enough - it's hg-git that was being suboptimal.

I was faced with two problems: don't mass iterate over manifests and don't mass generate Git trees. Both were seemingly impossible to avoid because both are critical to converting a Mercurial changeset to Git.

I thought about this problem for a while. I experimented with numerous micro benchmarks. I engaged the very helpful Mercurial developers on IRC (thanks everyone!). And, I eventually arrived at what I think is an elegant solution.

When I took a step back and looked at the larger problem of exporting Mercurial changesets to Git, I realized it would be beneficial in terms of efficiency for the conversion to be more aware of what had occurred before. Before I came along, hg-git was asking Mercurial for the full state of each changeset for each changeset conversion. When you think about it in low-level operations, this is extremely inefficient. Let's take Git trees as an example.

When you perform a commit, only the trees - and their parents - that had modified files will change. All the other trees will be identical across commits. For large repositories (in terms of files and directories) like mozilla-central, the number of static trees across small commits is quite significant compared to changed trees. The overhead of computing all these trees is not insignificant!

Instead of throwing away all the trees and file context information between changeset exports, what if I preserved it and reused it for the next changeset? I think I'm on to something...

Implementing incremental changeset export

To minimize the work performed when exporting Mercurial changesets to Git, I implemented a standalone class that can emit Mercurial changeset deltas in terms of Git objects. Essentially, it caches a Git tree representation of a Mercurial manifest. When you feed a new Mercurial changeset into it, it asks Mercurial to compare those changesets using the same API used by hg status. This API is efficient and returns the information I care about: the paths that changed. Once we have the changed files, we simply reflect those changes in terms of updating Git trees.

If a file changes or is added, we emit a blob. If a tree changes, we emit the new tree object. When the consumer has finished writing the set of new objects to Git, it asks for the SHA-1 of the root tree. (Up until this point the consumer is not aware of what any of the emitted objects actually are - just that they likely need to be added to storage.) It then uses the SHA-1 of the root tree to construct the commit. Then it moves on to the next changeset.

The impact of this change is significant. On my computer, converting Mercurial's own Mercurial repository Git went from 21:07 to 8:14 on my i7-2600k. mozilla-central is even more drastic. The first 200 commits (the first commit was a large dump from CVS) took 8:17 before and now take 2:32. I don't have exact numbers from newer commits, but I do know they were at least twice as slow as the initial commits and showed an even more drastic speedup.

But I was just getting started.

The initial implementation wasn't very efficient in terms of reducing tree object calculations. I changed that earlier today when I submitted a patch for consideration that only calculates tree changes for trees that actually changed. I also removed some needless sorting on the order of export operations. This second patch reduced conversion of Mercurial's repository down to 5:33. Even more impressive is that mozilla-central's changesets are now exporting almost 4x faster with this patch alone. The first 200 changesets now export in 42s (down from 2:32 which is down from 8:17). This is mostly due to the overhead of reprocessing non-dirty trees on every export.

And I'm not through.

As part of building the standalone incremental changeset exporter, one of the goals in the back of my mind was to eventually have things execute in parallel.

In my personal development branch I have a patch to perform Mercurial changeset export on multiple cores. Essentially hg-git fires up a bunch of worker processes and asks each to export a consecutive range of changesets. The workers writes new Git objects into Git and then tells the coordinator process the root tree SHA-1 corresponding to each Mercurial changeset. The coordinator process then uses these root tree SHA-1's to derive Git commit objects (you can't create the commit object until you know the SHA-1 of the commit's parents).

The blob and tree exporting on separate processes makes Mercurial to Git export scale out to however many cores you feel like throwing at it. When 32 core machines come around, you can convert using all available cores and the speedup should roughly be linear with the number of cores.

I'm still working out some kinks in the multiple processes patch (the multiprocessing module is very difficult to get working on all platforms and I don't want to break hg-git when it lands). But, Ehsan Akhgari has been using it to power the GitHub mirror of mozilla-central for months without issue. (His use of these patches freed up the CPU required to support conversion of more project branches on the Git mirror. And, he's still not using the 4x improvement patch I wrote today - he will shortly - so who knows what improvements will stem from that.)

With all the patches applied, hg-git now feels like a Ferrari when exporting Mercurial changesets to Git. Conversion of Mercurial's repository now takes 1:25 (down from 21:07). Conversion of mozilla-central has gone from 6+ days to about 3 hours! More importantly, ongoing conversions feel somewhat snappy now.

Making Git export even faster

With the patch today, I'd say optimization of exporting Mercurial changesets is nearing its limits. There are a few things I could try that may net another 2 or 3x improvement. But, I think the ~50x improvement I've already attained (at least for mozilla-central) is pretty damn good and good enough for most users. (Part of performance optimization is knowing when is good enough and stopping before you invest excessive time in the long tail.)

There is one giant refactor that could likely net a significant win for Git export. However, it requires optimizing for initial export over recurring incremental export (which is why I have little interest in it). Incremental export incurs a lot of random I/O accessing Mercurial filelogs and extracting specific file revisions as they are needed. An optimal export would iterate over the filelogs and export Git blobs from each filelog in the sequence they occur in within the filelogs. It would cache the file node to blob SHA-1. After all blobs are exported, the mappings would be combined and distributed to all workers. Then, tree export would occur in parallel largely under the existing model modulo blob writing. This would minimize overall I/O and work in Mercurial and would likely be significantly faster. However, it's mostly useful for initial export and IMO not worth implementing. (It's possible to employ a variation for incremental export that iterates over filelogs and exports not-yet-seen revisions. Perhaps I will investigate this some day.)

What about converting Git to Mercurial?

Now that I've tackled Mercurial to Git conversion, it's very tempting to work magic on the inverse: converting Git commits to Mercurial changesets. While I haven't looked at this problem in detail, I already know it will be at least slightly more challenging.

The reason is parallelization. With Mercurial export, I have each child process reading directly from Mercurial and writing directly to Git. There are no locks involved. There is just a coordinator that ensures minimum redundant work among workers. There is some redundant work, sure. But, the alternative would be lots of locking and/or exchange of state across processes - not cheap operations! Furthermore, the writes into Git can occur in any order (since Git is just a key-value store). The only hard requirement is a child commit must come after its parent (because you need the parent commit's SHA-1). And, single-threaded insert of commit objects isn't a big deal because you can crank through hundreds of them per second (it might even be over 1000/s on my machine).

Mercurial's storage implementation does not afford me the same carelessness with regards to writing into storage. Since Mercurial uses shared files for individual file and manifest history, we have a contention problem. We could lock files when writing to them. However, these files (revlogs in Mercurial speak) also use transparent delta compression. You get the best performance/compression when changes are written in the order they actually occured in (at least in the typical case).

To optimally write to Mercurial you need to order inserts. This means parallel reads from Git (in separate worker processes) would be very difficult to implement. Doable, sure, but you're looking at a lot of transferred state and ordering. This likely involves a lot more memory and CPU usage.

The best idea I've come up with so far is a single process that reads off Git commits and iterates trees. It hashes the paths of seen files to a consistent worker process which then pulls the blob from Git's storage and inserts it into the filelog. You don't need to lock filelogs because only one worker owns a specific path. Workers report the blob's corresponding file node to another process which then assembles manifests, writes manifests in order, and finally creates and writes changesets. Unfortunately, the worker processes are just doing blob I/O. There is no parallel processing of Git tree calculation or Mercurial manifests. Given this was a significant source of slowness exporting to Git, I worry the inverse will be true. Although, the problem with Git was tree creation and it was due to the volume. Since there is only 1 manifest per changeset, perhaps it won't be as bad.

While I've brainstormed a solution, I have no concrete plans to work on Git to Mercurial conversion. The impetus for me working on Mercurial to Git speedups was that I and a number of other Mozilla people were personally impacted. If the same is true for Git to Mercurial slowness, I could invest a few hours the next time I'm sick and bored over the weekend.

Conclusion

Converting Mercurial repositories to Git with hg-git is now significantly faster. If you thought it was too slow before, grab the latest code (from either the official repository or my personal branch) and enjoy.