Mercurial, SHA-1, and Trusting Version Control

February 28, 2017 at 12:40 PM | categories: Mercurial, Mozilla

The Internet went crazy on Thursday when Google announced a SHA-1 collision. This has spawned a lot of talk about the impact of SHA-1 in version control. Linus Torvalds (the creator of Git) weighed in on the Git mailing list and on Google+. There are also posts like SHA1 collisions make Git vulnerable to attakcs by third-parties, not just repo maintainers outlining the history of Git and SHA-1. On the Mercurial side, Matt Mackall (the creator of Mercurial) authored a SHA-1 and Mercurial security article. (If you haven't read Matt's article, please do so now before continuing.)

I'd like to contribute my own take on the problem with a slant towards Mercurial and while also comparing Mercurial's exposure to SHA-1 collisions to Git's. Since this is a security topic, I'd like to explicitly state that I'm not a cryptographer. However, I've worked on a number of software components that do security/cryptography (like Firefox Sync) and I'm pretty confident saying that my grasp on cryptographic primitives and security techniques is better than the average developer's.

Let's talk about Mercurial's exposure to SHA-1 collisions on a technical level.

Mercurial, like Git, is vulnerable to SHA-1 collisions. Mercurial is vulnerable because its logical storage mechanism (like Git's) indexes tracked content by SHA-1. If two objects with differing content have the same SHA-1, content under version control could be changed and detecting that would be difficult or impossible. That's obviously bad.

But, Mercurial's exposure to SHA-1 collisions isn't as severe as Git's. To understand why, we have to understand how each stores data.

Git's logical storage model is a content-addressable key-value store. Values (objects in Git parlance) consist of a header identifying the object type (commit, tree, blob, or tag), the size of the data (as a string), and the raw content of the thing being stored. Common content types are file content (blob), a list of files (tree), and a description of a commit (commit). Keys in this blob store are SHA-1 hashes of objects. All Git objects go into a single namespace in the Git repository's store. A beneficial side-effect of this is data de-duplication: if the same file is added to a Git repository, it's blob object will be identical and it will only be stored once by Git. A detrimental side-effect is that hash collisions are possible between any two objects, irregardless of their type or location in the repository.

Mercurial's logical storage model is also content-addressable. However, it is significantly different from Git's approach. Mercurial's logical storage model allocates a separate sub-store for each tracked path. If you run find .hg/store -name '*.i' inside a Mercurial repository, you'll see these files. There is a separate file for each path that has committed data. If you hg add foo.txt and hg commit, there will be a data/foo.txt.i file holding data for foo.txt. There are also special files 00changelog.i and 00manifest.i holding data for commits/changesets and file lists, respectively. Each of these .i files - a revlog - is roughly equivalent to an ordered collection of Git objects for a specific tracked path. This means that Mercurial's store consists of N discrete and independent namespaces for data. Contrast with Git's single namespace.

The benefits and drawbacks are the opposite of those pointed out for Git above: Mercurial doesn't have automatic content-based de-duplication but it does provide some defense against hash collisions. Because each logical path is independent of all others, a Mercurial repository will happily commit two files with different content but same hashes. This is more robust than Git because a hash collision is isolated to a single logical path / revlog. In other words, a random file added to the repository in directory X that has a hash collision with a file in directory Y won't cause problems.

Mercurial also differs significantly from Git in terms of how the hash is obtained. Git's hash is computed from raw content preceded by a header derived directly from the object's role and size. (Takeaway: the header is static and can be derived trivially.) Mercurial's hash is computed from raw content preceded by a header. But that header consists of the 20 byte SHA-1 hash(es) of the parent revisions in the revlog to which the content is being added. This chaining of hashes means that the header is not always static nor always trivially derived. This means that the same content can be stored in the revlog under multiple hashes. It also means that it is possible to store differing content having a hash collision within the same revlog! But only under some conditions - Mercurial will still barf in some scenarios if there is a hash collision within content tracked by the revlog. This is different from Git's behavior, where the same content always results in the same Git object hash. (It's worth noting that a SHA-1 collision on data with a Git object header has not yet been encountered in the wild.)

The takeaway from the above paragraphs is Mercurial's storage model is slightly more robust against hash collisions than Git's because there are multiple, isolated namespaces for storing content and because all hashes are chained to previous content. So, when SHA-1 collisions are more achievable and someone manages to create a collision for a hash used by version control, Mercurial's storage layer will be able to cope with that better than Git's.

But the concern about SHA-1 weakness is more about security than storage robustness. The disaster scenario for version control is that an attacker could replace content under version control, possibly undetected. If one can generate a hash collision, then this is possible. Mercurial's chaining of content provides some defense, but it isn't sufficient.

I agree with Matt Mackall that at the present time there are bigger concerns with content safety than SHA-1 collisions. Namely, if you are an attacker, it is much easier to introduce a subtle bug that contains a security vulnerability than to introduce a SHA-1 collision. It is also much easier to hack the canonical version control server (or any user or automated agent that has permissions to push to the server) and add a bad commit. Many projects don't have adequate defenses to detect such bad commits. Ask yourself: if a bad actor pushed a bad commit to my repository, would it be detected? Keep in mind that spoofing author and committer metadata in commits is trivial. The current state of Mercurial and Git rely primarily on trust - not SHA-1 hashes - as their primary defense against malicious actors.

The desire to move away from SHA-1 has been on the radar of the Mercurial project for years. For 10+ years, the revlog data structure has allocated 32 bytes for hashes while only using 20 bytes for SHA-1. And, the topic of SHA-1 weakness and desire to move to something stronger has come up at the developer sprints for the past several years. However, it has never been pressing enough to act on because there are bigger problems. If it were easy to change, then Mercurial likely would have done it already. But changing is not easy. As soon as you introduce a new hash format in a repository, you've potentially locked out all legacy versions of the Mercurial software from accessing the repository (unless the repository stores multiple hashes and allows legacy clients to access the legacy SHA-1 hashes). There are a number of concerns from legacy compatibility (something Mercurial cares deeply about) to user experience to even performance (SHA-1 hashing even at 1000+MB/s floats to the top of performance profiling for some Mercurial operations). I'm sure the topic will be discussed heavily at the upcoming developers sprint in a few weeks.

While Mercurial should (and will eventually) replace SHA-1, I think the biggest improvement Mercurial (or Git for that matter) can make to repository security is providing a better mechanism for tracking and auditing trust. Existing mechanisms for GPG signing every commit aren't practical or are a non-starer for many workflows. And, they rely on GPG, which has notorious end-user usability problems. (I would prefer my version control tool not subject me to toiling with GPG.) I've thought about this topic considerably, authoring a proposal for easier and more flexible commit signing. There is also a related proposal to establish a cryptographically meaningful chain-of-custody for a patch. There are some good ideas there. But, like all user-facing cryptography, the devil is in the details. There are some hard problems to solve, like how to manage/store public keys that were used for signatures. While there is some prior art in version control tools (see Monotone), it is far from a solved problem. And at the end of the day, you are still left having to trust a set of keys used to produce signatures.

While version control can keep using cryptographically strong hashes to mitigate collisions within its storage layer to prevent content swapping and can employ cryptographic signatures of tracked data, there is still the issue of trust. Version control can give you the tools for establishing and auditing trust. Version control can also provide tools for managing trust relationships. But at the end of the day, the actual act of trusting trust boils down to people making decisions (possibly through corporate or project policies). This will always be a weak link. Therefore, it's what malicious actors will attack. The best your version control tool can do is give its users the capability and tools to run a secure and verifiable repository so that when bad content is inevitably added you can't blame the version control tool for having poor security.