Mercurial, SHA-1, and Trusting Version Control

February 28, 2017 at 12:40 PM | categories: Mercurial, Mozilla | View Comments

The Internet went crazy on Thursday when Google announced a SHA-1 collision. This has spawned a lot of talk about the impact of SHA-1 in version control. Linus Torvalds (the creator of Git) weighed in on the Git mailing list and on Google+. There are also posts like SHA1 collisions make Git vulnerable to attakcs by third-parties, not just repo maintainers outlining the history of Git and SHA-1. On the Mercurial side, Matt Mackall (the creator of Mercurial) authored a SHA-1 and Mercurial security article. (If you haven't read Matt's article, please do so now before continuing.)

I'd like to contribute my own take on the problem with a slant towards Mercurial and while also comparing Mercurial's exposure to SHA-1 collisions to Git's. Since this is a security topic, I'd like to explicitly state that I'm not a cryptographer. However, I've worked on a number of software components that do security/cryptography (like Firefox Sync) and I'm pretty confident saying that my grasp on cryptographic primitives and security techniques is better than the average developer's.

Let's talk about Mercurial's exposure to SHA-1 collisions on a technical level.

Mercurial, like Git, is vulnerable to SHA-1 collisions. Mercurial is vulnerable because its logical storage mechanism (like Git's) indexes tracked content by SHA-1. If two objects with differing content have the same SHA-1, content under version control could be changed and detecting that would be difficult or impossible. That's obviously bad.

But, Mercurial's exposure to SHA-1 collisions isn't as severe as Git's. To understand why, we have to understand how each stores data.

Git's logical storage model is a content-addressable key-value store. Values (objects in Git parlance) consist of a header identifying the object type (commit, tree, blob, or tag), the size of the data (as a string), and the raw content of the thing being stored. Common content types are file content (blob), a list of files (tree), and a description of a commit (commit). Keys in this blob store are SHA-1 hashes of objects. All Git objects go into a single namespace in the Git repository's store. A beneficial side-effect of this is data de-duplication: if the same file is added to a Git repository, it's blob object will be identical and it will only be stored once by Git. A detrimental side-effect is that hash collisions are possible between any two objects, irregardless of their type or location in the repository.

Mercurial's logical storage model is also content-addressable. However, it is significantly different from Git's approach. Mercurial's logical storage model allocates a separate sub-store for each tracked path. If you run find .hg/store -name '*.i' inside a Mercurial repository, you'll see these files. There is a separate file for each path that has committed data. If you hg add foo.txt and hg commit, there will be a data/foo.txt.i file holding data for foo.txt. There are also special files 00changelog.i and 00manifest.i holding data for commits/changesets and file lists, respectively. Each of these .i files - a revlog - is roughly equivalent to an ordered collection of Git objects for a specific tracked path. This means that Mercurial's store consists of N discrete and independent namespaces for data. Contrast with Git's single namespace.

The benefits and drawbacks are the opposite of those pointed out for Git above: Mercurial doesn't have automatic content-based de-duplication but it does provide some defense against hash collisions. Because each logical path is independent of all others, a Mercurial repository will happily commit two files with different content but same hashes. This is more robust than Git because a hash collision is isolated to a single logical path / revlog. In other words, a random file added to the repository in directory X that has a hash collision with a file in directory Y won't cause problems.

Mercurial also differs significantly from Git in terms of how the hash is obtained. Git's hash is computed from raw content preceded by a header derived directly from the object's role and size. (Takeaway: the header is static and can be derived trivially.) Mercurial's hash is computed from raw content preceded by a header. But that header consists of the 20 byte SHA-1 hash(es) of the parent revisions in the revlog to which the content is being added. This chaining of hashes means that the header is not always static nor always trivially derived. This means that the same content can be stored in the revlog under multiple hashes. It also means that it is possible to store differing content having a hash collision within the same revlog! But only under some conditions - Mercurial will still barf in some scenarios if there is a hash collision within content tracked by the revlog. This is different from Git's behavior, where the same content always results in the same Git object hash. (It's worth noting that a SHA-1 collision on data with a Git object header has not yet been encountered in the wild.)

The takeaway from the above paragraphs is Mercurial's storage model is slightly more robust against hash collisions than Git's because there are multiple, isolated namespaces for storing content and because all hashes are chained to previous content. So, when SHA-1 collisions are more achievable and someone manages to create a collision for a hash used by version control, Mercurial's storage layer will be able to cope with that better than Git's.

But the concern about SHA-1 weakness is more about security than storage robustness. The disaster scenario for version control is that an attacker could replace content under version control, possibly undetected. If one can generate a hash collision, then this is possible. Mercurial's chaining of content provides some defense, but it isn't sufficient.

I agree with Matt Mackall that at the present time there are bigger concerns with content safety than SHA-1 collisions. Namely, if you are an attacker, it is much easier to introduce a subtle bug that contains a security vulnerability than to introduce a SHA-1 collision. It is also much easier to hack the canonical version control server (or any user or automated agent that has permissions to push to the server) and add a bad commit. Many projects don't have adequate defenses to detect such bad commits. Ask yourself: if a bad actor pushed a bad commit to my repository, would it be detected? Keep in mind that spoofing author and committer metadata in commits is trivial. The current state of Mercurial and Git rely primarily on trust - not SHA-1 hashes - as their primary defense against malicious actors.

The desire to move away from SHA-1 has been on the radar of the Mercurial project for years. For 10+ years, the revlog data structure has allocated 32 bytes for hashes while only using 20 bytes for SHA-1. And, the topic of SHA-1 weakness and desire to move to something stronger has come up at the developer sprints for the past several years. However, it has never been pressing enough to act on because there are bigger problems. If it were easy to change, then Mercurial likely would have done it already. But changing is not easy. As soon as you introduce a new hash format in a repository, you've potentially locked out all legacy versions of the Mercurial software from accessing the repository (unless the repository stores multiple hashes and allows legacy clients to access the legacy SHA-1 hashes). There are a number of concerns from legacy compatibility (something Mercurial cares deeply about) to user experience to even performance (SHA-1 hashing even at 1000+MB/s floats to the top of performance profiling for some Mercurial operations). I'm sure the topic will be discussed heavily at the upcoming developers sprint in a few weeks.

While Mercurial should (and will eventually) replace SHA-1, I think the biggest improvement Mercurial (or Git for that matter) can make to repository security is providing a better mechanism for tracking and auditing trust. Existing mechanisms for GPG signing every commit aren't practical or are a non-starer for many workflows. And, they rely on GPG, which has notorious end-user usability problems. (I would prefer my version control tool not subject me to toiling with GPG.) I've thought about this topic considerably, authoring a proposal for easier and more flexible commit signing. There is also a related proposal to establish a cryptographically meaningful chain-of-custody for a patch. There are some good ideas there. But, like all user-facing cryptography, the devil is in the details. There are some hard problems to solve, like how to manage/store public keys that were used for signatures. While there is some prior art in version control tools (see Monotone), it is far from a solved problem. And at the end of the day, you are still left having to trust a set of keys used to produce signatures.

While version control can keep using cryptographically strong hashes to mitigate collisions within its storage layer to prevent content swapping and can employ cryptographic signatures of tracked data, there is still the issue of trust. Version control can give you the tools for establishing and auditing trust. Version control can also provide tools for managing trust relationships. But at the end of the day, the actual act of trusting trust boils down to people making decisions (possibly through corporate or project policies). This will always be a weak link. Therefore, it's what malicious actors will attack. The best your version control tool can do is give its users the capability and tools to run a secure and verifiable repository so that when bad content is inevitably added you can't blame the version control tool for having poor security.

Read and Post Comments

MozReview Git Support and Improved Commit Mapping

February 08, 2016 at 11:05 AM | categories: MozReview, Mozilla | View Comments

MozReview - Mozilla's Review Board based code review tool - now supports ingestion from Git. Previously, it only supported Mercurial.

Instructions for configuring Git with MozReview are available. Because blog posts are not an appropriate medium for documenting systems and processes, I will not say anything more here on how to use Git with MozReview.

Somewhat related to the introduction of Git support is an improved mechanism for mapping commits to existing review requests.

When you submit commits to MozReview, MozReview has to decide how to map those commits to review requests in Review Board. It has to choose whether to recycle an existing review request or create a new one. When recycling, is has to pick an appropriate one. If it chooses incorrectly, wonky things can happen. For example, a review request could switch to tracking a new and completely unrelated commit. That's bad.

Up until today, our commit mapping algorithm was extremely simple. Yet it seemed to work 90% of the time. However, a number of people found the cracks and complained. With Git support coming online, I had a feeling that Git users would find these cracks with higher frequency than Mercurial users due to what I perceive to be variations in the commit workflows of Git versus Mercurial. So, I decided to proactively improve the commit mapping before the Git users had time to complain.

Both the Git and Mercurial MozReview client-side extensions now insert a MozReview-Commit-ID metadata line in commit messages. This line effectively defines a (likely) unique ID that identifies the commit across rewrites. When MozReview maps commits to review requests, it uses this identifier to find matches. What this means is that history rewriting (such as reordering commits) should be handled well by MozReview and should not confuse the commit mapping mechanism.

I'm not claiming the commit mapping mechanism is perfect. In fact, I know of areas where it can still fall apart. But it is much better than it was before. If you think you found a bug in the commit mapping, don't hesitate to file a bug. Please have it block bug 1243483.

A side-effect of introducing this improved commit mapping is that commit messages will have a MozReview-Commit-ID line in them. This may startle some. Some may complain about the spam. Unfortunately, there's no better alternative. Both Mercurial and Git do support a hidden key-value dictionary for each commit object. In fact, the MozReview Mercurial extension has been storing the very commit IDs that now appear in the commit message in this dictionary for months! Unfortunately, actually using this hidden dictionary for metadata storage is riddled with problems. For example, some Mercurial commands don't preserve all the metadata. And accessing or setting this data from Git is painful. While I wish this metadata (which provides little value to humans) were not located in the commit message where humans could be bothered by it, it's really the only practical place to put it. If people find it super annoying, we could modify Autoland to strip it before landing. Although, I think I like having it preserved because it will enable some useful scenarios down the road, such as better workflows for uplift requests. It's also worth noting that there is precedent for storing unique IDs in commit messages for purposes of commit mapping in the code review tool: Gerrit uses Change-ID lines.

I hope you enjoy the Git support and the more robust commit to review request mapping mechanism!

Read and Post Comments

Making MozReview Faster by Disregarding RESTful Design

January 13, 2016 at 03:25 PM | categories: MozReview, Mozilla | View Comments

When I first started writing web services, I was a huge RESTful fan boy. The architectural properties - especially the parts related to caching and scalability - really jived with me. But as I've grown older and gained experienced, I've realized that RESTful design, like many aspects of software engineering, is more of a guideline or ideal than a panacea. This post is about one of those experiences.

Review Board's Web API is RESTful. It's actually one of the better examples of a RESTful API I've seen. There is a very clear separation between resources. And everything - and I mean everything - is a resource. Hyperlinks are used for the purposes described in Roy T. Fielding's dissertation. I can tell the people who authored this web API understood RESTful design and they succeeded in transferring that knowledge to a web API.

Mozilla's MozReview code review tool is built on top of Review Board. We've made a number of customizations. The most significant is the ability to submit a series of commits as one logical review series. This occurs as a side-effect of a hg push to the code review repository. Once your changesets are pushed to the remote repository, that server issues a number of Review Board Web API HTTP requests to reviewboard.mozilla.org to create the review requests, assign reviewers, etc. This is mostly all built on the built-in web API endpoints offered by Review Board.

Because Review Board's Web API adheres to RESTful design principles so well, turning a series of commits into a series of review requests takes a lot of HTTP requests. For each commit, we have to perform something like 5 HTTP requests to define the server state. For series of say 10 commits (which aren't uncommon since we try to encourage the use of microcommits), this can add up to dozens of HTTP requests! And that's just counting the HTTP requests to Review Board: because we've integrated Review Board with Bugzilla, events like publishing result in additional RESTful HTTP requests from Review Board to bugzilla.mozilla.org.

At the end of the day, submitting and publishing a series of 10 commits consumes somewhere between 75 and 100 HTTP requests! While the servers are all in close physical proximity (read: low network latencies), we are reusing TCP connections, and each HTTP request completes fairly quickly, the overhead adds up. It's not uncommon for publishing commit series to take over 30s. This is unacceptable to developers. We want them to publish commits for review as quickly as possible so they can get on with their next task. Humans should not have to wait on machines.

Over in bug 1220468, I implemented a new batch submit web API for Review Board and converted the Mercurial server to call it instead of the classic, RESTful Review Board web APIs. In other words, I threw away the RESTful properties of the web API and implemented a monolith API doing exactly what we need. The result is a drastic reduction in net HTTP requests. In our tests, submitting a series of 20 commits for review reduced the HTTP request count by 104! Furthermore, the new API endpoint performs all modifications in a single database transaction. Before, each HTTP request was independent and we had bugs where failures in the middle of a HTTP request series left the server in inconsistent and unexpected state. The new API is significantly faster and more atomic as a bonus. The main reason the new implementation isn't yet nearly instantaneous is because we're still performing several RESTful HTTP requests to Bugzilla from Review Board. But there are plans for Bugzilla to implement the batch APIs we need as well, so stay tuned.

(I don't want to blame the Review Board or Bugzilla maintainers for their RESTful web APIs that are giving MozReview a bit of scaling pain. MozReview is definitely abusing them almost certainly in ways that weren't imagined when they were conceived. To their credit, the maintainers of both products have recognized the limitations in their APIs and are working to address them.)

As much as I still love the properties of RESTful design, there are practical limitations and consequences such as what I described above. The older and more experienced I get, the less patience I have for tolerating architecturally pure implementations that sacrifice important properties, such as ease of use and performance.

It's worth noting that many of the properties of RESTful design are applicable to microservices as well. When you create a new service in a microservices architecture, you are creating more overhead for clients that need to speak to multiple services, making changes less transactional and atomic, and making it difficult to consolidate multiple related requests into a higher-level, simpler, and performant API. I recommend Microservice Trade-Offs for more on this subject.

There is a place in the world for RESTful and microservice architectures. And as someone who does a lot of server-side engineering, I sympathize with wanting scalable, fault-tolerant architectures. But like most complex problems, you need to be cognizant of trade-offs. It is also important to not get too caught up with architectural purity if it is getting in the way of delivering a simple, intuitive, and fast product for your users. So, please, follow me down from the ivory tower. The air was cleaner up there - but that was only because it was so distant from the swamp at the base of the tower that surrounds every software project.

Read and Post Comments

Investing in the Firefox Build System in 2016

January 11, 2016 at 02:20 PM | categories: Mozilla, build system | View Comments

Most of Mozilla gathered in Orlando in December for an all hands meeting. If you attended any of the plenary sessions, you probably heard people like David Bryant and Lawrence Mandel make references to improving the Firefox build system and related tools. Well, the cat is out of the bag: Mozilla will be investing heavily in the Firefox build system and related tooling in 2016!

In the past 4+ years, the Firefox build system has mostly been held together and incrementally improved by a loose coalition of people who cared. We had a period in 2013 where a number of people were making significant updates (this is when moz.build files happened). But for the past 1.5+ years, there hasn't really been a coordinated effort to improve the build system - just a lot of one-off tasks and (much-appreciated) individual heroics. This is changing.

Improving the build system is a high priority for Mozilla in 2016. And investment has already begun. In Orlando, we had a marathon 3 hour meeting planning work for Q1. At least 8 people have committed to projects in Q1.

The focus of work is split between immediate short-term wins and longer-term investments. We also decided to prioritize the Firefox and Fennec developer workflows (over Gecko/Platform) as well as the development experience on Windows. This is because these areas have been under-loved and therefore have more potential for impact.

Here are the projects we're focusing on in Q1:

  • Turnkey artifact based builds for Firefox and Fennec (download pre-built binaries so you don't have to spend 20 minutes compiling C++)
  • Running tests from the source directory (so you don't have to copy tens of thousands of files to the object directory)
  • Speed up configure / prototype a replacement
  • Telemetry for mach and the build system
  • NSPR, NSS, and (maybe) ICU build system rewrites
  • mach build faster improvements
  • Improvements to build rules used for building binaries (enables non-make build backends)
  • mach command for analyzing C++ dependencies
  • Deploy automated testing for mach bootstrap on TaskCluster

Work has already started on a number of these projects. I'm optimistic 2016 will be a watershed year for the Firefox build system and the improvements will have a drastic positive impact on developer productivity.

Read and Post Comments

hg.mozilla.org replication updates

January 05, 2016 at 03:00 PM | categories: Mercurial, Mozilla | View Comments

A few minutes ago, I formally enabled a new replication system for hg.mozilla.org. For the curious, technical details are available.

This impacts you because pushes to hg.mozilla.org should now be significantly faster. For example, pushes to mozilla-inbound that used to take 15s now take 2s. Pushes to Try that used to take 45s now take 10s. (Yes, the old replication system really added a lot of overhead.) Pushes to hg.mozilla.org are still not as fast as they could be due to us running the service on 5 year old hardware (we plan to buy new servers this year) and due to the use of NFS on the server. However, I believe push latency is now reasonable for every repo except Try.

The new replication system opens the door to a number of future improvements. We'd like to stand up mirrors in multiple data centers - perhaps even offices - so clients have the fastest connectivity and so we have a better disaster recovery story. The new replication system facilitates this.

The new replication log is effectively a unified pushlog - something people have wanted for years. While there is not yet a public API for it, one could potentially be exposed, perhaps indirectly via Pulse.

It is now trivial for us to stand up new consumers of the replication log that react to repository events merely milliseconds after they occur. This should eventually result in downstream systems like build automation and conversion to Git repos starting and thus completing faster.

Finally, the new replication system has been running unofficially for a few weeks, so you likely won't notice anything different today (other than removal of some printed messages when you push). What changed today is the new system is enabled by default and we have no plans to continue supporting or operating the legacy system. Good riddance.

Read and Post Comments

« Previous Page -- Next Page »