Gregory Szorc's Digital Home

On Monolithic Repositories

September 09, 2014 at 10:00 AM | categories: Git, Mercurial, Mozilla

When companies or organizations deploy version control, they have to make many choices. One of them is how many repositories to create. Your choices are essentially a) a single, monolithic repository that holds everything b) many separate, smaller repositories that hold all the individual parts c) something in between.

The prevailing convention today (especially in the open source realm) is to create many separate and loosely coupled repositories, each repository mapping to a specific product or service. That does seem reasonable: if you were organizing files on your filesystem, you would group them by functionality or role (photos, music, documents, etc). And, version control tools are functionally filesystems. So it makes sense to draw repository boundaries at directory/role levels.

Further reinforcing the separate repository convention is the scaling behavior of our version control tools. Git, the popular tool in open source these days, doesn't scale well to very large repositories due to - among other things - not having narrow clones (fetching a subset of files). It scales well enough to the overwhelming majority of projects. But if you are a large organization generating lots of data (read: gigabytes of data over hundreds of thousands of files and commits) for version control, Git is unsuitable in its current form. Other tools (like Mercurial) don't currently fare that much better (although Mercurial has plans to tackle these scaling vectors).

Despite popular convention and even limitations in tools, companies like Google and Facebook opt to run large, monolithic repositories. Google runs Perforce. Facebook is on Mercurial, or at least is in the process of migrating to Mercurial.

Why do these companies run monolithic repositories? In Google's words:

We have a single large depot with almost all of Google's projects on it. This aids agile development and is much loved by our users, since it allows almost anyone to easily view almost any code, allows projects to share code, and allows engineers to move freely from project to project. Documentation and data is stored on the server as well as code.

So, monolithic repositories are all about moving fast and getting things done more efficiently. In other words, monolithic repositories increase developer productivity.

Furthermore, monolithic repositories are also more compatible with the ebb and flow of large organizations and large software projects. Components, features, products, and teams come and go, merge and split. The only constant is change. And if you are maintaining separate repositories that attempt to map to this ever-changing organizational topology, you are going to have a bad time. Either you'll be constantly copying, moving, merging, splitting, etc data and repositories. Or your repositories will be organized in a very non-logical and non-intuitive manner. That translates to overhead and lost productivity. I think that monolithic repositories handle the realities of large organizations much better. Big change or reorganization you want to reflect? You can make a single, atomic, history-preserving commit to move things around. I think that's much more manageable, especially when you consider the difficulty and annoyance of history-preserving changes across repositories.

Naysayers will decry monolithic repositories on principled and practical grounds.

The principled camp will say that separate repositories constitute a loosely coupled (dare I say service oriented) architecture that maps better to how software is consumed, assembled, and deployed and that erecting barriers in the form of separate repositories deliberately enforces this architecture. I agree. However, you can still maintain a loosely coupled architecture with monolithic repositories. The Subversion model of checking out a single tree from a larger repository proves this. Furthermore, I would say architecture decisions should be enforced by people (via code review, etc), not via version control repository topology. I believe this principled argument against monolithic repositories to be rather weak.

The principled camp living in the open source realm may also decry monolithic repositories as an affront to the spirit of open source. They would say that a monolithic repository creates unfairly strong ties to the organization that operates it and creates barriers to forking, etc. This may be true. But monolithic repositories don't intrinsically infringe on the basic software freedoms, organizations do. Therefore, I find this principled argument rather weak.

The practical camp will say that monolithic repositories just don't scale or aren't suitable for general audiences. These concerns are real.

Fully distributed version control systems (every commit on every machine) definitely don't scale past certain limits. Depending on your repository and user base, your scaling limits include disk space (repository data terabytes in size), bandwidth (repository data terabytes in size), filesystem (repository hundreds of thousands or millions of files), CPU and memory (operations on large repositories take too many system resources), and many heads/branches (tools like Git and Mercurial don't scale well to tens of thousands of heads/branches). These limitations with fully distributed version control are why distributed version control tools like Git and Mercurial support a partially-distributed mode that behaves more like your classical server-client model, like those employed by Subversion, Perforce, etc. Git supports shallow clone and sparse checkout. Mercurial supports shallow clone (via remotefilelog) and has planned support for narrow clone and sparse checkout by the end of 2015. Of course, you can avoid the scaling limitations of distributed version control by employing a non-distributed tool, such as Subversion. Many companies continue to reach this conclusion today. However, users adapted to the distributed workflow would likely be up in arms (they would probably use tools like hg-subversion or git-svn to maintain their workflows). So, while scaling of version control can be a real concern, there are solutions and workarounds. However, they do involve falling back to a partially-distributed model.

Another concern with monolithic repositories is user access control. You inevitably have code or data that is more sensitive and want to limit who can change or even access it. Separate repositories seem to facilitate a simpler model: per-repository access control. With monolithic repositories, you have to worry about per-directory/subtree permissions, an increased risk of data leaking, etc. This concern is more real with distributed version control, as distributed data and access control aren't naturally compatible. But these issues can be resolved. And if the tooling supports it, there is only a semantic difference between managing access control between repositories versus components of a single repository.

When it comes to repository hosting conversions, I agree with Google and Facebook: I prefer monolithic repositories. When I am interacting with version control, I just want to get stuff done. I don't want to waste time dealing with multiple commands to manage multiple repositories. I don't want to waste time or expend cognitive load dealing with submodule, subrepository, or big files management. I don't want to waste time trying to find and reuse code, data, or documentation. I want everything at my fingertips, where it can be easily discovered, inspected, and used. Monolithic repositories facilitate these workflows more than separate repositories and make me more productive as a result.

Now, if only all the tools and processes we use and love would work with monolithic repositories...

Want to read more about monolithic repositories? I highly recommend Advantages of Monolithic Version Control by Dan Luu.

Repository-Centric Development

July 24, 2014 at 08:23 PM | categories: Git, Mercurial, Mozilla

I was editing a wiki page yesterday and I think I coined a new term which I'd like to enter the common nomenclature: repository-centric development. The term refers to development/version control workflows that place repositories - not patches - first.

When collaborating on version controlled code with modern tools like Git and Mercurial, you essentially have two choices on how to share version control data: patches or repositories.

Patches have been around since the dawn of version control. Everyone knows how they work: your version control system has a copy of the canonical data and it can export a view of a specific change into what's called a patch. A patch is essentially a diff with extra metadata.

When distributed version control systems came along, they brought with them an alternative to patch-centric development: repository-centric development. You could still exchange patches if you wanted, but distributed version control allowed you to pull changes directly from multiple repositories. You weren't limited to a single master server (that's what the distributed in distributed version control means). You also didn't have to go through an intermediate transport such as email to exchange patches: you communicate directly with a peer repository instance.

Repository-centric development eliminates the middle man required for patch exchange: instead of exchanging derived data, you exchange the actual data, speaking the repository's native language.

One advantage of repository-centric development is it eliminates the problem of patch non-uniformity. Patches come in many different flavors. You have plain diffs. You have diffs with metadata. You have Git style metadata. You have Mercurial style metadata. You can produce patches with various lines of context in the diff. There are different methods for handling binary content. There are different ways to express file adds, removals, and renames. It's all a hot mess. Any system that consumes patches needs to deal with the non-uniformity. Do you think this isn't a problem in the real world? Think again. If you are involved with an open source project that collects patches via email or by uploading patches to a bug tracker, have you ever seen someone accidentally upload a patch in the wrong format? That's patch non-uniformity. New contributors to Firefox do this all the time. I also see it in the Mercurial project. With repository-centric development, patches never enter the picture, so patch non-uniformity is a non-issue. (Don't confuse the superficial formatting of patches with the content, such as an incorrect commit message format.)

Another advantage of repository-centric development is it makes the act of exchanging data easier. Just have two repositories talk to each other. This used to be difficult, but hosting services like GitHub and Bitbucket make this easy. Contrast with patches, which require hooking your version control tool up to wherever those patches are located. The Linux Kernel, like so many other projects, uses email for contributing changes. So now Git, Mercurial, etc all fulfill Zawinski's law. This means your version control tool is talking to your inbox to send and receive code. Firefox development uses Bugzilla to hold patches as attachments. So now your version control tool needs to talk to your issue tracker. (Not the worst idea in the world I will concede.) While, yes, the tools around using email or uploading patches to issue trackers or whatever else you are using to exchange patches exist and can work pretty well, the grim reality is that these tools are all reinventing the wheel of repository exchange and are solving a problem that has already been solved by git push, git fetch, hg pull, hg push, etc. Personally, I would rather hg push to a remote and have tools like issue trackers and mailing lists pull directly from repositories. At least that way they have a direct line into the source of truth and are guaranteed a consistent output format.

Another area where direct exchange is huge is multi-patch commits (branches in Git parlance) or where commit data is fragmented. When pushing patches to email, you need to insert metadata saying which patch comes after which. Then the email import tool needs to reassemble things in the proper order (remember that the typical convention is one email per patch and email can be delivered out of order). Not the most difficult problem in the world to solve. But seriously, it's been solved already by git fetch and hg pull! Things are worse for Bugzilla. There is no bullet-proof way to order patches there. The convention at Mozilla is to add Part N strings to commit messages and have the Bugzilla import tool do a sort (I assume it does that). But what if you have a logical commit series spread across multiple bugs? How do you reassemble everything into a linear series of commits? You don't, sadly. Just today I wanted to apply a somewhat complicated series of patches to the Firefox build system I was asked to review so I could jump into a debugger and see what was going on so I could conduct a more thorough review. There were 4 or 5 patches spread over 3 or 4 bugs. Bugzilla and its patch-centric workflow prevented me from importing the patches. Fortunately, this patch series was pushed to Mozilla's Try server, so I could pull from there. But I haven't always been so fortunate. This limitation means developers have to make sacrifices such as writing fewer, larger patches (this makes code review harder) or involving unrelated parties in the same bug and/or review. In other words, deficient tools are imposing limited workflows. No bueno.

It is a fair criticism to say that not everyone can host a server or that permissions and authorization are hard. Although I think concerns about impact are overblown. If you are a small project, just create a GitHub or Bitbucket account. If you are a larger project, realize that people time is one of your largest expenses and invest in tools like proper and efficient repository hosting (often this can be GitHub) to reduce this waste and keep your developers happier and more efficient.

One of the clearest examples of repository-centric development is GitHub. There are no patches in GitHub. Instead, you git push and git fetch. Want to apply someone else's work? Just add a remote and git fetch! Contrast with first locating patches, hooking up Git to consume them (this part was always confusing to me - do you need to retroactively have them sent to your email inbox so you can import them from there), and finally actually importing them. Just give me a URL to a repository already. But the benefits of repository-centric development with GitHub don't stop at pushing and pulling. GitHub has built code review functionality into pushes. They call these pull requests. While I have significant issues with GitHub's implemention of pull requests (I need to blog about those some day), I can't deny the utility of the repository-centric workflow and all the benefits around it. Once you switch to GitHub and its repository-centric workflow, you more clearly see how lacking patch-centric development is and quickly lose your desire to go back to the 1990's state-of-the-art methods for software development.

I hope you now know what repository-centric development is and will join me in championing it over patch-based development.

Mozillians reading this will be very happy to learn that work is under way to shift Firefox's development workflow to a more repository-centric world. Stay tuned.

New Repository for Mozilla Version Control Tools

February 05, 2014 at 07:15 PM | categories: Git, Mercurial, Mozilla

Version control systems can be highly useful tools.

At Mozilla, we've made numerous improvements and customizations to our version control tools. We have custom hooks that run on the server. We have a custom skin for Mercurial's web interface. Mozillians have written a handful of Mercurial extensions to aid with common developer tasks, such as pushing to try, interacting with Bugzilla, making mq more useful, and more.

These have all come into existence in an organic manner, one after the other. Individuals have seen an itch and scratched it. Good for them. Good for Mozilla.

Unfortunately, the collection of amassed tools has become quite large. They have become difficult to discover and keep up to date. The consistency in quality and style between the tools varies. Each tool has separate processes for updating and changing.

I contacted the maintainers of the popular version control tools at Mozilla with a simple proposal: let's maintain all our tools under one repo. This would allow us to increase cohesion, share code, maintain a high quality bar, share best practices, etc. There were no major objections, so we now have a unified repository containing our version control tools!

Currently, we only have a few Mercurial extensions in there. A goal is to accumulate as much of the existing Mercurial infrastructure into that repository as possible. Client code. Server code. All of the code. I want developers to be able to install the same hooks on their clients as what's running on the server: why should your local repo let you commit something that the server will reject? I want developers to be able to reasonably reproduce Mozilla's canonical version control server configuration locally. That way, you can test things locally with a high confidence that your changes will work the same way on production. This allows deployments to move faster and with less friction.

The immediate emphasis will be on moving extensions into this repo and deprecating the old homes on user repositories. Over time, we'll move into consolidating server code and getting hg.mozilla.org and git.mozilla.org to use this repository. But that's a lower priority: the most important goal right now is to make it easier and friendlier for people to run productivity-enhancing tools.

So, if you see your Mercurial extensions alerting you that they've been moved to a new repository, now you know what's going on.

Aggregating Version Control Info at Mozilla

January 21, 2014 at 10:50 AM | categories: Git, Mercurial, Mozilla, Python

Over the winter break, I set out on an ambitious project to create a service to help developers and others manage the flury of patches going into Firefox. While the project is far from complete, I'm ready to unleash the first part of the project upon the world.

If you point your browsers to moztree.gregoryszorc.com, you'll hopefully see some documentation about what I've built. Source code is available and free, of course. Patches very welcome.

Essentially, I built a centralized indexing service for version control repositories with Mozilla's extra metadata thrown in. I tell it what repositories to mirror, and it clones everything, fetches data such as the pushlog and Git SHA-1 mappings, and stores everything in a central database. It then exposes this aggregated data through world-readable web services.

Currently, I have the service indexing the popular project branches for Firefox (central, aurora, beta, release, esr, b2g, inbound, fx-team, try, etc). You can view the full list via the web service. As a bonus, I'm also serving these repositories via hg.gregoryszorc.com. My server appears to be significantly faster than hg.mozilla.org. If you want to use it for your daily needs, go for it. I make no SLA guarantees, however.

I'm also using this service as an opportunity to experiment with alternate forms of Mercurial hosting. I have mirrors of mozilla-central and the try repository with generaldelta and lz4 compression enabled. I may blog about what those are eventually. The teaser is that they can make Mercurial perform a lot faster under some conditions. I'm also using ZFS under the hood to manage repositories. Each repository is a ZFS filesystem. This means I can create repository copies on the server (user repositories anyone?) at a nearly free cost. Contrast this to the traditional method of full clones, which take lots of time, memory, CPU, and storage.

Anyway, some things you can do with the existing web service:

Obtain metadata about Mercurial changesets. Example.
Look up metadata about Git commits. Example.
Obtain a SPORE descriptor describing the web service endpoints. This allows you to auto-generate clients from descriptors. Yay!

Obviously, that's not a lot. But adding new endpoints is relatively straightforward. See the source. It's literally as easy as defining a URL mapping and writing a database query.

The performance is also not the best. I just haven't put in the effort to tune things yet. All of the querying hits the database, not Mercurial. So, making things faster should merely be a matter of database and hosting optimization. Patches welcome!

Some ideas that I haven't had time to implement yet:

Return changests in a specific repository
Return recently pushed changesets
Return pushes for a given user
Return commits for a given author
Return commits referencing a given bug
Obtain TBPL URLs for pushes with changeset
Integrate bugzilla metadata

Once those are in place, I foresee this service powering a number of dashboards. Patches welcome.

Again, this service is only the tip of what's possible. There's a lot that could be built on this service. I have ideas. Others have ideas.

The project includes a Vagrant file and Puppet manifests for provisioning the server. It's a one-liner to get a development environment up and running. It should be really easy to contribute to this project. Patches welcome.

Importance of Hosting Your Version Control Server

November 13, 2013 at 09:25 AM | categories: Git, Mercurial, Mozilla

The subject of where to host version control repositories comes up a lot at Mozilla. It takes many forms:

We should move the Firefox repository to GitHub
I should be allowed to commit to GitHub
I want the canonical repository to be hosted by Bitbucket

When Firefox development is concerned, Release Engineerings puts down their foot and insists the canonical repository be hosted by Mozilla, under a Mozilla hostname. When that's not possible, they set up a mirror on Mozilla infrastructure.

I think a recent issue with the Jenkins project demonstrates why hosting your own version control server is important. The gist is someone force pushed to a bunch of repos hosted on GitHub. They needed to involve GitHub support to recover from the issue. While it appears they largely recovered (and GitHub support deserves kudos - I don't want to take away from their excellence), this problem would have been avoided or the response time significantly decreased if the Jenkins people had direct control over the Git server: they either could have installed a custom hook that would have prevented the pushes or had access to the reflog so they could have easily seen the last pushed revision and easily forced pushed back to it. GitHub doesn't have a mechanism for defining pre-* hooks, doesn't allow defining custom hooks (a security and performance issue for them), and doesn't expose the reflog data.

Until repository hosting services expose full repository data (such as reflogs) and allow you to define custom hooks, accidents like these will happen and the recovery time will be longer than if you hosted the repo yourself.

It's possible repository hosting services like GitHub and Bitbucket will expose these features or provide a means to quickly recover. If so, kudos to them. But larger, more advanced projects will likely employ custom hooks and considering custom hooks are a massive security and performance issue for any hosted service provider, I'm not going to hold my breath this particular feature is rolled out any time soon. This is unfortunate, as it makes projects seemingly choose between low risk/low convenience and GitHub's vibrant developer community.

« Previous Page -- Next Page »

On Monolithic Repositories

Repository-Centric Development

New Repository for Mozilla Version Control Tools

Aggregating Version Control Info at Mozilla

Importance of Hosting Your Version Control Server

Categories