I recently gained SSH access to Mozilla's Mercurial servers. This allows me to run some custom queries directly against the data. I was interested in some high-level numbers and thought I'd share the results.
hg.mozilla.org hosts a total of 3,445 repositories. Of these, there are 1,223 distinct root commits (i.e. distinct graphs). Altogether, there are 32,123,211 commits. Of those, there are 865,594 distinct commits (not double counting commits that appear in multiple repositories).
We have a high ratio of total commits to distinct commits (about 37:1). This means we have high duplication of data on disk. This basically means a lot of repos are clones/forks of existing ones. No big surprise there.
What is surprising to me is the low number of total distinct commits. I was expecting the number to run into the millions. (Firefox itself accounts for ~240,000 commits.) Perhaps a lot of the data is sitting in Git, Bitbucket, and GitHub. Sounds like a good data mining expedition...
When companies or organizations deploy version control, they have to make many choices. One of them is how many repositories to create. Your choices are essentially a) a single, monolithic repository that holds everything b) many separate, smaller repositories that hold all the individual parts c) something in between.
The prevailing convention today (especially in the open source realm) is to create many separate and loosely coupled repositories, each repository mapping to a specific product or service. That does seem reasonable: if you were organizing files on your filesystem, you would group them by functionality or role (photos, music, documents, etc). And, version control tools are functionally filesystems. So it makes sense to draw repository boundaries at directory/role levels.
Further reinforcing the separate repository convention is the scaling behavior of our version control tools. Git, the popular tool in open source these days, doesn't scale well to very large repositories due to - among other things - not having narrow clones (fetching a subset of files). It scales well enough to the overwhelming majority of projects. But if you are a large organization generating lots of data (read: gigabytes of data over hundreds of thousands of files and commits) for version control, Git is unsuitable in its current form. Other tools (like Mercurial) don't currently fare that much better (although Mercurial has plans to tackle these scaling vectors).
Despite popular convention and even limitations in tools, companies like Google and Facebook opt to run large, monolithic repositories. Google runs Perforce. Facebook is on Mercurial, or at least is in the process of migrating to Mercurial.
Why do these companies run monolithic repositories? In Google's words:
We have a single large depot with almost all of Google's projects on it. This aids agile development and is much loved by our users, since it allows almost anyone to easily view almost any code, allows projects to share code, and allows engineers to move freely from project to project. Documentation and data is stored on the server as well as code.
So, monolithic repositories are all about moving fast and getting things done more efficiently. In other words, monolithic repositories increase developer productivity.
Furthermore, monolithic repositories are also more compatible with the ebb and flow of large organizations and large software projects. Components, features, products, and teams come and go, merge and split. The only constant is change. And if you are maintaining separate repositories that attempt to map to this ever-changing organizational topology, you are going to have a bad time. Either you'll be constantly copying, moving, merging, splitting, etc data and repositories. Or your repositories will be organized in a very non-logical and non-intuitive manner. That translates to overhead and lost productivity. I think that monolithic repositories handle the realities of large organizations much better. Big change or reorganization you want to reflect? You can make a single, atomic, history-preserving commit to move things around. I think that's much more manageable, especially when you consider the difficulty and annoyance of history-preserving changes across repositories.
Naysayers will decry monolithic repositories on principled and practical grounds.
The principled camp will say that separate repositories constitute a loosely coupled (dare I say service oriented) architecture that maps better to how software is consumed, assembled, and deployed and that erecting barriers in the form of separate repositories deliberately enforces this architecture. I agree. However, you can still maintain a loosely coupled architecture with monolithic repositories. The Subversion model of checking out a single tree from a larger repository proves this. Furthermore, I would say architecture decisions should be enforced by people (via code review, etc), not via version control repository topology. I believe this principled argument against monolithic repositories to be rather weak.
The principled camp living in the open source realm may also decry monolithic repositories as an affront to the spirit of open source. They would say that a monolithic repository creates unfairly strong ties to the organization that operates it and creates barriers to forking, etc. This may be true. But monolithic repositories don't intrinsically infringe on the basic software freedoms, organizations do. Therefore, I find this principled argument rather weak.
The practical camp will say that monolithic repositories just don't scale or aren't suitable for general audiences. These concerns are real.
Fully distributed version control systems (every commit on every machine) definitely don't scale past certain limits. Depending on your repository and user base, your scaling limits include disk space (repository data terabytes in size), bandwidth (repository data terabytes in size), filesystem (repository hundreds of thousands or millions of files), CPU and memory (operations on large repositories take too many system resources), and many heads/branches (tools like Git and Mercurial don't scale well to tens of thousands of heads/branches). These limitations with fully distributed version control are why distributed version control tools like Git and Mercurial support a partially-distributed mode that behaves more like your classical server-client model, like those employed by Subversion, Perforce, etc. Git supports shallow clone and sparse checkout. Mercurial supports shallow clone (via remotefilelog) and has planned support for narrow clone and sparse checkout in the next release or two. Of course, you can avoid the scaling limitations of distributed version control by employing a non-distributed tool, such as Subversion. Many companies continue to reach this conclusion today. However, users adapted to the distributed workflow would likely be up in arms (they would probably use tools like hg-subversion or git-svn to maintain their workflows). So, while scaling of version control can be a real concern, there are solutions and workarounds. However, they do involve falling back to a partially-distributed model.
Another concern with monolithic repositories is user access control. You inevitably have code or data that is more sensitive and want to limit who can change or even access it. Separate repositories seem to facilitate a simpler model: per-repository access control. With monolithic repositories, you have to worry about per-directory/subtree permissions, an increased risk of data leaking, etc. This concern is more real with distributed version control, as distributed data and access control aren't naturally compatible. But these issues can be resolved. And if the tooling supports it, there is only a semantic difference between managing access control between repositories versus components of a single repository.
When it comes to repository hosting conversions, I agree with Google and Facebook: I prefer monolithic repositories. When I am interacting with version control, I just want to get stuff done. I don't want to waste time dealing with multiple commands to manage multiple repositories. I don't want to waste time or expend cognitive load dealing with submodule, subrepository, or big files management. I don't want to waste time trying to find and reuse code, data, or documentation. I want everything at my fingertips, where it can be easily discovered, inspected, and used. Monolithic repositories facilitate these workflows more than separate repositories and make me more productive as a result.
Now, if only all the tools and processes we use and love would work with monolithic repositories...
Of of my first tasks in my new role as a Developer Productivity Engineer is to help make Mozilla's Mercurial server better. Many of the awesome things we have planned rely on features in newer versions of Mercurial. It's therefore important for us to upgrade our Mercurial server to a modern version (we are currently running 2.5.4) and to keep our Mercurial server upgraded as time passes.
There are a few reasons why we haven't historically upgraded our Mercurial server. First, as anyone who has maintained high-availability systems will tell you, there is the attitude of if it isn't broken, don't fix it. In other words, Mercurial 2.5.4 is working fine, so why mess with a good thing. This was all fine and dandy - until Mercurial started falling over in the last few weeks.
But the blocker towards upgrading that I want to talk about today is systems verification. There has been extreme caution around upgrading Mercurial at Mozilla because it is a critical piece of Mozilla's infrastructure and if the upgrade were to not go well, the outage would be disastrous for developer productivity and could even jeopardize an emergency Firefox release.
As much as I'd like to say that a modern version of Mercurial on the server would be a drop-in replacement (Mercurial has a great committment to backwards compatibility and has loose coupling between clients and servers such that upgrading servers should not impact clients), there is always a risk that something will change. And that risk is compounded by the amount of custom code we have running on our server.
The way you protect against unexpected changes is testing. In the ideal world, you have a robust test suite that you run against a staging instance of a service to validate that any changes have no impact. In the absence of testing, you are left with fear, uncertainty, and doubt. FUD is an especially horrible philosophy when it comes to managing servers.
Unfortunately, we don't really have a great testing infrastructure for Mozilla's Mercurial server. And I want to change that.
Reproducing the Server Environment
When writing tests, it is important for the thing being tested to be as similar as possible to the real thing. This is why so many people have an aversion to mocking: every time you alter the test environment, you run the risk that those differences from reality will mask changes seen in the real environment.
So, it makes sense that a good first goal for creating a test suite against our Mercurial server should be to reproduce the production server and environment as closely as possible.
I'm currently working on a Vagrant environment that attempts to reproduce the official environment as closely as possible. It starts one virtual machine for the SSH/master server. It starts a separate virtual machine for the hgweb/slave servers. The virtual machines are booting CentOS. This is different than production, where we run RHEL. But they are similar enough (and can share the same packages) that the differences shouldn't matter too much, at least for now.
In production, Mozilla is using Puppet to manage the Mercurial servers. Unfortunately, the actual Puppet configs that Mozilla is running are behind a firewall, mainly for security reasons. This is potentially a huge setback for my reproducibility effort, as I'd like to have my virtual machines use the same exact Puppet configs as whats used in production so the environments match as closely as possible. This would also save me a lot of work from having to reinvent the wheel.
Fortunately, Ben Kero has extracted the Mercurial-relevant Puppet config files into a standalone repository. Apparently that repository gets rolled into the production Puppet configs periodically. So, my virtual machines and production can share the same Mercurial Puppet files. Nice!
It wasn't long after starting to use the standalone Puppet configs that I realized this would be a rabbit hole. This first manifests in the standalone Puppet code referencing things that exist in the hidden Mozilla Puppet files. So the liberation was only partially successful. Sad panda.
So, I'm now in the process of creating a fake Mozilla Puppet environment that mimics the base Mozilla environment (from the closed repo) and am modifying the shared Puppet Mercurial code to work with both versions. This is a royal pain, but it needs to be done if we want to reproduce production and maintain peace of mind that test results reflect reality.
Because reproducing runtime environments is important for reproducing and solving bugs and for testing, I call on the maintainers of Mozilla's closed Puppet repository to liberate it from behind its firewall. I'd like to see a public Puppet configuration tree available for all to use so that anyone anywhere can reproduce the state of a server or service operated by Mozilla to within reasonable approximation. Had this already been done, it would have saved me hours of work. As it stands, I'm reverse engineering systems and trying to cobble together understanding of how the Mozilla Puppet configs work and what parts of them can safely be ignored to reproduce an approximate testing environment.
Along that vein, I finally got access to Mozilla's internal Puppet repository. This took a few meetings and apparently a lot of backroom chatter was generated - "developer's don't normally get access, oh my!" All I wanted was to see how systems are configured so I can help improve them. Instead, getting access felt like pulling teeth. This feels like a major roadblock towards productivity, reproducibility, and testing.
Facebook gives its developers access to most production machines and trusts them to not be stupid. I know we (Mozilla) like to hold ourselves to a high standard of security and privacy. But not giving developers access to the configurations for the systems their code runs on feels like a very silly policy. I hope Mozilla invests in opening up this important code and data, if not to the world, at least to its trusted employees.
Anyway, hopefully I'll soon have a Vagrant environment that allows people to build a standalone instance of Mozilla's Mercurial server. And once that's in place, I can start writing tests that basic services and workflows (including repository syncing) work as expected. Stay tuned.
As of today, I have a new role and title at Mozilla: Developer Productivity Engineer. I'll be reporting to Laura Thomson as a member of the Developer Services team.
I have an immediate goal to make our version control work better. This includes making Try scale and helping out with the deployment of ReviewBoard. After that, I'm not entirely sure. But Autoland and Firefox build system improvements have been discussed.
I'm really excited to be in this new role. If someone were to give me a clean slate and tell me to design my own job role, I think I'd answer with something very similar to the role I am now in. I am passionate about tools and enabling people to become more productive. I have little doubt I'll thrive in this new role.
Are you a Mozillian who uses Mercurial? Do you have a complaint, suggestion, observation, or any other type of feedback you'd like to give to the maintainers of Mercurial? Now's your chance.
There is a large gathering of Mercurial contributors next weekend in Munich. The topics list is already impressive. But Mozilla's delegation (Mike Hommey, Ben Kero, and myself) would love to advance Mozilla's concerns to the wider community.
To leave or vote for feedback, please visit https://hgfeedback.paas.allizom.org/e/august-2014-summit before August 29 so your voice may be heard.
I encourage you to leave feedback about any small, big or small, Mozilla-specific or not. Comparisons to Git, GitHub and other version control tools and services are also welcome.
If you have feedback that can't be captured in that moderator tool, please email me. firstname.lastname@example.org.
« Previous Page -- Next Page »