The ascendancy of GitHub has very little to do with its namesake tool, Git.
What GitHub did that was so radical for its time and the strategy that GitHub continues to execute so well on today is the approach of putting code first and enabling change to be a frictionless process.
In case you weren't around for the pre-GitHub days or don't remember, they were not pleasant. Tools around code management were a far cry from where they are today (I still argue the tools are pretty bad, but that's for another post). Centralized version control systems were prevalent (CVS and Subversion in open source, Perforce, ClearCase, Team Foundation Server, and others in the corporate world). Tools for looking at and querying code had horrible, ugly interfaces and came out of a previous era of web design and browser capabilities. It felt like a chore to do anything, including committing code. Yes, the world had awesome services like SourceForge, but they weren't the same as GitHub is today.
Before I get to my central thesis, I want to highlight some supporting reasons for GitHub's success. There were two developments in the second half of the 2000s the contributed to the success of GitHub: the rises of the distributed version control system (DVCS) and the modern web.
While distributed version control systems like Sun WorkShop TeamWare and BitKeeper existed earlier, it wasn't until the second half of the 2000s that DVCS systems took off. You can argue part of the reason for this was open source: my recollection is there wasn't a well-known DVCS available as free software before 2005. Speaking of 2005, it was a big year for DVCS projects: Git, Mercurial, and Bazaar all had initial releases. Suddenly, there were old-but-new ideas on how to do source control being exposed to new and willing-to-experiment audiences. DVCS were a critical leap from traditional version control because they (theoretically) impose less process and workflow limitations on users. With traditional version control, you needed to be online to commit, meaning you were managing patches, not commits, in your local development workflow. There were some forms of branching and merging, but they were a far cry from what is available today and were often too complex for mere mortals to use. As more and more people were exposed to distributed version control, they welcomed its less-restrictive and more powerful workflows. They realized that source control tools don't have to be so limiting. Distributed version control also promised all kinds of revamped workflows that could be harnessed. There were potential wins all around.
GitHub launched in April 2008 standing on the shoulders of both the emerging interest in the Git content tracking tool and the capabilities of modern browsers.
I wasn't an early user of GitHub. My recollection is that GitHub was mostly a Rubyist's playground back then. I wasn't a Ruby programmer, so I had little reason to use GitHub in the early stages. But people did start using GitHub. And in the spirit of Ruby (on Rails), GitHub moved fast, or at least was projecting the notion that they were. While other services built on top of DVCS tools - like Bitbucket - did exist back then, GitHub seemed to have momentum associated with it. (Look at the archives for GitHub's and Bitbucket's respective blogs. GitHub has hundreds of blog entries; Bitbucket numbers in the dozens.) Developers everywhere up until this point had all been dealing with sub-optimal tools and workflows. Some of us realized it. Others hadn't. Many of those who did saw GitHub as a beacon of hope: we have all these new ideas and new potentials with distributed version control and here is a service under active development trying to figure out how to exploit that. Oh, and it's free for open source. Sign me up!
GitHub did capitalize on a market opportunity. They also capitalized on the value of marketing and the perception that they were moving fast and providing features that people - especially in open source - wanted. This captured the early adopters market. But I think what really set GitHub apart and led to the success they are enjoying today is their code first approach and their desire to make contribution easy, and even fun and sociable.
As developers, our job is to solve problems. We often do that by writing and changing code. And this often involves working as part of a team, or collaborating. To collaborate, we need tools. You eventually need some processes. And as I recently blogged, this can lead to process debt and inefficiencies associated with them.
Before GitHub, the process debt for contributing to other projects was high. You often had to subscribe to mailing lists in order to submit patches as emails. Or, you had to create an account on someone's bug tracker or code review tool before you could send patches. Then you had to figure out how to use these tools and any organization or project-specific extensions and workflows attached to them. It was quite involved and a lot could go wrong. Many projects and organizations (like Mozilla) still practice this traditional methology. Furthermore (and as I've written before), these traditional, single patch/commit-based tools often aren't effective at ensuring the desired output of high quality software.
Before GitHub solved process debt via commoditization of knowledge via market dominance, they took another approach: emphasizing code first development.
GitHub is all about the code. You load a project page and you see code. You may think a README with basic project information would be the first thing on a project page. But it isn't. Code, like data, is king.
Collaboration and contribution on GitHub revolve around the pull request. It's a way of saying, hey, I made a change, will you take it? There's nothing too novel in the concept of the pull request: it's fundamentally no different than sending emails with patches to a mailing list. But what is so special is GitHub's execution. Gone are the days of configuring and using one-off tools and processes. Instead, we have the friendly confines of a clean, friendly, and modern web experience. While GitHub is built upon the Git tool, you don't even need to use Git (a tool lampooned for its horrible usability and approachability) to contribute on GitHub! Instead, you can do everything from your browser. That warrants repeating: you don't need to leave your browser to contribute on GitHub. GitHub has essentially reduced process debt to edit a text document territory, and pretty much anybody who has used a computer can do that. This has enabled GitHub to dabble into non-code territory, such as its GitHub and Government initiative to foster community involvement in government. (GitHub is really a platform for easily seeing and changing any content or data. But, please, let me continue using code as a stand-in, since I want to focus on the developer audience.)
GitHub took an overly-complicated and fragmented world of varying contribution processes and made the new world revolve around code and a unified and simple process for change - the pull request.
Yes, there are other reasons for GitHub's success. You can make strong arguments that GitHub has capitalized on the social and psychological aspects of coding and human desire for success and happiness. I agree.
You can also argue GitHub succeeded because of Git. That statement is more or less technically accurate, but I don't think it is a sound argument. Git may have been the most feature complete open source DVCS at the time GitHub came into existence. But that doesn't mean there is something special about Git that no other DVCS has that makes GitHub popular. Had another tool been more feature complete or had the backing of a project as large as Linux at the time of GitHub's launch, we could very well be looking at a successful service built on something that isn't Git. Git had early market advantage and I argue its popularity today - a lot of it via GitHub - is largely a result of its early advantages over competing tools. And, I would go so far to say that when you consider the poor usability of Git and the pain that its users go through when first learning it, more accurate statements would be that GitHub succeeded in spite of Git and Git owes much of its success to GitHub.
When I look back at the rise of GitHub, I see a service that has succeeded by putting people first by allowing them to capitalize on more productive workflows and processes. They've done this by emphasizing code, not process, as the means for change. Organizations and projects should take note.
When companies or organizations deploy version control, they have to make many choices. One of them is how many repositories to create. Your choices are essentially a) a single, monolithic repository that holds everything b) many separate, smaller repositories that hold all the individual parts c) something in between.
The prevailing convention today (especially in the open source realm) is to create many separate and loosely coupled repositories, each repository mapping to a specific product or service. That does seem reasonable: if you were organizing files on your filesystem, you would group them by functionality or role (photos, music, documents, etc). And, version control tools are functionally filesystems. So it makes sense to draw repository boundaries at directory/role levels.
Further reinforcing the separate repository convention is the scaling behavior of our version control tools. Git, the popular tool in open source these days, doesn't scale well to very large repositories due to - among other things - not having narrow clones (fetching a subset of files). It scales well enough to the overwhelming majority of projects. But if you are a large organization generating lots of data (read: gigabytes of data over hundreds of thousands of files and commits) for version control, Git is unsuitable in its current form. Other tools (like Mercurial) don't currently fare that much better (although Mercurial has plans to tackle these scaling vectors).
Despite popular convention and even limitations in tools, companies like Google and Facebook opt to run large, monolithic repositories. Google runs Perforce. Facebook is on Mercurial, or at least is in the process of migrating to Mercurial.
Why do these companies run monolithic repositories? In Google's words:
We have a single large depot with almost all of Google's projects on it. This aids agile development and is much loved by our users, since it allows almost anyone to easily view almost any code, allows projects to share code, and allows engineers to move freely from project to project. Documentation and data is stored on the server as well as code.
So, monolithic repositories are all about moving fast and getting things done more efficiently. In other words, monolithic repositories increase developer productivity.
Furthermore, monolithic repositories are also more compatible with the ebb and flow of large organizations and large software projects. Components, features, products, and teams come and go, merge and split. The only constant is change. And if you are maintaining separate repositories that attempt to map to this ever-changing organizational topology, you are going to have a bad time. Either you'll be constantly copying, moving, merging, splitting, etc data and repositories. Or your repositories will be organized in a very non-logical and non-intuitive manner. That translates to overhead and lost productivity. I think that monolithic repositories handle the realities of large organizations much better. Big change or reorganization you want to reflect? You can make a single, atomic, history-preserving commit to move things around. I think that's much more manageable, especially when you consider the difficulty and annoyance of history-preserving changes across repositories.
Naysayers will decry monolithic repositories on principled and practical grounds.
The principled camp will say that separate repositories constitute a loosely coupled (dare I say service oriented) architecture that maps better to how software is consumed, assembled, and deployed and that erecting barriers in the form of separate repositories deliberately enforces this architecture. I agree. However, you can still maintain a loosely coupled architecture with monolithic repositories. The Subversion model of checking out a single tree from a larger repository proves this. Furthermore, I would say architecture decisions should be enforced by people (via code review, etc), not via version control repository topology. I believe this principled argument against monolithic repositories to be rather weak.
The principled camp living in the open source realm may also decry monolithic repositories as an affront to the spirit of open source. They would say that a monolithic repository creates unfairly strong ties to the organization that operates it and creates barriers to forking, etc. This may be true. But monolithic repositories don't intrinsically infringe on the basic software freedoms, organizations do. Therefore, I find this principled argument rather weak.
The practical camp will say that monolithic repositories just don't scale or aren't suitable for general audiences. These concerns are real.
Fully distributed version control systems (every commit on every machine) definitely don't scale past certain limits. Depending on your repository and user base, your scaling limits include disk space (repository data terabytes in size), bandwidth (repository data terabytes in size), filesystem (repository hundreds of thousands or millions of files), CPU and memory (operations on large repositories take too many system resources), and many heads/branches (tools like Git and Mercurial don't scale well to tens of thousands of heads/branches). These limitations with fully distributed version control are why distributed version control tools like Git and Mercurial support a partially-distributed mode that behaves more like your classical server-client model, like those employed by Subversion, Perforce, etc. Git supports shallow clone and sparse checkout. Mercurial supports shallow clone (via remotefilelog) and has planned support for narrow clone and sparse checkout in the next release or two. Of course, you can avoid the scaling limitations of distributed version control by employing a non-distributed tool, such as Subversion. Many companies continue to reach this conclusion today. However, users adapted to the distributed workflow would likely be up in arms (they would probably use tools like hg-subversion or git-svn to maintain their workflows). So, while scaling of version control can be a real concern, there are solutions and workarounds. However, they do involve falling back to a partially-distributed model.
Another concern with monolithic repositories is user access control. You inevitably have code or data that is more sensitive and want to limit who can change or even access it. Separate repositories seem to facilitate a simpler model: per-repository access control. With monolithic repositories, you have to worry about per-directory/subtree permissions, an increased risk of data leaking, etc. This concern is more real with distributed version control, as distributed data and access control aren't naturally compatible. But these issues can be resolved. And if the tooling supports it, there is only a semantic difference between managing access control between repositories versus components of a single repository.
When it comes to repository hosting conversions, I agree with Google and Facebook: I prefer monolithic repositories. When I am interacting with version control, I just want to get stuff done. I don't want to waste time dealing with multiple commands to manage multiple repositories. I don't want to waste time or expend cognitive load dealing with submodule, subrepository, or big files management. I don't want to waste time trying to find and reuse code, data, or documentation. I want everything at my fingertips, where it can be easily discovered, inspected, and used. Monolithic repositories facilitate these workflows more than separate repositories and make me more productive as a result.
Now, if only all the tools and processes we use and love would work with monolithic repositories...
I was editing a wiki page yesterday and I think I coined a new term which I'd like to enter the common nomenclature: repository-centric development. The term refers to development/version control workflows that place repositories - not patches - first.
When collaborating on version controlled code with modern tools like Git and Mercurial, you essentially have two choices on how to share version control data: patches or repositories.
Patches have been around since the dawn of version control. Everyone knows how they work: your version control system has a copy of the canonical data and it can export a view of a specific change into what's called a patch. A patch is essentially a diff with extra metadata.
When distributed version control systems came along, they brought with them an alternative to patch-centric development: repository-centric development. You could still exchange patches if you wanted, but distributed version control allowed you to pull changes directly from multiple repositories. You weren't limited to a single master server (that's what the distributed in distributed version control means). You also didn't have to go through an intermediate transport such as email to exchange patches: you communicate directly with a peer repository instance.
Repository-centric development eliminates the middle man required for patch exchange: instead of exchanging derived data, you exchange the actual data, speaking the repository's native language.
One advantage of repository-centric development is it eliminates the problem of patch non-uniformity. Patches come in many different flavors. You have plain diffs. You have diffs with metadata. You have Git style metadata. You have Mercurial style metadata. You can produce patches with various lines of context in the diff. There are different methods for handling binary content. There are different ways to express file adds, removals, and renames. It's all a hot mess. Any system that consumes patches needs to deal with the non-uniformity. Do you think this isn't a problem in the real world? Think again. If you are involved with an open source project that collects patches via email or by uploading patches to a bug tracker, have you ever seen someone accidentally upload a patch in the wrong format? That's patch non-uniformity. New contributors to Firefox do this all the time. I also see it in the Mercurial project. With repository-centric development, patches never enter the picture, so patch non-uniformity is a non-issue. (Don't confuse the superficial formatting of patches with the content, such as an incorrect commit message format.)
Another advantage of repository-centric development is it makes the act of exchanging data easier. Just have two repositories talk to each other. This used to be difficult, but hosting services like GitHub and Bitbucket make this easy. Contrast with patches, which require hooking your version control tool up to wherever those patches are located. The Linux Kernel, like so many other projects, uses email for contributing changes. So now Git, Mercurial, etc all fulfill Zawinski's law. This means your version control tool is talking to your inbox to send and receive code. Firefox development uses Bugzilla to hold patches as attachments. So now your version control tool needs to talk to your issue tracker. (Not the worst idea in the world I will concede.) While, yes, the tools around using email or uploading patches to issue trackers or whatever else you are using to exchange patches exist and can work pretty well, the grim reality is that these tools are all reinventing the wheel of repository exchange and are solving a problem that has already been solved by git push, git fetch, hg pull, hg push, etc. Personally, I would rather hg push to a remote and have tools like issue trackers and mailing lists pull directly from repositories. At least that way they have a direct line into the source of truth and are guaranteed a consistent output format.
Another area where direct exchange is huge is multi-patch commits (branches in Git parlance) or where commit data is fragmented. When pushing patches to email, you need to insert metadata saying which patch comes after which. Then the email import tool needs to reassemble things in the proper order (remember that the typical convention is one email per patch and email can be delivered out of order). Not the most difficult problem in the world to solve. But seriously, it's been solved already by git fetch and hg pull! Things are worse for Bugzilla. There is no bullet-proof way to order patches there. The convention at Mozilla is to add Part N strings to commit messages and have the Bugzilla import tool do a sort (I assume it does that). But what if you have a logical commit series spread across multiple bugs? How do you reassemble everything into a linear series of commits? You don't, sadly. Just today I wanted to apply a somewhat complicated series of patches to the Firefox build system I was asked to review so I could jump into a debugger and see what was going on so I could conduct a more thorough review. There were 4 or 5 patches spread over 3 or 4 bugs. Bugzilla and its patch-centric workflow prevented me from importing the patches. Fortunately, this patch series was pushed to Mozilla's Try server, so I could pull from there. But I haven't always been so fortunate. This limitation means developers have to make sacrifices such as writing fewer, larger patches (this makes code review harder) or involving unrelated parties in the same bug and/or review. In other words, deficient tools are imposing limited workflows. No bueno.
It is a fair criticism to say that not everyone can host a server or that permissions and authorization are hard. Although I think concerns about impact are overblown. If you are a small project, just create a GitHub or Bitbucket account. If you are a larger project, realize that people time is one of your largest expenses and invest in tools like proper and efficient repository hosting (often this can be GitHub) to reduce this waste and keep your developers happier and more efficient.
One of the clearest examples of repository-centric development is GitHub. There are no patches in GitHub. Instead, you git push and git fetch. Want to apply someone else's work? Just add a remote and git fetch! Contrast with first locating patches, hooking up Git to consume them (this part was always confusing to me - do you need to retroactively have them sent to your email inbox so you can import them from there), and finally actually importing them. Just give me a URL to a repository already. But the benefits of repository-centric development with GitHub don't stop at pushing and pulling. GitHub has built code review functionality into pushes. They call these pull requests. While I have significant issues with GitHub's implemention of pull requests (I need to blog about those some day), I can't deny the utility of the repository-centric workflow and all the benefits around it. Once you switch to GitHub and its repository-centric workflow, you more clearly see how lacking patch-centric development is and quickly lose your desire to go back to the 1990's state-of-the-art methods for software development.
I hope you now know what repository-centric development is and will join me in championing it over patch-based development.
Mozillians reading this will be very happy to learn that work is under way to shift Firefox's development workflow to a more repository-centric world. Stay tuned.
Version control systems can be highly useful tools.
At Mozilla, we've made numerous improvements and customizations to our version control tools. We have custom hooks that run on the server. We have a custom skin for Mercurial's web interface. Mozillians have written a handful of Mercurial extensions to aid with common developer tasks, such as pushing to try, interacting with Bugzilla, making mq more useful, and more.
These have all come into existence in an organic manner, one after the other. Individuals have seen an itch and scratched it. Good for them. Good for Mozilla.
Unfortunately, the collection of amassed tools has become quite large. They have become difficult to discover and keep up to date. The consistency in quality and style between the tools varies. Each tool has separate processes for updating and changing.
I contacted the maintainers of the popular version control tools at Mozilla with a simple proposal: let's maintain all our tools under one repo. This would allow us to increase cohesion, share code, maintain a high quality bar, share best practices, etc. There were no major objections, so we now have a unified repository containing our version control tools!
Currently, we only have a few Mercurial extensions in there. A goal is to accumulate as much of the existing Mercurial infrastructure into that repository as possible. Client code. Server code. All of the code. I want developers to be able to install the same hooks on their clients as what's running on the server: why should your local repo let you commit something that the server will reject? I want developers to be able to reasonably reproduce Mozilla's canonical version control server configuration locally. That way, you can test things locally with a high confidence that your changes will work the same way on production. This allows deployments to move faster and with less friction.
The immediate emphasis will be on moving extensions into this repo and deprecating the old homes on user repositories. Over time, we'll move into consolidating server code and getting hg.mozilla.org and git.mozilla.org to use this repository. But that's a lower priority: the most important goal right now is to make it easier and friendlier for people to run productivity-enhancing tools.
So, if you see your Mercurial extensions alerting you that they've been moved to a new repository, now you know what's going on.
Over the winter break, I set out on an ambitious project to create a service to help developers and others manage the flury of patches going into Firefox. While the project is far from complete, I'm ready to unleash the first part of the project upon the world.
Essentially, I built a centralized indexing service for version control repositories with Mozilla's extra metadata thrown in. I tell it what repositories to mirror, and it clones everything, fetches data such as the pushlog and Git SHA-1 mappings, and stores everything in a central database. It then exposes this aggregated data through world-readable web services.
Currently, I have the service indexing the popular project branches for Firefox (central, aurora, beta, release, esr, b2g, inbound, fx-team, try, etc). You can view the full list via the web service. As a bonus, I'm also serving these repositories via hg.gregoryszorc.com. My server appears to be significantly faster than hg.mozilla.org. If you want to use it for your daily needs, go for it. I make no SLA guarantees, however.
I'm also using this service as an opportunity to experiment with alternate forms of Mercurial hosting. I have mirrors of mozilla-central and the try repository with generaldelta and lz4 compression enabled. I may blog about what those are eventually. The teaser is that they can make Mercurial perform a lot faster under some conditions. I'm also using ZFS under the hood to manage repositories. Each repository is a ZFS filesystem. This means I can create repository copies on the server (user repositories anyone?) at a nearly free cost. Contrast this to the traditional method of full clones, which take lots of time, memory, CPU, and storage.
Anyway, some things you can do with the existing web service:
- Obtain metadata about Mercurial changesets. Example.
- Look up metadata about Git commits. Example.
- Obtain a SPORE descriptor describing the web service endpoints. This allows you to auto-generate clients from descriptors. Yay!
Obviously, that's not a lot. But adding new endpoints is relatively straightforward. See the source. It's literally as easy as defining a URL mapping and writing a database query.
The performance is also not the best. I just haven't put in the effort to tune things yet. All of the querying hits the database, not Mercurial. So, making things faster should merely be a matter of database and hosting optimization. Patches welcome!
Some ideas that I haven't had time to implement yet:
- Return changests in a specific repository
- Return recently pushed changesets
- Return pushes for a given user
- Return commits for a given author
- Return commits referencing a given bug
- Obtain TBPL URLs for pushes with changeset
- Integrate bugzilla metadata
Once those are in place, I foresee this service powering a number of dashboards. Patches welcome.
Again, this service is only the tip of what's possible. There's a lot that could be built on this service. I have ideas. Others have ideas.
The project includes a Vagrant file and Puppet manifests for provisioning the server. It's a one-liner to get a development environment up and running. It should be really easy to contribute to this project. Patches welcome.
Next Page »