Notes from Git Merge 2015

May 12, 2015 at 03:40 PM | categories: Git, Mercurial, Mozilla | View Comments

Git Merge 2015 was a Git user conference held in Paris on April 8 and 9, 2015.

I'm kind of a version control geek. I'm also responsible for a large part of Mozilla's version control hosting. So, when the videos were made public, you can bet I took interest.

This post contains my notes from a few of the Git Merge talks. I try to separate content/facts from my opinions by isolating my opinions (within parenthesis).

Git at Google

Git at Google: Making Big Projects (and everyone else) Happy is from a Googler (Dave Borowitz) who works on JGit for the Git infrastructure team at Google.

"Everybody in this room is going to feel some kind of pain working with Git at scale at some time in their career."

First Git usage at Google in 2008 for Android. 2011 googlesource.com launches.

24,000 total Git repos at Google. 77.1M requests/day. 30-40 TB/day. 2-3 Gbps.

Largest repo is 210GB (not public it appears).

800 repos in AOSP. Google maintains internal fork of all Android repos (so they can throw stuff over the wall). Fresh AOSP tree is 17 GiB. Lots of contracts dictating access.

Chrome repos include Chromium, Blink, Chromium OS. Performed giant Subversion migration. Developers of Chrome very set in their ways. Had many workflows integrated with Subversion web interface. Subversion blame was fast, Git blame slow. Built caching backend for Git blame to make developers happy.

Chromium 2.9 GiB, 3.6M objects, 390k commits. Blink 5.3 GiB, 3.1M objects, 177k commits. They merged them into a monorepo. Mention of Facebook's monorepo talk and Mercurial scaling efforts for a repo larger then Chromium/Blink monorepo. Benefits to developers for doing atomic refactorings, etc.

"Being big is hard."

AOSP: 1 Gbps -> 2 minutes for 17 GiB. 20 Mbps -> 3 hours. Flaky internet combined with non-resumable clone results in badness. Delta resolution can take a while. Checkout of hundreds of thousands of files can be slow, especially on Windows.

"As tool developers... it's hard to tell people don't check in large binaries, do things this way, ... when all they are trying to do is get their job done." (I couldn't agree more: tools should ideally not impose sub-optimal workflows.)

They prefer scaling pain to supporting multiple tools. (I think this meant don't use multiple VCSs if you can just make one scale.)

Shallow clone beneficial. But some commands don't work. log not very useful.

Narrow clones mentioned. Apparently talked about elsewhere at Git Merge not captured on video. Non-trivial problem for Git. "We have no idea when this is going to happen."

Split repos until narrow clone is available. Google wrote repo to manage multiple repos. They view repo and multiple repos as stop-gap until narrow clone is implemented.

git submodule needs some love. Git commands don't handle submodules or multiple repos very well. They'd like to see repo features incorporated into git submodule.

Transition to server operation.

Pre-2.0, counting objects was hard. For Linux kernel, 60s 100% CPU time per clone to count objects. "Linux isn't even that big."

Traditionally Git is single homed. Load from users. Load from automation.

Told anecdote about how Google's automation once recloned the repo after a failed Git command. Accidentally made a change one day that caused a command to persistently fail. DoS against server. (We've had this at Mozilla.)

Garbage collection on server is CPU intensive and necessary. Takes cores away from clients.

Reachability bitmaps implemented in JGit, ported to Git 2.0. Counting objects for Linux clones went from 60s CPU to ~100ms.

Google puts static, pre-generated bundles on a CDN. Client downloads bundle then does incremental fetch. Massive reduction in server load. Bundle files better for users. Resumable.

They have ideas for integrating bundles into git fetch, but it's "a little way's off." (This feature is partially implemented in Mercurial 3.4 and we have plans for using it at Mozilla.) It's feature in repo today.

Shared filesystem would be really nice to spread CPU load. NFS "works." Performance problems with high throughput repositories.

Master-mirror replication can help. Problems with replication lag. Consistency is hard.

Google uses a distributed Git store using Bigtable and GFS built on JGit. Git-aware load balancer. Completely separate pool of garbage collection workers. They do replication to multiple datacenters before pushes. 6 datacenters across world. Some of their stuff is open source. A lot isn't.

Humans have difficulty managing hundreds of repositories. "How do you as a human know what you need to modify?" Monorepos have this problem too. Inherent with lots of content. (Seemed to imply it is worse with multiple repos than with a monorepo.)

Porting changes between forks is hard. e.g. cherry picking between internal and external Android repos.

ACLs are a mess.

Google built Gerrit code review. It does ACLs, auto rebasing, release branch management. It's a swiss army knife. (This aligns with my vision for MozReview and code-centric development.)

Scaling Git at Twitter

Wilhelm Bierbaum from Twitter talks about Scaling Git at Twitter.

"We've decided it's really important to make Twitter a good place to work for developers. Source control is one of those points where we were lacking. We had some performance problems with Git in the past."

Twitter runs a monorepo. Used to be 3 repos. "Working with a single repository is the way they prefer to develop software when developing hundreds of services." They also have a single build system. They have a geo diverse workforce.

They use normal canonical implementation of Git + some optimizations.

Benefits of a monorepo:

Visibility. Easier for people to find code in one repo. Code search tools tailored towards single repos.

Single toolchain. single set of tools to build, test, and deploy. When improvements to tools made, everyone benefits because one toolchain.

Easy to share code (particularly generated code like IDL). When operating many small services, services developed together. Code duplication is minimized. Twitter relies on IDL heavily.

Simpler to predict the impact of your changes. Easier to look at single code base then to understand how multiple code bases interact. Easy to make a change and see what breaks rather than submit changes to N repos and do testing in N repos.

Makes refactoring less arduous.

Surfaces architecture issues earlier.

Breaking changes easier to coordinate

Drawbacks of monorepos:

Large disk footprint for full history.

Tuning filesystem only goes so far.

Forces some organizations to adopt sparse checkouts and shallow clones.

Submodules aren't good enough to use. add and commit don't recognize submodule boundaries very well and aren't very usable.

"To us, using a tool such as repo that is in effect a secondary version control tool on top of Git does not feel right and doesn't lead to a fluid experience."

Twitter has centralized use of Git. Don't really benefit from distributed version control system. Feature branches. Goal is to live as close to master as possible. Long-running branches discouraged. Fewer conflicts to resolve.

They have project-level ownership system. But any developer can change anything.

They have lots of read-only replicas. Highly available writable server.

They use reference repos heavily so object exchange overhead is manageable.

Scaling issues with many refs. Partially due to how refs are stored on disk. File locking limits in OS. Commands like status, add, and commit can be slow, even with repo garbage collected and packed. Locking issues with garbage collection.

Incorporated file alteration monitor to make status faster. Reference to Facebook's work on watchman and its Mercurial integration. Significant impact on OS X. "Pretty much all our developers use OS X." (I assume they are using Watchman with Git - I've seen patches for this on the Git mailing list but they haven't been merged into core yet.)

They implemented a custom index format. Adopted faster hashing algorithm that uses native instructions.

Discovery with many refs didn't scale. 13 MB of raw data for refs exchange at Twitter. (!!) Experimenting with clients sending a bloom filter of refs. Hacked it together via HTTP header.

Fetch protocol is expensive. Lots of random I/O. Can take minutes to do incremental fetches. Bitmap indices help, but aren't good enough for them. Since they have central and well-defined environment, they changed fetch protocol to work like a journal: send all changed data since client's last fetch. Server work is essentially a sendfile system call. git push appends push packs to a log-structured journal. On fetch, clients "replay" the transactions from the journal. Similar to MySQL binary log replication. (This is very similar to some of the Mercurial replication work I'm doing at Mozilla. Yay technical validation.) (Append only files are also how Mercurial's storage model works by default.)

Log-structured data exchange means server side is cheap. They can insert HTTP caches to handle Range header aware requests.

Without this hack, they can't scale their servers.

Initial clone is seeded by BitTorrent.

It sounds like everyone's unmerged feature branches are on the one central repo and get transferred everywhere by default. Their journal fetch can selectively fetch refs so this isn't a huge problem.

They'd like to experiment with sparse repos. But they haven't gotten to that yet. They'd like a better storage abstraction in Git to enable necessary future scalability. They want a network-aware storage backend. Local objects not necessary if the network has them.

They are running a custom Git distribution/fork on clients. But they don't want to maintain forever. Prefer to send upstream.

Read and Post Comments

Code First and the Rise of the DVCS and GitHub

January 10, 2015 at 12:35 PM | categories: Git, Mozilla | View Comments

The ascendancy of GitHub has very little to do with its namesake tool, Git.

What GitHub did that was so radical for its time and the strategy that GitHub continues to execute so well on today is the approach of putting code first and enabling change to be a frictionless process.

In case you weren't around for the pre-GitHub days or don't remember, they were not pleasant. Tools around code management were a far cry from where they are today (I still argue the tools are pretty bad, but that's for another post). Centralized version control systems were prevalent (CVS and Subversion in open source, Perforce, ClearCase, Team Foundation Server, and others in the corporate world). Tools for looking at and querying code had horrible, ugly interfaces and came out of a previous era of web design and browser capabilities. It felt like a chore to do anything, including committing code. Yes, the world had awesome services like SourceForge, but they weren't the same as GitHub is today.

Before I get to my central thesis, I want to highlight some supporting reasons for GitHub's success. There were two developments in the second half of the 2000s the contributed to the success of GitHub: the rises of the distributed version control system (DVCS) and the modern web.

While distributed version control systems like Sun WorkShop TeamWare and BitKeeper existed earlier, it wasn't until the second half of the 2000s that DVCS systems took off. You can argue part of the reason for this was open source: my recollection is there wasn't a well-known DVCS available as free software before 2005. Speaking of 2005, it was a big year for DVCS projects: Git, Mercurial, and Bazaar all had initial releases. Suddenly, there were old-but-new ideas on how to do source control being exposed to new and willing-to-experiment audiences. DVCS were a critical leap from traditional version control because they (theoretically) impose less process and workflow limitations on users. With traditional version control, you needed to be online to commit, meaning you were managing patches, not commits, in your local development workflow. There were some forms of branching and merging, but they were a far cry from what is available today and were often too complex for mere mortals to use. As more and more people were exposed to distributed version control, they welcomed its less-restrictive and more powerful workflows. They realized that source control tools don't have to be so limiting. Distributed version control also promised all kinds of revamped workflows that could be harnessed. There were potential wins all around.

Around the same time that open source DVCS systems were emerging, web browsers were evolving from an application to render static pages to a platform for running web applications. Web sites using JavaScript to dynamically manipulate web page content (DHTML as it was known back then) were starting to hit their stride. I believe it was GMail that turned the most heads as to the full power of the modern web experience, with its novel-for-its-time extreme reliance on XMLHttpRequest for dynamically changing page content. People were realizing that powerful, desktop-like applications could be built for the web and could run everywhere.

GitHub launched in April 2008 standing on the shoulders of both the emerging interest in the Git content tracking tool and the capabilities of modern browsers.

I wasn't an early user of GitHub. My recollection is that GitHub was mostly a Rubyist's playground back then. I wasn't a Ruby programmer, so I had little reason to use GitHub in the early stages. But people did start using GitHub. And in the spirit of Ruby (on Rails), GitHub moved fast, or at least was projecting the notion that they were. While other services built on top of DVCS tools - like Bitbucket - did exist back then, GitHub seemed to have momentum associated with it. (Look at the archives for GitHub's and Bitbucket's respective blogs. GitHub has hundreds of blog entries; Bitbucket numbers in the dozens.) Developers everywhere up until this point had all been dealing with sub-optimal tools and workflows. Some of us realized it. Others hadn't. Many of those who did saw GitHub as a beacon of hope: we have all these new ideas and new potentials with distributed version control and here is a service under active development trying to figure out how to exploit that. Oh, and it's free for open source. Sign me up!

GitHub did capitalize on a market opportunity. They also capitalized on the value of marketing and the perception that they were moving fast and providing features that people - especially in open source - wanted. This captured the early adopters market. But I think what really set GitHub apart and led to the success they are enjoying today is their code first approach and their desire to make contribution easy, and even fun and sociable.

As developers, our job is to solve problems. We often do that by writing and changing code. And this often involves working as part of a team, or collaborating. To collaborate, we need tools. You eventually need some processes. And as I recently blogged, this can lead to process debt and inefficiencies associated with them.

Before GitHub, the process debt for contributing to other projects was high. You often had to subscribe to mailing lists in order to submit patches as emails. Or, you had to create an account on someone's bug tracker or code review tool before you could send patches. Then you had to figure out how to use these tools and any organization or project-specific extensions and workflows attached to them. It was quite involved and a lot could go wrong. Many projects and organizations (like Mozilla) still practice this traditional methology. Furthermore (and as I've written before), these traditional, single patch/commit-based tools often aren't effective at ensuring the desired output of high quality software.

Before GitHub solved process debt via commoditization of knowledge via market dominance, they took another approach: emphasizing code first development.

GitHub is all about the code. You load a project page and you see code. You may think a README with basic project information would be the first thing on a project page. But it isn't. Code, like data, is king.

Collaboration and contribution on GitHub revolve around the pull request. It's a way of saying, hey, I made a change, will you take it? There's nothing too novel in the concept of the pull request: it's fundamentally no different than sending emails with patches to a mailing list. But what is so special is GitHub's execution. Gone are the days of configuring and using one-off tools and processes. Instead, we have the friendly confines of a clean, friendly, and modern web experience. While GitHub is built upon the Git tool, you don't even need to use Git (a tool lampooned for its horrible usability and approachability) to contribute on GitHub! Instead, you can do everything from your browser. That warrants repeating: you don't need to leave your browser to contribute on GitHub. GitHub has essentially reduced process debt to edit a text document territory, and pretty much anybody who has used a computer can do that. This has enabled GitHub to dabble into non-code territory, such as its GitHub and Government initiative to foster community involvement in government. (GitHub is really a platform for easily seeing and changing any content or data. But, please, let me continue using code as a stand-in, since I want to focus on the developer audience.)

GitHub took an overly-complicated and fragmented world of varying contribution processes and made the new world revolve around code and a unified and simple process for change - the pull request.

Yes, there are other reasons for GitHub's success. You can make strong arguments that GitHub has capitalized on the social and psychological aspects of coding and human desire for success and happiness. I agree.

You can also argue GitHub succeeded because of Git. That statement is more or less technically accurate, but I don't think it is a sound argument. Git may have been the most feature complete open source DVCS at the time GitHub came into existence. But that doesn't mean there is something special about Git that no other DVCS has that makes GitHub popular. Had another tool been more feature complete or had the backing of a project as large as Linux at the time of GitHub's launch, we could very well be looking at a successful service built on something that isn't Git. Git had early market advantage and I argue its popularity today - a lot of it via GitHub - is largely a result of its early advantages over competing tools. And, I would go so far to say that when you consider the poor usability of Git and the pain that its users go through when first learning it, more accurate statements would be that GitHub succeeded in spite of Git and Git owes much of its success to GitHub.

When I look back at the rise of GitHub, I see a service that has succeeded by putting people first by allowing them to capitalize on more productive workflows and processes. They've done this by emphasizing code, not process, as the means for change. Organizations and projects should take note.

Read and Post Comments

On Monolithic Repositories

September 09, 2014 at 10:00 AM | categories: Git, Mercurial, Mozilla | View Comments

When companies or organizations deploy version control, they have to make many choices. One of them is how many repositories to create. Your choices are essentially a) a single, monolithic repository that holds everything b) many separate, smaller repositories that hold all the individual parts c) something in between.

The prevailing convention today (especially in the open source realm) is to create many separate and loosely coupled repositories, each repository mapping to a specific product or service. That does seem reasonable: if you were organizing files on your filesystem, you would group them by functionality or role (photos, music, documents, etc). And, version control tools are functionally filesystems. So it makes sense to draw repository boundaries at directory/role levels.

Further reinforcing the separate repository convention is the scaling behavior of our version control tools. Git, the popular tool in open source these days, doesn't scale well to very large repositories due to - among other things - not having narrow clones (fetching a subset of files). It scales well enough to the overwhelming majority of projects. But if you are a large organization generating lots of data (read: gigabytes of data over hundreds of thousands of files and commits) for version control, Git is unsuitable in its current form. Other tools (like Mercurial) don't currently fare that much better (although Mercurial has plans to tackle these scaling vectors).

Despite popular convention and even limitations in tools, companies like Google and Facebook opt to run large, monolithic repositories. Google runs Perforce. Facebook is on Mercurial, or at least is in the process of migrating to Mercurial.

Why do these companies run monolithic repositories? In Google's words:

We have a single large depot with almost all of Google's projects on it. This aids agile development and is much loved by our users, since it allows almost anyone to easily view almost any code, allows projects to share code, and allows engineers to move freely from project to project. Documentation and data is stored on the server as well as code.

So, monolithic repositories are all about moving fast and getting things done more efficiently. In other words, monolithic repositories increase developer productivity.

Furthermore, monolithic repositories are also more compatible with the ebb and flow of large organizations and large software projects. Components, features, products, and teams come and go, merge and split. The only constant is change. And if you are maintaining separate repositories that attempt to map to this ever-changing organizational topology, you are going to have a bad time. Either you'll be constantly copying, moving, merging, splitting, etc data and repositories. Or your repositories will be organized in a very non-logical and non-intuitive manner. That translates to overhead and lost productivity. I think that monolithic repositories handle the realities of large organizations much better. Big change or reorganization you want to reflect? You can make a single, atomic, history-preserving commit to move things around. I think that's much more manageable, especially when you consider the difficulty and annoyance of history-preserving changes across repositories.

Naysayers will decry monolithic repositories on principled and practical grounds.

The principled camp will say that separate repositories constitute a loosely coupled (dare I say service oriented) architecture that maps better to how software is consumed, assembled, and deployed and that erecting barriers in the form of separate repositories deliberately enforces this architecture. I agree. However, you can still maintain a loosely coupled architecture with monolithic repositories. The Subversion model of checking out a single tree from a larger repository proves this. Furthermore, I would say architecture decisions should be enforced by people (via code review, etc), not via version control repository topology. I believe this principled argument against monolithic repositories to be rather weak.

The principled camp living in the open source realm may also decry monolithic repositories as an affront to the spirit of open source. They would say that a monolithic repository creates unfairly strong ties to the organization that operates it and creates barriers to forking, etc. This may be true. But monolithic repositories don't intrinsically infringe on the basic software freedoms, organizations do. Therefore, I find this principled argument rather weak.

The practical camp will say that monolithic repositories just don't scale or aren't suitable for general audiences. These concerns are real.

Fully distributed version control systems (every commit on every machine) definitely don't scale past certain limits. Depending on your repository and user base, your scaling limits include disk space (repository data terabytes in size), bandwidth (repository data terabytes in size), filesystem (repository hundreds of thousands or millions of files), CPU and memory (operations on large repositories take too many system resources), and many heads/branches (tools like Git and Mercurial don't scale well to tens of thousands of heads/branches). These limitations with fully distributed version control are why distributed version control tools like Git and Mercurial support a partially-distributed mode that behaves more like your classical server-client model, like those employed by Subversion, Perforce, etc. Git supports shallow clone and sparse checkout. Mercurial supports shallow clone (via remotefilelog) and has planned support for narrow clone and sparse checkout by the end of 2015. Of course, you can avoid the scaling limitations of distributed version control by employing a non-distributed tool, such as Subversion. Many companies continue to reach this conclusion today. However, users adapted to the distributed workflow would likely be up in arms (they would probably use tools like hg-subversion or git-svn to maintain their workflows). So, while scaling of version control can be a real concern, there are solutions and workarounds. However, they do involve falling back to a partially-distributed model.

Another concern with monolithic repositories is user access control. You inevitably have code or data that is more sensitive and want to limit who can change or even access it. Separate repositories seem to facilitate a simpler model: per-repository access control. With monolithic repositories, you have to worry about per-directory/subtree permissions, an increased risk of data leaking, etc. This concern is more real with distributed version control, as distributed data and access control aren't naturally compatible. But these issues can be resolved. And if the tooling supports it, there is only a semantic difference between managing access control between repositories versus components of a single repository.

When it comes to repository hosting conversions, I agree with Google and Facebook: I prefer monolithic repositories. When I am interacting with version control, I just want to get stuff done. I don't want to waste time dealing with multiple commands to manage multiple repositories. I don't want to waste time or expend cognitive load dealing with submodule, subrepository, or big files management. I don't want to waste time trying to find and reuse code, data, or documentation. I want everything at my fingertips, where it can be easily discovered, inspected, and used. Monolithic repositories facilitate these workflows more than separate repositories and make me more productive as a result.

Now, if only all the tools and processes we use and love would work with monolithic repositories...

Want to read more about monolithic repositories? I highly recommend Advantages of Monolithic Version Control by Dan Luu.

Read and Post Comments

Repository-Centric Development

July 24, 2014 at 08:23 PM | categories: Git, Mercurial, Mozilla | View Comments

I was editing a wiki page yesterday and I think I coined a new term which I'd like to enter the common nomenclature: repository-centric development. The term refers to development/version control workflows that place repositories - not patches - first.

When collaborating on version controlled code with modern tools like Git and Mercurial, you essentially have two choices on how to share version control data: patches or repositories.

Patches have been around since the dawn of version control. Everyone knows how they work: your version control system has a copy of the canonical data and it can export a view of a specific change into what's called a patch. A patch is essentially a diff with extra metadata.

When distributed version control systems came along, they brought with them an alternative to patch-centric development: repository-centric development. You could still exchange patches if you wanted, but distributed version control allowed you to pull changes directly from multiple repositories. You weren't limited to a single master server (that's what the distributed in distributed version control means). You also didn't have to go through an intermediate transport such as email to exchange patches: you communicate directly with a peer repository instance.

Repository-centric development eliminates the middle man required for patch exchange: instead of exchanging derived data, you exchange the actual data, speaking the repository's native language.

One advantage of repository-centric development is it eliminates the problem of patch non-uniformity. Patches come in many different flavors. You have plain diffs. You have diffs with metadata. You have Git style metadata. You have Mercurial style metadata. You can produce patches with various lines of context in the diff. There are different methods for handling binary content. There are different ways to express file adds, removals, and renames. It's all a hot mess. Any system that consumes patches needs to deal with the non-uniformity. Do you think this isn't a problem in the real world? Think again. If you are involved with an open source project that collects patches via email or by uploading patches to a bug tracker, have you ever seen someone accidentally upload a patch in the wrong format? That's patch non-uniformity. New contributors to Firefox do this all the time. I also see it in the Mercurial project. With repository-centric development, patches never enter the picture, so patch non-uniformity is a non-issue. (Don't confuse the superficial formatting of patches with the content, such as an incorrect commit message format.)

Another advantage of repository-centric development is it makes the act of exchanging data easier. Just have two repositories talk to each other. This used to be difficult, but hosting services like GitHub and Bitbucket make this easy. Contrast with patches, which require hooking your version control tool up to wherever those patches are located. The Linux Kernel, like so many other projects, uses email for contributing changes. So now Git, Mercurial, etc all fulfill Zawinski's law. This means your version control tool is talking to your inbox to send and receive code. Firefox development uses Bugzilla to hold patches as attachments. So now your version control tool needs to talk to your issue tracker. (Not the worst idea in the world I will concede.) While, yes, the tools around using email or uploading patches to issue trackers or whatever else you are using to exchange patches exist and can work pretty well, the grim reality is that these tools are all reinventing the wheel of repository exchange and are solving a problem that has already been solved by git push, git fetch, hg pull, hg push, etc. Personally, I would rather hg push to a remote and have tools like issue trackers and mailing lists pull directly from repositories. At least that way they have a direct line into the source of truth and are guaranteed a consistent output format.

Another area where direct exchange is huge is multi-patch commits (branches in Git parlance) or where commit data is fragmented. When pushing patches to email, you need to insert metadata saying which patch comes after which. Then the email import tool needs to reassemble things in the proper order (remember that the typical convention is one email per patch and email can be delivered out of order). Not the most difficult problem in the world to solve. But seriously, it's been solved already by git fetch and hg pull! Things are worse for Bugzilla. There is no bullet-proof way to order patches there. The convention at Mozilla is to add Part N strings to commit messages and have the Bugzilla import tool do a sort (I assume it does that). But what if you have a logical commit series spread across multiple bugs? How do you reassemble everything into a linear series of commits? You don't, sadly. Just today I wanted to apply a somewhat complicated series of patches to the Firefox build system I was asked to review so I could jump into a debugger and see what was going on so I could conduct a more thorough review. There were 4 or 5 patches spread over 3 or 4 bugs. Bugzilla and its patch-centric workflow prevented me from importing the patches. Fortunately, this patch series was pushed to Mozilla's Try server, so I could pull from there. But I haven't always been so fortunate. This limitation means developers have to make sacrifices such as writing fewer, larger patches (this makes code review harder) or involving unrelated parties in the same bug and/or review. In other words, deficient tools are imposing limited workflows. No bueno.

It is a fair criticism to say that not everyone can host a server or that permissions and authorization are hard. Although I think concerns about impact are overblown. If you are a small project, just create a GitHub or Bitbucket account. If you are a larger project, realize that people time is one of your largest expenses and invest in tools like proper and efficient repository hosting (often this can be GitHub) to reduce this waste and keep your developers happier and more efficient.

One of the clearest examples of repository-centric development is GitHub. There are no patches in GitHub. Instead, you git push and git fetch. Want to apply someone else's work? Just add a remote and git fetch! Contrast with first locating patches, hooking up Git to consume them (this part was always confusing to me - do you need to retroactively have them sent to your email inbox so you can import them from there), and finally actually importing them. Just give me a URL to a repository already. But the benefits of repository-centric development with GitHub don't stop at pushing and pulling. GitHub has built code review functionality into pushes. They call these pull requests. While I have significant issues with GitHub's implemention of pull requests (I need to blog about those some day), I can't deny the utility of the repository-centric workflow and all the benefits around it. Once you switch to GitHub and its repository-centric workflow, you more clearly see how lacking patch-centric development is and quickly lose your desire to go back to the 1990's state-of-the-art methods for software development.

I hope you now know what repository-centric development is and will join me in championing it over patch-based development.

Mozillians reading this will be very happy to learn that work is under way to shift Firefox's development workflow to a more repository-centric world. Stay tuned.

Read and Post Comments

New Repository for Mozilla Version Control Tools

February 05, 2014 at 07:15 PM | categories: Git, Mercurial, Mozilla | View Comments

Version control systems can be highly useful tools.

At Mozilla, we've made numerous improvements and customizations to our version control tools. We have custom hooks that run on the server. We have a custom skin for Mercurial's web interface. Mozillians have written a handful of Mercurial extensions to aid with common developer tasks, such as pushing to try, interacting with Bugzilla, making mq more useful, and more.

These have all come into existence in an organic manner, one after the other. Individuals have seen an itch and scratched it. Good for them. Good for Mozilla.

Unfortunately, the collection of amassed tools has become quite large. They have become difficult to discover and keep up to date. The consistency in quality and style between the tools varies. Each tool has separate processes for updating and changing.

I contacted the maintainers of the popular version control tools at Mozilla with a simple proposal: let's maintain all our tools under one repo. This would allow us to increase cohesion, share code, maintain a high quality bar, share best practices, etc. There were no major objections, so we now have a unified repository containing our version control tools!

Currently, we only have a few Mercurial extensions in there. A goal is to accumulate as much of the existing Mercurial infrastructure into that repository as possible. Client code. Server code. All of the code. I want developers to be able to install the same hooks on their clients as what's running on the server: why should your local repo let you commit something that the server will reject? I want developers to be able to reasonably reproduce Mozilla's canonical version control server configuration locally. That way, you can test things locally with a high confidence that your changes will work the same way on production. This allows deployments to move faster and with less friction.

The immediate emphasis will be on moving extensions into this repo and deprecating the old homes on user repositories. Over time, we'll move into consolidating server code and getting hg.mozilla.org and git.mozilla.org to use this repository. But that's a lower priority: the most important goal right now is to make it easier and friendlier for people to run productivity-enhancing tools.

So, if you see your Mercurial extensions alerting you that they've been moved to a new repository, now you know what's going on.

Read and Post Comments

Next Page ยป