Gregory Szorc's Digital Home

Notes from Git Merge 2015

May 12, 2015 at 03:40 PM | categories: Git, Mercurial, Mozilla

Git Merge 2015 was a Git user conference held in Paris on April 8 and 9, 2015.

I'm kind of a version control geek. I'm also responsible for a large part of Mozilla's version control hosting. So, when the videos were made public, you can bet I took interest.

This post contains my notes from a few of the Git Merge talks. I try to separate content/facts from my opinions by isolating my opinions (within parenthesis).

Git at Google

Git at Google: Making Big Projects (and everyone else) Happy is from a Googler (Dave Borowitz) who works on JGit for the Git infrastructure team at Google.

"Everybody in this room is going to feel some kind of pain working with Git at scale at some time in their career."

First Git usage at Google in 2008 for Android. 2011 googlesource.com launches.

24,000 total Git repos at Google. 77.1M requests/day. 30-40 TB/day. 2-3 Gbps.

Largest repo is 210GB (not public it appears).

800 repos in AOSP. Google maintains internal fork of all Android repos (so they can throw stuff over the wall). Fresh AOSP tree is 17 GiB. Lots of contracts dictating access.

Chrome repos include Chromium, Blink, Chromium OS. Performed giant Subversion migration. Developers of Chrome very set in their ways. Had many workflows integrated with Subversion web interface. Subversion blame was fast, Git blame slow. Built caching backend for Git blame to make developers happy.

Chromium 2.9 GiB, 3.6M objects, 390k commits. Blink 5.3 GiB, 3.1M objects, 177k commits. They merged them into a monorepo. Mention of Facebook's monorepo talk and Mercurial scaling efforts for a repo larger then Chromium/Blink monorepo. Benefits to developers for doing atomic refactorings, etc.

"Being big is hard."

AOSP: 1 Gbps -> 2 minutes for 17 GiB. 20 Mbps -> 3 hours. Flaky internet combined with non-resumable clone results in badness. Delta resolution can take a while. Checkout of hundreds of thousands of files can be slow, especially on Windows.

"As tool developers... it's hard to tell people don't check in large binaries, do things this way, ... when all they are trying to do is get their job done." (I couldn't agree more: tools should ideally not impose sub-optimal workflows.)

They prefer scaling pain to supporting multiple tools. (I think this meant don't use multiple VCSs if you can just make one scale.)

Shallow clone beneficial. But some commands don't work. log not very useful.

Narrow clones mentioned. Apparently talked about elsewhere at Git Merge not captured on video. Non-trivial problem for Git. "We have no idea when this is going to happen."

Split repos until narrow clone is available. Google wrote repo to manage multiple repos. They view repo and multiple repos as stop-gap until narrow clone is implemented.

git submodule needs some love. Git commands don't handle submodules or multiple repos very well. They'd like to see repo features incorporated into git submodule.

Transition to server operation.

Pre-2.0, counting objects was hard. For Linux kernel, 60s 100% CPU time per clone to count objects. "Linux isn't even that big."

Traditionally Git is single homed. Load from users. Load from automation.

Told anecdote about how Google's automation once recloned the repo after a failed Git command. Accidentally made a change one day that caused a command to persistently fail. DoS against server. (We've had this at Mozilla.)

Garbage collection on server is CPU intensive and necessary. Takes cores away from clients.

Reachability bitmaps implemented in JGit, ported to Git 2.0. Counting objects for Linux clones went from 60s CPU to ~100ms.

Google puts static, pre-generated bundles on a CDN. Client downloads bundle then does incremental fetch. Massive reduction in server load. Bundle files better for users. Resumable.

They have ideas for integrating bundles into git fetch, but it's "a little way's off." (This feature is partially implemented in Mercurial 3.4 and we have plans for using it at Mozilla.) It's feature in repo today.

Shared filesystem would be really nice to spread CPU load. NFS "works." Performance problems with high throughput repositories.

Master-mirror replication can help. Problems with replication lag. Consistency is hard.

Google uses a distributed Git store using Bigtable and GFS built on JGit. Git-aware load balancer. Completely separate pool of garbage collection workers. They do replication to multiple datacenters before pushes. 6 datacenters across world. Some of their stuff is open source. A lot isn't.

Humans have difficulty managing hundreds of repositories. "How do you as a human know what you need to modify?" Monorepos have this problem too. Inherent with lots of content. (Seemed to imply it is worse with multiple repos than with a monorepo.)

Porting changes between forks is hard. e.g. cherry picking between internal and external Android repos.

ACLs are a mess.

Google built Gerrit code review. It does ACLs, auto rebasing, release branch management. It's a swiss army knife. (This aligns with my vision for MozReview and code-centric development.)

Scaling Git at Twitter

Wilhelm Bierbaum from Twitter talks about Scaling Git at Twitter.

"We've decided it's really important to make Twitter a good place to work for developers. Source control is one of those points where we were lacking. We had some performance problems with Git in the past."

Twitter runs a monorepo. Used to be 3 repos. "Working with a single repository is the way they prefer to develop software when developing hundreds of services." They also have a single build system. They have a geo diverse workforce.

They use normal canonical implementation of Git + some optimizations.

Benefits of a monorepo:

Visibility. Easier for people to find code in one repo. Code search tools tailored towards single repos.

Single toolchain. single set of tools to build, test, and deploy. When improvements to tools made, everyone benefits because one toolchain.

Easy to share code (particularly generated code like IDL). When operating many small services, services developed together. Code duplication is minimized. Twitter relies on IDL heavily.

Simpler to predict the impact of your changes. Easier to look at single code base then to understand how multiple code bases interact. Easy to make a change and see what breaks rather than submit changes to N repos and do testing in N repos.

Makes refactoring less arduous.

Surfaces architecture issues earlier.

Breaking changes easier to coordinate

Drawbacks of monorepos:

Large disk footprint for full history.

Tuning filesystem only goes so far.

Forces some organizations to adopt sparse checkouts and shallow clones.

Submodules aren't good enough to use. add and commit don't recognize submodule boundaries very well and aren't very usable.

"To us, using a tool such as repo that is in effect a secondary version control tool on top of Git does not feel right and doesn't lead to a fluid experience."

Twitter has centralized use of Git. Don't really benefit from distributed version control system. Feature branches. Goal is to live as close to master as possible. Long-running branches discouraged. Fewer conflicts to resolve.

They have project-level ownership system. But any developer can change anything.

They have lots of read-only replicas. Highly available writable server.

They use reference repos heavily so object exchange overhead is manageable.

Scaling issues with many refs. Partially due to how refs are stored on disk. File locking limits in OS. Commands like status, add, and commit can be slow, even with repo garbage collected and packed. Locking issues with garbage collection.

Incorporated file alteration monitor to make status faster. Reference to Facebook's work on watchman and its Mercurial integration. Significant impact on OS X. "Pretty much all our developers use OS X." (I assume they are using Watchman with Git - I've seen patches for this on the Git mailing list but they haven't been merged into core yet.)

They implemented a custom index format. Adopted faster hashing algorithm that uses native instructions.

Discovery with many refs didn't scale. 13 MB of raw data for refs exchange at Twitter. (!!) Experimenting with clients sending a bloom filter of refs. Hacked it together via HTTP header.

Fetch protocol is expensive. Lots of random I/O. Can take minutes to do incremental fetches. Bitmap indices help, but aren't good enough for them. Since they have central and well-defined environment, they changed fetch protocol to work like a journal: send all changed data since client's last fetch. Server work is essentially a sendfile system call. git push appends push packs to a log-structured journal. On fetch, clients "replay" the transactions from the journal. Similar to MySQL binary log replication. (This is very similar to some of the Mercurial replication work I'm doing at Mozilla. Yay technical validation.) (Append only files are also how Mercurial's storage model works by default.)

Log-structured data exchange means server side is cheap. They can insert HTTP caches to handle Range header aware requests.

Without this hack, they can't scale their servers.

Initial clone is seeded by BitTorrent.

It sounds like everyone's unmerged feature branches are on the one central repo and get transferred everywhere by default. Their journal fetch can selectively fetch refs so this isn't a huge problem.

They'd like to experiment with sparse repos. But they haven't gotten to that yet. They'd like a better storage abstraction in Git to enable necessary future scalability. They want a network-aware storage backend. Local objects not necessary if the network has them.

They are running a custom Git distribution/fork on clients. But they don't want to maintain forever. Prefer to send upstream.

Dropping Explicit Support for Mercurial 3.0

May 07, 2015 at 04:05 PM | categories: Mercurial, Mozilla

As of a few minutes ago, we explicitly dropped support for Mercurial 3.0 for all the Mercurial code in the version-control-tools repository. File issues in bug 1162304.

Code may still work against Mercurial 3.0. But it isn't supported and could break hard at any time.

Supporting multiple versions of any software carries with it some cost. The people writing tooling around Mercurial are busy. It is a waste of our time to bend over backwards to support old versions of software that all users should have upgraded from months ago. Still using older Mercurial versions means you aren't getting the best performance and may encounter bugs that have since been fixed.

See the Mozilla tailored Mercurial installation instructions for info on how to upgrade to the latest/greatest Mercurial version.

Reporting Mercurial Issues

May 04, 2015 at 01:45 PM | categories: Mercurial, Mozilla

I semi-frequently stumble upon conversations in hallways and on irc.mozilla.org about issues people are having with Mercurial. These conversations periodically involve a legitimate bug with Mercurial. Unfortunately, these conversations frequently end without an actionable result. Unless someone files a bug, pings me, etc, the complaints disappear into ether. That's not good for anyone and only results in bugs living longer than they should.

There are posters around Mozilla offices that say if you see something, file something. This advice does not just apply to Mozilla projects!

If you encounter an issue in Mercurial, please take the time to report it somewhere meaningful. The Reporting Issues with Mercurial page from the Mercurial for Mozillians guide tells you how to do this.

It is OK to complain about something. But if you don't inform someone empowered to do something about it, you are part of the problem without being part of the solution. Please make the incremental effort to be part of the solution.

Mercurial 3.4 Released

May 04, 2015 at 12:40 PM | categories: Mercurial, Mozilla

Mercurial 3.4 was released on May 1 (following Mercurial's time-based schedule of releasing a new version every 3 months).

3.4 is a significant release for a few reasons.

First, the next version of the wire protocol (bundle2) has been marked as non-experimental on servers. This version of the protocol paves over a number of deficiencies in the classic protocol. I won't go into low-level details. But I will say that the protocol enables some rich end-user experiences, such as having the server hand out URLs for pre-generated bundles (e.g. offload clones to S3), atomic push operations, and advanced workflows, such as having the server rebase automatically on push. Of course, you'll need a server running 3.4 to realize the benefits of the new protocol. hg.mozilla.org won't be updated until at least June 1.

Second, Mercurial 3.4 contains improvements to the tags cache to make performance concerns a thing of the past. Due to the structure of the Firefox repositories, the previous implementation of the tags cache could result in pauses of dozens of seconds during certain workflows. The problem should go away with Mercurial 3.4. Please note that on first use of Mercurial 3.4, your repository may perform a one-time upgrade of the tags cache. This will spin a full CPU core and will take up to a few minutes to complete on Firefox repos. Let it run to completion and performance should not be an issue again. I wrote the patches to change the tags cache (with lots of help from Pierre-Yves David, a Mercurial core contributor). So if you find anything wrong, I'm the one to complain to.

Third, the HTTP interface to Mercurial (hgweb) now has JSON output for nearly every endpoint. The implementation isn't yet complete, but it is better than nothing. But, it should be good enough for services to start consuming it. Again, this won't be available on hg.mozilla.org until the server is upgraded on June 1 at the earliest. This is a feature I added to core Mercurial. If you have feature requests, send them my way.

Fourth, a number of performance regressions introduced in Mercurial 3.3 were addressed. These performance issues frequently manifested during hg blame operations. Many Mozillians noticed them on hg.mozilla.org when looking at blame through the web interface.

For a more comprehensive list of changes, see my post about the 3.4 RC and the official release notes.

3.4 was a significant release. There are compelling reasons to upgrade. That being said, there were a lot of changes in 3.4. If you want to wait until 3.4.1 is released (scheduled for June 1) so you don't run into any regressions, nobody can fault you for that.

If you want to upgrade, I recommend reading the Mercurial for Mozillians Installation Page.

Automatically Redirecting Mercurial Pushes

April 30, 2015 at 12:30 PM | categories: Mercurial, Mozilla

Managing URLs in distributed version control tools can be a pain, especially if multiple repositories are involved. For example, with Mozilla's repository-based code review workflow (you push to a special review repository to initiate code review - this is conceptually similar to GitHub pull requests), there exist separate code review repositories for each logical repository. Figuring out how repositories map to each other and setting up remote paths for each new clone can be a pain and time sink.

As of today, we can now do something better.

If you push to ssh://reviewboard-hg.mozilla.org/autoreview, Mercurial will automatically figure out the appropriate review repository and redirect your push automatically. In other words, if we have MozReview set up to review whatever repository you are working on, your push and review request will automatically go through. No need to figure out what the appropriate review repo is or configure repository URLs!

Here's what it looks like:

$ hg push review
pushing to ssh://reviewboard-hg.mozilla.org/autoreview
searching for appropriate review repository
redirecting push to ssh://reviewboard-hg.mozilla.org/version-control-tools/
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 1 changesets with 1 changes to 1 files
remote: Trying to insert into pushlog.
remote: Inserted into the pushlog db successfully.
submitting 1 changesets for review

changeset:  11043:b65b087a81be
summary:    mozreview: create per-commit identifiers (bug 1160266)
review:     https://reviewboard.mozilla.org/r/7953 (draft)

review id:  bz://1160266/gps
review url: https://reviewboard.mozilla.org/r/7951 (draft)
(visit review url to publish this review request so others can see it)

Read the full instructions for more details.

This requires an updated version-control-tools repository, which you can get by running mach mercurial-setup from a Firefox repository.

For those that are curious, the autoreview repo/server advertises a list of repository URLs and their root commit SHA-1. The client automatically sends the push to a URL sharing the same root commit. The code is quite simple.

While this is only implemented for MozReview, I could envision us doing something similar for other centralized repository-centric services, such as Try and Autoland. Stay tuned.

« Previous Page -- Next Page »