Notes from Facebook's Developer Infrastructure at Scale F8 Talk

March 28, 2015 at 11:45 AM | categories: Mozilla

Any time Facebook talks about technical matters I tend to listen. They have a track record of demonstrating engineering leadership in several spaces. And, unlike many companies that just talk, Facebook often gives others access to those ideas via source code and healthy open source projects. It's rare to see a company operating on the frontier of the computing field provide so much insight into their inner workings. You can gain so much by riding their cotails and following their lead instead of clinging to and cargo culting from the past.

The Facebook F8 developer conference was this past week. All the talks are now available online. I encourage you to glimpse through the list of talks and watch whatever is relevant to you. There's really a little bit for everyone.

Of particular interest to me is the Big Code: Developer Infrastructure at Facebook's Scale talk. This is highly relevant to my job role as Developer Productivity Engineer at Mozilla.

My notes for this talk follow.

"We don't want humans waiting on computers. We want computers waiting on humans." (This is the common theme of the talk.)

In 2005, Facebook was on Subversion. In 2007 moved to Git. Deployed a bridge so people worked in Git and had distributed workflow but pushed to Subversion under the hood.

New platforms over time. Server code, iOS, Android. One Git repo per platform/project -> 3 Git repos. Initially no code sharing, so no problem. Over time, code sharing between all repos. Lots of code copying and confusion as to what is where and who owns what.

Facebook is mere weeks away from completing their migration to consolidate the big three repos to a Mercurial monorepo. (See also my post about monorepos.)

Reasons:

  1. Easier code sharing.
  2. Easier large-scale changes. Rewrite the universe at once.
  3. Unified set of tooling.

Facebook employees run >1M source control commands per day. >100k commits per week. VCS tool needs to be fast to prevent distractions and context switching, which slow people down.

Facebook implemented sparse checkout and shallow history in Mercurial. Necessary to scale distributed version control to large repos.

Quote from Google: "We're excited about the work Facebook is doing with Mercurial and glad to be collaborating with Facebook on Mercurial development." (Well, I guess the cat is finally out of the bag: Google is working on Mercurial. This was kind of an open secret for months. But I guess now it is official.)

Push-pull-rebase bottleneck: if you rebase and push and someone beats you to it, you have to pull, rebase, and try again. This gets worse as commit rate increases and people do needless legwork. Facebook has moved to server-side rebasing on push to mostly eliminate this pain point. (This is part of a still-experimental feature in Mercurial, which should hopefully lose its experimental flag soon.)

Starting 13:00 in we have a speaker change and move away from version control.

IDEs don't scale to Facebook scale. "Developing in Xcode at Facebook is an exercise in frustration." On average 3.5 minutes to open Facebook for iOS in Xcode. 5 minutes on average to index. Pegs CPU and makes not very responsive. 50 Xcode crashes per day across all Facebook iOS developers.

Facebook measures everything about tools. Mercurial operation times. Xcode times. Build times. Data tells them what tools and workflows need to be worked on.

Facebook believes IDEs are worth the pain because they make people more productive.

Facebook wants to support all editors and IDEs since people want to use whatever is most comfortable.

React Native changed things. Supported developing on multiple platforms, which no single IDE supports. People launched several editors and tools to do React Native development. People needed 4 windows to do development. That experience was "not acceptable." So they built their own IDE. Set of plugins on top of ATOM. Not a fork. They like hackable and web-y nature of ATOM.

The demo showing iOS development looks very nice! Doing Objective-C, JavaScript, simulator integration, and version control in one window!

It can connect to remote servers and transparently save and deploy changes. It can also get real-time compilation errors and hints from the remote server! (Demo was with Hack. Not sure if others langs supported. Having beefy central servers for e.g. Gecko development would be a fun experiment.)

Starting at 32:00 presentation shifts to continuous integration.

Number one goal of CI at Facebook is developer efficiency. We don't want developers waiting on computers to build and test diffs.

3 goals for CI:

  1. High-signal feedback. Don't want developers chasing failures that aren't their fault. Wastes time.
  2. Must provide rapid feedback. Developers don't want to wait.
  3. Provide frequent feedback. Developers should know as soon as possible after they did something. (I think this refers to local feedback.)

Sandcastle is their CI system.

Diff lifecycle discussion.

Basic tests and lint run locally. (My understanding from talking with Facebookers is "local" often means on a Facebook server, not local laptop. Machines at developers fingertips are often dumb terminals.)

They appear to use code coverage to determine what tests to run. "We're not going to run a test unless your diff might actually have broken it."

They run flaky tests less often.

They run slow tests less often.

Goal is to get feedback to developers in under 10 minutes.

If they run fewer tests and get back to developers quicker, things are less likely to break than if they run more tests but take longer to give feedback.

They also want feedback quickly so reviewers can see results at review time.

They use Web Driver heavily. Love cross-platform nature of Web Driver.

In addition to test results, performance and size metrics are reported.

They have a "Ship It" button on the diff.

Landcastle handles landing diff.

"It is not OK at Facebook to land a diff without using Landcastle." (Read: developers don't push directly to the master repo.)

Once Landcastle lands something, it runs tests again. If an issue is found, a task is filed. Task can be "push blocking." Code won't ship to users until the "push blocking" issue resolved. (Tweets confirm they do backouts "fairly aggressively." A valid resolution to a push blocking task is to backout. But fixing forward is fine as well.)

After a while, branch cut occurs. Some cherry picks onto release branches.

In addition to diff-based testing, they do continuous testing runs. Much more comprehensive. No time restrictions. Continuous runs on master and release candidate branches. Auto bisect to pin down regressions.

Sandcastle processes >1000 test results per second. 5 years of machine work per day. Thousands of machines in 5 data centers.

They started with buildbot. Single master. Hit scaling limits of single thread single master. Master could not push work to workers fast enough. Sandcastle has distributed queue. Workers just pull jobs from distributed queue.

"High-signal feedback is critical." "Flaky failures erode developer confidence." "We need developers to trust Sandcastle."

Extremely careful separating infra failures from other failures. Developers don't see infra failures. Infra failures only reported to Sandcastle team.

Bots look for flaky tests. Stress test individual tests. Run tests in parallel with themselves. Goal: developers don't see flaky tests.

There is a "not my fault" button that developers can use to report bad signals.

"Whatever the scale of your engineering organization, developer efficiency is the key thing that your infrastructure teams should be striving for. This is why at Facebook we have some of our top engineers working on developer infrastructure." (Preach it.)

Excellent talk. Mozillians doing infra work or who are in charge of head count for infra work should watch this video.

Update 2015-03-28 21:35 UTC - Clarified some bits in response to new info Tweeted at me. Added link to my monorepos blog post.