Any time Facebook talks about technical matters I tend to listen. They have a track record of demonstrating engineering leadership in several spaces. And, unlike many companies that just talk, Facebook often gives others access to those ideas via source code and healthy open source projects. It's rare to see a company operating on the frontier of the computing field provide so much insight into their inner workings. You can gain so much by riding their cotails and following their lead instead of clinging to and cargo culting from the past.
The Facebook F8 developer conference was this past week. All the talks are now available online. I encourage you to glimpse through the list of talks and watch whatever is relevant to you. There's really a little bit for everyone.
Of particular interest to me is the Big Code: Developer Infrastructure at Facebook's Scale talk. This is highly relevant to my job role as Developer Productivity Engineer at Mozilla.
My notes for this talk follow.
"We don't want humans waiting on computers. We want computers waiting on humans." (This is the common theme of the talk.)
In 2005, Facebook was on Subversion. In 2007 moved to Git. Deployed a bridge so people worked in Git and had distributed workflow but pushed to Subversion under the hood.
New platforms over time. Server code, iOS, Android. One Git repo per platform/project -> 3 Git repos. Initially no code sharing, so no problem. Over time, code sharing between all repos. Lots of code copying and confusion as to what is where and who owns what.
Facebook is mere weeks away from completing their migration to consolidate the big three repos to a Mercurial monorepo. (See also my post about monorepos.)
- Easier code sharing.
- Easier large-scale changes. Rewrite the universe at once.
- Unified set of tooling.
Facebook employees run >1M source control commands per day. >100k commits per week. VCS tool needs to be fast to prevent distractions and context switching, which slow people down.
Facebook implemented sparse checkout and shallow history in Mercurial. Necessary to scale distributed version control to large repos.
Quote from Google: "We're excited about the work Facebook is doing with Mercurial and glad to be collaborating with Facebook on Mercurial development." (Well, I guess the cat is finally out of the bag: Google is working on Mercurial. This was kind of an open secret for months. But I guess now it is official.)
Push-pull-rebase bottleneck: if you rebase and push and someone beats you to it, you have to pull, rebase, and try again. This gets worse as commit rate increases and people do needless legwork. Facebook has moved to server-side rebasing on push to mostly eliminate this pain point. (This is part of a still-experimental feature in Mercurial, which should hopefully lose its experimental flag soon.)
Starting 13:00 in we have a speaker change and move away from version control.
IDEs don't scale to Facebook scale. "Developing in Xcode at Facebook is an exercise in frustration." On average 3.5 minutes to open Facebook for iOS in Xcode. 5 minutes on average to index. Pegs CPU and makes not very responsive. 50 Xcode crashes per day across all Facebook iOS developers.
Facebook measures everything about tools. Mercurial operation times. Xcode times. Build times. Data tells them what tools and workflows need to be worked on.
Facebook believes IDEs are worth the pain because they make people more productive.
Facebook wants to support all editors and IDEs since people want to use whatever is most comfortable.
React Native changed things. Supported developing on multiple platforms, which no single IDE supports. People launched several editors and tools to do React Native development. People needed 4 windows to do development. That experience was "not acceptable." So they built their own IDE. Set of plugins on top of ATOM. Not a fork. They like hackable and web-y nature of ATOM.
It can connect to remote servers and transparently save and deploy changes. It can also get real-time compilation errors and hints from the remote server! (Demo was with Hack. Not sure if others langs supported. Having beefy central servers for e.g. Gecko development would be a fun experiment.)
Starting at 32:00 presentation shifts to continuous integration.
Number one goal of CI at Facebook is developer efficiency. We don't want developers waiting on computers to build and test diffs.
3 goals for CI:
- High-signal feedback. Don't want developers chasing failures that aren't their fault. Wastes time.
- Must provide rapid feedback. Developers don't want to wait.
- Provide frequent feedback. Developers should know as soon as possible after they did something. (I think this refers to local feedback.)
Sandcastle is their CI system.
Diff lifecycle discussion.
Basic tests and lint run locally. (My understanding from talking with Facebookers is "local" often means on a Facebook server, not local laptop. Machines at developers fingertips are often dumb terminals.)
They appear to use code coverage to determine what tests to run. "We're not going to run a test unless your diff might actually have broken it."
They run flaky tests less often.
They run slow tests less often.
Goal is to get feedback to developers in under 10 minutes.
If they run fewer tests and get back to developers quicker, things are less likely to break than if they run more tests but take longer to give feedback.
They also want feedback quickly so reviewers can see results at review time.
They use Web Driver heavily. Love cross-platform nature of Web Driver.
In addition to test results, performance and size metrics are reported.
They have a "Ship It" button on the diff.
Landcastle handles landing diff.
"It is not OK at Facebook to land a diff without using Landcastle." (Read: developers don't push directly to the master repo.)
Once Landcastle lands something, it runs tests again. If an issue is found, a task is filed. Task can be "push blocking." Code won't ship to users until the "push blocking" issue resolved. (Tweets confirm they do backouts "fairly aggressively." A valid resolution to a push blocking task is to backout. But fixing forward is fine as well.)
After a while, branch cut occurs. Some cherry picks onto release branches.
In addition to diff-based testing, they do continuous testing runs. Much more comprehensive. No time restrictions. Continuous runs on master and release candidate branches. Auto bisect to pin down regressions.
Sandcastle processes >1000 test results per second. 5 years of machine work per day. Thousands of machines in 5 data centers.
They started with buildbot. Single master. Hit scaling limits of single thread single master. Master could not push work to workers fast enough. Sandcastle has distributed queue. Workers just pull jobs from distributed queue.
"High-signal feedback is critical." "Flaky failures erode developer confidence." "We need developers to trust Sandcastle."
Extremely careful separating infra failures from other failures. Developers don't see infra failures. Infra failures only reported to Sandcastle team.
Bots look for flaky tests. Stress test individual tests. Run tests in parallel with themselves. Goal: developers don't see flaky tests.
There is a "not my fault" button that developers can use to report bad signals.
"Whatever the scale of your engineering organization, developer efficiency is the key thing that your infrastructure teams should be striving for. This is why at Facebook we have some of our top engineers working on developer infrastructure." (Preach it.)
Excellent talk. Mozillians doing infra work or who are in charge of head count for infra work should watch this video.
Update 2015-03-28 21:35 UTC - Clarified some bits in response to new info Tweeted at me. Added link to my monorepos blog post.
It's been a rough week.
The very short summary of events this week is that both the Firefox and Firefox OS release automation has been performing a denial of service attack against hg.mozilla.org.
On the face of it, this is nothing new. The release automation is by far the top consumer of hg.mozilla.org data, requesting several terabytes per day via several million HTTP requests from thousands of machines in multiple data centers. The very nature of their existence makes them a significant denial of service threat.
Lots of things went wrong this week. While a post mortem will shed light on them, many fall under the umbrella of release automation was making more requests than it should have and was doing so in a way that both increased the chances of an outage occurring and increased the chances of a prolonged outage. This resulted in the hg.mozilla.org servers working harder than they ever have. As a result, we have some new high scores to share.
On UTC day March 19, hg.mozilla.org transferred 7.4 TB of data. This is a significant increase from the ~4 TB we expect on a typical weekday. (Even more significant when you consider that most load is generated during peak hours.)
During the 1300 UTC hour of March 17, the cluster received 1,363,628 HTTP requests. No HTTP 503 Service Not Available errors were encountered in that window! 300,000 to 400,000 requests per hour is typical.
During the 0800 UTC hour of March 19, the cluster transferred 776 GB of repository data. That comes out to at least 1.725 Gbps on average (I didn't calculate TCP and other overhead). Anything greater than 250 GB per hour is not very common. No HTTP 503 errors were served from the origin servers during this hour!
We encountered many periods where hg.mozilla.org was operating more than twice its normal and expected operating capacity and it was able to handle the load just fine. As a server operator, I'm proud of this. The servers were provisioned beyond what is normally needed of them and it took a truly exceptional event (or two) to bring the service down. This is generally a good way to do hosted services (you rarely want to be barely provisioned because you fall over at the slighest change and you don't want to be grossly over-provisioned because you are wasting money on idle resources).
Unfortunately, the hg.mozilla.org service did fall over. Multiple times, in fact. There is room to improve. As proud as I am that the service operated well beyond its expected limits, I can't help but feel ashamed that it did eventual cave in under even extreme load and that people are probably making under-informed general assumptions like Mercurial can't scale. The simple fact of the matter is that clients cumulatively generated an exceptional amount of traffic to hg.mozilla.org this week. All servers have capacity limits. And this week we encountered the limit for the current configuration of hg.mozilla.org. Cause and effect.
The Firefox source repositories and automation have been closed the past few days due to a couple of outages.
Yesterday, aggregate CPU usage on many of the machines in the hg.mozilla.org cluster hit 100%. Previously, whenever hg.mozilla.org was under high load, we'd run out of network bandwidth before we ran out of CPU on the machines. In other words, Mercurial was generating data faster than the network could accept it.
When this happened, the service started issuing HTTP 503 Service Not Available responses. This is the universal server signal for I'm down, go away. Unfortunately, not all clients did this.
Parts of Firefox's release automation retried failing requests immediately, or with insufficient jitter in their backoff interval. Actively retrying requests against a server that's experiencing load issues only makes the problem worse. This effectively prolonged the outage.
Today, we had a similar but different network issue. The load balancer fronting hg.mozilla.org can only handle so much bandwidth. Today, we hit that limit. The load balancer started throttling connections. Load on hg.mozilla.org skyrocketed and request latency increased. From the perspective of clients, the service grinded to a halt.
hg.mozilla.org was partially sharing a load balancer with ftp.mozilla.org. That meant if one of the services experienced very high load, the other service could effectively be locked out of bandwidth. We saw this happening this morning. ftp.mozilla.org load was high (it looks like downloads of Firefox Developer Edition are a major contributor - these don't go through the CDN for reasons unknown to me) and there wasn't enough bandwidth to go around.
Separately today, hg.mozilla.org again hit 100% CPU. At that time, it also set a new record for network throughput: ~3 Gbps. It normally consumes between 200 and 500 Mbps, with periodic spikes to 750 Mbps. (Yesterday's event saw a spike to around ~2 Gbps.)
Going back through the hg.mozilla.org server logs, an offender is quite obvious. Before March 9, total outbound transfer for the build/tools repo was around 1 tebibyte per day. Starting on March 9, it increased to 3 tebibytes per day! This is quite remarkable, as a clone of this repo is only about 20 MiB. This means the repo was getting cloned about 150,000 times per day! (Note: I think all these numbers may be low by ~20% - stay tuned for the final analysis.)
2 TiB/day is statistically significant because we transfer less than 10 TiB/day across all of hg.mozilla.org. And, 1 TiB/day is close to 100 Mbps, assuming requests are evenly spread out (which of course they aren't).
Multiple things went wrong. If only one or two happened, we'd likely be fine. Maybe there would have been a short blip. But not the major event we've been firefighting the last ~24 hours.
This post is only a summary of what went wrong. I'm sure there will be a post-mortem and that it will contain lots of details for those who want to know more.
I'm currently working on annotating moz.build files with metadata that defines things like which bug component and code reviewers map to which files. It's going to enable a lot of awesomeness.
As part of this project, I'm implementing a new moz.build processing mode. Instead of reading moz.build files by traversing DIRS variables from previously-executed moz.build files, we're evaluating moz.build files according to filesystem topology. This has uncovered a few cases where a moz.build file errors because of assumptions that no longer hold. For example, for directories that are only active on Windows, the moz.build file might assume that if Windows is always true.
One such problem was with gfx/angle/srx/libGLESv2/moz.build. This file contained code similar to the following:
if CONFIG['IS_WINDOWS']: SOURCES += ['foo.cpp'] ... SOURCES['foo.cpp'].flags += ['-DBAR']
This always ran without issue because this moz.build was only included if building for Windows. This assumption is of course invalid when in filesystem traversal mode.
Anyway, as part of updating this trouble file, I lost maybe an hour of productivity. Here's how.
The top of the trouble moz.build file has a comment:
# Please note this file is autogenerated from generate_mozbuild.py, # so do not modify it directly
OK. So, I need to modify generate_mozbuild.py. First thing's first: I need to locate it:
$ hg locate generate_mozbuild.py gfx/skia/generate_mozbuild.py
So I load up this file. I see a main(). I run the script in my shell and get an error. Weird. I look around gfx/skia and see a README_MOZILLA file. I open it. README_MOZILLA contains some instructions. They aren't very good. I hop in #gfx on IRC and ask around. They tell me to do a Subversion clone of Skia and to check out the commit referenced in README_MOZILLA. There is no repo URL in README_MOZILLA. I search Google. I find a Git URL. I notice that README_MOZILLA contains a SHA-1 commit, not a Subversion integer revision. I figure the Git repo is what was meant. I clone the Git repo. I attempt to run the generation script referenced by README_MOZILLA. It fails. I ask again in #gfx. They are baffled at first. I dig around the source code. I see a reference in Skia's upstream code to a path that doesn't exist. I tell the #gfx people. They tell me sub-repos are likly involved and to use gclient to clone the repo. I search for the proper Skia source code docs and type the necessary gclient commands. (Fortunately I've used gclient before, so this wasn't completely alien to me.)
I get the Skia clone in the proper state. I run the generation script and all works. But I don't see it writing the trouble moz.build file I set out to fix. I set some breakpoints. I run the code again. I'm baffled.
Suddenly it hits me: I've been poking around with gfx/skia which is separate from gfx/angle! I look around gfx/angle and see a README.mozilla file. I open it. It reveals the existence of the Git repo https://github.com/mozilla/angle. I open GitHub in my browser. I see a generate_mozbuild.py script.
I now realize there are multiple files named generate_mozbuild.py. Unfortunately, the one I care about - the ANGLE one - is not checked into mozilla-central. So, my search for it with hg files did not reveal its existence. Between me trying to get the Skia code cloned and generating moz.build files, I probably lost an hour of work. All because a file with a similar name wasn't checked into mozilla-central!
I assumed that the single generate_mozbuild.py I found under source control was the only file of that name and that it must be the file I was interested in.
Maybe I should have known to look at gfx/angle/README.mozilla first. Maybe I should have known that gfx/angle and gfx/skia are completely independent.
But I didn't. My ignorance cost me.
Had the contents of the separate ANGLE repository been checked into mozilla-central, I would have seen the multiple generate_mozbuild.py files and I would likely have found the correct one immediately. But they weren't and I lost an hour of my time.
And I'm not done. Now I have to figure out how the separate ANGLE repo integrates with mozilla-central. I'll have to figure out how to submit the patch I still need to write. The GitHub description of this repo says Talk to vlad, jgilbert, or kamidphish for more info. So now I have to bother them before I can submit my patch. Maybe I'll just submit a pull request and see what happens.
I'm convinced I wouldn't have encountered this problem if a monolithic repository were used. I would have found the separate generate_mozbuild.py file immediately. And, the change process would likely have been known to me since all the code was in a repository I already knew how to submit patches from.
Separate repos are just lots of pain. You can bet I'll link to this post when people propose splitting up mozilla-central into multiple repositories.
Mozilla has historically done some funky things with the Firefox Mercurial repositories. One of the things we've done is create a bunch of named branches to track the Firefox release process. These are branch names like GECKO20b12_2011022218_RELBRANCH.
Over in bug 927219, we started the process of cleaning up some cruft left over from many of these old branches.
For starters, the old named branches in the Firefox repositories are being actively closed. When you hg commit --close-branch, Mercurial creates a special commit that says this branch is closed. Branches that are closed are automatically hidden from the output of hg branches and hg heads. As a result, the output of these commands is now much more usable.
Closed branches still constitute heads on the DAG. And several heads lead to degraded performance in some situations (notably push and pull times - the same thing happens in Git). I'd like to eventually merge these old heads so that repositories only have 1 or a small number of DAG heads. However, extra care must be taken before that step. Stay tuned.
Anyway, for the average person reading, you probably won't be impacted by these changes at all. The greatest impact will be from the person who lands the first change on top of any repository whose last commit was a branch close. If you commit on top of the tip commit, you'll be committing on top of a previously closed branch! You'll instead want to hg up default after you pull to ensure you are on the proper DAG head! And even then, if you have local commits, you may not be based on top of the appropriate commit! A simple run of hg log --graph should help you decipther the state of the world. (Please note that the usability problems around discovering the appropriate head to land on are a result of our poor branching strategy for the Firefox repositories. We probably should have named branches tracking the active Gecko releases. But that ship sailed years ago and fixing that is pretty far down the priority list. Wallpapering over things with the firefoxtree extensions is my recommended solution until matters are fixed.)
Next Page »