My Current Thoughts on System Administration

April 17, 2015 at 03:35 PM | categories: sysadmin, Mozilla

I attended PyCon last week. It's a great conference. You should attend. While I should write up a detailed trip report, I wanted to quickly share one of my takeaways.

Ansible was talked about a lot at PyCon. Sitting through a few presentations and talking with others helped me articulate why I've been drawn to Ansible (over say Puppet, Chef, Salt, etc) lately.

First, Ansible doesn't require a central server. Administration is done remotely Ansible establishes a SSH connection to a remote machine and does stuff. Having Ruby, Python, support libraries, etc installed on production systems just for system administration never really jived with me. I love Ansible's default hands off approach. (Yes, you can use a central server for Ansible, but that's not the default behavior. While tools like Puppet could be used without a central server, it felt like they were optimized for central server use and thus local mode felt awkward.)

Related to central servers, I never liked how that model consists of clients periodically polling for and applying updates. I like the idea of immutable server images and periodic updates work against this goal. The central model also has a major bazooka pointed at you: at any time, you are only one mistake away from completely hosing every machine doing continuous polling. e.g. if you accidentally update firewall configs and lock out central server and SSH connectivity, every machine will pick up these changes during periodic polling and by the time anyone realizes what's happened, your machines are all effectively bricked. (Yes, I've seen this happen.) I like having humans control exactly when my systems apply changes, thank you. I concede periodic updates and central control have some benefits.

Choosing not to use a central server by default means that hosts are modeled as a set of applied Ansible playbooks, not necessarily as a host with a set of Ansible playbooks attached. Although, Ansible does support both models. I can easily apply a playbook to a host in a one-off manner. This means I can have playbooks represent common, one-off tasks and I can easily run these tasks without having to muck around with the host to playbook configuration. More on this later.

I love the simplicity of Ansible's configuration. It is just YAML files. Not some Ruby-inspired DSL that takes hours to learn. With Ansible, I'm learning what modules are available and how they work, not complicated syntax. Yes, there is complexity in Ansible's configuration. But at least I'm not trying to figure out the file syntax as part of learning it.

Along that vein, I appreciate the readability of Ansible playbooks. They are simple, linear lists of tasks. Conceptually, I love the promise of full dependency graphs and concurrent execution. But I've spent hours debugging race conditions and cyclic dependencies in Puppet that I'm left unconvinced the complexity and power is worth it. I do wish Ansible could run faster by running things concurrently. But I think they made the right decision by following KISS.

I enjoy how Ansible playbooks are effectively high-level scripts. If I have a shell script or block of code, I can usually port it to Ansible pretty easily. One pass to do the conversion 1:1. Another pass to Ansibilize it. Simple.

I love how Ansible playbooks can be checked in to source control and live next to the code and applications they manage. I frequently see people maintain separate source control repositories for configuration management from the code it is managing. This always bothered me. When I write a service, I want the code for deploying and managing that service to live next to it in version control. That way, I get the configuration management and the code versioned in the same timeline. If I check out a release from 2 years ago, I should still be able to use its exact configuration management code. This becomes difficult to impossible when your organization is maintaining configuration management code in a separate repository where a central server is required to do deployments (see Puppet).

Before PyCon, I was having an internal monolog about adopting the policy that all changes to remote servers be implemented with Ansible playbooks. I'm pleased to report that a fellow contributor to the Mercurial project has adopted this workflow himself and he only has great things to say! So, starting today, I'm going to try to enforce that every change I make to a remote server is performed via Ansible and that the Ansible playbooks are checked into version control. The Ansible playbooks will become implicit documentation of every process involved with maintaining a server.

I've already applied this principle to deploying MozReview. Before, there was some internal Mozilla wiki documenting commands to execute in a terminal to deploy MozReview. I have replaced that documentation with a one-liner that invokes Ansible. And, the Ansible files are now in a public repository.

If you poke around that repository, you'll see that I have Ansible playbooks referencing Docker. I have Ansible provisioning Docker images used by the test and development environment. That same Ansible code is used to configure our production systems (or is at least in the process of being used in that way). Having dev, test, and prod using the same configuration management has been a pipe dream of mine and I finally achieved it! I attempted this before with Puppet but was unable to make it work just right. The flexibility that Ansible's design decisions have enabled has made this finally possible.

Ansible is my go to system management tool right now. And I still feel like I have a lot to learn about its hidden powers.

If you are still using Puppet, Chef, or other tools invented in previous generations, I urge you to check out Ansible. I think you'll be pleasantly surprised.


Notes from Facebook's Developer Infrastructure at Scale F8 Talk

March 28, 2015 at 11:45 AM | categories: Mozilla

Any time Facebook talks about technical matters I tend to listen. They have a track record of demonstrating engineering leadership in several spaces. And, unlike many companies that just talk, Facebook often gives others access to those ideas via source code and healthy open source projects. It's rare to see a company operating on the frontier of the computing field provide so much insight into their inner workings. You can gain so much by riding their cotails and following their lead instead of clinging to and cargo culting from the past.

The Facebook F8 developer conference was this past week. All the talks are now available online. I encourage you to glimpse through the list of talks and watch whatever is relevant to you. There's really a little bit for everyone.

Of particular interest to me is the Big Code: Developer Infrastructure at Facebook's Scale talk. This is highly relevant to my job role as Developer Productivity Engineer at Mozilla.

My notes for this talk follow.

"We don't want humans waiting on computers. We want computers waiting on humans." (This is the common theme of the talk.)

In 2005, Facebook was on Subversion. In 2007 moved to Git. Deployed a bridge so people worked in Git and had distributed workflow but pushed to Subversion under the hood.

New platforms over time. Server code, iOS, Android. One Git repo per platform/project -> 3 Git repos. Initially no code sharing, so no problem. Over time, code sharing between all repos. Lots of code copying and confusion as to what is where and who owns what.

Facebook is mere weeks away from completing their migration to consolidate the big three repos to a Mercurial monorepo. (See also my post about monorepos.)

Reasons:

  1. Easier code sharing.
  2. Easier large-scale changes. Rewrite the universe at once.
  3. Unified set of tooling.

Facebook employees run >1M source control commands per day. >100k commits per week. VCS tool needs to be fast to prevent distractions and context switching, which slow people down.

Facebook implemented sparse checkout and shallow history in Mercurial. Necessary to scale distributed version control to large repos.

Quote from Google: "We're excited about the work Facebook is doing with Mercurial and glad to be collaborating with Facebook on Mercurial development." (Well, I guess the cat is finally out of the bag: Google is working on Mercurial. This was kind of an open secret for months. But I guess now it is official.)

Push-pull-rebase bottleneck: if you rebase and push and someone beats you to it, you have to pull, rebase, and try again. This gets worse as commit rate increases and people do needless legwork. Facebook has moved to server-side rebasing on push to mostly eliminate this pain point. (This is part of a still-experimental feature in Mercurial, which should hopefully lose its experimental flag soon.)

Starting 13:00 in we have a speaker change and move away from version control.

IDEs don't scale to Facebook scale. "Developing in Xcode at Facebook is an exercise in frustration." On average 3.5 minutes to open Facebook for iOS in Xcode. 5 minutes on average to index. Pegs CPU and makes not very responsive. 50 Xcode crashes per day across all Facebook iOS developers.

Facebook measures everything about tools. Mercurial operation times. Xcode times. Build times. Data tells them what tools and workflows need to be worked on.

Facebook believes IDEs are worth the pain because they make people more productive.

Facebook wants to support all editors and IDEs since people want to use whatever is most comfortable.

React Native changed things. Supported developing on multiple platforms, which no single IDE supports. People launched several editors and tools to do React Native development. People needed 4 windows to do development. That experience was "not acceptable." So they built their own IDE. Set of plugins on top of ATOM. Not a fork. They like hackable and web-y nature of ATOM.

The demo showing iOS development looks very nice! Doing Objective-C, JavaScript, simulator integration, and version control in one window!

It can connect to remote servers and transparently save and deploy changes. It can also get real-time compilation errors and hints from the remote server! (Demo was with Hack. Not sure if others langs supported. Having beefy central servers for e.g. Gecko development would be a fun experiment.)

Starting at 32:00 presentation shifts to continuous integration.

Number one goal of CI at Facebook is developer efficiency. We don't want developers waiting on computers to build and test diffs.

3 goals for CI:

  1. High-signal feedback. Don't want developers chasing failures that aren't their fault. Wastes time.
  2. Must provide rapid feedback. Developers don't want to wait.
  3. Provide frequent feedback. Developers should know as soon as possible after they did something. (I think this refers to local feedback.)

Sandcastle is their CI system.

Diff lifecycle discussion.

Basic tests and lint run locally. (My understanding from talking with Facebookers is "local" often means on a Facebook server, not local laptop. Machines at developers fingertips are often dumb terminals.)

They appear to use code coverage to determine what tests to run. "We're not going to run a test unless your diff might actually have broken it."

They run flaky tests less often.

They run slow tests less often.

Goal is to get feedback to developers in under 10 minutes.

If they run fewer tests and get back to developers quicker, things are less likely to break than if they run more tests but take longer to give feedback.

They also want feedback quickly so reviewers can see results at review time.

They use Web Driver heavily. Love cross-platform nature of Web Driver.

In addition to test results, performance and size metrics are reported.

They have a "Ship It" button on the diff.

Landcastle handles landing diff.

"It is not OK at Facebook to land a diff without using Landcastle." (Read: developers don't push directly to the master repo.)

Once Landcastle lands something, it runs tests again. If an issue is found, a task is filed. Task can be "push blocking." Code won't ship to users until the "push blocking" issue resolved. (Tweets confirm they do backouts "fairly aggressively." A valid resolution to a push blocking task is to backout. But fixing forward is fine as well.)

After a while, branch cut occurs. Some cherry picks onto release branches.

In addition to diff-based testing, they do continuous testing runs. Much more comprehensive. No time restrictions. Continuous runs on master and release candidate branches. Auto bisect to pin down regressions.

Sandcastle processes >1000 test results per second. 5 years of machine work per day. Thousands of machines in 5 data centers.

They started with buildbot. Single master. Hit scaling limits of single thread single master. Master could not push work to workers fast enough. Sandcastle has distributed queue. Workers just pull jobs from distributed queue.

"High-signal feedback is critical." "Flaky failures erode developer confidence." "We need developers to trust Sandcastle."

Extremely careful separating infra failures from other failures. Developers don't see infra failures. Infra failures only reported to Sandcastle team.

Bots look for flaky tests. Stress test individual tests. Run tests in parallel with themselves. Goal: developers don't see flaky tests.

There is a "not my fault" button that developers can use to report bad signals.

"Whatever the scale of your engineering organization, developer efficiency is the key thing that your infrastructure teams should be striving for. This is why at Facebook we have some of our top engineers working on developer infrastructure." (Preach it.)

Excellent talk. Mozillians doing infra work or who are in charge of head count for infra work should watch this video.

Update 2015-03-28 21:35 UTC - Clarified some bits in response to new info Tweeted at me. Added link to my monorepos blog post.


New High Scores for hg.mozilla.org

March 19, 2015 at 08:20 PM | categories: Mercurial, Mozilla

It's been a rough week.

The very short summary of events this week is that both the Firefox and Firefox OS release automation has been performing a denial of service attack against hg.mozilla.org.

On the face of it, this is nothing new. The release automation is by far the top consumer of hg.mozilla.org data, requesting several terabytes per day via several million HTTP requests from thousands of machines in multiple data centers. The very nature of their existence makes them a significant denial of service threat.

Lots of things went wrong this week. While a post mortem will shed light on them, many fall under the umbrella of release automation was making more requests than it should have and was doing so in a way that both increased the chances of an outage occurring and increased the chances of a prolonged outage. This resulted in the hg.mozilla.org servers working harder than they ever have. As a result, we have some new high scores to share.

  • On UTC day March 19, hg.mozilla.org transferred 7.4 TB of data. This is a significant increase from the ~4 TB we expect on a typical weekday. (Even more significant when you consider that most load is generated during peak hours.)

  • During the 1300 UTC hour of March 17, the cluster received 1,363,628 HTTP requests. No HTTP 503 Service Not Available errors were encountered in that window! 300,000 to 400,000 requests per hour is typical.

  • During the 0800 UTC hour of March 19, the cluster transferred 776 GB of repository data. That comes out to at least 1.725 Gbps on average (I didn't calculate TCP and other overhead). Anything greater than 250 GB per hour is not very common. No HTTP 503 errors were served from the origin servers during this hour!

We encountered many periods where hg.mozilla.org was operating more than twice its normal and expected operating capacity and it was able to handle the load just fine. As a server operator, I'm proud of this. The servers were provisioned beyond what is normally needed of them and it took a truly exceptional event (or two) to bring the service down. This is generally a good way to do hosted services (you rarely want to be barely provisioned because you fall over at the slighest change and you don't want to be grossly over-provisioned because you are wasting money on idle resources).

Unfortunately, the hg.mozilla.org service did fall over. Multiple times, in fact. There is room to improve. As proud as I am that the service operated well beyond its expected limits, I can't help but feel ashamed that it did eventual cave in under even extreme load and that people are probably making under-informed general assumptions like Mercurial can't scale. The simple fact of the matter is that clients cumulatively generated an exceptional amount of traffic to hg.mozilla.org this week. All servers have capacity limits. And this week we encountered the limit for the current configuration of hg.mozilla.org. Cause and effect.


Network Events

March 18, 2015 at 02:25 PM | categories: Mozilla

The Firefox source repositories and automation have been closed the past few days due to a couple of outages.

Yesterday, aggregate CPU usage on many of the machines in the hg.mozilla.org cluster hit 100%. Previously, whenever hg.mozilla.org was under high load, we'd run out of network bandwidth before we ran out of CPU on the machines. In other words, Mercurial was generating data faster than the network could accept it.

When this happened, the service started issuing HTTP 503 Service Not Available responses. This is the universal server signal for I'm down, go away. Unfortunately, not all clients did this.

Parts of Firefox's release automation retried failing requests immediately, or with insufficient jitter in their backoff interval. Actively retrying requests against a server that's experiencing load issues only makes the problem worse. This effectively prolonged the outage.

Today, we had a similar but different network issue. The load balancer fronting hg.mozilla.org can only handle so much bandwidth. Today, we hit that limit. The load balancer started throttling connections. Load on hg.mozilla.org skyrocketed and request latency increased. From the perspective of clients, the service grinded to a halt.

hg.mozilla.org was partially sharing a load balancer with ftp.mozilla.org. That meant if one of the services experienced very high load, the other service could effectively be locked out of bandwidth. We saw this happening this morning. ftp.mozilla.org load was high (it looks like downloads of Firefox Developer Edition are a major contributor - these don't go through the CDN for reasons unknown to me) and there wasn't enough bandwidth to go around.

Separately today, hg.mozilla.org again hit 100% CPU. At that time, it also set a new record for network throughput: ~3 Gbps. It normally consumes between 200 and 500 Mbps, with periodic spikes to 750 Mbps. (Yesterday's event saw a spike to around ~2 Gbps.)

Going back through the hg.mozilla.org server logs, an offender is quite obvious. Before March 9, total outbound transfer for the build/tools repo was around 1 tebibyte per day. Starting on March 9, it increased to 3 tebibytes per day! This is quite remarkable, as a clone of this repo is only about 20 MiB. This means the repo was getting cloned about 150,000 times per day! (Note: I think all these numbers may be low by ~20% - stay tuned for the final analysis.)

2 TiB/day is statistically significant because we transfer less than 10 TiB/day across all of hg.mozilla.org. And, 1 TiB/day is close to 100 Mbps, assuming requests are evenly spread out (which of course they aren't).

Multiple things went wrong. If only one or two happened, we'd likely be fine. Maybe there would have been a short blip. But not the major event we've been firefighting the last ~24 hours.

This post is only a summary of what went wrong. I'm sure there will be a post-mortem and that it will contain lots of details for those who want to know more.


Lost Productivity Due to Non-Unified Repositories

February 17, 2015 at 02:50 PM | categories: Mozilla

I'm currently working on annotating moz.build files with metadata that defines things like which bug component and code reviewers map to which files. It's going to enable a lot of awesomeness.

As part of this project, I'm implementing a new moz.build processing mode. Instead of reading moz.build files by traversing DIRS variables from previously-executed moz.build files, we're evaluating moz.build files according to filesystem topology. This has uncovered a few cases where a moz.build file errors because of assumptions that no longer hold. For example, for directories that are only active on Windows, the moz.build file might assume that if Windows is always true.

One such problem was with gfx/angle/srx/libGLESv2/moz.build. This file contained code similar to the following:

if CONFIG['IS_WINDOWS']:
    SOURCES += ['foo.cpp']
...
SOURCES['foo.cpp'].flags += ['-DBAR']

This always ran without issue because this moz.build was only included if building for Windows. This assumption is of course invalid when in filesystem traversal mode.

Anyway, as part of updating this trouble file, I lost maybe an hour of productivity. Here's how.

The top of the trouble moz.build file has a comment:

# Please note this file is autogenerated from generate_mozbuild.py,
# so do not modify it directly

OK. So, I need to modify generate_mozbuild.py. First thing's first: I need to locate it:

$ hg locate generate_mozbuild.py
gfx/skia/generate_mozbuild.py

So I load up this file. I see a main(). I run the script in my shell and get an error. Weird. I look around gfx/skia and see a README_MOZILLA file. I open it. README_MOZILLA contains some instructions. They aren't very good. I hop in #gfx on IRC and ask around. They tell me to do a Subversion clone of Skia and to check out the commit referenced in README_MOZILLA. There is no repo URL in README_MOZILLA. I search Google. I find a Git URL. I notice that README_MOZILLA contains a SHA-1 commit, not a Subversion integer revision. I figure the Git repo is what was meant. I clone the Git repo. I attempt to run the generation script referenced by README_MOZILLA. It fails. I ask again in #gfx. They are baffled at first. I dig around the source code. I see a reference in Skia's upstream code to a path that doesn't exist. I tell the #gfx people. They tell me sub-repos are likly involved and to use gclient to clone the repo. I search for the proper Skia source code docs and type the necessary gclient commands. (Fortunately I've used gclient before, so this wasn't completely alien to me.)

I get the Skia clone in the proper state. I run the generation script and all works. But I don't see it writing the trouble moz.build file I set out to fix. I set some breakpoints. I run the code again. I'm baffled.

Suddenly it hits me: I've been poking around with gfx/skia which is separate from gfx/angle! I look around gfx/angle and see a README.mozilla file. I open it. It reveals the existence of the Git repo https://github.com/mozilla/angle. I open GitHub in my browser. I see a generate_mozbuild.py script.

I now realize there are multiple files named generate_mozbuild.py. Unfortunately, the one I care about - the ANGLE one - is not checked into mozilla-central. So, my search for it with hg files did not reveal its existence. Between me trying to get the Skia code cloned and generating moz.build files, I probably lost an hour of work. All because a file with a similar name wasn't checked into mozilla-central!

I assumed that the single generate_mozbuild.py I found under source control was the only file of that name and that it must be the file I was interested in.

Maybe I should have known to look at gfx/angle/README.mozilla first. Maybe I should have known that gfx/angle and gfx/skia are completely independent.

But I didn't. My ignorance cost me.

Had the contents of the separate ANGLE repository been checked into mozilla-central, I would have seen the multiple generate_mozbuild.py files and I would likely have found the correct one immediately. But they weren't and I lost an hour of my time.

And I'm not done. Now I have to figure out how the separate ANGLE repo integrates with mozilla-central. I'll have to figure out how to submit the patch I still need to write. The GitHub description of this repo says Talk to vlad, jgilbert, or kamidphish for more info. So now I have to bother them before I can submit my patch. Maybe I'll just submit a pull request and see what happens.

I'm convinced I wouldn't have encountered this problem if a monolithic repository were used. I would have found the separate generate_mozbuild.py file immediately. And, the change process would likely have been known to me since all the code was in a repository I already knew how to submit patches from.

Separate repos are just lots of pain. You can bet I'll link to this post when people propose splitting up mozilla-central into multiple repositories.


« Previous Page -- Next Page »