Automatically Redirecting Mercurial Pushes

April 30, 2015 at 12:30 PM | categories: Mercurial, Mozilla | View Comments

Managing URLs in distributed version control tools can be a pain, especially if multiple repositories are involved. For example, with Mozilla's repository-based code review workflow (you push to a special review repository to initiate code review - this is conceptually similar to GitHub pull requests), there exist separate code review repositories for each logical repository. Figuring out how repositories map to each other and setting up remote paths for each new clone can be a pain and time sink.

As of today, we can now do something better.

If you push to ssh://reviewboard-hg.mozilla.org/autoreview, Mercurial will automatically figure out the appropriate review repository and redirect your push automatically. In other words, if we have MozReview set up to review whatever repository you are working on, your push and review request will automatically go through. No need to figure out what the appropriate review repo is or configure repository URLs!

Here's what it looks like:

$ hg push review
pushing to ssh://reviewboard-hg.mozilla.org/autoreview
searching for appropriate review repository
redirecting push to ssh://reviewboard-hg.mozilla.org/version-control-tools/
searching for changes
remote: adding changesets
remote: adding manifests
remote: adding file changes
remote: added 1 changesets with 1 changes to 1 files
remote: Trying to insert into pushlog.
remote: Inserted into the pushlog db successfully.
submitting 1 changesets for review

changeset:  11043:b65b087a81be
summary:    mozreview: create per-commit identifiers (bug 1160266)
review:     https://reviewboard.mozilla.org/r/7953 (draft)

review id:  bz://1160266/gps
review url: https://reviewboard.mozilla.org/r/7951 (draft)
(visit review url to publish this review request so others can see it)

Read the full instructions for more details.

This requires an updated version-control-tools repository, which you can get by running mach mercurial-setup from a Firefox repository.

For those that are curious, the autoreview repo/server advertises a list of repository URLs and their root commit SHA-1. The client automatically sends the push to a URL sharing the same root commit. The code is quite simple.

While this is only implemented for MozReview, I could envision us doing something similar for other centralized repository-centric services, such as Try and Autoland. Stay tuned.

Read and Post Comments

My Current Thoughts on System Administration

April 17, 2015 at 03:35 PM | categories: sysadmin, Mozilla | View Comments

I attended PyCon last week. It's a great conference. You should attend. While I should write up a detailed trip report, I wanted to quickly share one of my takeaways.

Ansible was talked about a lot at PyCon. Sitting through a few presentations and talking with others helped me articulate why I've been drawn to Ansible (over say Puppet, Chef, Salt, etc) lately.

First, Ansible doesn't require a central server. Administration is done remotely Ansible establishes a SSH connection to a remote machine and does stuff. Having Ruby, Python, support libraries, etc installed on production systems just for system administration never really jived with me. I love Ansible's default hands off approach. (Yes, you can use a central server for Ansible, but that's not the default behavior. While tools like Puppet could be used without a central server, it felt like they were optimized for central server use and thus local mode felt awkward.)

Related to central servers, I never liked how that model consists of clients periodically polling for and applying updates. I like the idea of immutable server images and periodic updates work against this goal. The central model also has a major bazooka pointed at you: at any time, you are only one mistake away from completely hosing every machine doing continuous polling. e.g. if you accidentally update firewall configs and lock out central server and SSH connectivity, every machine will pick up these changes during periodic polling and by the time anyone realizes what's happened, your machines are all effectively bricked. (Yes, I've seen this happen.) I like having humans control exactly when my systems apply changes, thank you. I concede periodic updates and central control have some benefits.

Choosing not to use a central server by default means that hosts are modeled as a set of applied Ansible playbooks, not necessarily as a host with a set of Ansible playbooks attached. Although, Ansible does support both models. I can easily apply a playbook to a host in a one-off manner. This means I can have playbooks represent common, one-off tasks and I can easily run these tasks without having to muck around with the host to playbook configuration. More on this later.

I love the simplicity of Ansible's configuration. It is just YAML files. Not some Ruby-inspired DSL that takes hours to learn. With Ansible, I'm learning what modules are available and how they work, not complicated syntax. Yes, there is complexity in Ansible's configuration. But at least I'm not trying to figure out the file syntax as part of learning it.

Along that vein, I appreciate the readability of Ansible playbooks. They are simple, linear lists of tasks. Conceptually, I love the promise of full dependency graphs and concurrent execution. But I've spent hours debugging race conditions and cyclic dependencies in Puppet that I'm left unconvinced the complexity and power is worth it. I do wish Ansible could run faster by running things concurrently. But I think they made the right decision by following KISS.

I enjoy how Ansible playbooks are effectively high-level scripts. If I have a shell script or block of code, I can usually port it to Ansible pretty easily. One pass to do the conversion 1:1. Another pass to Ansibilize it. Simple.

I love how Ansible playbooks can be checked in to source control and live next to the code and applications they manage. I frequently see people maintain separate source control repositories for configuration management from the code it is managing. This always bothered me. When I write a service, I want the code for deploying and managing that service to live next to it in version control. That way, I get the configuration management and the code versioned in the same timeline. If I check out a release from 2 years ago, I should still be able to use its exact configuration management code. This becomes difficult to impossible when your organization is maintaining configuration management code in a separate repository where a central server is required to do deployments (see Puppet).

Before PyCon, I was having an internal monolog about adopting the policy that all changes to remote servers be implemented with Ansible playbooks. I'm pleased to report that a fellow contributor to the Mercurial project has adopted this workflow himself and he only has great things to say! So, starting today, I'm going to try to enforce that every change I make to a remote server is performed via Ansible and that the Ansible playbooks are checked into version control. The Ansible playbooks will become implicit documentation of every process involved with maintaining a server.

I've already applied this principle to deploying MozReview. Before, there was some internal Mozilla wiki documenting commands to execute in a terminal to deploy MozReview. I have replaced that documentation with a one-liner that invokes Ansible. And, the Ansible files are now in a public repository.

If you poke around that repository, you'll see that I have Ansible playbooks referencing Docker. I have Ansible provisioning Docker images used by the test and development environment. That same Ansible code is used to configure our production systems (or is at least in the process of being used in that way). Having dev, test, and prod using the same configuration management has been a pipe dream of mine and I finally achieved it! I attempted this before with Puppet but was unable to make it work just right. The flexibility that Ansible's design decisions have enabled has made this finally possible.

Ansible is my go to system management tool right now. And I still feel like I have a lot to learn about its hidden powers.

If you are still using Puppet, Chef, or other tools invented in previous generations, I urge you to check out Ansible. I think you'll be pleasantly surprised.

Read and Post Comments

Notes from Facebook's Developer Infrastructure at Scale F8 Talk

March 28, 2015 at 11:45 AM | categories: Mozilla | View Comments

Any time Facebook talks about technical matters I tend to listen. They have a track record of demonstrating engineering leadership in several spaces. And, unlike many companies that just talk, Facebook often gives others access to those ideas via source code and healthy open source projects. It's rare to see a company operating on the frontier of the computing field provide so much insight into their inner workings. You can gain so much by riding their cotails and following their lead instead of clinging to and cargo culting from the past.

The Facebook F8 developer conference was this past week. All the talks are now available online. I encourage you to glimpse through the list of talks and watch whatever is relevant to you. There's really a little bit for everyone.

Of particular interest to me is the Big Code: Developer Infrastructure at Facebook's Scale talk. This is highly relevant to my job role as Developer Productivity Engineer at Mozilla.

My notes for this talk follow.

"We don't want humans waiting on computers. We want computers waiting on humans." (This is the common theme of the talk.)

In 2005, Facebook was on Subversion. In 2007 moved to Git. Deployed a bridge so people worked in Git and had distributed workflow but pushed to Subversion under the hood.

New platforms over time. Server code, iOS, Android. One Git repo per platform/project -> 3 Git repos. Initially no code sharing, so no problem. Over time, code sharing between all repos. Lots of code copying and confusion as to what is where and who owns what.

Facebook is mere weeks away from completing their migration to consolidate the big three repos to a Mercurial monorepo. (See also my post about monorepos.)

Reasons:

  1. Easier code sharing.
  2. Easier large-scale changes. Rewrite the universe at once.
  3. Unified set of tooling.

Facebook employees run >1M source control commands per day. >100k commits per week. VCS tool needs to be fast to prevent distractions and context switching, which slow people down.

Facebook implemented sparse checkout and shallow history in Mercurial. Necessary to scale distributed version control to large repos.

Quote from Google: "We're excited about the work Facebook is doing with Mercurial and glad to be collaborating with Facebook on Mercurial development." (Well, I guess the cat is finally out of the bag: Google is working on Mercurial. This was kind of an open secret for months. But I guess now it is official.)

Push-pull-rebase bottleneck: if you rebase and push and someone beats you to it, you have to pull, rebase, and try again. This gets worse as commit rate increases and people do needless legwork. Facebook has moved to server-side rebasing on push to mostly eliminate this pain point. (This is part of a still-experimental feature in Mercurial, which should hopefully lose its experimental flag soon.)

Starting 13:00 in we have a speaker change and move away from version control.

IDEs don't scale to Facebook scale. "Developing in Xcode at Facebook is an exercise in frustration." On average 3.5 minutes to open Facebook for iOS in Xcode. 5 minutes on average to index. Pegs CPU and makes not very responsive. 50 Xcode crashes per day across all Facebook iOS developers.

Facebook measures everything about tools. Mercurial operation times. Xcode times. Build times. Data tells them what tools and workflows need to be worked on.

Facebook believes IDEs are worth the pain because they make people more productive.

Facebook wants to support all editors and IDEs since people want to use whatever is most comfortable.

React Native changed things. Supported developing on multiple platforms, which no single IDE supports. People launched several editors and tools to do React Native development. People needed 4 windows to do development. That experience was "not acceptable." So they built their own IDE. Set of plugins on top of ATOM. Not a fork. They like hackable and web-y nature of ATOM.

The demo showing iOS development looks very nice! Doing Objective-C, JavaScript, simulator integration, and version control in one window!

It can connect to remote servers and transparently save and deploy changes. It can also get real-time compilation errors and hints from the remote server! (Demo was with Hack. Not sure if others langs supported. Having beefy central servers for e.g. Gecko development would be a fun experiment.)

Starting at 32:00 presentation shifts to continuous integration.

Number one goal of CI at Facebook is developer efficiency. We don't want developers waiting on computers to build and test diffs.

3 goals for CI:

  1. High-signal feedback. Don't want developers chasing failures that aren't their fault. Wastes time.
  2. Must provide rapid feedback. Developers don't want to wait.
  3. Provide frequent feedback. Developers should know as soon as possible after they did something. (I think this refers to local feedback.)

Sandcastle is their CI system.

Diff lifecycle discussion.

Basic tests and lint run locally. (My understanding from talking with Facebookers is "local" often means on a Facebook server, not local laptop. Machines at developers fingertips are often dumb terminals.)

They appear to use code coverage to determine what tests to run. "We're not going to run a test unless your diff might actually have broken it."

They run flaky tests less often.

They run slow tests less often.

Goal is to get feedback to developers in under 10 minutes.

If they run fewer tests and get back to developers quicker, things are less likely to break than if they run more tests but take longer to give feedback.

They also want feedback quickly so reviewers can see results at review time.

They use Web Driver heavily. Love cross-platform nature of Web Driver.

In addition to test results, performance and size metrics are reported.

They have a "Ship It" button on the diff.

Landcastle handles landing diff.

"It is not OK at Facebook to land a diff without using Landcastle." (Read: developers don't push directly to the master repo.)

Once Landcastle lands something, it runs tests again. If an issue is found, a task is filed. Task can be "push blocking." Code won't ship to users until the "push blocking" issue resolved. (Tweets confirm they do backouts "fairly aggressively." A valid resolution to a push blocking task is to backout. But fixing forward is fine as well.)

After a while, branch cut occurs. Some cherry picks onto release branches.

In addition to diff-based testing, they do continuous testing runs. Much more comprehensive. No time restrictions. Continuous runs on master and release candidate branches. Auto bisect to pin down regressions.

Sandcastle processes >1000 test results per second. 5 years of machine work per day. Thousands of machines in 5 data centers.

They started with buildbot. Single master. Hit scaling limits of single thread single master. Master could not push work to workers fast enough. Sandcastle has distributed queue. Workers just pull jobs from distributed queue.

"High-signal feedback is critical." "Flaky failures erode developer confidence." "We need developers to trust Sandcastle."

Extremely careful separating infra failures from other failures. Developers don't see infra failures. Infra failures only reported to Sandcastle team.

Bots look for flaky tests. Stress test individual tests. Run tests in parallel with themselves. Goal: developers don't see flaky tests.

There is a "not my fault" button that developers can use to report bad signals.

"Whatever the scale of your engineering organization, developer efficiency is the key thing that your infrastructure teams should be striving for. This is why at Facebook we have some of our top engineers working on developer infrastructure." (Preach it.)

Excellent talk. Mozillians doing infra work or who are in charge of head count for infra work should watch this video.

Update 2015-03-28 21:35 UTC - Clarified some bits in response to new info Tweeted at me. Added link to my monorepos blog post.

Read and Post Comments

New High Scores for hg.mozilla.org

March 19, 2015 at 08:20 PM | categories: Mercurial, Mozilla | View Comments

It's been a rough week.

The very short summary of events this week is that both the Firefox and Firefox OS release automation has been performing a denial of service attack against hg.mozilla.org.

On the face of it, this is nothing new. The release automation is by far the top consumer of hg.mozilla.org data, requesting several terabytes per day via several million HTTP requests from thousands of machines in multiple data centers. The very nature of their existence makes them a significant denial of service threat.

Lots of things went wrong this week. While a post mortem will shed light on them, many fall under the umbrella of release automation was making more requests than it should have and was doing so in a way that both increased the chances of an outage occurring and increased the chances of a prolonged outage. This resulted in the hg.mozilla.org servers working harder than they ever have. As a result, we have some new high scores to share.

  • On UTC day March 19, hg.mozilla.org transferred 7.4 TB of data. This is a significant increase from the ~4 TB we expect on a typical weekday. (Even more significant when you consider that most load is generated during peak hours.)

  • During the 1300 UTC hour of March 17, the cluster received 1,363,628 HTTP requests. No HTTP 503 Service Not Available errors were encountered in that window! 300,000 to 400,000 requests per hour is typical.

  • During the 0800 UTC hour of March 19, the cluster transferred 776 GB of repository data. That comes out to at least 1.725 Gbps on average (I didn't calculate TCP and other overhead). Anything greater than 250 GB per hour is not very common. No HTTP 503 errors were served from the origin servers during this hour!

We encountered many periods where hg.mozilla.org was operating more than twice its normal and expected operating capacity and it was able to handle the load just fine. As a server operator, I'm proud of this. The servers were provisioned beyond what is normally needed of them and it took a truly exceptional event (or two) to bring the service down. This is generally a good way to do hosted services (you rarely want to be barely provisioned because you fall over at the slighest change and you don't want to be grossly over-provisioned because you are wasting money on idle resources).

Unfortunately, the hg.mozilla.org service did fall over. Multiple times, in fact. There is room to improve. As proud as I am that the service operated well beyond its expected limits, I can't help but feel ashamed that it did eventual cave in under even extreme load and that people are probably making under-informed general assumptions like Mercurial can't scale. The simple fact of the matter is that clients cumulatively generated an exceptional amount of traffic to hg.mozilla.org this week. All servers have capacity limits. And this week we encountered the limit for the current configuration of hg.mozilla.org. Cause and effect.

Read and Post Comments

Network Events

March 18, 2015 at 02:25 PM | categories: Mozilla | View Comments

The Firefox source repositories and automation have been closed the past few days due to a couple of outages.

Yesterday, aggregate CPU usage on many of the machines in the hg.mozilla.org cluster hit 100%. Previously, whenever hg.mozilla.org was under high load, we'd run out of network bandwidth before we ran out of CPU on the machines. In other words, Mercurial was generating data faster than the network could accept it.

When this happened, the service started issuing HTTP 503 Service Not Available responses. This is the universal server signal for I'm down, go away. Unfortunately, not all clients did this.

Parts of Firefox's release automation retried failing requests immediately, or with insufficient jitter in their backoff interval. Actively retrying requests against a server that's experiencing load issues only makes the problem worse. This effectively prolonged the outage.

Today, we had a similar but different network issue. The load balancer fronting hg.mozilla.org can only handle so much bandwidth. Today, we hit that limit. The load balancer started throttling connections. Load on hg.mozilla.org skyrocketed and request latency increased. From the perspective of clients, the service grinded to a halt.

hg.mozilla.org was partially sharing a load balancer with ftp.mozilla.org. That meant if one of the services experienced very high load, the other service could effectively be locked out of bandwidth. We saw this happening this morning. ftp.mozilla.org load was high (it looks like downloads of Firefox Developer Edition are a major contributor - these don't go through the CDN for reasons unknown to me) and there wasn't enough bandwidth to go around.

Separately today, hg.mozilla.org again hit 100% CPU. At that time, it also set a new record for network throughput: ~3 Gbps. It normally consumes between 200 and 500 Mbps, with periodic spikes to 750 Mbps. (Yesterday's event saw a spike to around ~2 Gbps.)

Going back through the hg.mozilla.org server logs, an offender is quite obvious. Before March 9, total outbound transfer for the build/tools repo was around 1 tebibyte per day. Starting on March 9, it increased to 3 tebibytes per day! This is quite remarkable, as a clone of this repo is only about 20 MiB. This means the repo was getting cloned about 150,000 times per day! (Note: I think all these numbers may be low by ~20% - stay tuned for the final analysis.)

2 TiB/day is statistically significant because we transfer less than 10 TiB/day across all of hg.mozilla.org. And, 1 TiB/day is close to 100 Mbps, assuming requests are evenly spread out (which of course they aren't).

Multiple things went wrong. If only one or two happened, we'd likely be fine. Maybe there would have been a short blip. But not the major event we've been firefighting the last ~24 hours.

This post is only a summary of what went wrong. I'm sure there will be a post-mortem and that it will contain lots of details for those who want to know more.

Read and Post Comments

« Previous Page -- Next Page »