New High Scores for

March 19, 2015 at 08:20 PM | categories: Mercurial, Mozilla

It's been a rough week.

The very short summary of events this week is that both the Firefox and Firefox OS release automation has been performing a denial of service attack against

On the face of it, this is nothing new. The release automation is by far the top consumer of data, requesting several terabytes per day via several million HTTP requests from thousands of machines in multiple data centers. The very nature of their existence makes them a significant denial of service threat.

Lots of things went wrong this week. While a post mortem will shed light on them, many fall under the umbrella of release automation was making more requests than it should have and was doing so in a way that both increased the chances of an outage occurring and increased the chances of a prolonged outage. This resulted in the servers working harder than they ever have. As a result, we have some new high scores to share.

  • On UTC day March 19, transferred 7.4 TB of data. This is a significant increase from the ~4 TB we expect on a typical weekday. (Even more significant when you consider that most load is generated during peak hours.)

  • During the 1300 UTC hour of March 17, the cluster received 1,363,628 HTTP requests. No HTTP 503 Service Not Available errors were encountered in that window! 300,000 to 400,000 requests per hour is typical.

  • During the 0800 UTC hour of March 19, the cluster transferred 776 GB of repository data. That comes out to at least 1.725 Gbps on average (I didn't calculate TCP and other overhead). Anything greater than 250 GB per hour is not very common. No HTTP 503 errors were served from the origin servers during this hour!

We encountered many periods where was operating more than twice its normal and expected operating capacity and it was able to handle the load just fine. As a server operator, I'm proud of this. The servers were provisioned beyond what is normally needed of them and it took a truly exceptional event (or two) to bring the service down. This is generally a good way to do hosted services (you rarely want to be barely provisioned because you fall over at the slighest change and you don't want to be grossly over-provisioned because you are wasting money on idle resources).

Unfortunately, the service did fall over. Multiple times, in fact. There is room to improve. As proud as I am that the service operated well beyond its expected limits, I can't help but feel ashamed that it did eventual cave in under even extreme load and that people are probably making under-informed general assumptions like Mercurial can't scale. The simple fact of the matter is that clients cumulatively generated an exceptional amount of traffic to this week. All servers have capacity limits. And this week we encountered the limit for the current configuration of Cause and effect.