Network Events

March 18, 2015 at 02:25 PM | categories: Mozilla

The Firefox source repositories and automation have been closed the past few days due to a couple of outages.

Yesterday, aggregate CPU usage on many of the machines in the cluster hit 100%. Previously, whenever was under high load, we'd run out of network bandwidth before we ran out of CPU on the machines. In other words, Mercurial was generating data faster than the network could accept it.

When this happened, the service started issuing HTTP 503 Service Not Available responses. This is the universal server signal for I'm down, go away. Unfortunately, not all clients did this.

Parts of Firefox's release automation retried failing requests immediately, or with insufficient jitter in their backoff interval. Actively retrying requests against a server that's experiencing load issues only makes the problem worse. This effectively prolonged the outage.

Today, we had a similar but different network issue. The load balancer fronting can only handle so much bandwidth. Today, we hit that limit. The load balancer started throttling connections. Load on skyrocketed and request latency increased. From the perspective of clients, the service grinded to a halt. was partially sharing a load balancer with That meant if one of the services experienced very high load, the other service could effectively be locked out of bandwidth. We saw this happening this morning. load was high (it looks like downloads of Firefox Developer Edition are a major contributor - these don't go through the CDN for reasons unknown to me) and there wasn't enough bandwidth to go around.

Separately today, again hit 100% CPU. At that time, it also set a new record for network throughput: ~3 Gbps. It normally consumes between 200 and 500 Mbps, with periodic spikes to 750 Mbps. (Yesterday's event saw a spike to around ~2 Gbps.)

Going back through the server logs, an offender is quite obvious. Before March 9, total outbound transfer for the build/tools repo was around 1 tebibyte per day. Starting on March 9, it increased to 3 tebibytes per day! This is quite remarkable, as a clone of this repo is only about 20 MiB. This means the repo was getting cloned about 150,000 times per day! (Note: I think all these numbers may be low by ~20% - stay tuned for the final analysis.)

2 TiB/day is statistically significant because we transfer less than 10 TiB/day across all of And, 1 TiB/day is close to 100 Mbps, assuming requests are evenly spread out (which of course they aren't).

Multiple things went wrong. If only one or two happened, we'd likely be fine. Maybe there would have been a short blip. But not the major event we've been firefighting the last ~24 hours.

This post is only a summary of what went wrong. I'm sure there will be a post-mortem and that it will contain lots of details for those who want to know more.