You may have noticed the Firefox trees were closed for the better part of yesterday.
Long story short, a file containing URLs for Firefox installers was updated to reference https://ftp.mozilla.org/ from http://download-installer.cdn.mozilla.net/. The original, latter host is a CDN. The former is not. When thousands of clients started hitting ftp.mozilla.org, it overwhelmed the servers and network, causing timeouts and other badness.
The change in question was accidental. It went through code review. From a code change standpoint, procedures were followed.
It is tempting to point fingers at the people involved. However, I want us to consider placing blame elsewhere: on the code review tool.
The diff being reviewed was to change the Firefox version number from 32.0 to 32.0.3. If you were asked to review this patch, I'm guessing your eyes would have glanced over everything in the URL except the version number part. I'm pretty sure mine would have.
Let's take a look at what the reviewer saw in Bugzilla/Splinter (click to see full size):
And here is what the reviewer would have seen had the review been conducted in MozReview:
Which tool makes the change of hostname more obvious? Bugzilla/Splinter or MozReview?
MozReview's support for intraline diffs more clearly draws attention to the hostname change. I posit that had this patch been reviewed with MozReview, the chances are greater we wouldn't have had a network outage yesterday.
And it isn't just intraline diffs that make Splinter/Bugzilla a sub-optimal code review tool. I recently blogged about the numerous ways that using Bugzilla for code revie results in harder reviews and buggier code. Every day we continue using Bugzilla/Splinter instead of investing in better code review tools is a day severe bugs like this can and will slip through the cracks.
If there is any silver lining to this outage, I hope it is that we need to double down on our investment in developer tools, particularly code review.