It's been a rough week.
The very short summary of events this week is that both the Firefox and Firefox OS release automation has been performing a denial of service attack against hg.mozilla.org.
On the face of it, this is nothing new. The release automation is by far the top consumer of hg.mozilla.org data, requesting several terabytes per day via several million HTTP requests from thousands of machines in multiple data centers. The very nature of their existence makes them a significant denial of service threat.
Lots of things went wrong this week. While a post mortem will shed light on them, many fall under the umbrella of release automation was making more requests than it should have and was doing so in a way that both increased the chances of an outage occurring and increased the chances of a prolonged outage. This resulted in the hg.mozilla.org servers working harder than they ever have. As a result, we have some new high scores to share.
On UTC day March 19, hg.mozilla.org transferred 7.4 TB of data. This is a significant increase from the ~4 TB we expect on a typical weekday. (Even more significant when you consider that most load is generated during peak hours.)
During the 1300 UTC hour of March 17, the cluster received 1,363,628 HTTP requests. No HTTP 503 Service Not Available errors were encountered in that window! 300,000 to 400,000 requests per hour is typical.
During the 0800 UTC hour of March 19, the cluster transferred 776 GB of repository data. That comes out to at least 1.725 Gbps on average (I didn't calculate TCP and other overhead). Anything greater than 250 GB per hour is not very common. No HTTP 503 errors were served from the origin servers during this hour!
We encountered many periods where hg.mozilla.org was operating more than twice its normal and expected operating capacity and it was able to handle the load just fine. As a server operator, I'm proud of this. The servers were provisioned beyond what is normally needed of them and it took a truly exceptional event (or two) to bring the service down. This is generally a good way to do hosted services (you rarely want to be barely provisioned because you fall over at the slighest change and you don't want to be grossly over-provisioned because you are wasting money on idle resources).
Unfortunately, the hg.mozilla.org service did fall over. Multiple times, in fact. There is room to improve. As proud as I am that the service operated well beyond its expected limits, I can't help but feel ashamed that it did eventual cave in under even extreme load and that people are probably making under-informed general assumptions like Mercurial can't scale. The simple fact of the matter is that clients cumulatively generated an exceptional amount of traffic to hg.mozilla.org this week. All servers have capacity limits. And this week we encountered the limit for the current configuration of hg.mozilla.org. Cause and effect.
Mozilla has historically done some funky things with the Firefox Mercurial repositories. One of the things we've done is create a bunch of named branches to track the Firefox release process. These are branch names like GECKO20b12_2011022218_RELBRANCH.
Over in bug 927219, we started the process of cleaning up some cruft left over from many of these old branches.
For starters, the old named branches in the Firefox repositories are being actively closed. When you hg commit --close-branch, Mercurial creates a special commit that says this branch is closed. Branches that are closed are automatically hidden from the output of hg branches and hg heads. As a result, the output of these commands is now much more usable.
Closed branches still constitute heads on the DAG. And several heads lead to degraded performance in some situations (notably push and pull times - the same thing happens in Git). I'd like to eventually merge these old heads so that repositories only have 1 or a small number of DAG heads. However, extra care must be taken before that step. Stay tuned.
Anyway, for the average person reading, you probably won't be impacted by these changes at all. The greatest impact will be from the person who lands the first change on top of any repository whose last commit was a branch close. If you commit on top of the tip commit, you'll be committing on top of a previously closed branch! You'll instead want to hg up default after you pull to ensure you are on the proper DAG head! And even then, if you have local commits, you may not be based on top of the appropriate commit! A simple run of hg log --graph should help you decipther the state of the world. (Please note that the usability problems around discovering the appropriate head to land on are a result of our poor branching strategy for the Firefox repositories. We probably should have named branches tracking the active Gecko releases. But that ship sailed years ago and fixing that is pretty far down the priority list. Wallpapering over things with the firefoxtree extensions is my recommended solution until matters are fixed.)
Mozilla's Mercurial documentation has historically been pretty bad. The documentation on MDN (which I refuse to link to) is horribly disjointed and contains a lot of outdated recommendations. I've made attempts to burn some of it to the ground, but it is just too overwhelming.
I've been casually creating my own Mercurial documentation tailored for Mozillians. It's called Mercurial for Mozillians.
It started as a way to document extensions inside the version-control-tools repository. But, it has since evolved to cover other topics, like how to install Mercurial, how to develop using bookmarks, and how to interact with a unified Firefox repository. The documentation is nowhere near complete. But it already has some very useful content beyond what MDN offers.
I'm not crazy about the idea of having generic Mercurial documentation on a Mozilla domain (this should be part of the official Mercurial documentation). Nor am I crazy about moving content off MDN. I'm sure content will move to its appropriate location later. Until then, enjoy some curated Mercurial documentation!
If you would like to contribute to Mercurial for Mozillians, read the docs.
The bzexport Mercurial extension - an extension that enables you to easily create new Bugzilla bugs and upload patches to Bugzilla for review - just received some major updates.
First, we now have automated test coverage of bzexport! This is built on top of the version control test harness I previously blogged about. As part of the tests, we start Docker containers that run the same code that's running on bugzilla.mozilla.org, so interactions with Bugzilla are properly tested. This is much, much better than mocking HTTP requests and responses because if Bugzilla changes, our tests will detect it. Yay continuous integration.
Second, bzexport now uses Bugzilla' REST API instead of the legacy bzAPI endpoint for all but 1 HTTP request. This should make BMO maintainers very happy.
Third and finally, bzexport now uses shared code for obtaining Bugzilla credentials. The behavior is documented, of course. Behavior is not backwards compatible. If you were using some old configuration values, you will now see warnings when running bzexport. These warnings are actionable, so I shouldn't need to describe them here.
Please obtain the new code by pulling the version-control-tools repository. Or, if you have a Firefox clone, run mach mercurial-setup.
Thanks go out to Steve Fink, Ed Morley, and Ted Mielczarek for looking at the code.
Starting today and continuing through next week, there will be a number of styling changes made to hg.mozilla.org.
The main goal of the work is to bring the style up-to-date with upstream Mercurial. This will result in more features being available to the web interface, hopefully making it more useful. This includes display of bookmarks and the Mercurial help documentation. As part of this work, we're also removing some files on the server that shouldn't be used. If you start getting 404s or notice an unexpected theme change, this is probably the reason why.
If you'd like to look over the changes before they are made or would like to file a bug against a regression (we suspect there will be minor regressions due to the large nature of the changes), head on over to bug 1117021 or ping people in #vcs on IRC.
Next Page »