Mozilla has historically done some funky things with the Firefox Mercurial repositories. One of the things we've done is create a bunch of named branches to track the Firefox release process. These are branch names like GECKO20b12_2011022218_RELBRANCH.
Over in bug 927219, we started the process of cleaning up some cruft left over from many of these old branches.
For starters, the old named branches in the Firefox repositories are being actively closed. When you hg commit --close-branch, Mercurial creates a special commit that says this branch is closed. Branches that are closed are automatically hidden from the output of hg branches and hg heads. As a result, the output of these commands is now much more usable.
Closed branches still constitute heads on the DAG. And several heads lead to degraded performance in some situations (notably push and pull times - the same thing happens in Git). I'd like to eventually merge these old heads so that repositories only have 1 or a small number of DAG heads. However, extra care must be taken before that step. Stay tuned.
Anyway, for the average person reading, you probably won't be impacted by these changes at all. The greatest impact will be from the person who lands the first change on top of any repository whose last commit was a branch close. If you commit on top of the tip commit, you'll be committing on top of a previously closed branch! You'll instead want to hg up default after you pull to ensure you are on the proper DAG head! And even then, if you have local commits, you may not be based on top of the appropriate commit! A simple run of hg log --graph should help you decipther the state of the world. (Please note that the usability problems around discovering the appropriate head to land on are a result of our poor branching strategy for the Firefox repositories. We probably should have named branches tracking the active Gecko releases. But that ship sailed years ago and fixing that is pretty far down the priority list. Wallpapering over things with the firefoxtree extensions is my recommended solution until matters are fixed.)
Mozilla's Mercurial documentation has historically been pretty bad. The documentation on MDN (which I refuse to link to) is horribly disjointed and contains a lot of outdated recommendations. I've made attempts to burn some of it to the ground, but it is just too overwhelming.
I've been casually creating my own Mercurial documentation tailored for Mozillians. It's called Mercurial for Mozillians.
It started as a way to document extensions inside the version-control-tools repository. But, it has since evolved to cover other topics, like how to install Mercurial, how to develop using bookmarks, and how to interact with a unified Firefox repository. The documentation is nowhere near complete. But it already has some very useful content beyond what MDN offers.
I'm not crazy about the idea of having generic Mercurial documentation on a Mozilla domain (this should be part of the official Mercurial documentation). Nor am I crazy about moving content off MDN. I'm sure content will move to its appropriate location later. Until then, enjoy some curated Mercurial documentation!
If you would like to contribute to Mercurial for Mozillians, read the docs.
The bzexport Mercurial extension - an extension that enables you to easily create new Bugzilla bugs and upload patches to Bugzilla for review - just received some major updates.
First, we now have automated test coverage of bzexport! This is built on top of the version control test harness I previously blogged about. As part of the tests, we start Docker containers that run the same code that's running on bugzilla.mozilla.org, so interactions with Bugzilla are properly tested. This is much, much better than mocking HTTP requests and responses because if Bugzilla changes, our tests will detect it. Yay continuous integration.
Second, bzexport now uses Bugzilla' REST API instead of the legacy bzAPI endpoint for all but 1 HTTP request. This should make BMO maintainers very happy.
Third and finally, bzexport now uses shared code for obtaining Bugzilla credentials. The behavior is documented, of course. Behavior is not backwards compatible. If you were using some old configuration values, you will now see warnings when running bzexport. These warnings are actionable, so I shouldn't need to describe them here.
Please obtain the new code by pulling the version-control-tools repository. Or, if you have a Firefox clone, run mach mercurial-setup.
Thanks go out to Steve Fink, Ed Morley, and Ted Mielczarek for looking at the code.
Starting today and continuing through next week, there will be a number of styling changes made to hg.mozilla.org.
The main goal of the work is to bring the style up-to-date with upstream Mercurial. This will result in more features being available to the web interface, hopefully making it more useful. This includes display of bookmarks and the Mercurial help documentation. As part of this work, we're also removing some files on the server that shouldn't be used. If you start getting 404s or notice an unexpected theme change, this is probably the reason why.
If you'd like to look over the changes before they are made or would like to file a bug against a regression (we suspect there will be minor regressions due to the large nature of the changes), head on over to bug 1117021 or ping people in #vcs on IRC.
At Mozilla, I often hear statements like Mercurial is slow. That's a very general statement. Depending on the context, it can mean one or more of several things:
- My Mercurial workflow is not very efficient
- hg commands I execute are slow to run
- hg commands I execute appear to stall
- The Mercurial server I'm interfacing with is slow
I want to spend time talking about a specific problem: why hg.mozilla.org (the server) is slow.
What Isn't Slow
If you are talking to hg.mozilla.org over HTTP or HTTPS (https://hg.mozilla.org/), there should not currently be any server performance issues. Our Mercurial HTTP servers are pretty beefy and are able to absorb a lot of load.
If https://hg.mozilla.org/ is slow, chances are:
- You are doing something like cloning a 1+ GB repository.
- You are asking the server to do something really expensive (like generate JSON for 100,000 changesets via the pushlog query interface).
- You don't have a high bandwidth connection to the server.
- There is a network event.
Previous Network Events
There have historically been network capacity issues in the datacenter where hg.mozilla.org is hosted (SCL3).
During Mozlandia, excessive traffic to ftp.mozilla.org essentially saturated the SCL3 network. During this time, requests to hg.mozilla.org were timing out: Mercurial traffic just couldn't traverse the network. Fortunately, events like this are quite rare.
Up until recently, Firefox release automation was effectively overwhelming the network by doing some clownshoesy things.
For example, gaia-central was being cloned all the time We had a ~1.6 GB repository being cloned over a thousand times per day. We were transferring close to 2 TB of gaia-central data out of Mercurial servers per day
We also found issues with pushlogs sending 100+ MB responses.
And the build/tools repo was getting cloned for every job. Ditto for mozharness.
In all, we identified a few terabytes of excessive Mercurial traffic that didn't need to exist. This excessive traffic was saturating the SCL3 network and slowing down not only Mercurial traffic, but other traffic in SCL3 as well.
Fortunately, people from Release Engineering were quick to respond to and fix the problems once they were identified. The problem is now firmly in control. Although, given the scale of Firefox's release automation, any new system that comes online that talks to version control is susceptible to causing server outages. I've already raised this concern when reviewing some TaskCluster code. The thundering herd of automation will be an ongoing concern. But I have plans to further mitigate risk in 2015. Stay tuned.
Looking back at our historical data, it appears that we hit these network saturation limits a few times before we reached a tipping point in early November 2014. Unfortunately, we didn't realize this because up until recently, we didn't have a good source of data coming from the servers. We lacked the tooling to analyze what we had. We lacked the experience to know what to look for. Outages are effective flashlights. We learned a lot and know what we need to do with the data moving forward.
Available Network Bandwidth
One person pinged me on IRC with the comment Git is cloning much faster than Mercurial. I asked for timings and the Mercurial clone wall time for Firefox was much higher than I expected.
The reason was network bandwidth. This person was performing a Git clone between 2 hosts in EC2 but was performing the Mercurial clone between hg.mozilla.org and a host in EC2. In other words, they were partially comparing the performance of a 1 Gbps network against a link over the public internet! When they did a fair comparison by removing the network connection as a variable, the clone times rebounded to what I expected.
The single-homed nature of hg.mozilla.org in a single datacenter in northern California is not only bad for disaster recovery reasons, it also means that machines far away from SCL3 or connecting to SCL3 over a slow network aren't getting optimal performance.
In 2015, expect us to build out a geo-distributed hg.mozilla.org so that connections are hitting a server that is closer and thus faster. This will probably be targeted at Firefox release automation in AWS first. We want those machines to have a fast connection to the server and we want their traffic isolated from the servers developers use so that hiccups in automation don't impact the ability for humans to access and interface with source code.
NFS on SSH Master Server
If you connect to http://hg.mozilla.org/ or https://hg.mozilla.org/, you are hitting a pool of servers behind a load balancer. These servers have repository data stored on local disk, where I/O is fast. In reality, most I/O is serviced by the page cache, so local disks don't come into play.
If you connect to ssh://hg.mozilla.org/, you are hitting a single, master server. Its repository data is hosted on an NFS mount. I/O on the NFS mount is horribly slow. Any I/O intensive operation performed on the master is much, much slower than it should be. Such is the nature of NFS.
We'll be exploring ways to mitigate this performance issue in 2015. But it isn't the biggest source of performance pain, so don't expect anything immediately.
Synchronous Replication During Pushes
When you hg push to hg.mozilla.org, the changes are first made on the SSH/NFS master server. They are subsequently mirrored out to the HTTP read-only slaves.
As is currently implemented, the mirroring process is performed synchronously during the push operation. The server waits for the mirrors to complete (to a reasonable state) before it tells the client the push has completed.
Depending on the repository, the size of the push, and server and network load, mirroring commonly adds 1 to 7 seconds to push times. This is time when a developer is sitting at a terminal, waiting for hg push to complete. The time for Try pushes can be larger: 10 to 20 seconds is not uncommon (but fortunately not the norm).
The current mirroring mechanism is overly simple and prone to many failures and sub-optimal behavior. I plan to work on fixing mirroring in 2015. When I'm done, there should be no user-visible mirroring delay.
Pushlog Replication Inefficiency
Up until yesterday (when we deployed a rewritten pushlog extension, the replication of pushlog data from master to server was very inefficient. Instead of tranferring a delta of pushes since last pull, we were literally copying the underlying SQLite file across the network!
Try's pushlog is ~30 MB. mozilla-central and mozilla-inbound are in the same ballpark. 30 MB x 10 slaves is a lot of data to transfer. These operations were capable of periodically saturating the network, slowing everyone down.
The rewritten pushlog extension performs a delta transfer automatically as part of hg pull. Pushlog synchronization now completes in milliseconds while commonly only consuming a few kilobytes of network traffic.
Early indications reveal that deploying this change yesterday decreased the push times to repositories with long push history by 1-3s.
Pretty much any interaction with the Try repository is guaranteed to have poor performance. The Try repository is doing things that distributed versions control systems weren't designed to do. This includes Git.
If you are using Try, all bets are off. Performance will be problematic until we roll out the headless try repository.
That being said, we've made changes recently to make Try perform better. The median time for pushing to Try has decreased significantly in the past few weeks. The first dip in mid-November was due to upgrading the server from Mercurial 2.5 to Mercurial 3.1 and from converting Try to use generaldelta encoding. The dip this week has been from merging all heads and from deploying the aforementioned pushlog changes. Pushing to Try is now significantly faster than 3 months ago.
Many of the reasons for hg.mozilla.org slowness are known. More often than not, they are due to clownshoes or inefficiencies on Mozilla's part rather than fundamental issues with Mercurial.
We have made significant progress at making hg.mozilla.org faster. But we are not done. We are continuing to invest in fixing the sub-optimal parts and making hg.mozilla.org faster yet. I'm confident that within a few months, nobody will be able to say that the servers are a source of pain like they have been for years.
Furthermore, Mercurial is investing in features to make the wire protocol faster, more efficient, and more powerful. When deployed, these should make pushes faster on any server. They will also enable workflow enhancements, such as Facebook's experimental extension to perform rebases as part of push (eliminating push races and having to manually rebase when you lose the push race).
Next Page »