Gregory Szorc's Digital Home

Style Changes on hg.mozilla.org

January 09, 2015 at 03:25 PM | categories: Mercurial, Mozilla

Starting today and continuing through next week, there will be a number of styling changes made to hg.mozilla.org.

The main goal of the work is to bring the style up-to-date with upstream Mercurial. This will result in more features being available to the web interface, hopefully making it more useful. This includes display of bookmarks and the Mercurial help documentation. As part of this work, we're also removing some files on the server that shouldn't be used. If you start getting 404s or notice an unexpected theme change, this is probably the reason why.

If you'd like to look over the changes before they are made or would like to file a bug against a regression (we suspect there will be minor regressions due to the large nature of the changes), head on over to bug 1117021 or ping people in #vcs on IRC.

Mercurial Pushlog Is Now Robust Against Interrupts

December 30, 2014 at 12:25 PM | categories: Mozilla, Firefox

hg.mozilla.org - Mozilla's Mercurial server - has functionality called the pushlog which records who pushed what when. Essentially, it's a log of when a repository was changed. This is separate from the commit log because the commit log can be spoofed and the commit log doesn't record when commits were actually pushed.

Since its inception, the pushlog has suffered from data consistency issues. If you aborted the push at a certain time, data was not inserted in the pushlog. If you aborted the push at another time, data existed in the pushlog but not in the repository (the repository would get rolled back but the pushlog data wouldn't).

I'm pleased to announce that the pushlog is now robust against interruptions and its updates are consistent with what is recorded by Mercurial. The pushlog database commit/rollback is tied to Mercurial's own transaction API. What Mercurial does to the push transaction, the pushlog follows.

This former inconsistency has caused numerous problems over the years. When data was inconsistent, we often had to close trees until someone could SSH into the machines and manually run SQL to fix the problems. This also contributed to a culture of don't press ctrl+c during push: it could corrupt Mercurial. (Ctrl+c should be safe to press any time: if it isn't, there is a bug to be filed.)

Any time you remove a source of tree closures is a cause for celebration. Please join me in celebrating your new freedom to abort pushes without concern for data inconsistency.

In case you want to test things out, aborting pushes and (and rolling back the pushlog) should now result in something like:

pushing to ssh://hg.mozilla.org/mozilla-central
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 1 changes to 1 files
Trying to insert into pushlog.
Inserted into the pushlog db successfully.
^C
rolling back pushlog
transaction abort!
rollback completed

Firefox Source Documentation Versus MDN

December 30, 2014 at 12:00 PM | categories: Mozilla

The Firefox source tree has had in-tree documentation powered by Sphinx for a while now. However, its canonical home has been a hard-to-find URL on ci.mozilla.org. I finally scratched an itch and wrote patches to enable the docs to be built easier. So, starting today, the docs are now available on Read the Docs at https://gecko.readthedocs.org/en/latest/!

While I was scratching itches, I decided to play around with another documentation-related task: automatic API documentation. I have a limited proof-of-concept for automatically generating XPIDL interface documentation. Essentially, we use the in-tree XPIDL parser to parse .idl files and turn the Python object representation into reStructured Text, which Sphinx parses and renders into pretty HTML for us. The concept can be applied to any source input, such as WebIDL and JavaScript code. I chose XPIDL because a parser is readily available and I know Joshua Cranmer has expressed interest in automatic XPIDL documentation generation. (As an aside, JavaScript tooling that supports the flavor of JavaScript used internally by Firefox is very limited. We need to prioritize removing Mozilla extensions to JavaScript if we ever want to start using awesome tooling that exists in the wild.)

As I was implementing this proof-of-concept, I was looking at XPIDL interface documentation on MDN to see how things are presented today. After perusing MDN for a bit and comparing its content against what I was able to derive from the source, something became extremely clear: MDN has significantly more content than the canonical source code. Obviously the .idl files document the interfaces, their attributes, their methods, and all the types and names in between: that's the very definition of an IDL. But what was generally missing from the source code is comments. What does this method do? What is each argument used for? Things like example usage are almost non-existent in the source code. MDN, by contrast, typically has no shortage of all these things.

As I was grasping the reality that MDN has a lot of out-of-tree supplemental content, I started asking myself what's the point in automatic API docs? Is manual document curation on MDN good enough? This question has sort of been tearing me apart. Let me try to explain.

MDN is an amazing site. You can tell a lot of love has gone into making the experience and much of its content excellent. However, the content around the technical implementation / internals of Gecko/Firefox generally sucks. There are some exceptions to the rule. But I find that things like internal API documentation to be lackluster on average. It is rare for me to find documentation that is up-to-date and useful. It is common to find documentation that is partial and incomplete. It is very common to find things like JSMs not documented at all. I think this is a problem. I argue the lack of good documentation raises the barrier to contributing. Furthermore, writing and maintaining excellent low-level documentation is too much effort.

My current thoughts on API and low-level documentation are that I question the value of this documentation existing on MDN. Specifically, I think things like JSM API docs (like Sqlite.jsm) and XPIDL interface documentation (like nsIFile) don't belong on MDN - at least not in wiki form. Instead, I believe that documentation like this should live in and be derived from source code. Now, if the MDN site wants to expose this as read-only content or if MDN wants to enable the content to be annotated in a wiki-like manner (like how MSDN and PHP documentation allow user comments), that's perfectly fine by me. Here's why.

First, if I must write separate-from-source-code API documentation on MDN (or any other platform for that matter), I must now perform extra work or forgo either the source code or external documentation. In other words, if I write in-line documentation in the source code, I must spend extra effort to essentially copy large parts of that to MDN. And I must continue to spend extra effort to keep updates in sync. If I don't want to spend that extra effort (I'm as lazy as you), I have to choose between documenting the source code or documenting MDN. If I choose the source code, people either have to read the source to read the docs (because we don't generate documentation from source today) or someone else has to duplicate the docs (overall more work). If I choose to document on MDN, then people reading the source code (probably because they want to change it) are deprived of additional context useful to make that process easier. This is a lose-lose scenario and it is a general waste of expensive people time.

Second, I prefer having API documentation derived from source code because I feel it results in more accurate documentation that has the higher liklihood of remaining accurate and in sync with reality. Think about it: when was the last time you reviewed changes to a JSM and searched MDN for content that needed updated? I'm sure there are some pockets of people that do this right. But I've written dozens of JavaScript patches for Firefox and I'm pretty sure I've been asked to update external documentation less than 5% of the time. Inline source documentation, however, is another matter entirely. Because the documentation is often proximal to code that changed, I frequently a) go ahead and make the documentation changes because everything is right there and it's low overhead to change as I adjust the source b) am asked to update in-line docs when a reviewer sees I forgot to. Generally speaking, things tend to stay in sync and fewer bugs form when everything is proximally located. By fragmenting documentation between source code and external services like MDN, we increase the liklihood that things become out of sync. This results in misleading information and increases the barriers to contribution and change. In other words, developer inefficiency.

Third, having API documentation derived from source code opens up numerous possibilities to further aid developer productivity and improve the usefullness of documentation. For example:

We can parse @param references out of documentation and issue warnings/errors when documentation doesn't match the AST.
We can issue warnings when code isn't documented.
We can include in-line examples and execute and verify these as part of builds/tests.
We can more easily cross-reference APIs because everything is documented consistently. We can also issue warnings when cross-references no longer work.
We can derive files so editors and IDEs can display in-line API docs as you type or can complete functions as you type, allowing people to code faster.

While we don't generally do these things today, they are all within the realm of possibility. Sphinx supports doing many of these things. Stop reading and run mach build-docs right now and look at the warnings from malformed documentation. I don't know about you, but I love when my tools tell me when I'm creating a burden for others.

There really is so much more we could be doing with source-derived documentation. And I argue managing it would take less overall work and would result in higher quality documentation.

But the world of source-derived documentation isn't all roses. MDN has a very important advantage: it's a wiki. Just log in, edit in a WYSIWYG, and save. It's so easy. The moment we move to source-derived documentation, we introduce the massive Firefox source repository, the Firefox code review process, bugs/Bugzilla, version control overhead (although versioning documentation is another plus for source-derived documentation), landing changes, extra cost to Mozilla for building and running those checkins (even if they contain docs-only changes, sadly), and the time and cognitive burden associated with each one. That's a lot of extra work compared to clicking a few buttons on MDN! Moving documentation editing out of MDN and into the Firefox patch submission world would be a step in the wrong direction in terms of fostering contributions. Should someone really have to go through all that just to correct a typo? I have no doubt we'd lose contributors if we switched the change contribution process. And considering our lackluster track record of writing inline documentation in source, I don't feel great about losing any person who contributes documentation, no matter how small the contribution.

And this is my dilemma: the existing source-or-MDN solution is sub-par for pretty much everything except ease of contribution on MDN and deploying nice tools (like Sphinx) to address the suckitude will result in more difficulty contributing. Both alternatives suck.

I intend to continue this train of thought in a subsequent post. Stay tuned.

Why hg.mozilla.org is Slow

December 19, 2014 at 02:40 PM | categories: Mercurial, Mozilla

At Mozilla, I often hear statements like Mercurial is slow. That's a very general statement. Depending on the context, it can mean one or more of several things:

My Mercurial workflow is not very efficient
hg commands I execute are slow to run
hg commands I execute appear to stall
The Mercurial server I'm interfacing with is slow

I want to spend time talking about a specific problem: why hg.mozilla.org (the server) is slow.

What Isn't Slow

If you are talking to hg.mozilla.org over HTTP or HTTPS (https://hg.mozilla.org/), there should not currently be any server performance issues. Our Mercurial HTTP servers are pretty beefy and are able to absorb a lot of load.

If https://hg.mozilla.org/ is slow, chances are:

You are doing something like cloning a 1+ GB repository.
You are asking the server to do something really expensive (like generate JSON for 100,000 changesets via the pushlog query interface).
You don't have a high bandwidth connection to the server.
There is a network event.

Previous Network Events

There have historically been network capacity issues in the datacenter where hg.mozilla.org is hosted (SCL3).

During Mozlandia, excessive traffic to ftp.mozilla.org essentially saturated the SCL3 network. During this time, requests to hg.mozilla.org were timing out: Mercurial traffic just couldn't traverse the network. Fortunately, events like this are quite rare.

Up until recently, Firefox release automation was effectively overwhelming the network by doing some clownshoesy things.

For example, gaia-central was being cloned all the time We had a ~1.6 GB repository being cloned over a thousand times per day. We were transferring close to 2 TB of gaia-central data out of Mercurial servers per day

We also found issues with pushlogs sending 100+ MB responses.

And the build/tools repo was getting cloned for every job. Ditto for mozharness.

In all, we identified a few terabytes of excessive Mercurial traffic that didn't need to exist. This excessive traffic was saturating the SCL3 network and slowing down not only Mercurial traffic, but other traffic in SCL3 as well.

Fortunately, people from Release Engineering were quick to respond to and fix the problems once they were identified. The problem is now firmly in control. Although, given the scale of Firefox's release automation, any new system that comes online that talks to version control is susceptible to causing server outages. I've already raised this concern when reviewing some TaskCluster code. The thundering herd of automation will be an ongoing concern. But I have plans to further mitigate risk in 2015. Stay tuned.

Looking back at our historical data, it appears that we hit these network saturation limits a few times before we reached a tipping point in early November 2014. Unfortunately, we didn't realize this because up until recently, we didn't have a good source of data coming from the servers. We lacked the tooling to analyze what we had. We lacked the experience to know what to look for. Outages are effective flashlights. We learned a lot and know what we need to do with the data moving forward.

Available Network Bandwidth

One person pinged me on IRC with the comment Git is cloning much faster than Mercurial. I asked for timings and the Mercurial clone wall time for Firefox was much higher than I expected.

The reason was network bandwidth. This person was performing a Git clone between 2 hosts in EC2 but was performing the Mercurial clone between hg.mozilla.org and a host in EC2. In other words, they were partially comparing the performance of a 1 Gbps network against a link over the public internet! When they did a fair comparison by removing the network connection as a variable, the clone times rebounded to what I expected.

The single-homed nature of hg.mozilla.org in a single datacenter in northern California is not only bad for disaster recovery reasons, it also means that machines far away from SCL3 or connecting to SCL3 over a slow network aren't getting optimal performance.

In 2015, expect us to build out a geo-distributed hg.mozilla.org so that connections are hitting a server that is closer and thus faster. This will probably be targeted at Firefox release automation in AWS first. We want those machines to have a fast connection to the server and we want their traffic isolated from the servers developers use so that hiccups in automation don't impact the ability for humans to access and interface with source code.

NFS on SSH Master Server

If you connect to http://hg.mozilla.org/ or https://hg.mozilla.org/, you are hitting a pool of servers behind a load balancer. These servers have repository data stored on local disk, where I/O is fast. In reality, most I/O is serviced by the page cache, so local disks don't come into play.

If you connect to ssh://hg.mozilla.org/, you are hitting a single, master server. Its repository data is hosted on an NFS mount. I/O on the NFS mount is horribly slow. Any I/O intensive operation performed on the master is much, much slower than it should be. Such is the nature of NFS.

We'll be exploring ways to mitigate this performance issue in 2015. But it isn't the biggest source of performance pain, so don't expect anything immediately.

Synchronous Replication During Pushes

When you hg push to hg.mozilla.org, the changes are first made on the SSH/NFS master server. They are subsequently mirrored out to the HTTP read-only slaves.

As is currently implemented, the mirroring process is performed synchronously during the push operation. The server waits for the mirrors to complete (to a reasonable state) before it tells the client the push has completed.

Depending on the repository, the size of the push, and server and network load, mirroring commonly adds 1 to 7 seconds to push times. This is time when a developer is sitting at a terminal, waiting for hg push to complete. The time for Try pushes can be larger: 10 to 20 seconds is not uncommon (but fortunately not the norm).

The current mirroring mechanism is overly simple and prone to many failures and sub-optimal behavior. I plan to work on fixing mirroring in 2015. When I'm done, there should be no user-visible mirroring delay.

Pushlog Replication Inefficiency

Up until yesterday (when we deployed a rewritten pushlog extension, the replication of pushlog data from master to server was very inefficient. Instead of tranferring a delta of pushes since last pull, we were literally copying the underlying SQLite file across the network!

Try's pushlog is ~30 MB. mozilla-central and mozilla-inbound are in the same ballpark. 30 MB x 10 slaves is a lot of data to transfer. These operations were capable of periodically saturating the network, slowing everyone down.

The rewritten pushlog extension performs a delta transfer automatically as part of hg pull. Pushlog synchronization now completes in milliseconds while commonly only consuming a few kilobytes of network traffic.

Early indications reveal that deploying this change yesterday decreased the push times to repositories with long push history by 1-3s.

Try

Pretty much any interaction with the Try repository is guaranteed to have poor performance. The Try repository is doing things that distributed versions control systems weren't designed to do. This includes Git.

If you are using Try, all bets are off. Performance will be problematic until we roll out the headless try repository.

That being said, we've made changes recently to make Try perform better. The median time for pushing to Try has decreased significantly in the past few weeks. The first dip in mid-November was due to upgrading the server from Mercurial 2.5 to Mercurial 3.1 and from converting Try to use generaldelta encoding. The dip this week has been from merging all heads and from deploying the aforementioned pushlog changes. Pushing to Try is now significantly faster than 3 months ago.

Conclusion

Many of the reasons for hg.mozilla.org slowness are known. More often than not, they are due to clownshoes or inefficiencies on Mozilla's part rather than fundamental issues with Mercurial.

We have made significant progress at making hg.mozilla.org faster. But we are not done. We are continuing to invest in fixing the sub-optimal parts and making hg.mozilla.org faster yet. I'm confident that within a few months, nobody will be able to say that the servers are a source of pain like they have been for years.

Furthermore, Mercurial is investing in features to make the wire protocol faster, more efficient, and more powerful. When deployed, these should make pushes faster on any server. They will also enable workflow enhancements, such as Facebook's experimental extension to perform rebases as part of push (eliminating push races and having to manually rebase when you lose the push race).

mach sub-commands

December 18, 2014 at 09:45 AM | categories: Mozilla

mach - the generic command line dispatching tool that powers the mach command to aid Firefox development - now has support for sub-commands.

You can now create simple and intuitive user interfaces involving sub-actions. e.g.

mach device sync
mach device run
mach device delete

Before, to do something like this would require a universal argument parser or separate mach commands. Both constitute a poor user experience (confusing array of available arguments or proliferation of top-level commands). Both result in mach help being difficult to comprehend. And that's not good for usability and approachability.

Nothing in Firefox currently uses this feature. Although there is an in-progress patch in bug 1108293 for providing a mach command to analyze C/C++ build dependencies. It is my hope that others write useful commands and functionality on top of this feature.

The documentation for mach has also been rewritten. It is now exposed as part of the in-tree Sphinx documentation.

Everyone should thank Andrew Halberstadt for promptly reviewing the changes!

« Previous Page -- Next Page »