Gregory Szorc's Digital Home

Network Events

March 18, 2015 at 02:25 PM | categories: Mozilla

The Firefox source repositories and automation have been closed the past few days due to a couple of outages.

Yesterday, aggregate CPU usage on many of the machines in the hg.mozilla.org cluster hit 100%. Previously, whenever hg.mozilla.org was under high load, we'd run out of network bandwidth before we ran out of CPU on the machines. In other words, Mercurial was generating data faster than the network could accept it.

When this happened, the service started issuing HTTP 503 Service Not Available responses. This is the universal server signal for I'm down, go away. Unfortunately, not all clients did this.

Parts of Firefox's release automation retried failing requests immediately, or with insufficient jitter in their backoff interval. Actively retrying requests against a server that's experiencing load issues only makes the problem worse. This effectively prolonged the outage.

Today, we had a similar but different network issue. The load balancer fronting hg.mozilla.org can only handle so much bandwidth. Today, we hit that limit. The load balancer started throttling connections. Load on hg.mozilla.org skyrocketed and request latency increased. From the perspective of clients, the service grinded to a halt.

hg.mozilla.org was partially sharing a load balancer with ftp.mozilla.org. That meant if one of the services experienced very high load, the other service could effectively be locked out of bandwidth. We saw this happening this morning. ftp.mozilla.org load was high (it looks like downloads of Firefox Developer Edition are a major contributor - these don't go through the CDN for reasons unknown to me) and there wasn't enough bandwidth to go around.

Separately today, hg.mozilla.org again hit 100% CPU. At that time, it also set a new record for network throughput: ~3 Gbps. It normally consumes between 200 and 500 Mbps, with periodic spikes to 750 Mbps. (Yesterday's event saw a spike to around ~2 Gbps.)

Going back through the hg.mozilla.org server logs, an offender is quite obvious. Before March 9, total outbound transfer for the build/tools repo was around 1 tebibyte per day. Starting on March 9, it increased to 3 tebibytes per day! This is quite remarkable, as a clone of this repo is only about 20 MiB. This means the repo was getting cloned about 150,000 times per day! (Note: I think all these numbers may be low by ~20% - stay tuned for the final analysis.)

2 TiB/day is statistically significant because we transfer less than 10 TiB/day across all of hg.mozilla.org. And, 1 TiB/day is close to 100 Mbps, assuming requests are evenly spread out (which of course they aren't).

Multiple things went wrong. If only one or two happened, we'd likely be fine. Maybe there would have been a short blip. But not the major event we've been firefighting the last ~24 hours.

This post is only a summary of what went wrong. I'm sure there will be a post-mortem and that it will contain lots of details for those who want to know more.

Lost Productivity Due to Non-Unified Repositories

February 17, 2015 at 02:50 PM | categories: Mozilla

I'm currently working on annotating moz.build files with metadata that defines things like which bug component and code reviewers map to which files. It's going to enable a lot of awesomeness.

As part of this project, I'm implementing a new moz.build processing mode. Instead of reading moz.build files by traversing DIRS variables from previously-executed moz.build files, we're evaluating moz.build files according to filesystem topology. This has uncovered a few cases where a moz.build file errors because of assumptions that no longer hold. For example, for directories that are only active on Windows, the moz.build file might assume that if Windows is always true.

One such problem was with gfx/angle/srx/libGLESv2/moz.build. This file contained code similar to the following:

if CONFIG['IS_WINDOWS']:
    SOURCES += ['foo.cpp']
...
SOURCES['foo.cpp'].flags += ['-DBAR']

This always ran without issue because this moz.build was only included if building for Windows. This assumption is of course invalid when in filesystem traversal mode.

Anyway, as part of updating this trouble file, I lost maybe an hour of productivity. Here's how.

The top of the trouble moz.build file has a comment:

# Please note this file is autogenerated from generate_mozbuild.py,
# so do not modify it directly

OK. So, I need to modify generate_mozbuild.py. First thing's first: I need to locate it:

$ hg locate generate_mozbuild.py
gfx/skia/generate_mozbuild.py

So I load up this file. I see a main(). I run the script in my shell and get an error. Weird. I look around gfx/skia and see a README_MOZILLA file. I open it. README_MOZILLA contains some instructions. They aren't very good. I hop in #gfx on IRC and ask around. They tell me to do a Subversion clone of Skia and to check out the commit referenced in README_MOZILLA. There is no repo URL in README_MOZILLA. I search Google. I find a Git URL. I notice that README_MOZILLA contains a SHA-1 commit, not a Subversion integer revision. I figure the Git repo is what was meant. I clone the Git repo. I attempt to run the generation script referenced by README_MOZILLA. It fails. I ask again in #gfx. They are baffled at first. I dig around the source code. I see a reference in Skia's upstream code to a path that doesn't exist. I tell the #gfx people. They tell me sub-repos are likly involved and to use gclient to clone the repo. I search for the proper Skia source code docs and type the necessary gclient commands. (Fortunately I've used gclient before, so this wasn't completely alien to me.)

I get the Skia clone in the proper state. I run the generation script and all works. But I don't see it writing the trouble moz.build file I set out to fix. I set some breakpoints. I run the code again. I'm baffled.

Suddenly it hits me: I've been poking around with gfx/skia which is separate from gfx/angle! I look around gfx/angle and see a README.mozilla file. I open it. It reveals the existence of the Git repo https://github.com/mozilla/angle. I open GitHub in my browser. I see a generate_mozbuild.py script.

I now realize there are multiple files named generate_mozbuild.py. Unfortunately, the one I care about - the ANGLE one - is not checked into mozilla-central. So, my search for it with hg files did not reveal its existence. Between me trying to get the Skia code cloned and generating moz.build files, I probably lost an hour of work. All because a file with a similar name wasn't checked into mozilla-central!

I assumed that the single generate_mozbuild.py I found under source control was the only file of that name and that it must be the file I was interested in.

Maybe I should have known to look at gfx/angle/README.mozilla first. Maybe I should have known that gfx/angle and gfx/skia are completely independent.

But I didn't. My ignorance cost me.

Had the contents of the separate ANGLE repository been checked into mozilla-central, I would have seen the multiple generate_mozbuild.py files and I would likely have found the correct one immediately. But they weren't and I lost an hour of my time.

And I'm not done. Now I have to figure out how the separate ANGLE repo integrates with mozilla-central. I'll have to figure out how to submit the patch I still need to write. The GitHub description of this repo says Talk to vlad, jgilbert, or kamidphish for more info. So now I have to bother them before I can submit my patch. Maybe I'll just submit a pull request and see what happens.

I'm convinced I wouldn't have encountered this problem if a monolithic repository were used. I would have found the separate generate_mozbuild.py file immediately. And, the change process would likely have been known to me since all the code was in a repository I already knew how to submit patches from.

Separate repos are just lots of pain. You can bet I'll link to this post when people propose splitting up mozilla-central into multiple repositories.

Branch Cleanup in Firefox Repositories

January 28, 2015 at 08:35 PM | categories: Mercurial, Mozilla

Mozilla has historically done some funky things with the Firefox Mercurial repositories. One of the things we've done is create a bunch of named branches to track the Firefox release process. These are branch names like GECKO20b12_2011022218_RELBRANCH.

Over in bug 927219, we started the process of cleaning up some cruft left over from many of these old branches.

For starters, the old named branches in the Firefox repositories are being actively closed. When you hg commit --close-branch, Mercurial creates a special commit that says this branch is closed. Branches that are closed are automatically hidden from the output of hg branches and hg heads. As a result, the output of these commands is now much more usable.

Closed branches still constitute heads on the DAG. And several heads lead to degraded performance in some situations (notably push and pull times - the same thing happens in Git). I'd like to eventually merge these old heads so that repositories only have 1 or a small number of DAG heads. However, extra care must be taken before that step. Stay tuned.

Anyway, for the average person reading, you probably won't be impacted by these changes at all. The greatest impact will be from the person who lands the first change on top of any repository whose last commit was a branch close. If you commit on top of the tip commit, you'll be committing on top of a previously closed branch! You'll instead want to hg up default after you pull to ensure you are on the proper DAG head! And even then, if you have local commits, you may not be based on top of the appropriate commit! A simple run of hg log --graph should help you decipther the state of the world. (Please note that the usability problems around discovering the appropriate head to land on are a result of our poor branching strategy for the Firefox repositories. We probably should have named branches tracking the active Gecko releases. But that ship sailed years ago and fixing that is pretty far down the priority list. Wallpapering over things with the firefoxtree extensions is my recommended solution until matters are fixed.)

Commit Part Numbers and MozReview

January 27, 2015 at 08:17 PM | categories: MozReview, Mozilla, code review

It is common for commit messages in Firefox to contains strings like Part 1, Part 2, etc. See this push for bug 784841 for an extreme multi-part example.

When code review is conducted in Bugzilla, these identifiers are necessary because Bugzilla orders attachments/patches in the order they were updated or their patch title (I'm not actually sure!). If part numbers were omitted, it could be very confusing trying to figure out which order patches should be applied in.

However, when code review is conducted in MozReview, there is no need for explicit part numbers to convey ordering because the ordering of commits is implicitly defined by the repository history that you pushed to MozReview!

I argue that if you are using MozReview, you should stop writing Part N in your commit messages, as it provides little to no benefit.

I, for one, welcome this new world order: I've previously wasted a lot of time rewriting commit messages to reflect new part ordering after doing history rewriting. With MozReview, that overhead is gone and I barely pay a penalty for rewriting history, something that often produces a more reviewable series of commits and makes reviewing and landing a complex patch series significantly easier.

Automatic Python Static Analysis on MozReview

January 24, 2015 at 11:30 PM | categories: Python, MozReview, Mozilla, code review

A bunch of us were in Toronto last week hacking on MozReview.

One of the cool things we did was deploy a bot for performing Python static analysis. If you submit some .py files to MozReview, the bot should leave a review. If it finds violations (it uses flake8 internally), it will open an issue for each violation. It also leaves a comment that should hopefully give enough detail on how to fix the problem.

While we haven't done much in the way of performance optimizations, the bot typically submits results less than 10 seconds after the review is posted! So, a human should never be reviewing Python that the bot hasn't seen. This means you can stop thinking about style nits and start thinking about what the code does.

This bot should be considered an alpha feature. The code for the bot isn't even checked in yet. We're running the bot against production to get a feel for how it behaves. If things don't go well, we'll turn it off until the problems are fixed.

We'd like to eventually deploy C++, JavaScript, etc bots. Python won out because it was the easiest to integrate (it has sane and efficient tooling that is compatible with Mozilla's code bases - most existing JavaScript tools won't work with Gecko-flavored JavaScript, sadly).

I'd also like to eventually make it easier to locally run the same static analysis we run in MozReview. Addressing problems locally before pushing is a no-brainer since it avoids needless context switching from other people and is thus better for productivity. This will come in time.

Report issues in #mozreview or in the Developer Services :: MozReview Bugzilla component.

« Previous Page -- Next Page »

Network Events

Lost Productivity Due to Non-Unified Repositories

Branch Cleanup in Firefox Repositories

Commit Part Numbers and MozReview

Automatic Python Static Analysis on MozReview

Categories