Lost Productivity Due to Non-Unified Repositories

February 17, 2015 at 02:50 PM | categories: Mozilla | View Comments

I'm currently working on annotating moz.build files with metadata that defines things like which bug component and code reviewers map to which files. It's going to enable a lot of awesomeness.

As part of this project, I'm implementing a new moz.build processing mode. Instead of reading moz.build files by traversing DIRS variables from previously-executed moz.build files, we're evaluating moz.build files according to filesystem topology. This has uncovered a few cases where a moz.build file errors because of assumptions that no longer hold. For example, for directories that are only active on Windows, the moz.build file might assume that if Windows is always true.

One such problem was with gfx/angle/srx/libGLESv2/moz.build. This file contained code similar to the following:

if CONFIG['IS_WINDOWS']:
    SOURCES += ['foo.cpp']
...
SOURCES['foo.cpp'].flags += ['-DBAR']

This always ran without issue because this moz.build was only included if building for Windows. This assumption is of course invalid when in filesystem traversal mode.

Anyway, as part of updating this trouble file, I lost maybe an hour of productivity. Here's how.

The top of the trouble moz.build file has a comment:

# Please note this file is autogenerated from generate_mozbuild.py,
# so do not modify it directly

OK. So, I need to modify generate_mozbuild.py. First thing's first: I need to locate it:

$ hg locate generate_mozbuild.py
gfx/skia/generate_mozbuild.py

So I load up this file. I see a main(). I run the script in my shell and get an error. Weird. I look around gfx/skia and see a README_MOZILLA file. I open it. README_MOZILLA contains some instructions. They aren't very good. I hop in #gfx on IRC and ask around. They tell me to do a Subversion clone of Skia and to check out the commit referenced in README_MOZILLA. There is no repo URL in README_MOZILLA. I search Google. I find a Git URL. I notice that README_MOZILLA contains a SHA-1 commit, not a Subversion integer revision. I figure the Git repo is what was meant. I clone the Git repo. I attempt to run the generation script referenced by README_MOZILLA. It fails. I ask again in #gfx. They are baffled at first. I dig around the source code. I see a reference in Skia's upstream code to a path that doesn't exist. I tell the #gfx people. They tell me sub-repos are likly involved and to use gclient to clone the repo. I search for the proper Skia source code docs and type the necessary gclient commands. (Fortunately I've used gclient before, so this wasn't completely alien to me.)

I get the Skia clone in the proper state. I run the generation script and all works. But I don't see it writing the trouble moz.build file I set out to fix. I set some breakpoints. I run the code again. I'm baffled.

Suddenly it hits me: I've been poking around with gfx/skia which is separate from gfx/angle! I look around gfx/angle and see a README.mozilla file. I open it. It reveals the existence of the Git repo https://github.com/mozilla/angle. I open GitHub in my browser. I see a generate_mozbuild.py script.

I now realize there are multiple files named generate_mozbuild.py. Unfortunately, the one I care about - the ANGLE one - is not checked into mozilla-central. So, my search for it with hg files did not reveal its existence. Between me trying to get the Skia code cloned and generating moz.build files, I probably lost an hour of work. All because a file with a similar name wasn't checked into mozilla-central!

I assumed that the single generate_mozbuild.py I found under source control was the only file of that name and that it must be the file I was interested in.

Maybe I should have known to look at gfx/angle/README.mozilla first. Maybe I should have known that gfx/angle and gfx/skia are completely independent.

But I didn't. My ignorance cost me.

Had the contents of the separate ANGLE repository been checked into mozilla-central, I would have seen the multiple generate_mozbuild.py files and I would likely have found the correct one immediately. But they weren't and I lost an hour of my time.

And I'm not done. Now I have to figure out how the separate ANGLE repo integrates with mozilla-central. I'll have to figure out how to submit the patch I still need to write. The GitHub description of this repo says Talk to vlad, jgilbert, or kamidphish for more info. So now I have to bother them before I can submit my patch. Maybe I'll just submit a pull request and see what happens.

I'm convinced I wouldn't have encountered this problem if a monolithic repository were used. I would have found the separate generate_mozbuild.py file immediately. And, the change process would likely have been known to me since all the code was in a repository I already knew how to submit patches from.

Separate repos are just lots of pain. You can bet I'll link to this post when people propose splitting up mozilla-central into multiple repositories.

Read and Post Comments

Branch Cleanup in Firefox Repositories

January 28, 2015 at 08:35 PM | categories: Mercurial, Mozilla | View Comments

Mozilla has historically done some funky things with the Firefox Mercurial repositories. One of the things we've done is create a bunch of named branches to track the Firefox release process. These are branch names like GECKO20b12_2011022218_RELBRANCH.

Over in bug 927219, we started the process of cleaning up some cruft left over from many of these old branches.

For starters, the old named branches in the Firefox repositories are being actively closed. When you hg commit --close-branch, Mercurial creates a special commit that says this branch is closed. Branches that are closed are automatically hidden from the output of hg branches and hg heads. As a result, the output of these commands is now much more usable.

Closed branches still constitute heads on the DAG. And several heads lead to degraded performance in some situations (notably push and pull times - the same thing happens in Git). I'd like to eventually merge these old heads so that repositories only have 1 or a small number of DAG heads. However, extra care must be taken before that step. Stay tuned.

Anyway, for the average person reading, you probably won't be impacted by these changes at all. The greatest impact will be from the person who lands the first change on top of any repository whose last commit was a branch close. If you commit on top of the tip commit, you'll be committing on top of a previously closed branch! You'll instead want to hg up default after you pull to ensure you are on the proper DAG head! And even then, if you have local commits, you may not be based on top of the appropriate commit! A simple run of hg log --graph should help you decipther the state of the world. (Please note that the usability problems around discovering the appropriate head to land on are a result of our poor branching strategy for the Firefox repositories. We probably should have named branches tracking the active Gecko releases. But that ship sailed years ago and fixing that is pretty far down the priority list. Wallpapering over things with the firefoxtree extensions is my recommended solution until matters are fixed.)

Read and Post Comments

Commit Part Numbers and MozReview

January 27, 2015 at 08:17 PM | categories: Mozilla, code review | View Comments

It is common for commit messages in Firefox to contains strings like Part 1, Part 2, etc. See this push for bug 784841 for an extreme multi-part example.

When code review is conducted in Bugzilla, these identifiers are necessary because Bugzilla orders attachments/patches in the order they were updated or their patch title (I'm not actually sure!). If part numbers were omitted, it could be very confusing trying to figure out which order patches should be applied in.

However, when code review is conducted in MozReview, there is no need for explicit part numbers to convey ordering because the ordering of commits is implicitly defined by the repository history that you pushed to MozReview!

I argue that if you are using MozReview, you should stop writing Part N in your commit messages, as it provides little to no benefit.

I, for one, welcome this new world order: I've previously wasted a lot of time rewriting commit messages to reflect new part ordering after doing history rewriting. With MozReview, that overhead is gone and I barely pay a penalty for rewriting history, something that often produces a more reviewable series of commits and makes reviewing and landing a complex patch series significantly easier.

Read and Post Comments

Automatic Python Static Analysis on MozReview

January 24, 2015 at 11:30 PM | categories: Python, Mozilla, code review | View Comments

A bunch of us were in Toronto last week hacking on MozReview.

One of the cool things we did was deploy a bot for performing Python static analysis. If you submit some .py files to MozReview, the bot should leave a review. If it finds violations (it uses flake8 internally), it will open an issue for each violation. It also leaves a comment that should hopefully give enough detail on how to fix the problem.

While we haven't done much in the way of performance optimizations, the bot typically submits results less than 10 seconds after the review is posted! So, a human should never be reviewing Python that the bot hasn't seen. This means you can stop thinking about style nits and start thinking about what the code does.

This bot should be considered an alpha feature. The code for the bot isn't even checked in yet. We're running the bot against production to get a feel for how it behaves. If things don't go well, we'll turn it off until the problems are fixed.

We'd like to eventually deploy C++, JavaScript, etc bots. Python won out because it was the easiest to integrate (it has sane and efficient tooling that is compatible with Mozilla's code bases - most existing JavaScript tools won't work with Gecko-flavored JavaScript, sadly).

I'd also like to eventually make it easier to locally run the same static analysis we run in MozReview. Addressing problems locally before pushing is a no-brainer since it avoids needless context switching from other people and is thus better for productivity. This will come in time.

Report issues in #mozreview or in the Developer Services :: MozReview Bugzilla component.

Read and Post Comments

End to End Testing with Docker

January 24, 2015 at 11:10 PM | categories: Docker, Mozilla, testing | View Comments

I've written an extensive testing framework for Mozilla's version control tools. Despite it being a little rough around the edges, I'm a bit proud of it.

When you run tests for MozReview, Mozilla's heavily modified Review Board code review tool, the following things happen:

  • A MySQL server is started in a Docker container.
  • A Bugzilla server (running the same code as bugzilla.mozilla.org) is started on an Apache httpd server with mod_perl inside a Docker container.
  • A RabbitMQ server mimicking pulse.mozilla.org is started in a Docker container.
  • A Review Board Django development server is started.
  • A Mercurial HTTP server is started

In the future, we'll likely also need to add support for various other services to support MozReview and other components of version control tools:

  • The Autoland HTTP service will be started in a Docker container, along with any other requirements it may have.
  • An IRC server will be started in a Docker container.
  • Zookeeper and Kafka will be started on multiple Docker containers

The entire setup is pretty cool. You have actual services running on your local machine. Mike Conley and Steven MacLeod even did some pair coding of MozReview while on a plane last week. I think it's pretty cool this is even possible.

There is very little mocking in the tests. If we need an external service, we try to spin up an instance inside a local container. This way, we can't have unexpected test successes or failures due to bugs in mocking. We have very high confidence that if something works against local containers, it will work in production.

I currently have each test file owning its own set of Docker containers and processes. This way, we get full test isolation and can run tests concurrently without race conditions. This drastically reduces overall test execution time and makes individual tests easier to reason about.

As cool as the test setup is, there's a bunch I wish were better.

Spinning up and shutting down all those containers and processes takes a lot of time. We're currently sitting around 8s startup time and 2s shutdown time. 10s overhead per test is unacceptable. When I make a one line change, I want the tests to be instantenous. 10s is too long for me to sit idly by. Unfortunately, I've already gone to great pains to make test overhead as short as possible. Fig wasn't good enough for me for various reasons. I've reimplemented my own orchestration directly on top of the docker-py package to achieve some significant performance wins. Using concurrent.futures to perform operations against multiple containers concurrently was a big win. Bootstrapping containers (running their first-run entrypoint scripts and committing the result to be used later by tests) was a bigger win (first run of Bugzilla is 20-25 seconds).

I'm at the point of optimizing startup where the longest pole is the initialization of the services inside Docker containers themselves. MySQL takes a few seconds to start accepting connections. Apache + Bugzilla has a semi-involved initialization process. RabbitMQ takes about 4 seconds to initialize. There are some cascading dependencies in there, so the majority of startup time is waiting for processes to finish their startup routine.

Another concern with running all these containers is memory usage. When you start running 6+ instances of MySQL + Apache, RabbitMQ, + ..., it becomes really easy to exhaust system memory, incur swapping, and have performance fall off a cliff. I've spent a non-trivial amount of time figuring out the minimal amount of memory I can make services consume while still not sacrificing too much performance.

It is quite an experience having the problem of trying to minimize resource usage and startup time for various applications. Searching the internet will happily give you recommended settings for applications. You can find out how to make a service start in 10s instead of 60s or consume 100 MB of RSS instead of 1 GB. But what the internet won't tell you is how to make the service start in 2s instead of 3s or consume as little memory as possible. I reckon I'm past the point of diminishing returns where most people don't care about any further performance wins. But, because of how I'm using containers for end-to-end testing and I have a surplus of short-lived containers, it is clearly I problem I need to solve.

I might be able to squeeze out a few more seconds of reduction by further optimizing startup and shutdown. But, I doubt I'll reduce things below 5s. If you ask me, that's still not good enough. I want no more than 2s overhead per test. And I don't think I'm going to get that unless I start utilizing containers across multiple tests. And I really don't want to do that because it sacrifices test purity. Engineering is full of trade-offs.

Another takeaway from implementing this test harness is that the pre-built Docker images available from the Docker Registry almost always become useless. I eventually make a customization that can't be shoehorned into the readily-available image and I find myself having to reinvent the wheel. I'm not a fan of the download and run a binary model, especially given Docker's less-than-stellar history on the security and cryptography fronts (I'll trust Linux distributions to get package distribution right, but I'm not going to be trusting the Docker Registry quite yet), so it's not a huge loss. I'm at the point where I've lost faith in Docker Registry images and my default position is to implement my own builder. Containers are supposed to do one thing, so it usually isn't that difficult to roll my own images.

There's a lot to love about Docker and containerized test execution. But I feel like I'm foraging into new territory and solving problems like startup time minimization that I shouldn't really have to be solving. I think I can justify it given the increased accuracy from the tests and the increased confidence that brings. I just wish the cost weren't so high. Hopefully as others start leaning on containers and Docker more for test execution, people start figuring out how to make some of these problems disappear.

Read and Post Comments

Next Page ยป