Bulk Analysis of Mozilla's Build and Test Data

April 01, 2013 at 01:12 PM | categories: Mozilla, Firefox

When you push code changes to Firefox and other similar Mozilla projects, a flood of automated jobs is triggered on Mozilla's infrastructure. It works like any other continuous integration system. First you build, then you run tests, etc. What sets it apart from other continuous integration systems is the size: Mozilla runs thousands of jobs per week and the combined output sums into the tens of gigabytes.

Most of the data from Mozilla's continuous integration is available on public servers, notably ftp.mozilla.org. This includes compiled binaries, logs, etc.

While there are tools that can sift through this mountain of data (like TBPL), they don't allow ad-hoc queries over the raw data. Furthermore, these tools are very function-specific and there are many data views they don't expose. This missing data has always bothered me because, well, there are cool and useful things I'd like to do with this data.

This itch has been bothering me for well over a year. The persistent burning sensation coupled with rain over the weekend caused me to scratch it.

The product of my weekend labor is a system facilitating bulk storage and analysis of Mozilla's build data. While it's currently very alpha, it's already showing promise for more throrough data analysis.

Essentially, the tool works by collecting the dumps of all the jobs executed on Mozilla's infrastructure. It can optionally supplement this with the raw logs from those jobs. Then, it combs through this data, extracts useful bits, and stores them. Once the initial fetching has completed, you simply need to re-"parse" the data set into useful data. And, since all data is stored locally, the performance of this is not bound by Internet bandwidth. In practice, this means that you can obtain a new metric faster than would have been required before. The downside is you will likely be storing gigabytes of raw data locally. But, disks are cheap. And, you have control over what gets pulled in, so you can limit it to what you need.

Please note the project is very alpha and is only currently serving my personal interests. However, I know there is talk about TBPL2 and what I have built could evolve into the data store for the next generation TBPL tool. Also, most of the work so far has centered on data import. There is tons of analysis code waiting to be written.

If you are interested in improving the tool, please file a GitHub pull request.

I hope to soon blog about useful information I've obtained through this tool.