Visualizing Mozilla's release infrastructure machine efficiency

August 30, 2013 at 12:30 PM | categories: Mozilla

Have you ever wondered what the machines in Mozilla's automation infrastructure are doing all the time? I have. So, I decided to create a visualization of this data. You can find it at http://automation-dashboard.paas.allizom.org/.

When you start looking at the visualizations, you notice something: there's a lot of time when machines aren't doing anything! All that white space between jobs is time machines are not processing jobs. This is capacity Mozilla is failing to utilize.

While some may say Mozilla's automation infrastructure has a load or capacity problem, I say it has an efficiency problem. The average machine in our automation infrastructure is doing work less than 50% of the time during weekdays. Now, some of this might be VMs that are powered off (due to low demand). But considering physical machines are also under-utilized, I'd say it's a global problem.

Oh, and don't get too hung up with a machine-job efficiency metric. While important, it's only part of the problem. When jobs are running, they are typically only using a fraction of the CPU available to them. From data now available in mozharness, we know that many test suites only use 10-15% CPU. If you combine this with sub-50% machine utilization in terms of jobs, I estimate we're only utilizing somewhere between 5-10% of available CPU cycles in our automation infrastructure. We have a magnitude more capacity in the machines we already have. We don't have a capacity problem, we have an efficiency problem. In my opinion we should throw less time and money at new hardware and invest in maximizing the return from what we have.

Edit 2013-09-03

I started a thread about this data in another forum and others have pointed out that the data in this post is incomplete. Notably absent from this data is when on-demand EC2 instances are shut down, when there are or aren't jobs scheduled (if jobs aren't scheduled, poor machine utilization isn't such a big deal), and when some maintainence tasks are performed (e.g. Panda boards are checked for consistency between jobs). While we could probably hook job scheduling data up to the graph and numbers easily, I don't believe data on slave uptime or background tasks is publicly available. Perhaps we should publish this data to facilitate deeper analysis.

A goal of this post was to shed light on how little we utilize some of the machines in our automation infrastructure in order to inspire a conversation and ultimately to address the perceived problem. It was not a goal to point fingers and cast blame. If I inadvertently performed the latter or seemed to jump to conclusions based on incomplete data, I apologize.