Quantifying Mozilla's Automation Efficiency

July 14, 2013 at 11:15 PM | categories: Mozilla

Mozilla's build and test automation now records system resource usage (CPU, memory, and I/O) during mozharness jobs. It does this by adding a generic resource collection feature to mozharness. If a mozharness script inherits from a specific class, it magically performs system resource collection and reporting!

I want to emphasize that the current state of the feature is far from complete. There are numerous shortcomings and areas for improvement:

  • At the time I'm writing this, the mozharness patch is only deployed on the Cedar tree. Hopefully it will be deployed to the production infrastructure shortly.
  • This feature only works for mozharness jobs. Notably absent are desktop builds.
  • psutil - the underlying Python package used to collect data - isn't yet installable everywhere. As Release Engineering rolls it out to other machine classes in bug 893254, those jobs will magically start collecting resource usage.
  • While detailed resource usage is captured during job execution, we currently only report a very high-level summary at job completion time. This will be addressed with bug 893388.
  • Jobs running on virtual machines appear to misreport CPU usage (presumably due to CPU steal being counted as utilized CPU). Bug 893391 tracks.
  • You need to manually open logs to view resource usage. (e.g. open this log and search for Total resource usage.) I hope to one day have key metrics reported in TBPL output and/or easily graphable.
  • Resource collection operates at the system level. Because there is only 1 job running on a machine and slaves typically do little else, we assume system resource usage is a sufficient proxy for automation job usage. This obviously isn't always correct. But, it was the easiest to initially implement.

Those are a lot of shortcomings! And, it essentially means only OS X test jobs are providing meaningful data now. But, you have to start somewhere. And, we have more data now than we did before. That's progress.

Purpose and Analysis

Collecting resource usage of automation jobs (something I'm quite frankly surprised we weren't doing before) should help raise awareness of inefficient machine utilization and other hardware problems. It will allow us to answer questions such as are the machines working as hard as they can or is a particular hardware component contributing to slower automation execution.

Indeed a casual look at the first days of the data has shown some alarming readings, notably the abysmal CPU efficiency of our test jobs. For an OS X 10.8 opt build, the xpcshell job only utilized an average of 10% CPU during execution. A browser chrome job only utilized 12% CPU on average. Finally, a reftest job only utilized 13%.

Any CPU cycle not utilized by our automation infrastructure is forever lost and cannot be put to work again. So, to only utilize 10-13% of available CPU cycles during test jobs is wasting a lot of machine potential. This is the equivalent of buying 7.7 to 10 machines and only turning 1 of them on! Or, in terms of time, it would reduce the wall time execution of a 1 hour job to 6:00 to 7:48. Or in terms of overall automation load, it would significantly decrease the backlog and turnaround time. You get my drift. This is why parallelizing test execution within test suites - a means to increase CPU utilization - is such an exciting project to me. This work is all tracked in bug 845748 and in my opinion it cannot complete soon enough. (I'd also like to see more investigation into bottlenecks in test execution. Even small improvements of 1 or 2% can have a measurable impact when multiplied by thousands of tests per suite and hundreds of test job runs per day.)

Another interesting observation is that there is over 1 GB of write I/O during some test jobs. Browser chrome tests write close to 2GB! That is surprisingly high to me. Are the tests really incurring that much I/O? If so, which ones? Do they need to? If not tests, what background service is performing that much work? Could I/O wait be slowing tests down? Should we invest in more SSDs? More science is definitely needed.

I hope people find this data useful and that we put it to use to make more data-driven decisions around Mozilla's automation infrastructure.