mozilla-central Build Times

July 29, 2012 at 01:20 PM | categories: Mozilla, build system

In my previous post, I explained how Mozilla's build system works. In this post, I want to give people a feel for where time is spent and to identify obvious areas that need improvement.

To my knowledge, nobody has provided such a comprehensive collection of measurements before. Thus, most of our thoughts on where build time goes have been largely lacking scientific rigor. I hope this post changes matters and injects some much-needed quantitative figures into the discussion.

Methodology

All the times reported in this post were obtained from my 2011 generation MacBook Pro. It has 8 GB of RAM and a magnetic hard drive. It is not an ultimate build rig by any stretch of the imagination. However, it's no slouch either. The machine has 1 physical Core i7 processor with 4 cores, each clocked at 2.3GHz. Each core has hyperthreading, so to the OS there appears to be 8 cores. For the remainder of this post, I'll simply state that my machine has 8 cores.

When I obtained measurements, I tried to limit the number of processes running elsewhere on the machine. I also performed multiple runs and reported the best timings. This means that the results are likely synthetic and highly influenced by the page cache. More on that later.

I configured make to use up to 8 parallel processes (adding -j8 to the make flags). I also used silent builds (the -s flag to make). Silent builds are important because terminal rendering can add many seconds of wall time to builds, especially on slow terminals (like Windows). I measured results with make output being written to the terminal. In hindsight, I wish I hadn't done this. Next time.

To obtain the times, I used the ubiquitous time utility. Wall times are the real times from time. CPU time is the sum of the user and sys times.

CPU utilization is the percentage of CPU cores busy during the wall time of execution. In other words, I divided the CPU time by 8 times the wall time (8 for the number of cores in my machine). 100% is impossible to obtain, as obviously the CPU on my machine is doing other things during measurement. But, I tried to limit the number of background processes running, so there shouldn't have been that much noise.

I built a debug version of Firefox (the browser app in mozilla-central) using r160922 of the Clang compiler (pulled and built the day I did this measurement). The revision of mozilla-central being built was 08428edb1e89. I also had --enable-tests, which adds a significant amount of extra work to builds.

Configure

time reported the following for running configure:

real 0m25.927s
user 0m9.595s
sys  0m8.729s

This is a one-time cost. Unless you are hacking on the build system or pull down a tree change that modified the build system, you typically don't need to worry about this.

Clobber Builds with Empty ccache

I built each tier separately with an empty ccache on a recently-configured object directory. This measures the optimal worst case time for building mozilla-central. In other words, we have nothing cached in the object directory, so the maximum amount of work needs to be performed. Since I measured multiple times and used the best results, this is what I mean by optimal.

The table below contains the measurements. I omitted CPU utilization calculation for small time values because I don't feel it is relevant.

Tier - Sub-tier Wall Time (s) CPU Time (s) CPU Utilization
base export 0.714 0.774 N/A
base libs 5.081 8.620 21%
base tools 0.453 0.444 N/A
base (total) 6.248 9.838 19.7%
nspr 9.309 8.404 11.3%
js export 1.030 1.877 22.8%
js libs 71.450 416.718 52%
js tools 0.324 0.246 N/A
js (total) 72.804 418.841 71.9%
platform export 40.487 141.704 43.7%
platform libs 1211 4896 50.5%
platform tools 70.416 90.917 16.1%
platform (total) 1312 5129 48.9%
app export 4.383 3.059 8.7%
app libs 18.727 18.976 12.7%
app tools (no-op) 0.519s 0.968 N/A
app (total) 23.629 23.003 12.2%
Total 1424 (23:44) 5589 (93:15) 49.1%

It's worth mentioning that linking libxul is part of the platform libs tier. libxul linking should be called out because it is unlike other parts of the build in that it is more I/O bound and can't use multiple cores. On my machine, libxul linking (not using gold) takes ~61s. During this time, only 1 CPU core is in use. The ~61s wall time corresponds to roughly 5% of platform libs' wall time. Yet, even if we subtract this ~61s from the effective CPU calculation, the percentage doesn't change much.

Clobber with Populated ccache

Using the ccache from a recently built tree to make C/C++ compilation faster, I measured how long it took each tier to build on a clobber build.

This measurement can be used to estimate the overhead of C/C++ compilation during builds. In theory, the difference between CPU times between this and the former measurement will be the amount of CPU time spent in the C/C++ compiler.

This will also isolate how much time we spend not in the C/C++ compiler. It will arguably be very difficult to make the C/C++ compiler faster (although things like reducing the abuse of templates can have a measureable impact). However, we do have control over many of the other things we do. If we find that CPU time spent outside the C/C++ compiler is large, we can look for pieces to optimize.

Tiers not containing compiled files are omitted from the data.

Tier - Sub-tier Wall Time (s) CPU Time (s) ccache savings (s) (Time in Compiler)
base libs 1.075 1.525 7.095
base tools 1.957 0.933 1.522
nspr 5.582 1.688 6.716
js libs 22.829 9.529 407.189
platform libs 431 328 4568
platform tools 14.498 25.744 65.173
app libs 10.193 15.809 3.167
Total 487.134 (6:07) 383.229 (6:23) 5059 (84:19)

No-op Build

A no-op build is a build performed in an object directory that was just built. Nothing changed in the source repository nor object directory, so theoretically the build should do nothing. And, it should be fast.

In reality, our build system isn't smart and performs some redundant work. One part of redundant work is because one of the first things the main Makefile does before invoking the tiers is delete a large chunk of the dist/ directory and the entirety of the _tests/ directory from the object directory.

In these measurements, I bypassed the deletion of these directories. In other words, I measure what no-op builds are if we eliminate the clown shoes around blowing away large parts of the object directory.

Tier - Sub-tier Wall Time (s) CPU Time (s)
base export 0.524 0.537
base libs 0.625 0.599
base tools 0.447 0.437
nspr 0.809 0.752
js export 0.334 0.324
js libs 0.375 0.361
platform export 10.904 13.136
platform libs 30.969 44.25
platform tools 8.213 10.737
app export 0.524 1.006
app libs 6.090 13.753
Total 59.814 85.892

So, no-op builds use ~60s of wall time and only make use of 17.9% of available CPU resources.

No-op Build With Directory Removal Silliness

As mentioned above, before the tiers are iterated, the top-level Makefile blows away large parts of dist/ and the entirety of _tests/. What impact does this have?

In this section, I try to isolate how much time was thrown away by doing this.

First, we have to account for the deletion of these directories. On my test build, deleting 15,005 files in these directories took ~3 seconds.

The table below contains my results. This is a more accurate reading than the above on how long no-op builds takes because this is actually what we do during normal builds. The time delta column contains the difference between this build and a build without the removal silliness. Positive times can be attributes to overhead associated with repopulating dist/ and _tests/.

Tier - Sub-tier Wall Time (s) Wall Time Delta (s) CPU Time (s) CPU Time Delta (s)
base export 0.544 Negligible 0.559 Negligible
base libs 0.616 Negligible 0.594 Negligible
base tools 0.447 Negligible 0.436 Negligible
nspr 0.803 Negligible 0.743 Negligible
js export 0.338 Negligible 0.329 Negligible
js libs 0.378 Negligible 0.363 Negligible
platform export 13.140 2.236 13.314 Negligible
platform libs 35.290 4.329 43.059 -1.191 (normal variance?)
platform tools 8.819 0.606 10.983 0.246
app export 0.525 Negligible 1.012 Negligible
app libs 8.876 2.786 13.527 -0.226
Total 69.776 9.962 84.919 -0.973 (normal variance)

If a delta is listed as negligible, it was within 100ms of the original value and I figured this was either due to expected variance between runs or below our threshold for caring. In the case of base, nspr, and js tiers, the delta was actually much smaller than 100ms, often less then 10ms.

It certainly appears that the penalty for deleting large parts of dist/ and the entirety of _tests/ is about 10 seconds.

The Overhead of Make

We've measured supposed no-op build times. As I stated above, our no-op builds actually aren't no-op builds. Even if we bypass the deletion of dist/ and _tests/ we always evaluate some make rules. Can we measure how much work it takes to just load the make files without actually doing anything? This would allow us to get a rough estimate of how much we are wasting by doing redundant work. It will also help us establish a baseline for make overhead.

Turns out we can! Make has a --dry-run argument which evaluates the make file but doesn't actually do anything. It simply prints what would have been done.

Using --dry-run, I timed the different tiers. The difference from a no-op build should roughly be the overhead associated with make itself. It is possible that --dry-run adds a little overhead because it prints the commands that would have been executed. (Previous timings were using -s, which suppresses this.)

The delta times in the following table are the difference in times between the true no-op build from above (the one where we don't delete dist/ and _tests/) and the times measured here. It roughly isolates the amount of time spent outside of make, doing redundant work.

Tier - Sub-tier Wall Time (s) Wall Time Delta (s) CPU Time (s) CPU Time Delta (s)
base export 0.369 0.155 0.365 0.172
base libs 0.441 0.184 0.431 0.168
base tools 0.368 0.079 0.364 0.073
nspr 0.636 0.173 0.591 0.161
js export 0.225 0.109 0.225 0.099
js libs 0.278 0.097 0.273 0.088
platform export 3.841 7.063 6.108 7.028
platform libs 8.938 22.031 14.723 29.527
platform tools 3.962 4.251 6.185 4.552
app export 0.422 0.102 0.865 0.141
app libs 0.536 5.554 1.148 12.605
Total 20.016 39.798 31.278 54.614

Observations

The numbers say a lot. I'll get to key takeaways in a bit.

First, what the numbers don't show is the variance between runs. Subsequent runs are almost always significantly faster than the initial, even on no-op builds. I suspect this is due mostly to I/O wait. In the initial tier run, files are loaded into the page cache. Then, in subsequent runs, all I/O comes from physical memory rather than waiting on a magnetic hard drive.

Because of the suspected I/O related variance, I fear that the numbers I obtained are highly synthetic, at least for my machine. It is unlikely I'll ever see all these numbers in one mozilla-central build. Instead, it requires a specific sequence of events to obtain the best times possible. And, this sequence of events is not likely to correspond with real-world usage.

That being said, I think these numbers are important. If you remove I/O from the equation - say you have an SSD with near 0 service times or have enough memory so you don't need a hard drive - these numbers will tell what limits you are brushing up against. And, as computers get more powerful, I think more and more people will cross this threshold and will be more limited by the build system than the capabilities of their hardware. (A few months ago, I measured resource usage when compiling mozilla-central on Linux and concluded you need roughly 9GB of dedicated memory to compile and link mozilla-central without page cache eviction. In other words, if building on a machine with only 8GB of RAM, your hard drive will play a role.)

Anyway, to the numbers.

I think the most important number in the above tables is 49.1%. That is the effective CPU utilization during a clobber build. This means that during a build, on average half of the available CPU cores are unused. Now, I could be generous and bump this number to 50.7%. That's what the effective CPU utilization is if you remove the ~60s of libxul linking from the calculation.

The 49.1% has me reaching the following conclusions:

  1. I/O wait really matters.
  2. Our recursive use of make is incapable of executing more than 4 items at a time on average (assuming 8 cores).
  3. My test machine had more CPU wait than I think.

I/O wait is easy to prove: compare times on an SSD or with a similar I/O bus with near zero service times (e.g. a filled page cache with no eviction - like a machine with 16+ GB of memory that has built mozilla-central recently).

A derived time not captured in any table is 11:39. This is the total CPU time of a clobber build (93:15) divided by the number of cores (8). If we had 100% CPU utilization across all cores during builds, we should be able to build mozilla-central in 11:39. This is an ideal figure and won't be reached. As mentioned above, libxul linking takes ~60s itself! I think 13:00 is a more realistic optimal compilation time for a modern 8 core machine. This points out a startling number: we are wasting ~12 minutes of wall time due to not fully utilizing CPU cores during clobber builds.

Another important number is 5059 out of 5589, or 90.5%. That is the CPU time in a clobber build spent in the C/C++ compiler, as measured by the speedup of using ccache. It's unlikely we are going to make the C/C++ compiler go much faster (short of not compiling things). So, this is a big fat block of time we will never be able to optimize. On my machine compiling mozilla-central will always take at least ~10:30 wall time, just in compiling C/C++.

A clobber build with a saturated ccache took 487s wall time but only 383s CPU time. That's only about 10% total CPU utilization. And, this represents only 6.8% of total CPU time from the original clobber build. Although, it is 34.2% of total wall time.

The above means that everything not the C/C++ compiler is horribly inefficient. These are clown shoes of epic proportions. We're not even using 1 full core doing build actions outside of the C/C++ compiler!

Because we are inefficient when it comes to core usage, I think a key takeaway is that throwing more cores at the existing build system will have diminishing returns. Sure, some parts of the build system today could benefit from it (mainly js, layout, and dom, as they have Makefile's with large numbers of source files). But, most of the build system won't take advantage of many cores. If you want to throw money at a build machine, I think your first choice should be an SSD. If you can't do that, have as much memory as you can so most of your filesystem I/O is serviced by the page cache, not your disk drive.

In the final table, we isolated how much time make is spending to just to figure out what to do. That amounts to ~20 seconds wall time and ~31s CPU time. That leaves ~40s wall and ~55s CPU for non-make work during no-op builds. Translation: we are doing 40s of wall time work during no-op builds. Nothing changed. We are throwing 40s of wall time away because the build system isn't using proper dependencies and is doing redundant work.

I've long been a critic of us blowing away parts of dist/ and _tests/ at the top of builds. Well, after measuring it, I have mixed reactions. It only amounts to about ~10s of added time to builds. This doesn't seem like a lot in the grand scheme of things. However, this is ~10s on top of the ~60s it actually takes to iterate through the tiers. So, in terms of percentages for no-op builds, it is actually quite significant.

No-op builds with the existing build system take ~70s under ideal conditions. In order of time, the breakdown is roughly:

  • ~40s for doing redundant work in Makefiles
  • ~20s for make traversal and loading overhead
  • ~10s for repopulating deleted content from dist/ and _tests/

In other words, ~50s of ~70s no-op build times are spent doing work we have already done. This is almost purely clown shoes. Assuming we can't make make traversal and loading faster, the shortest possible no-op build time will be ~20s.

Splitting things up a bit more:

  • ~22s - platform libs make evaluation
  • ~20s - make file traversal and loading (readying for evaluation)
  • ~10s - repopulating deleted content from dist/ and _tests/
  • ~7s - platform export make evaluation
  • ~5.5 - app libs make evaluation
  • ~4s - platform tools

The ~20s for make file traversal and loading is interesting. I suspect (although I haven't yet measured) that a lot of this is due to the sheer size of rules.mk. As I measured on Friday, the overhead of rules.mk with pymake is significant. I hypothesized that it would have a similar impact on GNU make. I think a good amount of this ~20s is similar overhead. I need to isolate, however. I am tempted to say that if we truly did no-op builds and make Makefile's load into make faster, we could attain no-op build times in the ~10s range. I think this is pretty damn good! Even ~20s isn't too bad. As surprising as it is for me to say it, recursive make is not (a significant) part of our no-op build problem.

Why is the Build System Slow?

People often ask the question above. As the data has told me, the answer, like many to complicated problems, is nuanced.

If you are doing a clobber build on a fresh machine, the build system is slow because 1) compiling all the C/C++ takes a lot of time (84:19 CPU time actually) 2) we don't make efficient use of all available cores when building. Half of the CPU horsepower during a fresh build is unharnessed.

If you are doing a no-op build, the build system is slow mainly because it is performing a lot of needless and redundant work. A significant contributor is the overhead of make, probably due to rules.mk being large.

If you are doing an incremental build, you will fall somewhere between either extreme. You will likely get nipped by both inefficient core usage as well as redundant work. Which one hurts the most depends on the scope of the incremental change.

If you are building on a machine with a magnetic hard drive (not an SSD), your builds are slow because you are waiting on I/O. You can combat this by putting 8+GB of memory in your system and doing your best to ensure that building mozilla-central can use as much of it as possible. I highly recommend 12GB, if not 16GB.

Follow-ups

The measurements reported in this post are only the tip of the iceberg. If I had infinite time, I would:

  • Measure other applications, not just browser/Firefox. I've heard that mobile/Fennec's build config is far from optimal, for example. I would love to quantify that.
  • Set up buildbot to record and post measurements so we have a dashboard of build times. We have some of this today, but the granularity isn't as fine as what I captured.
  • Record per-directory times.
  • Isolate time spent in different processes (DTrace could be used here).
  • Capture I/O numbers.
  • Correlate the impact of I/O service times on build times.
  • Isolate the overhead of ccache (mainly in terms of I/O).
  • Obtain numbers on other platforms and systems. Ensure results can be reproduced.

Next Steps

If we want to make our existing recursive make build backend faster, I recommend the following actions (in no particular order):

  1. Factor pieces out of rules.mk into separate .mk files and conditionally load based on presence of specific variables. In other words, finish what we have started. This definitely cuts down on the overhead with pymake (as measured on Friday) and likely makes GNU make faster as well.
  2. Don't blow away parts of dist/ and _tests/ at the top of builds. I know this introduces a problem where we could leave orphaned files in the object directory. We should solve this problem by having proper manifests for everything so we can detect and delete orphans. The cheap man's solution is to periodically clobber these directories.
  3. Don't perform unnecessary work during no-op builds. I suspect a lot of redundant work is due to rules in Makefile's not the rules in rules.mk. As we eliminate rules from Makefile's, this problem should gradually go away since rules.mk is generally intelligent about these things.
  4. More parallelism. I'm not sure how we're going to solve this with recursive make short of using PARALLEL_DIRS more and/or consolidating Makefile's together.

Again, these steps apply to our current recursive make build backend.

Because the most significant losses are due to ungained parallelism, our focus should be on increasing parallelism. We can only do this so much with recursive make. It is clear now more than ever that recursive make needs to be replaced with something that can fully realize the potential of multiple CPU cores. That could be non-recursive make or a separate build backend altogether.

We will likely not have an official alternate build backend soon. Until then, there are no shortage of clown shoes that can be looked at.

The redundant work during no-op builds is definitely tempting to address, as I think that has significant impact to most developers. Eliminating the absurdly long no-op build times removes the needs for hacks like smart-make and instills a culture of trust the build system.

I suspect a lot of the redundant work during no-op builds is due to poorly implemented rules in individual Makefiles rather than on silliness in rules.mk. Therefore, removing rules from Makefile's again seems to be one of the most important things we can do to make the build system faster. It also prepares us for implementing newer build backends, so it is a win-win!