Quantifying Mozilla's Automation Efficiency

July 14, 2013 at 11:15 PM | categories: Mozilla

Mozilla's build and test automation now records system resource usage (CPU, memory, and I/O) during mozharness jobs. It does this by adding a generic resource collection feature to mozharness. If a mozharness script inherits from a specific class, it magically performs system resource collection and reporting!

I want to emphasize that the current state of the feature is far from complete. There are numerous shortcomings and areas for improvement:

  • At the time I'm writing this, the mozharness patch is only deployed on the Cedar tree. Hopefully it will be deployed to the production infrastructure shortly.
  • This feature only works for mozharness jobs. Notably absent are desktop builds.
  • psutil - the underlying Python package used to collect data - isn't yet installable everywhere. As Release Engineering rolls it out to other machine classes in bug 893254, those jobs will magically start collecting resource usage.
  • While detailed resource usage is captured during job execution, we currently only report a very high-level summary at job completion time. This will be addressed with bug 893388.
  • Jobs running on virtual machines appear to misreport CPU usage (presumably due to CPU steal being counted as utilized CPU). Bug 893391 tracks.
  • You need to manually open logs to view resource usage. (e.g. open this log and search for Total resource usage.) I hope to one day have key metrics reported in TBPL output and/or easily graphable.
  • Resource collection operates at the system level. Because there is only 1 job running on a machine and slaves typically do little else, we assume system resource usage is a sufficient proxy for automation job usage. This obviously isn't always correct. But, it was the easiest to initially implement.

Those are a lot of shortcomings! And, it essentially means only OS X test jobs are providing meaningful data now. But, you have to start somewhere. And, we have more data now than we did before. That's progress.

Purpose and Analysis

Collecting resource usage of automation jobs (something I'm quite frankly surprised we weren't doing before) should help raise awareness of inefficient machine utilization and other hardware problems. It will allow us to answer questions such as are the machines working as hard as they can or is a particular hardware component contributing to slower automation execution.

Indeed a casual look at the first days of the data has shown some alarming readings, notably the abysmal CPU efficiency of our test jobs. For an OS X 10.8 opt build, the xpcshell job only utilized an average of 10% CPU during execution. A browser chrome job only utilized 12% CPU on average. Finally, a reftest job only utilized 13%.

Any CPU cycle not utilized by our automation infrastructure is forever lost and cannot be put to work again. So, to only utilize 10-13% of available CPU cycles during test jobs is wasting a lot of machine potential. This is the equivalent of buying 7.7 to 10 machines and only turning 1 of them on! Or, in terms of time, it would reduce the wall time execution of a 1 hour job to 6:00 to 7:48. Or in terms of overall automation load, it would significantly decrease the backlog and turnaround time. You get my drift. This is why parallelizing test execution within test suites - a means to increase CPU utilization - is such an exciting project to me. This work is all tracked in bug 845748 and in my opinion it cannot complete soon enough. (I'd also like to see more investigation into bottlenecks in test execution. Even small improvements of 1 or 2% can have a measurable impact when multiplied by thousands of tests per suite and hundreds of test job runs per day.)

Another interesting observation is that there is over 1 GB of write I/O during some test jobs. Browser chrome tests write close to 2GB! That is surprisingly high to me. Are the tests really incurring that much I/O? If so, which ones? Do they need to? If not tests, what background service is performing that much work? Could I/O wait be slowing tests down? Should we invest in more SSDs? More science is definitely needed.

I hope people find this data useful and that we put it to use to make more data-driven decisions around Mozilla's automation infrastructure.


The Importance of Time on Automated Machine Configuration

June 24, 2013 at 09:00 PM | categories: sysadmin, Mozilla, Puppet

Usage of machine configuration management software like Puppet and Chef has taken off in recent years. And rightly so - these pieces of software make the lives of countless system administrators much better (in theory).

In their default (and common) configuration, these pieces of software do a terrific job of ensuring a machine is provisioned with today's configuration. However, for many server provisioning scenarios, we actually care about yesterday's configuration.

In this post, I will talk about the importance of time when configuring machines.

Describing the problem

If you've worked on any kind of server application, chances are you've had to deal with a rollback. Some new version of a package or web application is rolled out to production. However, due to unforeseen problems, it needed to be rolled back.

Or, perhaps you operate a farm of machines that continuously build or compile software from version control. It's desirable to be able to reproduce the output from a previous build (ideally bit identical).

In these scenarios, the wall time plays a crucial rule when dealing with a central, master configuration server (such as a Puppet master).

Since a client will always pull the latest revision of its configuration from the server, it's very easy to define your configurations such that the result of machine provisioning today is different from yesterday (or last week or last month).

For example, let's say you are running Puppet to manage a machine that sits in a continuous integration farm and recompiles a source tree over and over. In your Puppet manifest you have:

package {
    "gcc":
        ensure => latest
}

If you run Puppet today, you may pull down GCC 4.7 from the remote package repository because 4.7 is the latest version available. But if you run Puppet tomorrow, you may pull down GCC 4.8 because the package repository has been updated! If for some reason you need to rebuild one of today's builds tomorrow (perhaps you want to rebuild that revision plus a minor patch), they'll use different compiler versions (or any package for that matter) and the output may not be consistent - it may not even work at all! So much for repeatability.

File templates are another example. In Puppet, file templates are evaluated on the server and the results are sent to the client. So, the output of file template execution today might be different from the output tomorrow. If you needed to roll back your server to an old version, you may not be able to do that because the template on the server isn't backwards compatible! This can be worked around, sure (commonly by copying the template and branching differences), but over time these hacks accumulate in a giant pile of complexity.

The common issue here is that time has an impact on the outcome of machine configuration. I refer to this issue as time-dependent idempotency. In other words, does time play a role in the supposedly idempotent configuration process? If the output is consistent no matter when you run the configuration, it is time-independent and truly idempotent. If it varies depending on when configuration is performed, it is time-dependent and thus not truly idempotent.

Solving the problem

My attitude towards machine configuration and automation is that it should be as time independent as possible. If I need to revert to yesterday's state or want to reproduce something that happened months ago, I want strong guarantees that it will be similar, if not identical. Now, this is just my opinion. I've worked in environments where we had these strong guarantees. And having had this luxury, I abhore the alternative where so many pieces of configuration vary over time as the central configuration moves forward without the ability to turn back the clock. As always, your needs may be different and this post may not apply to you!

I said as possible a few times in the previous paragraph. While you could likely make all parts of your configuration time independent, it's not a good idea. In the real world, things change over time and making all configuration data static regardless of time will produce a broken or bad configuration.

User access is one such piece of configuration. Employees come and go. Passwords and SSH keys change. You don't want to revert user access to the way it was two months ago, restoring access to a disgruntled former employee or allowing access via a compromised password. Network configuration is another. Say the network topology changed and the firewall rules need updating. If you reverted the networking configuration, the machine likely wouldn't work on the network!

This highlights an important fact: if making your machine configuration time independent is a goal, you will need to bifurcate configuration by time dependency and solve for both. You'll need to identify every piece of configuration and ask do I put this in the bucket that is constant over time or the bucket that changes over time?

Machine configuration software can do a terrific job of ensuring an applied configuration is idempotent. The problem is it typically can't manage both time-dependent and time-independent attributes at the same time. Solving this requires a little brain power, but is achievable if there is will. In the next section, I'll describe how.

Technical implementation

Time-dependent machine configuration is a solved problem. Deploy Puppet master (or similar) and you are good to go.

Time-independent configuration is a bit more complicated.

As I mentioned above, the first step is to isolate all of the configuration you want to be time independent. Next, you need to ensure time dependency doesn't creep into that configuration. You need to identify things that can change over time and take measures to ensure those changes won't affect the configuration. I encourage you to employ the external system test: does this aspect of configuration depend on an external system or entity? If so how will I prevent changes in it over time from affecting us?

Package repositories are one such external system. New package versions are released all the time. Old packages are deleted. If your configuration says to install the latest package, there's no guarantee the package version won't change unless the package repository doesn't change. If you simply pin a package to a specific version, that version may disappear from the server. The solution: pin packages to specific versions and run your own package mirror that doesn't delete or modify existing packages.

Does your configuration fetch a file from a remote server or use a file as a template? Cache that file locally (in case it disappears) and put it under version control. Have the configuration reference the version control revision of that file. As long as the repository is accessible, the exact version of the file can be retrieved at any time without variation.

In my professional career, I've used two separate systems for managing time-independent configuration data. Both relied heavily on version control. Essentially, all the time-independent configuration data is collected into a single repository - an independent repository from all the time-dependent data (although that's technically an implementation detail). For Puppet, this would include all the manifests, modules, and files used directly by Puppet. When you want to activate a machine with a configuration, you simply say check out revision X of this repository and apply its configuration. Since revision X of the repository is constant over time, the set of configuration data being used to configure the machine is constant. And, if you've done things correctly, the output is idempotent over time.

In one of these systems, we actually had two versions of Puppet running on a machine. First, we had the daemon communicating with a central Puppet master. It was continually applying time-dependent configuration (user accounts, passwords, networking, etc). We supplemented this was a manually executed standalone Puppet instance. When you ran a script, it asked the Puppet master for its configuration. Part of that configuration was the revision of the time-independent Git repository containing the Puppet configuration files the client should use. It then pulled the Git repo, checked out the specified revision, merged Puppet master's settings for the node with that config (not the manifests, just some variables), then ran Puppet locally to apply the configuration. While a machine's configuration typically referenced a SHA-1 of a specific Git commit to use, we could use anything git checkout understood. We had some machines running master or other branches if we didn't care about time-independent idempotency for that machine at that time. What this all meant was that if you wanted to roll back a machine's configuration, you simply specified an earlier Git commit SHA-1 and then re-ran local Puppet.

We were largely satisfied with this model. We felt like we got the best of both worlds. And, since we were using the same technology (Puppet) for time-dependent and time-independent configuration, it was a pretty simple-to-understand system. A downside was there were two Puppet instances instead of one. With a little effort, someone could probably devise a way for the Puppet master to merge the two configuration trees. I'll leave that as an exercise for the reader. Perhaps someone has done this already! If you know of someone, please leave a comment!

Challenges

The solution I describe does not come without its challenges.

First, deciding whether a piece of configuration is time dependent or time independent can be quiet complicated. For example, should a package update for a critical security fix be time dependent or time independent? It depends! What's the risk of the machine not receiving that update? How often is that machine rolled back? Is that package important to the operation/role of that machine (if so, I'd lean more towards time independent).

Second, minimizing exposure to external entities is hard. While I recommend putting as much as possible under version control in a single repository and pinning versions everywhere when you interface with an external system, this isn't always feasible. It's probably a silly idea to have your 200 GB Apt repository under version control and distributed locally to every machine in your network. So, you end up introducing specialized one-off systems as necessary. For our package repository, we just ran an internal HTTP server that only allowed inserts (no deletes or mutates). If we were creative, we could have likely devised a way for the client to pass a revision with the request and have the server dynamically serve from that revision of an underlying repository. Although, that may not work for every server type due to limited control over client behavior.

Third, ensuring compatibility between the time-dependent configuration and time-independent configuration is hard. This is a consequence of separating those configurations. Will a time-independent configuration from a revision two years ago work with the time-dependent configuration of today? This issue can be mitigated by first having as much configuration as possible be time independent and second not relying on wide support windows. If it's good enough to only support compatibility for time-independent configurations less than a month old, then it's good enough! With this issue, I feel you are trading long-term future incompatibility for well-defined and understood behavior in the short to medium term. That's a trade-off I'm willing to make.

Conclusion

Many machine configuration management systems only care about idempotency today. However, with a little effort, it's possible to achieve consistent state over time. This requires a little extra effort and brain power, but it's certainly doable.

The next time you are programming your system configuration tool, I hope you take the time to consider the effects time will have and that you will take the necessary steps to ensure consistency over time (assuming you want that, of course).


Using Docker to Build Firefox

May 19, 2013 at 01:45 PM | categories: Mozilla, Firefox

I have the privilege of having my desk located around a bunch of really intelligent people from the Mozilla Services team. They've been talking a lot about all the new technologies around server provisioning. One that interested me is Docker.

Docker is a pretty nifty piece of software. It's essentially a glorified wrapper around Linux Containers. But, calling it that is doing it an injustice.

Docker interests me because it allows simple environment isolation and repeatability. I can create a run-time environment once, package it up, then run it again on any other machine. Furthermore, everything that runs in that environment is isolated from the underlying host (much like a virtual machine). And best of all, everything is fast and simple.

For my initial experimentation with Docker, I decided to create an environment for building Firefox.

Building Firefox with Docker

To build Firefox with Docker, you'll first need to install Docker. That's pretty simple.

Then, it's just a matter of creating a new container with our build environment:

curl https://gist.github.com/indygreg/5608534/raw/30704c59364ce7a8c69a02ee7f1cfb23d1ffcb2c/Dockerfile | docker build

The output will look something like:

FROM ubuntu:12.10
MAINTAINER Gregory Szorc "gps@mozilla.com"
RUN apt-get update
===> d2f4faba3834
RUN dpkg-divert --local --rename --add /sbin/initctl && ln -s /bin/true /sbin/initctl
===> aff37cc837d8
RUN apt-get install -y autoconf2.13 build-essential unzip yasm zip
===> d0fc534feeee
RUN apt-get install -y libasound2-dev libcurl4-openssl-dev libdbus-1-dev libdbus-glib-1-dev libgtk2.0-dev libiw-dev libnotify-dev libxt-dev mesa-common-dev uuid-dev
===> 7c14cf7af304
RUN apt-get install -y binutils-gold
===> 772002841449
RUN apt-get install -y bash-completion curl emacs git man-db python-dev python-pip vim
===> 213b117b0ff2
RUN pip install mercurial
===> d3987051be44
RUN useradd -m firefox
===> ce05a44dc17e
Build finished. image id: ce05a44dc17e
ce05a44dc17e

As you can see, it is essentially bootstrapping an environment to build Firefox.

When this has completed, you can activate a shell in the container by taking the image id printed at the end and running it:

docker run -i -t ce05a44dc17e /bin/bash
# You should now be inside the container as root.
su - firefox
hg clone https://hg.mozilla.org/mozilla-central
cd mozilla-central
./mach build

If you want to package up this container for distribution, you just find its ID then export it to a tar archive:

docker ps -a
# Find ID of container you wish to export.
docker export 2f6e0edf64e8 > image.tar
# Distribute that file somewhere.
docker import - < image.tar

Simple, isn't it?

Future use at Mozilla

I think it would be rad if Release Engineering used Docker for managing their Linux builder configurations. Want to develop against the exact system configuration that Mozilla uses in its automation - you could do that. No need to worry about custom apt repositories, downloading custom toolchains, keeping everything isolated from the rest of your system, etc: Docker does that all automatically. Mozilla simply needs to publish Docker images on the Internet and anybody can come along and reproduce the official environment with minimal effort. Once we do that, there are few excuses for someone breaking Linux builds because of an environment discrepancy.

Release Engineering could also use Docker to manage isolation of environments between builds. For example, it could spin up a new container for each build or test job. It could even save images from the results of these jobs. Have a weird build failure like a segmentation fault in the compiler? Publish the Docker image and have someone take a look! No need to take the builder offline while someone SSH's into it. No need to worry about the probing changing state because you can always revert to the state at the time of the failure! And, builds would likely start faster. As it stands, our automation spends minutes managing packages before builds begin. This lag would largely be eliminated with Docker. If nothing else, executing automation jobs inside a container would allow us to extract accurate resource usage info (CPU, memory, I/O) since the Linux kernel effectively gives containers their own namespace independent of the global system's.

I might also explore publishing Docker images that construct an ideal development environment (since getting recommended tools in the hands of everybody is a hard problem).

Maybe I'll even consider hooking up build system glue to automatically run builds inside containers.

Lots of potential here.

Conclusion

I encourage Linux users to play around with Docker. It enables some new and exciting workflows and is a really powerful tool despite its simplicity. So far, the only major faults I have with it are that the docs say it should not be used in production (yet) and it only works on Linux.


Build System Status Update 2013-05-14

May 13, 2013 at 07:35 PM | categories: Mozilla, build system

I'd like to make an attempt at delivering regular status updates on the Gecko/Firefox build system and related topics. Here we go with the first instance. I'm sure I missed awesomeness. Ping me and I'll add it to the next update.

MozillaBuild Windows build environment updated

Kyle Huey released version 1.7 of our Windows build environment. It contains a newer version of Python and a modern version of Mercurial among other features.

I highly recommend every Windows developer update ASAP. Please note that you will likely encounter Python errors unless you clobber your build.

New submodule and peers

I used my power as module owner to create a submodule of the build config module whose scope is the (largely mechanical) transition of content from Makefile.in to moz.build files. I granted Joey Armstrong and Mike Shal peer status for this module. I would like to eventually see both elevated to build peers of the main build module.

moz.build transition

The following progress has been made:

  • Mike Shal has converted variables related to defining XPIDL files in bug 818246.
  • Mike Shal converted MODULE in bug 844654.
  • Mike Shal converted EXPORTS in bug 846634.
  • Joey Armstrong converted xpcshell test manifests in bug 844655.
  • Brian O'Keefe converted PROGRAM in bug 862986.
  • Mike Shal is about to land conversion of CPPSRCS in bug 864774.

Non-recursive XPIDL generation

In bug 850380 I'm trying to land non-recursive building of XPIDL files. As part of this I'm trying to combine the generation of .xpt and .h for each input .idl file into a single process call because profiling revealed that parsing the IDL consumes most of the CPU time. This shaves a few dozen seconds off of build times.

I have encounterd multiple pymake bugs when developing this patch, which is the primary reason it hasn't landed yet.

WebIDL refactoring

I was looking at my build logs and noticed WebIDL generation was taking longer than I thought it should. I filed bug 861587 to investigate making it faster. While my initial profiling turned out to be wrong, Boris Zbarsky looked into things and discovered that the serialization and deserialization of the parser output was extremely slow. He is currently trying to land a refactor of how WebIDL bindings are handled. The early results look very promising.

I think the bug is a good example of the challenges we face improving the build system, as Boris can surely attest.

Test directory reorganization

Joel Maher is injecting sanity into the naming scheme of test directories in bug 852065.

Manifests for mochitests

Jeff Hammel, Joel Maher, Ted Mielczarek, and I are working out using manifests for mochitests (like xpcshell tests) in bug 852416.

Mach core is now a standalone package

I extracted the mach core to a standalone repository and added it to PyPI.

Mach now categorizes commands in its help output.

Requiring Python 2.7.3

Now that the Windows build environment ships with Python 2.7.4, I've filed bug 870420 to require Python 2.7.3+ to build the tree. We already require Python 2.7.0+. I want to bump the point release because there are many small bug fixes in 2.7.3, especially around Python 3 compatibility.

This is currently blocked on RelEng rolling out 2.7.3 to all the builders.

Eliminating master xpcshell manifest

Now that xpcshell test manifests are defined in moz.build files, we theoretically don't need the master manifest. Joshua Cranmer is working on removing them in bug 869635.

Enabling GTests and dual linking libxul

Benoit Gerard and Mike Hommey are working in bug 844288 to dual link libxul so GTests can eventually be enabled and executed as part of our automation.

This will regress build times since we need to link libxul twice. But, giving C++ developers the ability to write unit tests with a real testing framework is worth it, in my opinion.

ICU landing

ICU was briefly enabled in bug 853301 but then backed out because it broke cross-compiling. It should be on track for enabling in Firefox 24.

Resource monitoring in mozbase

I gave mozbase a class to record system resource usage. I plan to eventually hook this up to the build system so the build system records how long it took to perform key events. This will give us better insight into slow and inefficient parts of the build and will help us track build system speed improvements over time.

Sorted lists in moz.build files

I'm working on requiring lists in moz.build be sorted. Work is happening in bug 863069.

This idea started as a suggestion on the dev-platform list. If anyone has more great ideas, don't hold them back!

Smartmake added to mach

Nicholas Alexander taught mach how to build intelligently by importing some of Josh Matthews' smartmake tool's functionality into the tree.

Source server fixed

Kyle Huey and Ted Mielczarek collaborated to fix the source server.

Auto clobber functionality

Auto clobber functionality was added to the tree. After flirting briefly with on-by-default, we changed it to opt-in. When you encounter it, it will tell you how to enable it.

Faster clobbers on automation

I was looking at build logs and identified we were inefficiently performing clobber.

Massimo Gervasini and Chris AtLee deployed changes to automation to make it more efficient. My measurements showed a Windows try build that took 15 fewer minutes to start - a huge improvement.

Upgrading to Mercurial 2.5.4

RelEng is tracking the global deployment of Mercurial 2.5.4. hg.mozilla.org is currently running 2.0.2 and automation is all over the map. The upgrade should make Mercurial operations faster and more robust across the board.

I'm considering adding code to mach or the build system that prompts the user when her Mercurial is out of date (since an out of date Mercurial can result in a sub-par user experience).

Parallelize reftests

Nathan Froyd is leading an effort to parallelize reftest execution. If he pulls this off, it could shave hours off of the total automation load per checkin. Go Nathan!

Overhaul of MozillaBuild in the works

I am mentoring a pair of interns this summer. I'm still working out the final set of goals, but I'm keen to have one of them overhaul the MozillaBuild Windows development environment. Cross your fingers.


Mozilla Build System Brain Dump

May 13, 2013 at 05:25 PM | categories: build system, Mozilla, Firefox, mach

I hold a lot of context in my head when it comes to the future of Mozilla's build system and the interaction with it. I wanted to perform a brain dump of sorts so people have an idea of where I'm coming from when I inevitably propose radical changes.

The sad state of build system interaction and the history of mach

I believe that Mozilla's build system has had a poor developer experience for as long as there has been a Mozilla build system. Getting started with Firefox development was a rite of passage. It required following (often out-of-date) directions on MDN. It required finding pages through MDN search or asking other people for info over IRC. It was the kind of process that turned away potential contributors because it was just too damn hard.

mach - while born out of my initial efforts to radically change the build system proper - morphed into a generic command dispatching framework by the time it landed in mozilla-central. It has one overarching purpose: provide a single gateway point for performing common developer tasks (such as building the tree and running tests). The concept was nothing new - individual developers had long coded up scripts and tools to streamline workflows. Some even published these for others to use. What set mach apart was a unified interface for these commands (the mach script in the top directory of a checkout) and that these productivity gains were in the tree and thus easily discoverable and usable by everybody without significant effort (just run mach help).

While mach doesn't yet satisfy everyone's needs, it's slowly growing new features and making developers' lives easier with every one. All of this is happening despite that there is not a single person tasked with working on mach full time. Until a few months ago, mach was largely my work. Recently, Matt Brubeck has been contributing a flurry of enhancements - thanks Matt! Ehsan Akhgari and Nicholas Alexander have contributed a few commands as well! There are also a few people with a single command to their name. This is fulfilling my original vision of facilitating developers to scratch their own itches by contributing mach commands.

I've noticed more people referencing mach in IRC channels. And, more people get angry when a mach command breaks or changes behavior. So, I consider the mach experiment a success. Is it perfect, no. If it's not good enough for you, please file a bug and/or code up a patch. If nothing else, please tell me: I love to know about everyone's subtle requirements so I can keep them in mind when refactoring the build system and hacking on mach.

The object directory is a black box

One of the ideas I'm trying to advance is that the object directory should be considered a black box for the majority of developers. In my ideal world, developers don't need to look inside the object directory. Instead, they interact with it through condoned and supported tools (like mach).

I say this for a few reasons. First, as the build config module owner I would like the ability to massively refactor the internals of the object directory without disrupting workflows. If people are interacting directly with the object directory, I get significant push back if things change. This inevitably holds back much-needed improvements and triggers resentment towards me, build peers, and the build system. Not a good situation. Whereas if people are indirectly interacting with the object directory, we simply need to maintain a consistent interface (like mach) and nobody should care if things change.

Second, I believe that the methods used when directly interacting with the object directory are often sub-par compared with going through a more intelligent tool and that productivity suffers as a result. For example, when you type make in inside the object directory you need to know to pass -j8, use make vs pymake, and that you also need to build toolkit/library, etc. Also, by invoking make directly, you bypass other handy features, such as automatic compiler warning aggregation (which only happens if you invoke the build system through mach). If you go through a tool like mach, you should automatically get the most ideal experience possible.

In order for this vision to be realized, we need massive improvements to tools like mach to cover the missing workflows that still require direct object directory interaction. We also need people to start using mach. I think increased mach usage comes after mach has established itself as obviously superior to the alternatives (I already believe it offers this for tasks like running tests).

I don't want to force mach upon people but...

Nobody likes when they are forced to change a process that has been familiar for years. Developers especially. I get it. That's why I've always attempted to position mach as an alternative to existing workflows. If you don't like mach, you can always fall back to the previous workflow. Or, you can improve mach (patches more than welcome!). Having gone down the please-use-this-tool-it's-better road before at other organizations, I strongly believe that the best method to incur adoption of a new tool is to gradually sway people through obvious superiority and praise (as opposed to a mandate to switch). I've been trying this approach with mach.

Lately, more and more people have been saying things like we should have the build infrastructure build through mach instead of client.mk and why do we need testsuite-targets.mk when we have mach commands. While I personally feel that client.mk and testsuite-targets.mk are antiquated as a developer-facing interface compared to mach, I'm reluctant to eliminate them because I don't like forcing change on others. That being said, there are compelling reasons to eliminate or at least refactor how they work.

Let's take testsuite-targets.mk as an example. This is the make file that provides the targets to run tests (like make xpcshell-test and make mochitest-browser-chrome). What's interesting about this file is that it's only used in local builds: our automation infrastructure does not use testsuite-targets.mk! Instead, mozharness and the old buildbot configs manually build up the command used to invoke the test harnesses. Initially, the mach commands for running tests simply invoked make targets defined in testsuite-targets.mk. Lately, we've been converting the mach commands to invoke the Python test runners directly. I'd argue that the logic for invoke the test runner only needs to live in one place in the tree. Furthermore as a build module peer, I have little desire to support multiple implementations. Especially considering how fragile they can be.

I think we're trending towards an outcome where mach (or the code behind mach commands) transitions into the authoratitive invocation method and legacy interfaces like client.mk and testsuite-targets.mk are reimplemented to either call mach commands or the same routine that powers them. Hopefully this will be completely transparent to developers.

The future of mozconfigs and environment configuration

mozconfig files are shell scripts used to define variables consumed by the build system. They are the only officially supported mechanism for configuring how the build system works.

I'd argue mozconfig files are a mediocre solution at best. First, there's the issue of mozconfig statements that don't actually do anything. I've seen no-op mozconfig content cargo culted into the in-tree mozconfigs (used for the builder configurations)! Oops. Second, doing things in mozconfig files is just awkward. Defining the object directory requires mk_add_options MOZ_OBJDIR=some-path. What's mk_add_options? If some-path is relative, what is it relative to? While certainly addressable, the documentation on how mozconfig files work is not terrific and fails to explain many pitfalls. Even with proper documentation, there's still the issue of the file format allowing no-op variable assignments to persist.

I'm very tempted to reinvent build configuration as something not mozconfigs. What exactly, I don't know. mach has support for ini-like configuration files. We could certainly have mach and the build system pull configs from the same file.

I'm not sure what's going to happen here. But deprecating mozconfig files as they are today is part of many of the options.

Handling multiple mozconfig files

A lot of developers only have a single mozconfig file (per source tree at least). For these developers, life is easy. You simply install your mozconfig in one of the default locations and it's automagically used when you use mach or client.mk. Easy peasy.

I'm not sure what the relative numbers are, but many developers maintain multiple mozconfig files per source tree. e.g. they'll have one mozconfig to build desktop Firefox and another one for Android. They may have debug variations of each.

Some developers even have a single mozconfig file but leverage the fact that mozconfig files are shell scripts and have their mozconfig dynamically do things depending on the current working directory, value of an environment variable, etc.

I've also seen wrapper scripts that glorify setting environment variables, changing directory, etc and invoke a command.

I've been thinking a lot about providing a common and well-supported solution for switching between active build configurations. Installing mach on $PATH goes a long way to facilitate this. If you are in an object directory, the mozconfig used when that object directory was created is automatically applied. Simple enough. However, I want people to start treating object directories as black boxes. So, I'd rather not see people have their shell inside the object directory.

Whenever I think about solutions, I keep arriving at a virtualenv-like solution. Developers would potentially need to activate a Mozilla build environment (similar to how Windows developers need to launch MozillaBuild). Inside this environment, the shell prompt would contain the name of the current build configuration. Users could switch between configurations using mach switch or some other magic command on the $PATH.

Truth be told, I'm skeptical if people would find this useful. I'm not sure it's that much better than exporting the MOZCONFIG environment variable to define the active config. This one requires more thought.

The integration between the build environment and Python

We use Python extensively in the build system and for common developer tasks. mach is written in Python. moz.build processing is implemented in Python. Most of the test harnesses are written in Python.

Doing practically anything in the tree requires a Python interpreter that knows about all the Python code in the tree and how to load it.

Currently, we have two very similar Python environments. One is a virtualenv created while running configure at the beginning of a build. The other is essentially a cheap knock-off that mach creates when it is launched.

At some point I'd like to consolidate these Python environments. From any Python process we should have a way to automatically bootstrap/activate into a well-defined Python environment. This certainly sounds like establishing a unified Python virtualenv used by both the build system and mach.

Unfortunately, things aren't straightforward. The virtualenv today is constructed in the object directory. How do we determine the current object directory? By loading the mozconfig file. How do we do that? Well, if you are mach, we use Python. And, how does mach know where to find the code to load the mozconfig file? You can see the dilemma here.

A related issue is that of portable build environments. Currently, a lot of our automation recreates the build system's virtualenv from its own configuration (not that from the source tree). This has and will continue to bite us. We'd really like to package up the virtualenv (or at least its config) with tests so there is no potential for discrepancy.

The inner workings of how we integrate with Python should be invisible to most developers. But, I figured I'd capture it here because it's an annoying problem. And, it's also related to an activated build environment. What if we required all developers to activate their shell with a Mozilla build environment (like we do on Windows)? Not only would this solve Python issues, but it would also facilitate simpler config switching (outlined above). Hmmm...

Direct interaction with the build system considered harmful

Ever since there was a build system developers have been typing make (or make.py) to build the tree. One of the goals of the transition to moz.build files is to facilitate building the tree with Tup. make will do nothing when you're not using Makefiles! Another goal of the moz.build transition is to start derecursifying the make build system such that we build things in parallel. It's likely we'll produce monolithic make files and then process all targets for a related class IDLs, C++ compilation, etc in one invocation of make. So, uh, what happens during a partial tree build? If a .cpp file from /dom/src/storage is being handled by a monolithic make file invoked by the Makefile at the top of the tree, how does a partial tree build pick that up? Does it build just that target or every target in the monolithic/non-recursive make file?

Unless the build peers go out of our way to install redundant targets in leaf Makefiles, directly invoking make from a subdirectory of the tree won't do what it's done for years.

As I said above, I'm sympathetic to forced changes in procedure, so it's likely we'll provide backwards-compatibile behavior. But, I'd prefer to not do it. I'd first prefer partial-tree builds are not necessary and a full tree build finishes quickly. But, we're not going to get there for a bit. As an alternative, I'll take people building through mach build. That way, we have an easily extensible interface on which to build partial tree logic. We saw this recently when dumbmake/smartmake landed. And, going through mach also reinforces my ideal that the object directory is a black box.

Semi-persistent state

Currently, most state as it pertains to a checkout or build is in the object directory. This is fine for artifacts from the build system. However, there is a whole class of state that arguably shouldn't be in the object directory. Specifically, it shouldn't be clobbered when you rebuild. This includes logs from previous builds, the warnings database, previously failing tests, etc. The list is only going to grow over time.

I'd like to establish a location for semi-persistant state related to the tree and builds. Perhaps we change the clobber logic to ignore a specific directory. Perhaps we start storing things in the user's home directory. Perhaps we could establish a second object directory named the state directory? How would this interact with build environments?

This will probably sit on the backburner until there is a compelling use case for it.

The battle against C++

Compiling C++ consumes the bulk of our build time. Anything we can do to speed up C++ compilation will work wonders for our build times.

I'm optimistic things like precompiled headers and compiling multiple .cpp files with a single process invocation will drastically decrease build times. However, no matter how much work we put in to make C++ compilation faster, we still have a giant issue: dependency hell.

As shown in my build system presentation a few months back, we have dozens of header files included by hundreds if not thousands of C++ files. If you change one file: you invalidate build dependencies and trigger a rebuild. This is why whenever files like mozilla-config.h change you are essentially confronted with a full rebuild. ccache may help if you are lucky. But, I fear that as long as headers proliferate the way they do, there is little the build system by itself can do.

My attitude towards this is to wait and see what we can get out of precompiled headers and the like. Maybe that makes it good enough. If not, I'll likely be making a lot of noise at Platform meetings requesting that C++ gurus brainstorm on a solution for reducing header proliferation.

Conclusion

Belive it or not, these are only some of the topics floating around in my head! But I've probably managed to bore everyone enough so I'll call it a day.

I'm always interested in opinions and ideas, especially if they are different from mine. I encourage you to leave a comment if you have something to say.


« Previous Page -- Next Page »