I've written an extensive testing framework for Mozilla's version control tools. Despite it being a little rough around the edges, I'm a bit proud of it.
When you run tests for MozReview, Mozilla's heavily modified Review Board code review tool, the following things happen:
- A MySQL server is started in a Docker container.
- A Bugzilla server (running the same code as bugzilla.mozilla.org) is started on an Apache httpd server with mod_perl inside a Docker container.
- A RabbitMQ server mimicking pulse.mozilla.org is started in a Docker container.
- A Review Board Django development server is started.
- A Mercurial HTTP server is started
In the future, we'll likely also need to add support for various other services to support MozReview and other components of version control tools:
- The Autoland HTTP service will be started in a Docker container, along with any other requirements it may have.
- An IRC server will be started in a Docker container.
- Zookeeper and Kafka will be started on multiple Docker containers
The entire setup is pretty cool. You have actual services running on your local machine. Mike Conley and Steven MacLeod even did some pair coding of MozReview while on a plane last week. I think it's pretty cool this is even possible.
There is very little mocking in the tests. If we need an external service, we try to spin up an instance inside a local container. This way, we can't have unexpected test successes or failures due to bugs in mocking. We have very high confidence that if something works against local containers, it will work in production.
I currently have each test file owning its own set of Docker containers and processes. This way, we get full test isolation and can run tests concurrently without race conditions. This drastically reduces overall test execution time and makes individual tests easier to reason about.
As cool as the test setup is, there's a bunch I wish were better.
Spinning up and shutting down all those containers and processes takes a lot of time. We're currently sitting around 8s startup time and 2s shutdown time. 10s overhead per test is unacceptable. When I make a one line change, I want the tests to be instantenous. 10s is too long for me to sit idly by. Unfortunately, I've already gone to great pains to make test overhead as short as possible. Fig wasn't good enough for me for various reasons. I've reimplemented my own orchestration directly on top of the docker-py package to achieve some significant performance wins. Using concurrent.futures to perform operations against multiple containers concurrently was a big win. Bootstrapping containers (running their first-run entrypoint scripts and committing the result to be used later by tests) was a bigger win (first run of Bugzilla is 20-25 seconds).
I'm at the point of optimizing startup where the longest pole is the initialization of the services inside Docker containers themselves. MySQL takes a few seconds to start accepting connections. Apache + Bugzilla has a semi-involved initialization process. RabbitMQ takes about 4 seconds to initialize. There are some cascading dependencies in there, so the majority of startup time is waiting for processes to finish their startup routine.
Another concern with running all these containers is memory usage. When you start running 6+ instances of MySQL + Apache, RabbitMQ, + ..., it becomes really easy to exhaust system memory, incur swapping, and have performance fall off a cliff. I've spent a non-trivial amount of time figuring out the minimal amount of memory I can make services consume while still not sacrificing too much performance.
It is quite an experience having the problem of trying to minimize resource usage and startup time for various applications. Searching the internet will happily give you recommended settings for applications. You can find out how to make a service start in 10s instead of 60s or consume 100 MB of RSS instead of 1 GB. But what the internet won't tell you is how to make the service start in 2s instead of 3s or consume as little memory as possible. I reckon I'm past the point of diminishing returns where most people don't care about any further performance wins. But, because of how I'm using containers for end-to-end testing and I have a surplus of short-lived containers, it is clearly I problem I need to solve.
I might be able to squeeze out a few more seconds of reduction by further optimizing startup and shutdown. But, I doubt I'll reduce things below 5s. If you ask me, that's still not good enough. I want no more than 2s overhead per test. And I don't think I'm going to get that unless I start utilizing containers across multiple tests. And I really don't want to do that because it sacrifices test purity. Engineering is full of trade-offs.
Another takeaway from implementing this test harness is that the pre-built Docker images available from the Docker Registry almost always become useless. I eventually make a customization that can't be shoehorned into the readily-available image and I find myself having to reinvent the wheel. I'm not a fan of the download and run a binary model, especially given Docker's less-than-stellar history on the security and cryptography fronts (I'll trust Linux distributions to get package distribution right, but I'm not going to be trusting the Docker Registry quite yet), so it's not a huge loss. I'm at the point where I've lost faith in Docker Registry images and my default position is to implement my own builder. Containers are supposed to do one thing, so it usually isn't that difficult to roll my own images.
There's a lot to love about Docker and containerized test execution. But I feel like I'm foraging into new territory and solving problems like startup time minimization that I shouldn't really have to be solving. I think I can justify it given the increased accuracy from the tests and the increased confidence that brings. I just wish the cost weren't so high. Hopefully as others start leaning on containers and Docker more for test execution, people start figuring out how to make some of these problems disappear.
Warning: This post is long and rambling. There is marginal value in reading beyond the first few paragraphs unless you care about Docker.
I recently wrote about how Mozilla tests version control. In this post, I want to talk about the part of that effort that consumed the most time: adding Docker support to the test harness.
Introducing the Problem and Desired End State
Running Docker containers inside tests just seems like an obvious thing you'd want to do. I mean, wouldn't it be cool if your tests could spin up MySQL, Redis, Cassandra, Nginx, etc inside Docker containers and test things against actual instances of the things running in your data centers? Of course it would! If you ask me, this approach beats mocking because many questions around accuracy of the mocked interface are removed. Furthermore, you can run all tests locally, while on a plane: no data center or staging environment required. How cool is that! And, containers are all isolated so there's no need to pollute your system with extra packages and system services. Seems like wins all around.
When Mozilla started adding customizations to the Review Board code review software in preparation for deploying it at Mozilla as a replacement for Bugzilla's Splinter, it quickly became apparant that we had a significant testing challenge ahead of us. We weren't just standing up Review Board and telling people to use it, we were integrating user authentication with Bugzilla, having Review Board update Bugzilla after key events, and were driving the initiation of code review in Review Board by pushing code to a Mercurial server. That's 3 user-visible services all communicating with each to expose a unified workflow. It's the kind of thing testing nightmares are made of.
During my early involvement with the project, I recognized the challenge ahead and was quick to insist that we write automated tests for as much as possible. I insisted that all the code (there are multiple extensions to ReviewBoard, a Mercurial hook, and a Mercurial extension) live under one common repository and share testing. That way we could tinker with all the parts easily and test them in concern without having to worry about version sync. We moved all the code to the version-control-tools repository and Review Board was the driving force behind improvements to the test harness in that repository. We had Mercurial .t tests starting Django dev servers hosting Review Board running from per-test SQLite databases and all was nice. Pretty much every scenario involving the interaction between Mercurial and ReviewBoard was tested. If you cared about just these components, life was happy.
A large piece of the integration story was lacking in this testing world: Bugzilla. We had somewhat complex code for having Review Board and Bugzilla talk to each other but no tests for it because nobody had yet hooked Bugzilla up to the tests. As my responsibilities in the project expanded from covering just the Mercurial and Review Board interaction to Bugzilla as well, I again looked at the situation and thought there's a lot of complex interaction here and alpha testing has revealed the presence of many bugs: we need a better testing story. So, I set out to integrate Bugzilla into the test harness.
My goals were for Review Board tests to be able to make requests against a Bugzilla instance configured just like bugzilla.mozilla.org, to allow tests to execute concurrently (don't make developers wait on machines), for tests to run as quickly as possible, to run tests in an environment as similar to production as possible, and to be able to run tests from a plane or train or anywhere without internet connectivity. I was unwilling to budge on these core testing requirements because they represent what's best from test accuracy and developer productivity standpoints: you want your tests to be representative of the real world and you want to enable people to hack on this service anywhere, anytime, and not be held back by tests that take too long to execute. Move fast and don't break things.
Before I go on, it's time for a quick aside on tolerable waiting times. Throughout this article I talk about minimizing the run time of tests. This may sound like premature optimization. I argue it isn't, at least not if you are optimizing for developer productivity. There is a fair bit of academic research in this area. A study on tolerable waiting time: how long are Web users willing to wait gets cited a lot. It says 2 seconds for web content. If you read a few paragraphs in, it references other literature. They disagree on specific thresholds, but one thing is common: the thresholds are typically low - just a few seconds. The latencies I deal with are all longer than what research says leads to badness. When given a choice, I want to optimize workflows for what humans are conditioned to tolerate. If I can't do that, I've failed and the software will be ineffective.
The architecture of Bugzilla created some challenges and eliminated some implementation possibilities. First, I wasn't using any Bugzilla: I was using Mozilla's branch of Bugzilla that powers bugzilla.mozilla.org. Let's call it BMO. I could try hosting it from local SQLite files and running a local, Perl-based HTTP server (Bugzilla is written in Perl). But my experience with Perl and takeaways from talking to the BMO admins was that pain would likely be involved. Plus, this would be a departure from test accuracy. So, I would be using MySQL, Apache HTTPd, and mod_perl, just like BMO uses them in production.
Running Apache and MySQL is always a... fun endeavor. It wasn't a strict requirement, but I also highly preferred that the tests didn't pollute the system they ran on. In other words, having tests connect to an already-running MySQL or Apache server felt like the wrong solution. That's just one more thing people must set up and run locally to run the tests. That's just one more thing that could differ from production and cause bad testing results. It felt like a dangerous approach. Plus, there's the requirement to run things concurrently. Could I have multiple tests talking to the same MySQL server concurrently? They'd have to use separate databases so they don't conflict. That's a possibility. Honestly, I didn't entertain the thought of running Apache and MySQL manually for too long. I knew about this thing called Docker and that it theoretically fit my use case perfectly: construct building blocks for your application and then dymanically hook things up. Perfect. I could build Docker containers for all the required services and have each test start a new, independent set of containers for just that test.
So, I set out integrating Docker into the version-control-tools test harness. Specifically, my goal was to enable the running of independent BMO instances during individual tests. It sounded simple enough.
What I didn't know was that integrating a Dockerized BMO into the test harness would take the better part of 2 weeks. And it's still not up to my standards. This post is the story about the trials and tribulations I encountered along the way. I hope it serves as a warning and potentially a guide for others attempting similar feats. If any Docker developers are reading, I hope it gives you ideas on how to improve Docker.
Running Bugzilla inside Docker
First thing's first: to run BMO inside Docker I needed to make Docker containers for BMO. Fortunately, David Lawrence has prior art here. I really just wanted to take that code, dump it into version-control-tools and call it a day. In hindsight, I probably should have done that. Instead, armed with the knowledge of the Docker best practice of one container per service and David Lawrence's similar wishes to make his code conform to that ideal, I decided to spend some time to fix David's code so that MySQL and Apache were in separate containers, not part of a single container running supervisord. Easy enough, right?
It was relatively easy extracting the MySQL and Apache parts of BMO into separate containers. For MySQL, I started with the official MySQL container from the Docker library and added a custom my.cnf. Simple enough. For Apache, I just copied everything from David's code that wasn't MySQL. I was able to manually hook the containers together using the Docker CLI. It sort of just worked. I was optimistic this project would only take a few hours.
A garbage collection bug in Docker
My first speed bump came as I was iterating on Dockerfiles. All of a sudden I get an error from Docker that it is out of space. Wat? I look at docker images and don't see anything too obvious eating up space. What could be going on? At this point, I'm using boot2docker to host Docker. boot2docker is this nifty tool that allows Windows and OS X users to easily run Docker (Docker requires a Linux host). boot2docker spins up a Linux virtual machine running Docker and tells you how to point your local docker CLI interface at that VM. So, when Docker complains it is out of space, I knew immediately that the VM must be low on space. I SSH into it, run df, and sure enough, the VM is nearly out of space. But I looked at docker images -a and confirmed there's not enough data to fill the disk. What's going on? I can't find the issue right now, but it turns out there is a bug in Docker! When running Docker on aufs filesystems (like boot2docker does), Docker does not always remove data volumes containers when deleting a container. It turns out that the MySQL containers from the official Docker library were creating a data-only container to hold persistent MySQL data that outlives the container itself. These containers are apparently light magic. They are containers that are attached to other containers, but they don't really show up in the Docker interfaces. When you delete the host container, these containers are supposed to be garbage collected. Except on aufs, they aren't. My MySQL containers were creating 1+ GB InnoDB data files on start and the associated data containers were sitting around after container deletion, effectively leaking 1+ GB every time I created a MySQL container, quickly filling the boot2docker disk. Derp.
I worked around this problem by forking the official MySQL container. I didn't need persistent MySQL data (the containers only need to live for one invocation - for the lifetime of a single test), so I couldn't care less about persisted data volumes. So, I changed the MySQL container to hold its data locally, not in a data volume container. The solution was simple enough. But it took me a while to identify the problem. Here I was seeing Docker do something extremely stupid. Surely my understanding of Docker was wrong and I was doing something stupid to cause it to leak data. I spent hours digging through the documentation to make sure I was doing things exactly as recommended. It wasn't until I started an Ubuntu VM and tried the same thing there did I realize this looked like a bug in boot2docker. A few Google searches later led me to a comment hiding at the bottom of an existing GitHub issue that pins aufs as the culprit. And here I thought Docker reached 1.0 and wouldn't have bad bugs like this. I certainly wouldn't expect boot2docker to be shipping a VM with a sub-par storage driver (shouldn't it be using devicemapper or btrfs instead). Whatever.
Wrangling with Mozilla's Branch of Bugzilla
At this point, I've got basic Docker containers for MySQL and Apache+mod_perl+Bugzilla being created. Now, I needed to convert from vanilla Bugzilla to BMO. Should be straightforward. Just change the Git remote URL and branch to check out. I did this and all-of-a-sudden my image started encountering errors building! It turns out that the BMO code base doesn't work on a fresh database! Fortunately, this is a known issue and I've worked around it previously. When I tackled it a few months ago, I spent a handful of hours disecting this problem. It wasn't pretty. But this time I knew what to do. I even had a Puppet manifest for installing BMO on a fresh machine. So, I just needed to translate that Puppet config into Dockerfile commands. No big deal, right? Well, when I did that Puppet config a few months ago, I based it on Ubuntu because I'm more familiar with Debian-based distros and figured Ubuntu would be the easiest since it tends to have the largest package diversity. Unfortunately, David's Docker work is based on Fedora. So, I spent some time converting the Dockerfile to Ubuntu rather than trying to port things to Fedora. Arguably the wrong decision since Mozilla operates the RedHat flavor of Linux distributions in production. But I was willing to trade accuracy for time here, having lost time dealing with the aufs bug.
Unfortunately, I under-estimated how long it would take to port the image to Ubuntu. It didn't take so long from a code change perspective. Instead, most of the time was spent waiting for Docker to run the commands to build the image. In the final version, Apt is downloading and installing over 250 packages. And Bugzilla's bootstrap process installs dozens of packages from CPAN. Every time I made a small change, I invalidated Docker's image building cache, causing extreme delays while waiting for Apt and CPAN to do their thing. This experience partially contributed to my displeasure with how Docker currently handles image creation. If Docker images were composed of pre-built pieces instead of stacked commands, my cache hit rate would have been much higher and I would have converted the image in no time. But no, that's not how things work. So I lost numerous hours through this 2 week process waiting for Docker images to perform operations I've already done elsewhere dozens of times before.
Docker Container Orchestration
After porting the Bugzilla image to Ubuntu and getting BMO to bootstrap in a manually managed container (up to this point I'm using the docker CLI to create images, start containers, etc), it was time to automate the process so that tests could run the containers. At this time, I started looking for tools that performed multiple container orchestration. I had multiple containers that needed to be treated as a single logical unit, so I figured I'd use an existing tool to solve this problem for me. Don't reinvent the wheel unless you have to, right? I discovered Fig, which seemed to fit the bill. I read that it is being integrated into Docker itself, so it must be best of breed. Even if it weren't its future seems to be more certain than other tools. So, I stopped my tools search and used Fig without much consideration for other tools.
Lack of a useful feature in Fig
I quickly whipped up a fig.yml and figured it would just work. Nope! Starting the containers from scratch using Fig resulted in an error. I wasn't sure what the error was at first because Fig didn't tell me. After some investigation, I realized that my bmoweb container (the container holding Apache + BMO code) was failing in its entrypoint command (that's a command that runs when the container starts up, but not the primary command a container runs - that's a bit confusing I know - read the docs). I added some debug statements and quickly realized that Bugzilla was erroring connecting to MySQL. Strange, I thought. Fig is essentially a DSL around manual docker commands, so I checked everything by typing everything into the shell. No error. Again on a new set of containers. No error. I thought maybe my environment variable handling was wrong - that the dynamically allocated IP address and port number of the linked MySQL container being passed to the bmoweb container weren't getting honored. I added some logging to disprove that theory. The wheels inside my brain spun for a little bit. And, aided by some real-time logging, I realized I was dealing with a race condition: Fig was starting the MySQL and bmoweb containers concurrently and bmoweb was attempting to access the MySQL server before MySQL had fully initialized and started listening on its TCP port! That made sense. And I think it's a reasonable optimization for Fig to start containers concurrently to speed up start time. But surely a tool that orchestrates different containers has considered the problem of dependencies and has a mechanism to declare them to prevent these race conditions. I check the online docs and there's nothing to be found. A red panda weeps. So, I change the bmoweb entrypoint script to wait until it can open a TCP socket to MySQL before actually using MySQL and sure enough, the race condition goes away and the bmoweb container starts just fine!
OK, I'm real close now. I can feel it.
I start playing around with manually starting and stopping containers as part of a toy test. The good news is things appear to work. The bad news is it is extremely slow. It didn't take long for me to realize that the reason for the slowness is Bugzilla's bootstrap on first run. Bugzilla, like many complex applications, has a first run step that sets up database schema, writes out some files on the filesystem, inserts some skeleton data in the database, creates an admin user, etc. Much to my dismay this was taking a long time. Something on the order of 25 to 30 seconds. And that's on a Haswell with plenty of RAM and an SSD. Oy. The way things are currently implemented would result in a 25 to 30 second delay when running every test. Change 1 line and wait say 25s for any kind of output. Are you kidding me?! Unacceptable. It violated my core goal of having tests that are quick to run. Again, humans should not have to wait on machines.
I think about this problem for like half a second and the solution is obvious: take a snapshot of the bootstrapped images and start instances of that snapshot from tests. In other words, you perform the common bootstrap operations once and only once. And, you can probably do that outside the scope of running tests so that the same snapshot can be used across multiple invocations of the test harness. Sounds simple! To the Docker uninitiated, it sounds like the solution would be to move the BMO bootstrapping into the Dockerfile code so it gets executed at image creation time. Yes, that would be ideal. Unfortunately, when building images via Dockerfile, you can't tell Docker to link that image to another container. Without container linking, you can't have MySQL. Without MySQL, you can't do BMO bootstrap. So, BMO bootstrap must be done during container startup. And in Docker land, that means putting it as part of your entrypoint script (where it was conveniently already located for this reason).
Talking Directly to the Docker API
Of course, the tools that I found that help with Docker image building and container orchestration don't seem to have an answer for this create a snapshot of a bootstrapped container problem. I'm sure someone has solved this problem. But in my limited searching, I couldn't find anything. And, I figured the problem would be easy enough to solve manually, so I set about creating a script to do it. I'm not a huge fan of shell script for automation. It's hard to debug and simple things can be hard and hard things can be harder. Plus, why solve solutions such as parsing output for relevant data when you can talk to an API directly and get native types. Since the existing test harness automation in version-control-tools was written in Python, I naturally decided to write some Python to create the bootstrapped images. So, I do a PyPI search and discover docker-py, a Python client library to the Docker Remote API, an HTTP API that the Docker daemon runs and is what the docker CLI tool itself uses to interface with Docker. Good, now I have access to the full power of Docker and am not limited by what the docker CLI may not expose. So, I spent some time looking at source and the Docker Remote API documentation to get an understanding of my new abilities and what I'd need to do. I was pleasantly surprised to learn that the docker CLI is pretty similar to the Remote API and the Python API was similar as well, so the learning was pretty shallow. Yay for catching a break!
Confusion Over Container Stopping
I wrote some Python for building the BMO images, launching the containers, committing the result, and saving state to disk (so it could be consulted later - preventing a bootstrap by subsequent consumers). This was pleasantly smooth at first, but I encountered some bumps along the way. First, I didn't have a complete grasp on the differences between stop and kill. I was seeing some weird behavior by MySQL on startup and didn't know why. Turns out I was forcefully killing the container after bootstrap via the kill API and this was sending a SIGKILL to MySQL, effectively causing unclean shutdown. After some documentation reading, I realized stop is the better API - it issues SIGTERM, waits for a grace period, then issues SIGKILL. Issuing SIGTERM made MySQL shut down gracefully and this issue stemming from my ignorance was resolved. (If anyone from Docker is reading, I think the help output for docker kill should mention the forcefullness of the command versus stop. Not all of us remember the relative forcefullness of the POSIX signals and having documentation reinforce their cryptic meaning could help people select the proper command.) A few lines of Python later and I was talking directly to the Docker Remote API, doing everything I needed to do to save (commit in Docker parlance) a bootstrapped BMO environment for re-use among multiple tests.
It was pretty easy to hook the bootstrapped images up to a single test. Just load the bootstrapped image IDs from the config file and start new containers based on them. That's Docker 101 (except I was using Python to do everything).
Concurrent Execution Confuses Bugzilla
Now that I could start Dockerized BMO from a single test, it was time to make things work concurrently. I hooked Docker up to a few tests and launched them in parallel to see what would happen. The containers appeared to start just fine! Great anticipation on my part to design for concurrency from the beginning, I thought. It appeared I was nearly done. Victory was near. So, I changed some tests to actually interact with BMO running from Docker. (Up until this point I was merely starting containers, not doing anything with them.) Immediately I see errors. Cannot connect to Bugzilla http://... connection refused. Huh? It took a few moments, but I realized the experience I had with MySQL starting and this error were very similar. I changed my start BMO containers code to wait for the HTTP server's TCP socket to start accepting connections before returning control and sure enough, I was able to make HTTP requests against Bugzilla running in Docker! Woo!
Next step, make an authenticated query against Bugzilla running in Docker. HTTP request completes... with an internal server error. What?! I successfully browsed BMO from containers days before and was able to log in just fine - this shouldn't be happening. This problem took me ages to diagnose. I traced every step of provisioning and couldn't figure out what was going on. After resorting to print debugging in nearly every component, including Bugzilla's Perl code itself, I found the culprit: Bugzilla wasn't liking the dynamic nature of the MySQL and HTTP endpoints. You see, when you start Docker containers, network addresses change. The IP address assigned to the container is whatever is available to Docker at the time the container was started. Likewise the IP address and port number of linked services can change. So, your container entrypoint has to deal with this dynamic nature of addresses. For example, if you have a configuration file, you need to update that configuration file on every run with the proper network address info. My Bugzilla entrypoint script was doing this. Or so I thought. It turns out that Bugzilla's bootstrap process has multiple config files. There's an answers file that provides static answers to questions asked when running the bootstrap script (checksetup.pl). checksetup.pl will produce a localconfig file (actually a Perl script) containing all that data. There's also a data/params file containing yet more configuration options. And, the way I was running bootstrap, checksetup.pl refused to update files with new values. I initially had the entrypoint script updating only the answers file and running checksetup.pl, thinking checksetup.pl would update localconfig if the answers change. Nope! checksetup.pl only appears to update localconfig if localconfig is missing a value. So, here my entrypoint script was, successully calling checksetup.pl with the proper network values, which checksetup.pl was more than happy to use. But when I started the web application, it used the old values from localconfig and data/params and blew up. Derp. So, to have dynamic MySQL hosts and ports and a dynamic self-referential HTTP URL, I needed to manually update localconfig and data/params during the entrypoint script. The entrypoint script now rewrites Perl scripts during container load to reflect appropriate variables. Oy.
At some point I got working BMO containers running concurrently from multiple tests. This was a huge milestone. But it only revealed my next problem: resource constraints. The running containers were consuming gobs of memory and I couldn't run more than 2 or 3 tests concurrently before running out of memory. Before, I was able to run 8 tests concurrently no problem. Well crap, I just slowed down the test harness significantly by reducing concurrency. No bueno.
Some quick investigation revealed the culprit was MySQL and Apache being greedier than they needed to be. MySQL was consuming 1GB RSS on start. Apache was something like 350 MB. It had been a while since I ran a MySQL server, so I had to scour the net for settings to put MySQL on a diet. The results were not promising. I knew enough about MySQL to know that the answers I found had similar quality to comments on the php.net function documentation circa 2004 (it was not uncommon to see things like SQL injection in the MySQL pages back then - who knows, maybe that's still the case). Anyway, a little tuning later and I was able to get MySQL using a few hundred MB RAM and I reduced the Apache worker pool to something reasonable (maybe 2) to free up enough memory to be able to run tests with the desired concurrency again. If using Docker as part of testing ever takes off, I imagine there will be two flavors of every container: low memory and regular. I'm not running a production service here: I'll happily trade memory for high-end performance as long as it doesn't impact my tests too much.
Caching, Invalidating, and Garbage Collecting Bootstrapped Images
As part of iterating on making BMO bootstrap work, I encountered another problem: knowing when to perform a bootstrap. As mentioned earlier, bootstrap was slow: 25 to 30 seconds. While I had reduced the cost of bootstrap to at most once per test suite execution (as opposed to once per test), there was still the potential for a painful 25-30s delay when running tests. Unacceptable! Furthermore, when I changed how bootstrap worked, I needed a way to invalidate the previous bootstrapped image. Otherwise, we may use an outdated bootstrapped image that doesn't represent the environment it needs to and test execution would fail. How should I do this?
Docker has considered this problem and they have a solution: build context. When you do a docker build, Docker takes all the files from the directory containing the Dockerfile and makes them available to the environment doing the building. If you ADD one of these files in your Dockerfile, the image ID will change if the file changes, invalidating the cache used by Docker to build images. So, if I ADDed the scripts that perform BMO bootstrap to my Docker images, Docker would automagically invalidate the built images and force a bootstrap for me. Nice! Unfortunately, docker build doesn't allow you to add files outside of the current directory to the build context. Internet sleuthing reveals the solution here is to copy things to a temporary directory and run docker build from that. Seriously? Fortunately, I was using the Docker API directly via Python. And that API simply takes an archive of files. And since you can create archives dynamically inside Python using e.g. tarfile, it wasn't too difficult to build proper custom context archives that contained my extra data that could be used to invalidate bootstrapped images. I threw some simple ADD directives into my Dockerfiles and now I got bootstrapped image invalidation!
To avoid having to perform bootstrap on every test run, I needed a mapping between the base images and the bootstrapped result. I ended up storing this in a simple JSON file. I realize now I could have queried Docker for images having the base image as its parent since there is supposed to be a 1:1 relationship between them. I may do this as a follow-up.
With the look-up table in place, ensuring bootstrapped images were current involved doing a couple docker builds, finding the bootstrapped images from those base images, and doing the bootstrap if necessary. If everything is up-to-date, docker build finishes quickly and we have less than 1s of delay. Very acceptable. If things aren't current, well, there's not much you can do there if accuracy is important. I was happy with where I was.
Once I started producing bootstrapped images every time the code impacting the generation of that image changed, I ran into a new problem: garbage collection. All those old bootstrapped images were piling up inside of Docker! I needed a way to prune them. Docker has support for associating a repository and a tag with images. Great, I thought, I'll just associate all images with a well-defined repository, leave the tag blank (because it isn't really relevant), and garbage collection will iterate over all images in to-be-pruned repositories and delete all but the most recent one. Of course, this simple solution did not work. As far as I can tell, Docker doesn't really let you have multiple untagged images. You can set a repository with no tag and Docker will automagically assign the latest tag to that image. But the second you create a new image in that repository, the original image loses that repository association. I'm not sure if this is by design or a bug, but it feels wrong to me. I want the ability to associate tags with images (and containers) so I can easily find all entities in a logical set. It seemed to me that repository facilitated that need (albeit with the restriction of only associating 1 identifier per image). My solution here was to assign type 1 UUIDs to the tag field for each image. This forced Docker to retain the repository association when new images were created. I chose type 1 UUIDs so I can later extract the time component embedded within and do time-based garbage collection e.g. delete all images created more than a week ago.
Making Things Work in Jenkins/Ubuntu
At about this point, I figured things were working well enough on my boot2docker machine that it was time to update the Jenkins virtual machine / Vagrant configuration to run Docker. So, I hacked up the provisioner to install the docker.io package and tried to run things. First, I had to update code that talks to Docker to know where Docker is in an Ubuntu VM. Before, I was keying things off DOCKER_HOST, which I guess is used by the docker CLI and boot2docker reminds you to set. Easy enough. When I finally got things talking to Docker, my scripts threw a cryptic error when talking to Docker. Huh? This worked in boot2docker! When in doubt, always check your package versions. Sure enough, Ubuntu was installing an old Docker version. I added the Docker Apt repo to the Vagrant provisioner and tried again. Bingo - working Docker in an Ubuntu VM!
Choice of storage engines
I started building the BMO Docker images quickly noticed something: building images was horribly slow. Specifically, the part where new images are committed was taking seemingly forever. 5 to 8 seconds or something. Ugh. This wouldn't really bother me except due to subsequent issues, I found myself changing images enough as part of debugging that image building latency became a huge time sink. I felt I was spending more time waiting for layers to commit than making progress. So, I I decided to do something about it. I remembered glancing at an overview of storage options in Docker the week or two prior. I instinctively pinned the difference on different storage drivers between boot2docker and Ubuntu. Sure enough, boot2docker was using aufs and Ubuntu was using devicemapper. OK, now I identified a potential culprit. Time to test the theory. A few paragraphs into that blog post, I see a sorted list of storage driver priorities. I see aufs first, btrfs second, and devicemapper third. I know aufs has kernel inclusion issues (plus a nasty data leakage bug). I don't want that. devicemapper is slow. I figured the list is ordered for a reason and just attempted to use btrfs without really reading the article. Sure enough, btrfs is much faster at committing images than devicemapper. And, it isn't aufs. While images inside btrfs are building, I glance over the article and come to the conclusion that btrfs is in fact good enough for me.
So now I'm running Docker on btrfs on Ubuntu and Docker on aufs in boot2docker. Hopefully that will be the last noticable difference between host environments. After all, Docker is supposed to abstract all this away, right? I wish.
The Mystery of Inconsistent State
It was here that I experienced the most baffling, mind bending puzzle yet. As I was trying to get things working on the Jenkins/Ubuntu VM - things that had already been proved out in boot2docker - I was running into inexplicable issues during creation of the bootstrapped BMO containers. It seemed that my bootstrapped containers were somehow missing data. It appeared as if bootstrap had completed but data written during bootstrap failed to write. You start the committed/bootstrapped image and bootstrap had obviously completed partially, but it appeared to have never finished. Same Docker version. Same images. Same build scripts. Only the host environment was different. Ummmm, Bueller?
This problem had me totally and completely flabbergasted. My brain turned to mush exhausting possibilities. My initial instinct was this was a filesystem buffering problem. Different storage driver (btrfs vs aufs) means different semantics in how data is flushed, right? I once again started littering code with print statements to record the presence or non-presence of files and content therein. MySQL wasn't seeing all its data, so I double and triple check I'm shutting down MySQL correctly. Who knows, maybe one of the options I used to trim the fat from MySQL removed some of the safety from writing data and unclean shutdown is causing MySQL to lose data?
While I was investigating this problem, I noticed an additional oddity: I was having trouble getting reliable debug output from running containers (docker log -f). It seemed as if I was losing log events. I could tell from the state of a container that something happened, but I was seeing no evidence from docker logs -f that that thing actually happened. Weird! On a hunch, a threw some sys.stdout.flush() calls in my Python scripts, and sure enough my missing output started arriving! Pipe buffering strikes again. So, now we have dirty hacks in all the Python scripts related to Docker to unbuffer stdout to prevent data loss. Don't ask how much time was wasted tracking down bad theories due to stdout output being buffered.
Getting back to the problem at hand, I still hand Docker containers seemingly lose data. And it was only happening when Ubuntu/btrfs was the host environment for Docker. I eventually exhausted all leads in my filesystem wasn't flushed theory. At some point, I compared the logs of docker logs -f between boot2docker and Ubuntu and eventually noticed that the bmoweb container in Ubuntu wasn't printing as much. This wasn't obvious at first because the output from bootstrap on Ubuntu looked fine. Besides, the script that waits for bootstrap to complete waits for the Apache HTTP TCP socket to come alive before it gracefully stops the container and snapshots the bootstrapped result: bootstrap must be completing, ignore what docker logs -f says.
Eventually I hit an impasse and resort to context dumping everything on IRC. Ben Kero is around and he picks up on something almost immediately. He simply says ... systemd?. I knew almost instantly what he was referring to and knew the theory fit the facts. Do you know what I was doing wrong?
I still don't know what and quite frankly I don't care, but something in my Ubuntu host environment had a trigger on the TCP port the HTTP server would be listening on. Remember, I was detecting bootstrap completion by waiting until a TCP socket could be opened to the HTTP server. As soon as that connection was established, we stopped the containers gracefully and took a snapshot of the bootstrapped result. Except on Ubuntu something was accepting that socket open, giving a false positive to my wait code, and triggering early shutdown. Docker issued the signal to stop the container gracefully, but it wasn't finished bootstrapping yet, so it forcefully killed the container, resulting in bootstrap being in a remarkably-consistent-across-runs inconsistent state. Changing the code from wait on TCP socket to wait for valid HTTP response fixed the problem. And just for good measure, I changed the code waiting on the MySQL server to also try to establish an actual connection to the MySQL application layer, not merely a TCP socket.
After solving this mystery, I thought there's no way I could be so blind as to not see the container receiving the stop signal during bootstrap. So, I changed things back to prove to myself I wasn't on crack. No matter how hard I tried, I could not get the logs to show that the signal was received. I think what was happening was that my script was starting the container and issuing the graceful stop so quickly that it wasn't captured by log clients. Sure enough, adding some sleeps in the proper places made it possible to catch the events in action. In hindsight, I suppose I could have used docker events to shed some light on this as well. If Docker persisted logs/output from containers and allowed me to scroll back in time, I think this would have saved me. Although, there's a chance my entrypoint script wouldn't have informed me about the received signal. Perhaps checksetup.pl was ignoring it? What I really need is a unified event + log stream from Docker containers so I can debug exactly what's going on.
Everything is Working, Right?
After solving the inconsistent bootstrap state problem, things were looking pretty good. I had BMO bootstrapping and running from tests on both boot2docker and Ubuntu hosts. Tests were seeing completely independent environments and there were no race conditions. I was nearly done.
So, I started porting more and more tests to Docker. I started running tests more and more. Things worked. Most of the time. But I'm still frustrated by periodic apparent bugs in Docker. For example, our containers periodically fail to shut down. Our images periodically fail to delete.
During container shutdown and delete at the end of tests, we periodically see error messagess like the following:
docker.errors.APIError: 500 Server Error: Internal Server Error ("Cannot destroy container f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255: Driver aufs failed to remove root filesystem f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255: rename /mnt/sda1/var/lib/docker/aufs/mnt/f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255 /mnt/sda1/var/lib/docker/aufs/mnt/f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255-removing: device or resource busy")
500 Server Error: Internal Server Error ("Cannot destroy container 7e87e5950501734b2a1c02705e9c19f65357a15bad605d8168452aa564d63786: Unable to remove filesystem for 7e87e5950501734b2a1c02705e9c19f65357a15bad605d8168452aa564d63786: remove /mnt/sda1/var/lib/docker/containers/7e87e5950501734b2a1c02705e9c19f65357a15bad605d8168452aa564d63786: directory not empty")
Due to the way we're executing tests (Mercurial's .t test format), this causes the test's output to change and the test to fail. Sadness.
I think these errors are limited to boot2docker/aufs. But we haven't executed enough test runs in the Jenkins/Ubuntu/btrfs VM yet to be sure. This definitely smells like a bug in Docker and it is very annoying.
After much wrangling and going deeper in a rabbit hole than I ever felt was possible, I finally managed to get BMO running inside Docker as part of our test infrastructure. We're now building tests for complicated components that touch Mercurial, Review Board, and Bugzilla and people are generally happy with how things work.
There are still a handful of bugs, workarounds, and components that aren't as optimal as I would like them to be. But you can't always have perfection.
My takeaway from this ordeal is that Docker still has a number of bugs and user experience issues to resolve. I really want to support Docker and to see it succeed. But every time I try doing something non-trivial with Docker, I get bit hard. Yes, some of the issues I experienced were due to my own ignorance. But at the same time, if one of Docker's mantras is about simplicity and usability, then should there be such gaping cracks for people like me to fall through?
In the end, the promise of Docker fits my use case almost perfectly. I know the architecture is a good fit for testing. We will likely stick with Docker, especially now that I've spent the time to make it work. I really wish this project would have taken a few days, not a few weeks.
Docker is a really nifty tool. It vastly lowers the barrier to distributing and executing applications. It forces people to think about building server side code as a collection of discrete applications and services. When it was released, I instantly realized its potential, including for uses it wasn't primary intended for, such as applications in automated build and test environments.
Over the months, Docker's feature set has grown and many of its shortcomings have been addressed. It's more usable than ever. Most of my early complaints and concerns have been addressed or are actively being addressed.
But one supposedly solved part of Docker still bothers me: image creation.
One of the properties that gets people excited about Docker is the ability to ship execution environments around as data. Simply produce an image once, transfer it to a central server, pull it down from anywhere, and execute. That's pretty damn elegant. I dare say Docker has solved the image distribution problem. (Ignore for a minute that the implementation detail of how images map to filesystems still has a few quirks to work out. But they'll solve that.)
The ease at which Docker manages images is brilliant. I, like many, was overcome with joy and marvelled at how amazing it was. But as I started producing more and more images, my initial excitement turned to frustration.
The thing that bothers me most about images is that the de facto and recommended method for producing images is neither deterministic nor results in minimal images. I strongly believe that the current recommended and applied approach is far from optimal and has too many drawbacks. Let me explain.
If you look at the Dockerfiles from the official Docker library (examples: Node, MySQL), you notice something in common: they tend to use apt-get update as one of their first steps. For those not familiar with Apt, that command will synchronize the package repository indexes with a remote server. In other words, depending on when you run the command, different versions of packages will be pulled down and the result of image creation will differ. The same thing happens when you clone a Git repository. Depending on when you run the command - when you create the image - you may get different output. If you create an image from scratch today, it could have a different version of say Python than it did the day before. This can be a big deal, especially if you are trying to use Docker to accurately reproduce environments.
This non-determinism of building Docker images really bothers me. It seems to run counter to Docker's goal of facilitating reliable environments for running applications. Sure, one person can produce an image once, upload it to a Docker Registry server, and have others pull it. But there are applications where independent production of the same base image is important.
One area is the security arena. There are many people who are justifiably paranoid about running binaries produced by others and pre-built Docker images set off all kinds of alarms. So, these people would rather build an image from source, from a Dockerfile, than pull binaries. Except then they build the image from a Dockerfile and the application doesn't run because of an incompatibility with a new version of some random package whose version wasn't pinned. Of course, you probably lost numerous hours tracing down this obscure reason. How frustrating! Determinism and verifiability as part of Docker image creation help solve this problem.
Deterministic image building is also important for disaster recovery. What happens if your Docker Registry and all hosts with copies of its images go down? If you go to build the images from scratch again, what guarantee do you have that things will behave the same? Without determinism, you are taking a risk that things will be different and your images won't work as intended. That's scary. (Yes, Docker is no different here from existing tools that attempt to solve this problem.)
What if your open source product relies on a proprietary component that can't be legally distributed? So much for Docker image distribution. The best you can do is provide a base image and instructions for completing the process. But if that doesn't work deterministically, your users now have varying Docker images, again undermining Docker's goal of increasing consistency.
My other main concern about Docker images is that they tend to be large, both in size and in scope. Many Docker images use a full Linux install as their base. A lot of people start with a base e.g. Ubuntu or Debian install, apt-get install the required packages, do some extra configuration, and call it a day. Simple and straightforward, yes. But this practice makes me more than a bit uneasy.
One of the themes surrounding Docker is minimalism. Containers are lighter than VMs; just ship your containers around; deploy dozens or hundreds of containers simultaneously; compose your applications of many, smaller containers instead of larger, monolithic ones. I get it and am totally on board. So why are Docker images built on top of the bloaty excess of a full operating system (modulo the kernel)? Do I really need a package manager in my Docker image? Do I need a compiler or header files so I can e.g. build binary Python extensions? No, I don't, thank you.
As a security-minded person, I want my Docker images to consist of only the files they need, especially binary files. By leaving out non-critical elements from your image and your run-time environment, you are reducing the surface area to attack. If your application doesn't need a shell, don't include a shell and don't leave yourself potentially vulnerable to shellshock. I want the attacker who inevitably breaks out of my application into the outer container to get nothing, not something that looks like an operating system and has access to tools like curl and wget that could potentially be used to craft a more advanced attack (which might even be able to exploit a kernel vulnerability to break out of the container). Of course, you can and should pursue additional security protections in addition to attack surface reduction to secure your execution environment. Defense in depth. But that doesn't give Docker images a free pass on being bloated.
Another reason I want smaller containers is... because they are smaller. People tend to have relatively slow upload bandwidth. Pushing Docker images that can be hundreds of megabytes clogs my tubes. However, I'll gladly push 10, 20, or even 50 megabytes of only the necessary data. When you factor in that Docker image creation isn't deterministic, you also realize that different people are producing different versions of images from the same Dockerfiles and that you have to spend extra bandwidth transferring the different versions around. This bites me all the time when I'm creating new images and am experimenting with the creation steps. I tend to bypass the fake caching mechanism (fake because the output isn't deterministic) and this really results in data explosion.
I understand why Docker images are neither deterministic nor minimal: making them so is a hard problem. I think Docker was right to prioritize solving distribution (it opens up many new possibilities). But I really wish some effort could be put into making images deterministic (and thus verifiable) and more minimal. I think it would make Docker an even more appealing platform, especially for the security conscious. (As an aside, I would absolutely love if we could ship a verifiable Firefox build, for example.)
These are hard problems. But they are solvable. Here's how I would do it.
First, let's tackle deterministic image creation. Despite computers and software being ideally deterministic, building software tends not to be, so deterministic image creation is a hard problem. Even tools like Puppet and Chef which claim to solve aspects of this problem don't do a very good job with determinism. Read my post on The Importance of Time on Machine Provisioning for more on the topic. But there are solutions. NixOS and the Nix package manager have the potential to be used as the basis of a deterministic image building platform. The high-level overview of Nix is that the inputs and contents of a package determine the package ID. If you know how Git or Mercurial get their commit SHA-1's, it's pretty much the same concept. In theory, two people on different machines start with the same environment and bootstrap the exact same packages, all from source. Gitian is a similar solution. Although I prefer Nix's content-based approach and how it goes about managing packages and environments. Nix feels so right as a base for deterministically building software. Anyway, yes, fully verifiable build environments are turtles all the way down (I recommend reading Tor's overview of the problem and their approach. However, Nix's approach addresses many of the turtles and silences most of the critics. I would absolutely love if more and more Docker images were the result of a deterministic build process like Nix. Perhaps you could define the full set of packages (with versions) that would be used. Let's call this the package manifest. You would then PGP sign and distribute your manifest. You could then have Nix step through all the dependencies, compiling everything from source. If PGP verification fails, compilation output changes, or extra files are needed, the build aborts or issues a warning. I have a feeling the security-minded community would go crazy over this. I know I would.
OK, so now you can use Nix to produce packages (and thus images) (more) deterministically. How do you make them minimal? Well, instead of just packaging the entire environment, I'd employ tools like makejail. The purpose of makejail is to create minimal chroot jail environments. These are very similar to Docker/LXC containers. In fact, you can often take a tarball of a chroot directory tree and convert it into a Docker container! With makejail, you define a configuration file saying among other things what binaries to run inside the jail. makejail will trace file I/O of that binary and copy over accessed files. The result is an execution environment that (hopefully) contains only what you need. Then, create an archive of that environment and pipe it into docker build to create a minimal Docker image.
In summary, Nix provides you with a reliable and verifiable build environment. Tools like makejail pair down the produced packages into something minimal, which you then turn into your Docker image. Regular people can still pull binary images, but they are much smaller and more in tune with Docker's principles of minimalism. The paranoid among us can produce the same bits from source (after verifying the inputs look credible and waiting through a few hours of compiling). Or, perhaps the individual files in the image could be signed and thus verified via trust somehow? The company deploying Docker can have peace of mind that disaster scenarios resulting in Docker image loss should not result in total loss of the image (just rebuild it exactly as it was before).
You'll note that my proposed solution does not involve Dockerfiles as they exist today. I just don't think Dockerfile's design of stackable layers of commands is the right model, at least for people who care about determinism and minimalism. You really want a recipe that knows how to create a set of relevant files and some metadata like what ports to expose, what command to run on container start, etc and turn that into your Docker image. I suppose you could accomplish this all inside Dockerfiles. But that's a pretty radical departure from how Dockerfiles work today. I'm not sure the two solutions are compatible. Something to think about.
I'm pretty sure of what it would take to add deterministic and verifiable building of minimal and more secure Docker images. And, if someone solved this problem, it could be applicable outside of Docker (again, Docker images are essentially chroot environments plus metadata). As I was putting the finishing touches on this article, I discovered nix-docker. It looks very promising! I hope the Docker community latches on to these ideas and makes deterministic, verifiable, and minimal images the default, not the exception.