Warning: This post is long and rambling. There is marginal value in reading beyond the first few paragraphs unless you care about Docker.
I recently wrote about how Mozilla tests version control. In this post, I want to talk about the part of that effort that consumed the most time: adding Docker support to the test harness.
Introducing the Problem and Desired End State
Running Docker containers inside tests just seems like an obvious thing you'd want to do. I mean, wouldn't it be cool if your tests could spin up MySQL, Redis, Cassandra, Nginx, etc inside Docker containers and test things against actual instances of the things running in your data centers? Of course it would! If you ask me, this approach beats mocking because many questions around accuracy of the mocked interface are removed. Furthermore, you can run all tests locally, while on a plane: no data center or staging environment required. How cool is that! And, containers are all isolated so there's no need to pollute your system with extra packages and system services. Seems like wins all around.
When Mozilla started adding customizations to the Review Board code review software in preparation for deploying it at Mozilla as a replacement for Bugzilla's Splinter, it quickly became apparant that we had a significant testing challenge ahead of us. We weren't just standing up Review Board and telling people to use it, we were integrating user authentication with Bugzilla, having Review Board update Bugzilla after key events, and were driving the initiation of code review in Review Board by pushing code to a Mercurial server. That's 3 user-visible services all communicating with each to expose a unified workflow. It's the kind of thing testing nightmares are made of.
During my early involvement with the project, I recognized the challenge ahead and was quick to insist that we write automated tests for as much as possible. I insisted that all the code (there are multiple extensions to ReviewBoard, a Mercurial hook, and a Mercurial extension) live under one common repository and share testing. That way we could tinker with all the parts easily and test them in concern without having to worry about version sync. We moved all the code to the version-control-tools repository and Review Board was the driving force behind improvements to the test harness in that repository. We had Mercurial .t tests starting Django dev servers hosting Review Board running from per-test SQLite databases and all was nice. Pretty much every scenario involving the interaction between Mercurial and ReviewBoard was tested. If you cared about just these components, life was happy.
A large piece of the integration story was lacking in this testing world: Bugzilla. We had somewhat complex code for having Review Board and Bugzilla talk to each other but no tests for it because nobody had yet hooked Bugzilla up to the tests. As my responsibilities in the project expanded from covering just the Mercurial and Review Board interaction to Bugzilla as well, I again looked at the situation and thought there's a lot of complex interaction here and alpha testing has revealed the presence of many bugs: we need a better testing story. So, I set out to integrate Bugzilla into the test harness.
My goals were for Review Board tests to be able to make requests against a Bugzilla instance configured just like bugzilla.mozilla.org, to allow tests to execute concurrently (don't make developers wait on machines), for tests to run as quickly as possible, to run tests in an environment as similar to production as possible, and to be able to run tests from a plane or train or anywhere without internet connectivity. I was unwilling to budge on these core testing requirements because they represent what's best from test accuracy and developer productivity standpoints: you want your tests to be representative of the real world and you want to enable people to hack on this service anywhere, anytime, and not be held back by tests that take too long to execute. Move fast and don't break things.
Before I go on, it's time for a quick aside on tolerable waiting times. Throughout this article I talk about minimizing the run time of tests. This may sound like premature optimization. I argue it isn't, at least not if you are optimizing for developer productivity. There is a fair bit of academic research in this area. A study on tolerable waiting time: how long are Web users willing to wait gets cited a lot. It says 2 seconds for web content. If you read a few paragraphs in, it references other literature. They disagree on specific thresholds, but one thing is common: the thresholds are typically low - just a few seconds. The latencies I deal with are all longer than what research says leads to badness. When given a choice, I want to optimize workflows for what humans are conditioned to tolerate. If I can't do that, I've failed and the software will be ineffective.
The architecture of Bugzilla created some challenges and eliminated some implementation possibilities. First, I wasn't using any Bugzilla: I was using Mozilla's branch of Bugzilla that powers bugzilla.mozilla.org. Let's call it BMO. I could try hosting it from local SQLite files and running a local, Perl-based HTTP server (Bugzilla is written in Perl). But my experience with Perl and takeaways from talking to the BMO admins was that pain would likely be involved. Plus, this would be a departure from test accuracy. So, I would be using MySQL, Apache HTTPd, and mod_perl, just like BMO uses them in production.
Running Apache and MySQL is always a... fun endeavor. It wasn't a strict requirement, but I also highly preferred that the tests didn't pollute the system they ran on. In other words, having tests connect to an already-running MySQL or Apache server felt like the wrong solution. That's just one more thing people must set up and run locally to run the tests. That's just one more thing that could differ from production and cause bad testing results. It felt like a dangerous approach. Plus, there's the requirement to run things concurrently. Could I have multiple tests talking to the same MySQL server concurrently? They'd have to use separate databases so they don't conflict. That's a possibility. Honestly, I didn't entertain the thought of running Apache and MySQL manually for too long. I knew about this thing called Docker and that it theoretically fit my use case perfectly: construct building blocks for your application and then dymanically hook things up. Perfect. I could build Docker containers for all the required services and have each test start a new, independent set of containers for just that test.
So, I set out integrating Docker into the version-control-tools test harness. Specifically, my goal was to enable the running of independent BMO instances during individual tests. It sounded simple enough.
What I didn't know was that integrating a Dockerized BMO into the test harness would take the better part of 2 weeks. And it's still not up to my standards. This post is the story about the trials and tribulations I encountered along the way. I hope it serves as a warning and potentially a guide for others attempting similar feats. If any Docker developers are reading, I hope it gives you ideas on how to improve Docker.
Running Bugzilla inside Docker
First thing's first: to run BMO inside Docker I needed to make Docker containers for BMO. Fortunately, David Lawrence has prior art here. I really just wanted to take that code, dump it into version-control-tools and call it a day. In hindsight, I probably should have done that. Instead, armed with the knowledge of the Docker best practice of one container per service and David Lawrence's similar wishes to make his code conform to that ideal, I decided to spend some time to fix David's code so that MySQL and Apache were in separate containers, not part of a single container running supervisord. Easy enough, right?
It was relatively easy extracting the MySQL and Apache parts of BMO into separate containers. For MySQL, I started with the official MySQL container from the Docker library and added a custom my.cnf. Simple enough. For Apache, I just copied everything from David's code that wasn't MySQL. I was able to manually hook the containers together using the Docker CLI. It sort of just worked. I was optimistic this project would only take a few hours.
A garbage collection bug in Docker
My first speed bump came as I was iterating on Dockerfiles. All of a sudden I get an error from Docker that it is out of space. Wat? I look at docker images and don't see anything too obvious eating up space. What could be going on? At this point, I'm using boot2docker to host Docker. boot2docker is this nifty tool that allows Windows and OS X users to easily run Docker (Docker requires a Linux host). boot2docker spins up a Linux virtual machine running Docker and tells you how to point your local docker CLI interface at that VM. So, when Docker complains it is out of space, I knew immediately that the VM must be low on space. I SSH into it, run df, and sure enough, the VM is nearly out of space. But I looked at docker images -a and confirmed there's not enough data to fill the disk. What's going on? I can't find the issue right now, but it turns out there is a bug in Docker! When running Docker on aufs filesystems (like boot2docker does), Docker does not always remove data volumes containers when deleting a container. It turns out that the MySQL containers from the official Docker library were creating a data-only container to hold persistent MySQL data that outlives the container itself. These containers are apparently light magic. They are containers that are attached to other containers, but they don't really show up in the Docker interfaces. When you delete the host container, these containers are supposed to be garbage collected. Except on aufs, they aren't. My MySQL containers were creating 1+ GB InnoDB data files on start and the associated data containers were sitting around after container deletion, effectively leaking 1+ GB every time I created a MySQL container, quickly filling the boot2docker disk. Derp.
I worked around this problem by forking the official MySQL container. I didn't need persistent MySQL data (the containers only need to live for one invocation - for the lifetime of a single test), so I couldn't care less about persisted data volumes. So, I changed the MySQL container to hold its data locally, not in a data volume container. The solution was simple enough. But it took me a while to identify the problem. Here I was seeing Docker do something extremely stupid. Surely my understanding of Docker was wrong and I was doing something stupid to cause it to leak data. I spent hours digging through the documentation to make sure I was doing things exactly as recommended. It wasn't until I started an Ubuntu VM and tried the same thing there did I realize this looked like a bug in boot2docker. A few Google searches later led me to a comment hiding at the bottom of an existing GitHub issue that pins aufs as the culprit. And here I thought Docker reached 1.0 and wouldn't have bad bugs like this. I certainly wouldn't expect boot2docker to be shipping a VM with a sub-par storage driver (shouldn't it be using devicemapper or btrfs instead). Whatever.
Wrangling with Mozilla's Branch of Bugzilla
At this point, I've got basic Docker containers for MySQL and Apache+mod_perl+Bugzilla being created. Now, I needed to convert from vanilla Bugzilla to BMO. Should be straightforward. Just change the Git remote URL and branch to check out. I did this and all-of-a-sudden my image started encountering errors building! It turns out that the BMO code base doesn't work on a fresh database! Fortunately, this is a known issue and I've worked around it previously. When I tackled it a few months ago, I spent a handful of hours disecting this problem. It wasn't pretty. But this time I knew what to do. I even had a Puppet manifest for installing BMO on a fresh machine. So, I just needed to translate that Puppet config into Dockerfile commands. No big deal, right? Well, when I did that Puppet config a few months ago, I based it on Ubuntu because I'm more familiar with Debian-based distros and figured Ubuntu would be the easiest since it tends to have the largest package diversity. Unfortunately, David's Docker work is based on Fedora. So, I spent some time converting the Dockerfile to Ubuntu rather than trying to port things to Fedora. Arguably the wrong decision since Mozilla operates the RedHat flavor of Linux distributions in production. But I was willing to trade accuracy for time here, having lost time dealing with the aufs bug.
Unfortunately, I under-estimated how long it would take to port the image to Ubuntu. It didn't take so long from a code change perspective. Instead, most of the time was spent waiting for Docker to run the commands to build the image. In the final version, Apt is downloading and installing over 250 packages. And Bugzilla's bootstrap process installs dozens of packages from CPAN. Every time I made a small change, I invalidated Docker's image building cache, causing extreme delays while waiting for Apt and CPAN to do their thing. This experience partially contributed to my displeasure with how Docker currently handles image creation. If Docker images were composed of pre-built pieces instead of stacked commands, my cache hit rate would have been much higher and I would have converted the image in no time. But no, that's not how things work. So I lost numerous hours through this 2 week process waiting for Docker images to perform operations I've already done elsewhere dozens of times before.
Docker Container Orchestration
After porting the Bugzilla image to Ubuntu and getting BMO to bootstrap in a manually managed container (up to this point I'm using the docker CLI to create images, start containers, etc), it was time to automate the process so that tests could run the containers. At this time, I started looking for tools that performed multiple container orchestration. I had multiple containers that needed to be treated as a single logical unit, so I figured I'd use an existing tool to solve this problem for me. Don't reinvent the wheel unless you have to, right? I discovered Fig, which seemed to fit the bill. I read that it is being integrated into Docker itself, so it must be best of breed. Even if it weren't its future seems to be more certain than other tools. So, I stopped my tools search and used Fig without much consideration for other tools.
Lack of a useful feature in Fig
I quickly whipped up a fig.yml and figured it would just work. Nope! Starting the containers from scratch using Fig resulted in an error. I wasn't sure what the error was at first because Fig didn't tell me. After some investigation, I realized that my bmoweb container (the container holding Apache + BMO code) was failing in its entrypoint command (that's a command that runs when the container starts up, but not the primary command a container runs - that's a bit confusing I know - read the docs). I added some debug statements and quickly realized that Bugzilla was erroring connecting to MySQL. Strange, I thought. Fig is essentially a DSL around manual docker commands, so I checked everything by typing everything into the shell. No error. Again on a new set of containers. No error. I thought maybe my environment variable handling was wrong - that the dynamically allocated IP address and port number of the linked MySQL container being passed to the bmoweb container weren't getting honored. I added some logging to disprove that theory. The wheels inside my brain spun for a little bit. And, aided by some real-time logging, I realized I was dealing with a race condition: Fig was starting the MySQL and bmoweb containers concurrently and bmoweb was attempting to access the MySQL server before MySQL had fully initialized and started listening on its TCP port! That made sense. And I think it's a reasonable optimization for Fig to start containers concurrently to speed up start time. But surely a tool that orchestrates different containers has considered the problem of dependencies and has a mechanism to declare them to prevent these race conditions. I check the online docs and there's nothing to be found. A red panda weeps. So, I change the bmoweb entrypoint script to wait until it can open a TCP socket to MySQL before actually using MySQL and sure enough, the race condition goes away and the bmoweb container starts just fine!
OK, I'm real close now. I can feel it.
I start playing around with manually starting and stopping containers as part of a toy test. The good news is things appear to work. The bad news is it is extremely slow. It didn't take long for me to realize that the reason for the slowness is Bugzilla's bootstrap on first run. Bugzilla, like many complex applications, has a first run step that sets up database schema, writes out some files on the filesystem, inserts some skeleton data in the database, creates an admin user, etc. Much to my dismay this was taking a long time. Something on the order of 25 to 30 seconds. And that's on a Haswell with plenty of RAM and an SSD. Oy. The way things are currently implemented would result in a 25 to 30 second delay when running every test. Change 1 line and wait say 25s for any kind of output. Are you kidding me?! Unacceptable. It violated my core goal of having tests that are quick to run. Again, humans should not have to wait on machines.
I think about this problem for like half a second and the solution is obvious: take a snapshot of the bootstrapped images and start instances of that snapshot from tests. In other words, you perform the common bootstrap operations once and only once. And, you can probably do that outside the scope of running tests so that the same snapshot can be used across multiple invocations of the test harness. Sounds simple! To the Docker uninitiated, it sounds like the solution would be to move the BMO bootstrapping into the Dockerfile code so it gets executed at image creation time. Yes, that would be ideal. Unfortunately, when building images via Dockerfile, you can't tell Docker to link that image to another container. Without container linking, you can't have MySQL. Without MySQL, you can't do BMO bootstrap. So, BMO bootstrap must be done during container startup. And in Docker land, that means putting it as part of your entrypoint script (where it was conveniently already located for this reason).
Talking Directly to the Docker API
Of course, the tools that I found that help with Docker image building and container orchestration don't seem to have an answer for this create a snapshot of a bootstrapped container problem. I'm sure someone has solved this problem. But in my limited searching, I couldn't find anything. And, I figured the problem would be easy enough to solve manually, so I set about creating a script to do it. I'm not a huge fan of shell script for automation. It's hard to debug and simple things can be hard and hard things can be harder. Plus, why solve solutions such as parsing output for relevant data when you can talk to an API directly and get native types. Since the existing test harness automation in version-control-tools was written in Python, I naturally decided to write some Python to create the bootstrapped images. So, I do a PyPI search and discover docker-py, a Python client library to the Docker Remote API, an HTTP API that the Docker daemon runs and is what the docker CLI tool itself uses to interface with Docker. Good, now I have access to the full power of Docker and am not limited by what the docker CLI may not expose. So, I spent some time looking at source and the Docker Remote API documentation to get an understanding of my new abilities and what I'd need to do. I was pleasantly surprised to learn that the docker CLI is pretty similar to the Remote API and the Python API was similar as well, so the learning was pretty shallow. Yay for catching a break!
Confusion Over Container Stopping
I wrote some Python for building the BMO images, launching the containers, committing the result, and saving state to disk (so it could be consulted later - preventing a bootstrap by subsequent consumers). This was pleasantly smooth at first, but I encountered some bumps along the way. First, I didn't have a complete grasp on the differences between stop and kill. I was seeing some weird behavior by MySQL on startup and didn't know why. Turns out I was forcefully killing the container after bootstrap via the kill API and this was sending a SIGKILL to MySQL, effectively causing unclean shutdown. After some documentation reading, I realized stop is the better API - it issues SIGTERM, waits for a grace period, then issues SIGKILL. Issuing SIGTERM made MySQL shut down gracefully and this issue stemming from my ignorance was resolved. (If anyone from Docker is reading, I think the help output for docker kill should mention the forcefullness of the command versus stop. Not all of us remember the relative forcefullness of the POSIX signals and having documentation reinforce their cryptic meaning could help people select the proper command.) A few lines of Python later and I was talking directly to the Docker Remote API, doing everything I needed to do to save (commit in Docker parlance) a bootstrapped BMO environment for re-use among multiple tests.
It was pretty easy to hook the bootstrapped images up to a single test. Just load the bootstrapped image IDs from the config file and start new containers based on them. That's Docker 101 (except I was using Python to do everything).
Concurrent Execution Confuses Bugzilla
Now that I could start Dockerized BMO from a single test, it was time to make things work concurrently. I hooked Docker up to a few tests and launched them in parallel to see what would happen. The containers appeared to start just fine! Great anticipation on my part to design for concurrency from the beginning, I thought. It appeared I was nearly done. Victory was near. So, I changed some tests to actually interact with BMO running from Docker. (Up until this point I was merely starting containers, not doing anything with them.) Immediately I see errors. Cannot connect to Bugzilla http://... connection refused. Huh? It took a few moments, but I realized the experience I had with MySQL starting and this error were very similar. I changed my start BMO containers code to wait for the HTTP server's TCP socket to start accepting connections before returning control and sure enough, I was able to make HTTP requests against Bugzilla running in Docker! Woo!
Next step, make an authenticated query against Bugzilla running in Docker. HTTP request completes... with an internal server error. What?! I successfully browsed BMO from containers days before and was able to log in just fine - this shouldn't be happening. This problem took me ages to diagnose. I traced every step of provisioning and couldn't figure out what was going on. After resorting to print debugging in nearly every component, including Bugzilla's Perl code itself, I found the culprit: Bugzilla wasn't liking the dynamic nature of the MySQL and HTTP endpoints. You see, when you start Docker containers, network addresses change. The IP address assigned to the container is whatever is available to Docker at the time the container was started. Likewise the IP address and port number of linked services can change. So, your container entrypoint has to deal with this dynamic nature of addresses. For example, if you have a configuration file, you need to update that configuration file on every run with the proper network address info. My Bugzilla entrypoint script was doing this. Or so I thought. It turns out that Bugzilla's bootstrap process has multiple config files. There's an answers file that provides static answers to questions asked when running the bootstrap script (checksetup.pl). checksetup.pl will produce a localconfig file (actually a Perl script) containing all that data. There's also a data/params file containing yet more configuration options. And, the way I was running bootstrap, checksetup.pl refused to update files with new values. I initially had the entrypoint script updating only the answers file and running checksetup.pl, thinking checksetup.pl would update localconfig if the answers change. Nope! checksetup.pl only appears to update localconfig if localconfig is missing a value. So, here my entrypoint script was, successully calling checksetup.pl with the proper network values, which checksetup.pl was more than happy to use. But when I started the web application, it used the old values from localconfig and data/params and blew up. Derp. So, to have dynamic MySQL hosts and ports and a dynamic self-referential HTTP URL, I needed to manually update localconfig and data/params during the entrypoint script. The entrypoint script now rewrites Perl scripts during container load to reflect appropriate variables. Oy.
At some point I got working BMO containers running concurrently from multiple tests. This was a huge milestone. But it only revealed my next problem: resource constraints. The running containers were consuming gobs of memory and I couldn't run more than 2 or 3 tests concurrently before running out of memory. Before, I was able to run 8 tests concurrently no problem. Well crap, I just slowed down the test harness significantly by reducing concurrency. No bueno.
Some quick investigation revealed the culprit was MySQL and Apache being greedier than they needed to be. MySQL was consuming 1GB RSS on start. Apache was something like 350 MB. It had been a while since I ran a MySQL server, so I had to scour the net for settings to put MySQL on a diet. The results were not promising. I knew enough about MySQL to know that the answers I found had similar quality to comments on the php.net function documentation circa 2004 (it was not uncommon to see things like SQL injection in the MySQL pages back then - who knows, maybe that's still the case). Anyway, a little tuning later and I was able to get MySQL using a few hundred MB RAM and I reduced the Apache worker pool to something reasonable (maybe 2) to free up enough memory to be able to run tests with the desired concurrency again. If using Docker as part of testing ever takes off, I imagine there will be two flavors of every container: low memory and regular. I'm not running a production service here: I'll happily trade memory for high-end performance as long as it doesn't impact my tests too much.
Caching, Invalidating, and Garbage Collecting Bootstrapped Images
As part of iterating on making BMO bootstrap work, I encountered another problem: knowing when to perform a bootstrap. As mentioned earlier, bootstrap was slow: 25 to 30 seconds. While I had reduced the cost of bootstrap to at most once per test suite execution (as opposed to once per test), there was still the potential for a painful 25-30s delay when running tests. Unacceptable! Furthermore, when I changed how bootstrap worked, I needed a way to invalidate the previous bootstrapped image. Otherwise, we may use an outdated bootstrapped image that doesn't represent the environment it needs to and test execution would fail. How should I do this?
Docker has considered this problem and they have a solution: build context. When you do a docker build, Docker takes all the files from the directory containing the Dockerfile and makes them available to the environment doing the building. If you ADD one of these files in your Dockerfile, the image ID will change if the file changes, invalidating the cache used by Docker to build images. So, if I ADDed the scripts that perform BMO bootstrap to my Docker images, Docker would automagically invalidate the built images and force a bootstrap for me. Nice! Unfortunately, docker build doesn't allow you to add files outside of the current directory to the build context. Internet sleuthing reveals the solution here is to copy things to a temporary directory and run docker build from that. Seriously? Fortunately, I was using the Docker API directly via Python. And that API simply takes an archive of files. And since you can create archives dynamically inside Python using e.g. tarfile, it wasn't too difficult to build proper custom context archives that contained my extra data that could be used to invalidate bootstrapped images. I threw some simple ADD directives into my Dockerfiles and now I got bootstrapped image invalidation!
To avoid having to perform bootstrap on every test run, I needed a mapping between the base images and the bootstrapped result. I ended up storing this in a simple JSON file. I realize now I could have queried Docker for images having the base image as its parent since there is supposed to be a 1:1 relationship between them. I may do this as a follow-up.
With the look-up table in place, ensuring bootstrapped images were current involved doing a couple docker builds, finding the bootstrapped images from those base images, and doing the bootstrap if necessary. If everything is up-to-date, docker build finishes quickly and we have less than 1s of delay. Very acceptable. If things aren't current, well, there's not much you can do there if accuracy is important. I was happy with where I was.
Once I started producing bootstrapped images every time the code impacting the generation of that image changed, I ran into a new problem: garbage collection. All those old bootstrapped images were piling up inside of Docker! I needed a way to prune them. Docker has support for associating a repository and a tag with images. Great, I thought, I'll just associate all images with a well-defined repository, leave the tag blank (because it isn't really relevant), and garbage collection will iterate over all images in to-be-pruned repositories and delete all but the most recent one. Of course, this simple solution did not work. As far as I can tell, Docker doesn't really let you have multiple untagged images. You can set a repository with no tag and Docker will automagically assign the latest tag to that image. But the second you create a new image in that repository, the original image loses that repository association. I'm not sure if this is by design or a bug, but it feels wrong to me. I want the ability to associate tags with images (and containers) so I can easily find all entities in a logical set. It seemed to me that repository facilitated that need (albeit with the restriction of only associating 1 identifier per image). My solution here was to assign type 1 UUIDs to the tag field for each image. This forced Docker to retain the repository association when new images were created. I chose type 1 UUIDs so I can later extract the time component embedded within and do time-based garbage collection e.g. delete all images created more than a week ago.
Making Things Work in Jenkins/Ubuntu
At about this point, I figured things were working well enough on my boot2docker machine that it was time to update the Jenkins virtual machine / Vagrant configuration to run Docker. So, I hacked up the provisioner to install the docker.io package and tried to run things. First, I had to update code that talks to Docker to know where Docker is in an Ubuntu VM. Before, I was keying things off DOCKER_HOST, which I guess is used by the docker CLI and boot2docker reminds you to set. Easy enough. When I finally got things talking to Docker, my scripts threw a cryptic error when talking to Docker. Huh? This worked in boot2docker! When in doubt, always check your package versions. Sure enough, Ubuntu was installing an old Docker version. I added the Docker Apt repo to the Vagrant provisioner and tried again. Bingo - working Docker in an Ubuntu VM!
Choice of storage engines
I started building the BMO Docker images quickly noticed something: building images was horribly slow. Specifically, the part where new images are committed was taking seemingly forever. 5 to 8 seconds or something. Ugh. This wouldn't really bother me except due to subsequent issues, I found myself changing images enough as part of debugging that image building latency became a huge time sink. I felt I was spending more time waiting for layers to commit than making progress. So, I I decided to do something about it. I remembered glancing at an overview of storage options in Docker the week or two prior. I instinctively pinned the difference on different storage drivers between boot2docker and Ubuntu. Sure enough, boot2docker was using aufs and Ubuntu was using devicemapper. OK, now I identified a potential culprit. Time to test the theory. A few paragraphs into that blog post, I see a sorted list of storage driver priorities. I see aufs first, btrfs second, and devicemapper third. I know aufs has kernel inclusion issues (plus a nasty data leakage bug). I don't want that. devicemapper is slow. I figured the list is ordered for a reason and just attempted to use btrfs without really reading the article. Sure enough, btrfs is much faster at committing images than devicemapper. And, it isn't aufs. While images inside btrfs are building, I glance over the article and come to the conclusion that btrfs is in fact good enough for me.
So now I'm running Docker on btrfs on Ubuntu and Docker on aufs in boot2docker. Hopefully that will be the last noticable difference between host environments. After all, Docker is supposed to abstract all this away, right? I wish.
The Mystery of Inconsistent State
It was here that I experienced the most baffling, mind bending puzzle yet. As I was trying to get things working on the Jenkins/Ubuntu VM - things that had already been proved out in boot2docker - I was running into inexplicable issues during creation of the bootstrapped BMO containers. It seemed that my bootstrapped containers were somehow missing data. It appeared as if bootstrap had completed but data written during bootstrap failed to write. You start the committed/bootstrapped image and bootstrap had obviously completed partially, but it appeared to have never finished. Same Docker version. Same images. Same build scripts. Only the host environment was different. Ummmm, Bueller?
This problem had me totally and completely flabbergasted. My brain turned to mush exhausting possibilities. My initial instinct was this was a filesystem buffering problem. Different storage driver (btrfs vs aufs) means different semantics in how data is flushed, right? I once again started littering code with print statements to record the presence or non-presence of files and content therein. MySQL wasn't seeing all its data, so I double and triple check I'm shutting down MySQL correctly. Who knows, maybe one of the options I used to trim the fat from MySQL removed some of the safety from writing data and unclean shutdown is causing MySQL to lose data?
While I was investigating this problem, I noticed an additional oddity: I was having trouble getting reliable debug output from running containers (docker log -f). It seemed as if I was losing log events. I could tell from the state of a container that something happened, but I was seeing no evidence from docker logs -f that that thing actually happened. Weird! On a hunch, a threw some sys.stdout.flush() calls in my Python scripts, and sure enough my missing output started arriving! Pipe buffering strikes again. So, now we have dirty hacks in all the Python scripts related to Docker to unbuffer stdout to prevent data loss. Don't ask how much time was wasted tracking down bad theories due to stdout output being buffered.
Getting back to the problem at hand, I still hand Docker containers seemingly lose data. And it was only happening when Ubuntu/btrfs was the host environment for Docker. I eventually exhausted all leads in my filesystem wasn't flushed theory. At some point, I compared the logs of docker logs -f between boot2docker and Ubuntu and eventually noticed that the bmoweb container in Ubuntu wasn't printing as much. This wasn't obvious at first because the output from bootstrap on Ubuntu looked fine. Besides, the script that waits for bootstrap to complete waits for the Apache HTTP TCP socket to come alive before it gracefully stops the container and snapshots the bootstrapped result: bootstrap must be completing, ignore what docker logs -f says.
Eventually I hit an impasse and resort to context dumping everything on IRC. Ben Kero is around and he picks up on something almost immediately. He simply says ... systemd?. I knew almost instantly what he was referring to and knew the theory fit the facts. Do you know what I was doing wrong?
I still don't know what and quite frankly I don't care, but something in my Ubuntu host environment had a trigger on the TCP port the HTTP server would be listening on. Remember, I was detecting bootstrap completion by waiting until a TCP socket could be opened to the HTTP server. As soon as that connection was established, we stopped the containers gracefully and took a snapshot of the bootstrapped result. Except on Ubuntu something was accepting that socket open, giving a false positive to my wait code, and triggering early shutdown. Docker issued the signal to stop the container gracefully, but it wasn't finished bootstrapping yet, so it forcefully killed the container, resulting in bootstrap being in a remarkably-consistent-across-runs inconsistent state. Changing the code from wait on TCP socket to wait for valid HTTP response fixed the problem. And just for good measure, I changed the code waiting on the MySQL server to also try to establish an actual connection to the MySQL application layer, not merely a TCP socket.
After solving this mystery, I thought there's no way I could be so blind as to not see the container receiving the stop signal during bootstrap. So, I changed things back to prove to myself I wasn't on crack. No matter how hard I tried, I could not get the logs to show that the signal was received. I think what was happening was that my script was starting the container and issuing the graceful stop so quickly that it wasn't captured by log clients. Sure enough, adding some sleeps in the proper places made it possible to catch the events in action. In hindsight, I suppose I could have used docker events to shed some light on this as well. If Docker persisted logs/output from containers and allowed me to scroll back in time, I think this would have saved me. Although, there's a chance my entrypoint script wouldn't have informed me about the received signal. Perhaps checksetup.pl was ignoring it? What I really need is a unified event + log stream from Docker containers so I can debug exactly what's going on.
Everything is Working, Right?
After solving the inconsistent bootstrap state problem, things were looking pretty good. I had BMO bootstrapping and running from tests on both boot2docker and Ubuntu hosts. Tests were seeing completely independent environments and there were no race conditions. I was nearly done.
So, I started porting more and more tests to Docker. I started running tests more and more. Things worked. Most of the time. But I'm still frustrated by periodic apparent bugs in Docker. For example, our containers periodically fail to shut down. Our images periodically fail to delete.
During container shutdown and delete at the end of tests, we periodically see error messagess like the following:
docker.errors.APIError: 500 Server Error: Internal Server Error ("Cannot destroy container f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255: Driver aufs failed to remove root filesystem f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255: rename /mnt/sda1/var/lib/docker/aufs/mnt/f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255 /mnt/sda1/var/lib/docker/aufs/mnt/f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255-removing: device or resource busy")
500 Server Error: Internal Server Error ("Cannot destroy container 7e87e5950501734b2a1c02705e9c19f65357a15bad605d8168452aa564d63786: Unable to remove filesystem for 7e87e5950501734b2a1c02705e9c19f65357a15bad605d8168452aa564d63786: remove /mnt/sda1/var/lib/docker/containers/7e87e5950501734b2a1c02705e9c19f65357a15bad605d8168452aa564d63786: directory not empty")
Due to the way we're executing tests (Mercurial's .t test format), this causes the test's output to change and the test to fail. Sadness.
I think these errors are limited to boot2docker/aufs. But we haven't executed enough test runs in the Jenkins/Ubuntu/btrfs VM yet to be sure. This definitely smells like a bug in Docker and it is very annoying.
After much wrangling and going deeper in a rabbit hole than I ever felt was possible, I finally managed to get BMO running inside Docker as part of our test infrastructure. We're now building tests for complicated components that touch Mercurial, Review Board, and Bugzilla and people are generally happy with how things work.
There are still a handful of bugs, workarounds, and components that aren't as optimal as I would like them to be. But you can't always have perfection.
My takeaway from this ordeal is that Docker still has a number of bugs and user experience issues to resolve. I really want to support Docker and to see it succeed. But every time I try doing something non-trivial with Docker, I get bit hard. Yes, some of the issues I experienced were due to my own ignorance. But at the same time, if one of Docker's mantras is about simplicity and usability, then should there be such gaping cracks for people like me to fall through?
In the end, the promise of Docker fits my use case almost perfectly. I know the architecture is a good fit for testing. We will likely stick with Docker, especially now that I've spent the time to make it work. I really wish this project would have taken a few days, not a few weeks.
Version control services and interaction with them play an important role at any company. Despite version control being a critical part of your infrastructure, my experience from working at a few companies and talking with others is that version control often doesn't get the testing love that other services do. Hooks get written, spot-tested by the author, and deployed. Tools that interact with version control often rely on behavior that may or may not change over time, especially when the version of your version control software is upgraded.
We've seen this pattern at Mozilla. Mercurial hooks and extensions were written and deployed to the server without test coverage. As a result, things break when we try to upgrade the server. This happens a few times and you naturally develop an attitude of fear, uncertainty, and doubt around touching anything on the server (or the clients for that matter). If it isn't broken, why fix it prevails for months or years. Then one an enthusiastic individual comes around wanting to deploy some hot new functionality. You tell them the path is arduous because the server is running antiquated versions of software and nothing is tested. The individual realizes the amazing change isn't worth the effort and justifiably throws up their hands and gives up. This is almost a textbook definition of how not having test coverage can result in technical debt. This is the position Mozilla is trying to recover from.
One of the biggest impacts I've had since joining the Developer Services Team at Mozilla a little over a month ago has been changing the story about how we test version control at Mozilla.
I'm proud to say that Mozilla now has a robust enough testing infrastructure in place around our Mercurial server that we're feeling pretty good about silencing the doubters when it comes to changing server behavior. Here's how we did it.
The genesis of this project was likely me getting involved with the hg-git and Mercurial projects. For hg-git, I learned a bit about Mercurial internals and how extensions work. When I looked at Mercurial extensions and hooks used by Mozilla, I started to realize what parts were good and what parts were bad. I realized what parts would likely break after upgrades. When I started contributing patches to Mercurial itself, I took notice of how Mercurial is tested. When I discovered T Tests, I thought, wow, that's pretty cool: we should use them to test Mozilla's Mercurial customizations!
After some frustrations with Mercurial extensions breaking after Mercurial upgrades, I wanted to do something about it to prevent this from happening again. I'm a huge fan of unified repositories. So earlier this year, I reached out to the various parties who maintain all the different components and convinced nearly everyone that establishing a single repository for all the version control code was a good idea. The version-control-tools repository was born. Things were slow at first. It was initially pretty much my playground for hosting Mercurial extensions that I authored. Fast forward a few months, and the version-control-tools repository now contains full history imports of our Mercurial hooks that are deployed on hg.mozilla.org, the templates used to render HTML on hg.mozilla.org, and pretty much every Mercurial extension authored by Mozillians, including pushlog. Having all the code in one repository has been very useful. It has simplified server deployments: we now pull 1 repository instead of 3. If there is a dependency between different components, we can do the update atomically. These are all benefits of using a single repository instead of N>1.
While version-control-tools was still pretty much my personal playground, I introduced a short script for running tests. It was pretty basic: just find test files and invoke them with Mercurial's test harness. It served my needs pretty well. Over time, as more and more functionality was rolled into version-control-tools, we expanded the scope of the test harness.
We can now run Python unit tests (in addition to Mercurial .t tests). Test all of the things!
We set up continuous integration with Jenkins so tests run after check-in and alert us when things fail.
We added support for running tests against multiple versions of Mercurial. We even test the bleeding edge of Mercurial so we know when an upstream Mercurial change breaks our code. So, no more surprises on Mercurial release day. I can tell you today that we have a handful of extensions that are broken in Mercurial 3.2, due for release around November 1. (Hopefully we'll fix them before release.)
We have Vagrant configurations so you can start a virtual machine that runs the tests the same way Jenkins does.
The latest addition to the test harness is the ability to spin up Docker containers as part of tests. Right now, this is limited to running Bugzilla during tests. But I imagine the scope will only increase over time.
Before I go on, I want to quickly explain how amazing Mercurial's .t tests are. These are a flavor of tests used by Mercurial and the dominant form of new tests added to the version-control-tools repository. These tests are glorified shell scripts annotated with expected command output and other metadata. It might be easier to explain by showing. Take bzpost's tests as an example. The bzpost extension automatically posts commit URLs to Bugzilla during push. Read more if you are interested. What I like so much about .t tests is that they are actually testing the user experience. The test actually runs hg push and verifies the output is exactly what is intended. Furthermore, since we're running a Dockerized Bugzilla server during the test, we're able to verify that the bzpost extension actually resulted in Bugzilla comments being added to the appropriate bug(s). Contrast this with unit tests that only test a subset of functionality. Or, contrast with writing a lot of boilerplate and often hard-to-read code that invokes processes and uses regular expressions, etc to compare output. I find .t tests are more concise and they do a better job of testing user experience. More than once I've written a .t test and thought this user experience doesn't feel right, I should change the behavior to be more user friendly. This happened because I was writing actual end-user commands as part of writing tests and seeing the exact output the user would see. It is much harder to attain this sense of understanding when writing unit tests. I can name a few projects with poor command line interfaces that could benefit from this approach... I'm not saying .t tests are perfect or that they should replace other testing methodologies such as unit tests. I just think they are very useful for accurately testing higher-level functionality and for assessing user experience. I really wish we had these tests for mach commands...
Anyway, with a proper testing harness in place for our version control code, we've been pretty good about ensuring new code is properly tested. When people submit new hooks or patches to existing hooks, we can push back and refuse to grant review unless tests are included. When someone requests a new deployment to the server, we can look at what changed, cross-reference to test coverage, and assess the riskiness of the deployment. We're getting to the point where we just trust our tests and server deployments are minor events. Concerns over accidental regressions due to server changes are waning. We can tell people if you really care about this not breaking, you need a test and if you add a test, we'll support it for you. People are often more than happy to write tests to ensure them peace of mind, especially when that test's presence shifts maintenance responsibility away from them. We're happy because we don't have many surprises (and fire drills) at deployment time. It's a win-win!
So, what's next? Good question! We still have a number of large gaps in our test coverage. Our code to synchronize repositories from the master server to read-only slaves is likely the most critical omission. We also don't yet have a good way of reproducing our server environment. Ideally, we'd run the continuous integration in an environment that's very similar to production. Same package versions and everything. This would also allow us to simulate the actual hg.mozilla.org server topology during tests. Currently, our tests are more unit-style than integration-style. We rely on the consistent behavior of Mercurial and other tools as sufficient proxies for test accuracy and we back those up with running the tests on the staging server before production deployment. But these aren't a substitute for an accurate reproduction of the production servers, especially when it comes to things like the replication tests. We'll get there some day. I also have plans to improve Mercurial's test harness to better facilitate some of our advanced use cases. I would absolutely love to make Mercurial's .t test harness more consumable outside the context of Mercurial. (cram is one such attempt at this.) We also need to incorporate the Git server code into this repository. Currently, I'm pretty sure everything Git at Mozilla is untested. Challenge accepted!
In summary, our story for testing version control at Mozilla has gone from a cobbled together mess to something cohesive and comprehensive. This has given us confidence to move fast without breaking things. I think the weeks of people time invested into improving the state of testing was well spent and will pay enormous dividends going forward. Looking back, the mountain of technical debt now looks like a mole hill. I feel good knowing that I played a part in making this change.
Docker is a really nifty tool. It vastly lowers the barrier to distributing and executing applications. It forces people to think about building server side code as a collection of discrete applications and services. When it was released, I instantly realized its potential, including for uses it wasn't primary intended for, such as applications in automated build and test environments.
Over the months, Docker's feature set has grown and many of its shortcomings have been addressed. It's more usable than ever. Most of my early complaints and concerns have been addressed or are actively being addressed.
But one supposedly solved part of Docker still bothers me: image creation.
One of the properties that gets people excited about Docker is the ability to ship execution environments around as data. Simply produce an image once, transfer it to a central server, pull it down from anywhere, and execute. That's pretty damn elegant. I dare say Docker has solved the image distribution problem. (Ignore for a minute that the implementation detail of how images map to filesystems still has a few quirks to work out. But they'll solve that.)
The ease at which Docker manages images is brilliant. I, like many, was overcome with joy and marvelled at how amazing it was. But as I started producing more and more images, my initial excitement turned to frustration.
The thing that bothers me most about images is that the de facto and recommended method for producing images is neither deterministic nor results in minimal images. I strongly believe that the current recommended and applied approach is far from optimal and has too many drawbacks. Let me explain.
If you look at the Dockerfiles from the official Docker library (examples: Node, MySQL), you notice something in common: they tend to use apt-get update as one of their first steps. For those not familiar with Apt, that command will synchronize the package repository indexes with a remote server. In other words, depending on when you run the command, different versions of packages will be pulled down and the result of image creation will differ. The same thing happens when you clone a Git repository. Depending on when you run the command - when you create the image - you may get different output. If you create an image from scratch today, it could have a different version of say Python than it did the day before. This can be a big deal, especially if you are trying to use Docker to accurately reproduce environments.
This non-determinism of building Docker images really bothers me. It seems to run counter to Docker's goal of facilitating reliable environments for running applications. Sure, one person can produce an image once, upload it to a Docker Registry server, and have others pull it. But there are applications where independent production of the same base image is important.
One area is the security arena. There are many people who are justifiably paranoid about running binaries produced by others and pre-built Docker images set off all kinds of alarms. So, these people would rather build an image from source, from a Dockerfile, than pull binaries. Except then they build the image from a Dockerfile and the application doesn't run because of an incompatibility with a new version of some random package whose version wasn't pinned. Of course, you probably lost numerous hours tracing down this obscure reason. How frustrating! Determinism and verifiability as part of Docker image creation help solve this problem.
Deterministic image building is also important for disaster recovery. What happens if your Docker Registry and all hosts with copies of its images go down? If you go to build the images from scratch again, what guarantee do you have that things will behave the same? Without determinism, you are taking a risk that things will be different and your images won't work as intended. That's scary. (Yes, Docker is no different here from existing tools that attempt to solve this problem.)
What if your open source product relies on a proprietary component that can't be legally distributed? So much for Docker image distribution. The best you can do is provide a base image and instructions for completing the process. But if that doesn't work deterministically, your users now have varying Docker images, again undermining Docker's goal of increasing consistency.
My other main concern about Docker images is that they tend to be large, both in size and in scope. Many Docker images use a full Linux install as their base. A lot of people start with a base e.g. Ubuntu or Debian install, apt-get install the required packages, do some extra configuration, and call it a day. Simple and straightforward, yes. But this practice makes me more than a bit uneasy.
One of the themes surrounding Docker is minimalism. Containers are lighter than VMs; just ship your containers around; deploy dozens or hundreds of containers simultaneously; compose your applications of many, smaller containers instead of larger, monolithic ones. I get it and am totally on board. So why are Docker images built on top of the bloaty excess of a full operating system (modulo the kernel)? Do I really need a package manager in my Docker image? Do I need a compiler or header files so I can e.g. build binary Python extensions? No, I don't, thank you.
As a security-minded person, I want my Docker images to consist of only the files they need, especially binary files. By leaving out non-critical elements from your image and your run-time environment, you are reducing the surface area to attack. If your application doesn't need a shell, don't include a shell and don't leave yourself potentially vulnerable to shellshock. I want the attacker who inevitably breaks out of my application into the outer container to get nothing, not something that looks like an operating system and has access to tools like curl and wget that could potentially be used to craft a more advanced attack (which might even be able to exploit a kernel vulnerability to break out of the container). Of course, you can and should pursue additional security protections in addition to attack surface reduction to secure your execution environment. Defense in depth. But that doesn't give Docker images a free pass on being bloated.
Another reason I want smaller containers is... because they are smaller. People tend to have relatively slow upload bandwidth. Pushing Docker images that can be hundreds of megabytes clogs my tubes. However, I'll gladly push 10, 20, or even 50 megabytes of only the necessary data. When you factor in that Docker image creation isn't deterministic, you also realize that different people are producing different versions of images from the same Dockerfiles and that you have to spend extra bandwidth transferring the different versions around. This bites me all the time when I'm creating new images and am experimenting with the creation steps. I tend to bypass the fake caching mechanism (fake because the output isn't deterministic) and this really results in data explosion.
I understand why Docker images are neither deterministic nor minimal: making them so is a hard problem. I think Docker was right to prioritize solving distribution (it opens up many new possibilities). But I really wish some effort could be put into making images deterministic (and thus verifiable) and more minimal. I think it would make Docker an even more appealing platform, especially for the security conscious. (As an aside, I would absolutely love if we could ship a verifiable Firefox build, for example.)
These are hard problems. But they are solvable. Here's how I would do it.
First, let's tackle deterministic image creation. Despite computers and software being ideally deterministic, building software tends not to be, so deterministic image creation is a hard problem. Even tools like Puppet and Chef which claim to solve aspects of this problem don't do a very good job with determinism. Read my post on The Importance of Time on Machine Provisioning for more on the topic. But there are solutions. NixOS and the Nix package manager have the potential to be used as the basis of a deterministic image building platform. The high-level overview of Nix is that the inputs and contents of a package determine the package ID. If you know how Git or Mercurial get their commit SHA-1's, it's pretty much the same concept. In theory, two people on different machines start with the same environment and bootstrap the exact same packages, all from source. Gitian is a similar solution. Although I prefer Nix's content-based approach and how it goes about managing packages and environments. Nix feels so right as a base for deterministically building software. Anyway, yes, fully verifiable build environments are turtles all the way down (I recommend reading Tor's overview of the problem and their approach. However, Nix's approach addresses many of the turtles and silences most of the critics. I would absolutely love if more and more Docker images were the result of a deterministic build process like Nix. Perhaps you could define the full set of packages (with versions) that would be used. Let's call this the package manifest. You would then PGP sign and distribute your manifest. You could then have Nix step through all the dependencies, compiling everything from source. If PGP verification fails, compilation output changes, or extra files are needed, the build aborts or issues a warning. I have a feeling the security-minded community would go crazy over this. I know I would.
OK, so now you can use Nix to produce packages (and thus images) (more) deterministically. How do you make them minimal? Well, instead of just packaging the entire environment, I'd employ tools like makejail. The purpose of makejail is to create minimal chroot jail environments. These are very similar to Docker/LXC containers. In fact, you can often take a tarball of a chroot directory tree and convert it into a Docker container! With makejail, you define a configuration file saying among other things what binaries to run inside the jail. makejail will trace file I/O of that binary and copy over accessed files. The result is an execution environment that (hopefully) contains only what you need. Then, create an archive of that environment and pipe it into docker build to create a minimal Docker image.
In summary, Nix provides you with a reliable and verifiable build environment. Tools like makejail pair down the produced packages into something minimal, which you then turn into your Docker image. Regular people can still pull binary images, but they are much smaller and more in tune with Docker's principles of minimalism. The paranoid among us can produce the same bits from source (after verifying the inputs look credible and waiting through a few hours of compiling). Or, perhaps the individual files in the image could be signed and thus verified via trust somehow? The company deploying Docker can have peace of mind that disaster scenarios resulting in Docker image loss should not result in total loss of the image (just rebuild it exactly as it was before).
You'll note that my proposed solution does not involve Dockerfiles as they exist today. I just don't think Dockerfile's design of stackable layers of commands is the right model, at least for people who care about determinism and minimalism. You really want a recipe that knows how to create a set of relevant files and some metadata like what ports to expose, what command to run on container start, etc and turn that into your Docker image. I suppose you could accomplish this all inside Dockerfiles. But that's a pretty radical departure from how Dockerfiles work today. I'm not sure the two solutions are compatible. Something to think about.
I'm pretty sure of what it would take to add deterministic and verifiable building of minimal and more secure Docker images. And, if someone solved this problem, it could be applicable outside of Docker (again, Docker images are essentially chroot environments plus metadata). As I was putting the finishing touches on this article, I discovered nix-docker. It looks very promising! I hope the Docker community latches on to these ideas and makes deterministic, verifiable, and minimal images the default, not the exception.
I recently gained SSH access to Mozilla's Mercurial servers. This allows me to run some custom queries directly against the data. I was interested in some high-level numbers and thought I'd share the results.
hg.mozilla.org hosts a total of 3,445 repositories. Of these, there are 1,223 distinct root commits (i.e. distinct graphs). Altogether, there are 32,123,211 commits. Of those, there are 865,594 distinct commits (not double counting commits that appear in multiple repositories).
We have a high ratio of total commits to distinct commits (about 37:1). This means we have high duplication of data on disk. This basically means a lot of repos are clones/forks of existing ones. No big surprise there.
What is surprising to me is the low number of total distinct commits. I was expecting the number to run into the millions. (Firefox itself accounts for ~240,000 commits.) Perhaps a lot of the data is sitting in Git, Bitbucket, and GitHub. Sounds like a good data mining expedition...
When companies or organizations deploy version control, they have to make many choices. One of them is how many repositories to create. Your choices are essentially a) a single, monolithic repository that holds everything b) many separate, smaller repositories that hold all the individual parts c) something in between.
The prevailing convention today (especially in the open source realm) is to create many separate and loosely coupled repositories, each repository mapping to a specific product or service. That does seem reasonable: if you were organizing files on your filesystem, you would group them by functionality or role (photos, music, documents, etc). And, version control tools are functionally filesystems. So it makes sense to draw repository boundaries at directory/role levels.
Further reinforcing the separate repository convention is the scaling behavior of our version control tools. Git, the popular tool in open source these days, doesn't scale well to very large repositories due to - among other things - not having narrow clones (fetching a subset of files). It scales well enough to the overwhelming majority of projects. But if you are a large organization generating lots of data (read: gigabytes of data over hundreds of thousands of files and commits) for version control, Git is unsuitable in its current form. Other tools (like Mercurial) don't currently fare that much better (although Mercurial has plans to tackle these scaling vectors).
Despite popular convention and even limitations in tools, companies like Google and Facebook opt to run large, monolithic repositories. Google runs Perforce. Facebook is on Mercurial, or at least is in the process of migrating to Mercurial.
Why do these companies run monolithic repositories? In Google's words:
We have a single large depot with almost all of Google's projects on it. This aids agile development and is much loved by our users, since it allows almost anyone to easily view almost any code, allows projects to share code, and allows engineers to move freely from project to project. Documentation and data is stored on the server as well as code.
So, monolithic repositories are all about moving fast and getting things done more efficiently. In other words, monolithic repositories increase developer productivity.
Furthermore, monolithic repositories are also more compatible with the ebb and flow of large organizations and large software projects. Components, features, products, and teams come and go, merge and split. The only constant is change. And if you are maintaining separate repositories that attempt to map to this ever-changing organizational topology, you are going to have a bad time. Either you'll be constantly copying, moving, merging, splitting, etc data and repositories. Or your repositories will be organized in a very non-logical and non-intuitive manner. That translates to overhead and lost productivity. I think that monolithic repositories handle the realities of large organizations much better. Big change or reorganization you want to reflect? You can make a single, atomic, history-preserving commit to move things around. I think that's much more manageable, especially when you consider the difficulty and annoyance of history-preserving changes across repositories.
Naysayers will decry monolithic repositories on principled and practical grounds.
The principled camp will say that separate repositories constitute a loosely coupled (dare I say service oriented) architecture that maps better to how software is consumed, assembled, and deployed and that erecting barriers in the form of separate repositories deliberately enforces this architecture. I agree. However, you can still maintain a loosely coupled architecture with monolithic repositories. The Subversion model of checking out a single tree from a larger repository proves this. Furthermore, I would say architecture decisions should be enforced by people (via code review, etc), not via version control repository topology. I believe this principled argument against monolithic repositories to be rather weak.
The principled camp living in the open source realm may also decry monolithic repositories as an affront to the spirit of open source. They would say that a monolithic repository creates unfairly strong ties to the organization that operates it and creates barriers to forking, etc. This may be true. But monolithic repositories don't intrinsically infringe on the basic software freedoms, organizations do. Therefore, I find this principled argument rather weak.
The practical camp will say that monolithic repositories just don't scale or aren't suitable for general audiences. These concerns are real.
Fully distributed version control systems (every commit on every machine) definitely don't scale past certain limits. Depending on your repository and user base, your scaling limits include disk space (repository data terabytes in size), bandwidth (repository data terabytes in size), filesystem (repository hundreds of thousands or millions of files), CPU and memory (operations on large repositories take too many system resources), and many heads/branches (tools like Git and Mercurial don't scale well to tens of thousands of heads/branches). These limitations with fully distributed version control are why distributed version control tools like Git and Mercurial support a partially-distributed mode that behaves more like your classical server-client model, like those employed by Subversion, Perforce, etc. Git supports shallow clone and sparse checkout. Mercurial supports shallow clone (via remotefilelog) and has planned support for narrow clone and sparse checkout in the next release or two. Of course, you can avoid the scaling limitations of distributed version control by employing a non-distributed tool, such as Subversion. Many companies continue to reach this conclusion today. However, users adapted to the distributed workflow would likely be up in arms (they would probably use tools like hg-subversion or git-svn to maintain their workflows). So, while scaling of version control can be a real concern, there are solutions and workarounds. However, they do involve falling back to a partially-distributed model.
Another concern with monolithic repositories is user access control. You inevitably have code or data that is more sensitive and want to limit who can change or even access it. Separate repositories seem to facilitate a simpler model: per-repository access control. With monolithic repositories, you have to worry about per-directory/subtree permissions, an increased risk of data leaking, etc. This concern is more real with distributed version control, as distributed data and access control aren't naturally compatible. But these issues can be resolved. And if the tooling supports it, there is only a semantic difference between managing access control between repositories versus components of a single repository.
When it comes to repository hosting conversions, I agree with Google and Facebook: I prefer monolithic repositories. When I am interacting with version control, I just want to get stuff done. I don't want to waste time dealing with multiple commands to manage multiple repositories. I don't want to waste time or expend cognitive load dealing with submodule, subrepository, or big files management. I don't want to waste time trying to find and reuse code, data, or documentation. I want everything at my fingertips, where it can be easily discovered, inspected, and used. Monolithic repositories facilitate these workflows more than separate repositories and make me more productive as a result.
Now, if only all the tools and processes we use and love would work with monolithic repositories...
Next Page »