Mercurial and Git both experience scaling pains as the number of heads in a repository approaches infinity. Operations like push and pull slow to a crawl and everyone gets frustrated.
This is the problem Mozilla's Try repository has been dealing with for years. We know the solution doesn't scale. But we've been content kicking the can by resetting the repository (blowing away data) to make the symptoms temporarily go away.
One of my official goals is to ship a scalable Try solution by the end of 2014.
Today, I believe I finally have enough code cobbled together to produce a working concept. And I could use your help testing it.
I would like people to push their Try, code review, and other miscellaneous heads to a special repository. To do this:
$ hg push -r . -f ssh://email@example.com/gecko-headless
- Consider the changeset belonging to the working copy
- Allow the creation of new heads
- Send it to the gecko-headless repo on hg.gregoryszorc.com using SSH
Here's what happening.
I have deployed a special repository to my personal server that I believe will behave very similarly to the final solution.
When you push to this repository, instead of your changesets being applied directly to the repository, it siphons them off to a Mercurial bundle. It then saves this bundle somewhere along with some metadata describing what is inside.
When you run hg pull -r
Things this repository doesn't do:
- This repository will not actually send changesets to Try for you.
- You cannot
hg clonethe repository and get all of the commits from bundles. This isn't a goal. It will likely never be supported.
- We do not yet record a pushlog entry for pushes to the repository.
- The hgweb HTML interface does not yet handle commits that only exist in bundles. People want this to work. It will eventually work.
- Pulling from the repository over HTTP with a vanilla Mercurial install may not preserve phase data.
The purpose of this experiment is to expose the repository to some actual traffic patterns so I can see what's going on and get a feel for real-world performance, variability, bugs, etc. I plan to do all of this in the testing environment. But I'd like some real-world use on the actual Firefox repository to give me peace of mind.
Please report any issues directly to me. Leave a comment here. Ping me on IRC. Send me an email. etc.
Update 2014-11-21: People discovered a bug with pushed changesets accidentally being advanced to the public phase, despite the repository being non-publishing. I have fixed the issue. But you must now push to the repository over SSH, not HTTP.
We had a hiccup on hg.mozilla.org yesterday. It resulted in prolonged tree closures for Firefox. Bug 1094922 tracks the issue.
We noticed that many HTTP requests to hg.mozilla.org were getting 503 responses. On initial glance, the servers were healthy. CPU was below 100% utilization, I/O wait was reasonable. And there was little to no swapping. Furthermore, the logs showed a healthy stream of requests being successfully processed at levels that are typical. In other words, it looked like business as usual on the servers.
Upon deeper investigation, we noticed that the WSGI process pool on the servers was fully saturated. There were were 24 slots/processes per server allocated to process Mercurial requests and all 24 of them were actively processing requests. This created a backlog of requests that had been accepted by the HTTP server but were waiting an internal dispatch/proxy to WSGI. To the client, this would appear as a request with a long lag before response generation.
This being an emergency (trees were already closed and developers were effectively unable to use hg.mozilla.org), we decided to increase the size of the WSGI worker pool. After all, we had CPU, I/O, and memory capacity to spare and we could identify the root cause later. We first bumped worker pool capacity from 24 to 36 and immediately saw a significant reduction in the number of pending requests awaiting a WSGI worker. We still had spare CPU, I/O, and memory capacity and were still seeing requests waiting on a WSGI worker, so we bumped the capacity to 48 processes. At that time, we stopped seeing worker pool exhaustion and all incoming requests were being handed off to a WSGI worker as soon as they came in.
At this time, things were looking pretty healthy from the server end.
Impact on Memory and Swapping
Increasing the number of WSGI processes had the side-effect of increasing the total amount of system memory used by Mercurial processes in two ways. First, more processes means more memory. That part is obvious. Second, more processes means fewer requests for each process per unit of time and thus it takes longer for each process to reach its max number of requests being being reaped. (It is a common practice in servers to have a single process hand multiple requests. This prevents overhead associated with spawning a new process and loading possibly expensive context in it.)
We had our Mercurial WSGI processes configured to serve 100 requests before being reaped. With the doubling of WSGI processes from 24 to 48, those processes were lingering for 2x as long as before. Since the Mercurial processes grow over time (they are aggressive about caching repository data), this was slowly exhausting our memory pool.
It took a few hours, but a few servers started flirting with high swap usage. (We don't expect the servers to swap.) This is how we identified that memory use wasn't sane.
We lowered the maximum requests per process from 100 to 50 to match the ratio increase in the WSGI worker pool.
Mercurial Memory "Leak"
When we started looking at the memory usage of WSGI processes in more detail, we noticed something strange: RSS of Mercurial processes was growing steadily when processes were streaming bundle data. This seemed very odd to me. Being a Mercurial developer, I was pretty sure the behavior was wrong.
I filed a bug against Mercurial.
I was able to reproduce the issue locally and started running a bisection to find the regressing changeset. To my surprise, this issue has been around since Mercurial 2.7!
I looked at the code in question, identified why so much memory was being allocated, and submitted patches to stop an unbounded memory growth during clone/pull and to further reduce memory use during those operations. Both of those patches have been queued to go in the next stable release of Mercurial, 3.2.1.
Mercurial 3.2 is still not as memory efficient during clones as Mercurial 2.5.4. If I have time, I'd like to formulate more patches. But the important fix - not growing memory unbounded during clone/pull - is in place.
Armed with the knowledge that Mercurial is leaky (although not a leak in the traditional sense since the memory was eventually getting garbage collected), we further reduced the max requests per process from 50 to 20. This will cause processes to get reaped sooner and will be more aggressive about keeping RSS growth in check.
We suspect the root cause of the event is a network event.
Before this outage, we rarely had more than 10 requests being served from the WSGI worker pool. In other words, we were often well below 50% capacity. But something changed yesterday. More slots were being occupied and high-bandwidth operations were taking longer to complete. Kendall Libby noted that outbound traffic dropped by ~800 Mbps during the event. For reasons that still haven't been identified, the network became slower, clones weren't being processed as quickly, and clients were occupying WSGI processes for longer amounts of time. This eventually exhausted the available process pool, leading to HTTP 503's, intermittent service availability, and a tree closure.
Interestingly, we noticed that in-flight HTTP requests are abnormally high again this morning. However, because the servers are now configured to handle the extra capacity, we seem to be powering through it without any issues.
You can make the argument that the servers weren't configured to serve as much traffic as possible. After all, we were able to double the WSGI process pool without hitting CPU, I/O, and memory limits.
The servers were conservatively configured. However, the worker pool was initially configured at 2x CPU core count. And as a general rule of thumb, you don't want your worker pool to be much greater than CPU count because that introduces context switching and can give each individual process a smaller slice of the CPU to process requests, leading to higher latency. Since clone operations often manage to peg a single CPU core, there is some justification for keeping the ratio of WSGI workers to CPU count low. Furthermore, we rarely came close to exhausting the WSGI worker pool before. There was little to no justification for increasing capacity to a threshold not normally seen.
But at the same time, even with 4x workers to CPU cores, our CPU usage rarely flirts with 100% across all cores, even with the majority of workers occupied. Until we actually hit CPU (or I/O) limits, running a high multiplier seems like the right thing to do.
Long term, we expect CPU usage during clone operations to drop dramatically. Mike Hommey has contributed a patch to Mercurial that allows servers to hand out a URL of a bundle file to fetch during clone. So, a server can say I have your data: fetch this static file from S3 and then apply this small subset of the data that I'll give you. When properly deployed and used at Mozilla, this will effectively drop server-side CPU usage for clones to nothing.
Where to do Better
There was a long delay between the Nagios alerts firing and someone with domain-specific knowledge looking at the problem.
The trees could have reopened earlier. We were pretty confident about the state of things at 1000. Trees opened in metered mode at 1246 and completely at 1909. Although, the swapping issue wasn't mitigated until 1615, so you can argue that being conservative on the tree reopening was justified. There is a chance that full reopening could have triggered excessive swap and another round of chaos for everyone involved.
We need an alert on WSGI pool exhaustion. It took longer than it should have to identify this problem. However, now that we've encountered it, it should be obvious if/when it happens again.
Firefox release automation is the largest single consumer of hg.mozilla.org. Since they are operating thousands of machines, any reduction in interaction or increase in efficiency will result in drastic load reductions on the server. Chris AtLee and Jordan Lund have been working on bug 1050109 to reduce clones of the mozharness and build/tools repositories, which should go a long way to dropping load on the server.
Timeline of Events
All times PST.
- 0705 - First Nagios alerts fire
- 0819 - Trees closed
- 0915 - WSGI process pool increased from 24 to 36
- 0945 - WSGI process pool increased from 36 to 48
- 1246 - Trees reopen in metered mode
- 1615 - Decrease max requests per process from 100 to 50
- 1909 - Trees open completely
- 0012 - Patches to reduce memory usage submitted to Mercurial
- 0800 - Mercurial patches accepted
- 0915 - Decrease max requests per process from 50 to 20
Warning: This post is long and rambling. There is marginal value in reading beyond the first few paragraphs unless you care about Docker.
I recently wrote about how Mozilla tests version control. In this post, I want to talk about the part of that effort that consumed the most time: adding Docker support to the test harness.
Introducing the Problem and Desired End State
Running Docker containers inside tests just seems like an obvious thing you'd want to do. I mean, wouldn't it be cool if your tests could spin up MySQL, Redis, Cassandra, Nginx, etc inside Docker containers and test things against actual instances of the things running in your data centers? Of course it would! If you ask me, this approach beats mocking because many questions around accuracy of the mocked interface are removed. Furthermore, you can run all tests locally, while on a plane: no data center or staging environment required. How cool is that! And, containers are all isolated so there's no need to pollute your system with extra packages and system services. Seems like wins all around.
When Mozilla started adding customizations to the Review Board code review software in preparation for deploying it at Mozilla as a replacement for Bugzilla's Splinter, it quickly became apparant that we had a significant testing challenge ahead of us. We weren't just standing up Review Board and telling people to use it, we were integrating user authentication with Bugzilla, having Review Board update Bugzilla after key events, and were driving the initiation of code review in Review Board by pushing code to a Mercurial server. That's 3 user-visible services all communicating with each to expose a unified workflow. It's the kind of thing testing nightmares are made of.
During my early involvement with the project, I recognized the challenge ahead and was quick to insist that we write automated tests for as much as possible. I insisted that all the code (there are multiple extensions to ReviewBoard, a Mercurial hook, and a Mercurial extension) live under one common repository and share testing. That way we could tinker with all the parts easily and test them in concern without having to worry about version sync. We moved all the code to the version-control-tools repository and Review Board was the driving force behind improvements to the test harness in that repository. We had Mercurial .t tests starting Django dev servers hosting Review Board running from per-test SQLite databases and all was nice. Pretty much every scenario involving the interaction between Mercurial and ReviewBoard was tested. If you cared about just these components, life was happy.
A large piece of the integration story was lacking in this testing world: Bugzilla. We had somewhat complex code for having Review Board and Bugzilla talk to each other but no tests for it because nobody had yet hooked Bugzilla up to the tests. As my responsibilities in the project expanded from covering just the Mercurial and Review Board interaction to Bugzilla as well, I again looked at the situation and thought there's a lot of complex interaction here and alpha testing has revealed the presence of many bugs: we need a better testing story. So, I set out to integrate Bugzilla into the test harness.
My goals were for Review Board tests to be able to make requests against a Bugzilla instance configured just like bugzilla.mozilla.org, to allow tests to execute concurrently (don't make developers wait on machines), for tests to run as quickly as possible, to run tests in an environment as similar to production as possible, and to be able to run tests from a plane or train or anywhere without internet connectivity. I was unwilling to budge on these core testing requirements because they represent what's best from test accuracy and developer productivity standpoints: you want your tests to be representative of the real world and you want to enable people to hack on this service anywhere, anytime, and not be held back by tests that take too long to execute. Move fast and don't break things.
Before I go on, it's time for a quick aside on tolerable waiting times. Throughout this article I talk about minimizing the run time of tests. This may sound like premature optimization. I argue it isn't, at least not if you are optimizing for developer productivity. There is a fair bit of academic research in this area. A study on tolerable waiting time: how long are Web users willing to wait gets cited a lot. It says 2 seconds for web content. If you read a few paragraphs in, it references other literature. They disagree on specific thresholds, but one thing is common: the thresholds are typically low - just a few seconds. The latencies I deal with are all longer than what research says leads to badness. When given a choice, I want to optimize workflows for what humans are conditioned to tolerate. If I can't do that, I've failed and the software will be ineffective.
The architecture of Bugzilla created some challenges and eliminated some implementation possibilities. First, I wasn't using any Bugzilla: I was using Mozilla's branch of Bugzilla that powers bugzilla.mozilla.org. Let's call it BMO. I could try hosting it from local SQLite files and running a local, Perl-based HTTP server (Bugzilla is written in Perl). But my experience with Perl and takeaways from talking to the BMO admins was that pain would likely be involved. Plus, this would be a departure from test accuracy. So, I would be using MySQL, Apache HTTPd, and mod_perl, just like BMO uses them in production.
Running Apache and MySQL is always a... fun endeavor. It wasn't a strict requirement, but I also highly preferred that the tests didn't pollute the system they ran on. In other words, having tests connect to an already-running MySQL or Apache server felt like the wrong solution. That's just one more thing people must set up and run locally to run the tests. That's just one more thing that could differ from production and cause bad testing results. It felt like a dangerous approach. Plus, there's the requirement to run things concurrently. Could I have multiple tests talking to the same MySQL server concurrently? They'd have to use separate databases so they don't conflict. That's a possibility. Honestly, I didn't entertain the thought of running Apache and MySQL manually for too long. I knew about this thing called Docker and that it theoretically fit my use case perfectly: construct building blocks for your application and then dymanically hook things up. Perfect. I could build Docker containers for all the required services and have each test start a new, independent set of containers for just that test.
So, I set out integrating Docker into the version-control-tools test harness. Specifically, my goal was to enable the running of independent BMO instances during individual tests. It sounded simple enough.
What I didn't know was that integrating a Dockerized BMO into the test harness would take the better part of 2 weeks. And it's still not up to my standards. This post is the story about the trials and tribulations I encountered along the way. I hope it serves as a warning and potentially a guide for others attempting similar feats. If any Docker developers are reading, I hope it gives you ideas on how to improve Docker.
Running Bugzilla inside Docker
First thing's first: to run BMO inside Docker I needed to make Docker containers for BMO. Fortunately, David Lawrence has prior art here. I really just wanted to take that code, dump it into version-control-tools and call it a day. In hindsight, I probably should have done that. Instead, armed with the knowledge of the Docker best practice of one container per service and David Lawrence's similar wishes to make his code conform to that ideal, I decided to spend some time to fix David's code so that MySQL and Apache were in separate containers, not part of a single container running supervisord. Easy enough, right?
It was relatively easy extracting the MySQL and Apache parts of BMO into separate containers. For MySQL, I started with the official MySQL container from the Docker library and added a custom my.cnf. Simple enough. For Apache, I just copied everything from David's code that wasn't MySQL. I was able to manually hook the containers together using the Docker CLI. It sort of just worked. I was optimistic this project would only take a few hours.
A garbage collection bug in Docker
My first speed bump came as I was iterating on Dockerfiles. All of a sudden I get an error from Docker that it is out of space. Wat? I look at docker images and don't see anything too obvious eating up space. What could be going on? At this point, I'm using boot2docker to host Docker. boot2docker is this nifty tool that allows Windows and OS X users to easily run Docker (Docker requires a Linux host). boot2docker spins up a Linux virtual machine running Docker and tells you how to point your local docker CLI interface at that VM. So, when Docker complains it is out of space, I knew immediately that the VM must be low on space. I SSH into it, run df, and sure enough, the VM is nearly out of space. But I looked at docker images -a and confirmed there's not enough data to fill the disk. What's going on? I can't find the issue right now, but it turns out there is a bug in Docker! When running Docker on aufs filesystems (like boot2docker does), Docker does not always remove data volumes containers when deleting a container. It turns out that the MySQL containers from the official Docker library were creating a data-only container to hold persistent MySQL data that outlives the container itself. These containers are apparently light magic. They are containers that are attached to other containers, but they don't really show up in the Docker interfaces. When you delete the host container, these containers are supposed to be garbage collected. Except on aufs, they aren't. My MySQL containers were creating 1+ GB InnoDB data files on start and the associated data containers were sitting around after container deletion, effectively leaking 1+ GB every time I created a MySQL container, quickly filling the boot2docker disk. Derp.
I worked around this problem by forking the official MySQL container. I didn't need persistent MySQL data (the containers only need to live for one invocation - for the lifetime of a single test), so I couldn't care less about persisted data volumes. So, I changed the MySQL container to hold its data locally, not in a data volume container. The solution was simple enough. But it took me a while to identify the problem. Here I was seeing Docker do something extremely stupid. Surely my understanding of Docker was wrong and I was doing something stupid to cause it to leak data. I spent hours digging through the documentation to make sure I was doing things exactly as recommended. It wasn't until I started an Ubuntu VM and tried the same thing there did I realize this looked like a bug in boot2docker. A few Google searches later led me to a comment hiding at the bottom of an existing GitHub issue that pins aufs as the culprit. And here I thought Docker reached 1.0 and wouldn't have bad bugs like this. I certainly wouldn't expect boot2docker to be shipping a VM with a sub-par storage driver (shouldn't it be using devicemapper or btrfs instead). Whatever.
Wrangling with Mozilla's Branch of Bugzilla
At this point, I've got basic Docker containers for MySQL and Apache+mod_perl+Bugzilla being created. Now, I needed to convert from vanilla Bugzilla to BMO. Should be straightforward. Just change the Git remote URL and branch to check out. I did this and all-of-a-sudden my image started encountering errors building! It turns out that the BMO code base doesn't work on a fresh database! Fortunately, this is a known issue and I've worked around it previously. When I tackled it a few months ago, I spent a handful of hours disecting this problem. It wasn't pretty. But this time I knew what to do. I even had a Puppet manifest for installing BMO on a fresh machine. So, I just needed to translate that Puppet config into Dockerfile commands. No big deal, right? Well, when I did that Puppet config a few months ago, I based it on Ubuntu because I'm more familiar with Debian-based distros and figured Ubuntu would be the easiest since it tends to have the largest package diversity. Unfortunately, David's Docker work is based on Fedora. So, I spent some time converting the Dockerfile to Ubuntu rather than trying to port things to Fedora. Arguably the wrong decision since Mozilla operates the RedHat flavor of Linux distributions in production. But I was willing to trade accuracy for time here, having lost time dealing with the aufs bug.
Unfortunately, I under-estimated how long it would take to port the image to Ubuntu. It didn't take so long from a code change perspective. Instead, most of the time was spent waiting for Docker to run the commands to build the image. In the final version, Apt is downloading and installing over 250 packages. And Bugzilla's bootstrap process installs dozens of packages from CPAN. Every time I made a small change, I invalidated Docker's image building cache, causing extreme delays while waiting for Apt and CPAN to do their thing. This experience partially contributed to my displeasure with how Docker currently handles image creation. If Docker images were composed of pre-built pieces instead of stacked commands, my cache hit rate would have been much higher and I would have converted the image in no time. But no, that's not how things work. So I lost numerous hours through this 2 week process waiting for Docker images to perform operations I've already done elsewhere dozens of times before.
Docker Container Orchestration
After porting the Bugzilla image to Ubuntu and getting BMO to bootstrap in a manually managed container (up to this point I'm using the docker CLI to create images, start containers, etc), it was time to automate the process so that tests could run the containers. At this time, I started looking for tools that performed multiple container orchestration. I had multiple containers that needed to be treated as a single logical unit, so I figured I'd use an existing tool to solve this problem for me. Don't reinvent the wheel unless you have to, right? I discovered Fig, which seemed to fit the bill. I read that it is being integrated into Docker itself, so it must be best of breed. Even if it weren't its future seems to be more certain than other tools. So, I stopped my tools search and used Fig without much consideration for other tools.
Lack of a useful feature in Fig
I quickly whipped up a fig.yml and figured it would just work. Nope! Starting the containers from scratch using Fig resulted in an error. I wasn't sure what the error was at first because Fig didn't tell me. After some investigation, I realized that my bmoweb container (the container holding Apache + BMO code) was failing in its entrypoint command (that's a command that runs when the container starts up, but not the primary command a container runs - that's a bit confusing I know - read the docs). I added some debug statements and quickly realized that Bugzilla was erroring connecting to MySQL. Strange, I thought. Fig is essentially a DSL around manual docker commands, so I checked everything by typing everything into the shell. No error. Again on a new set of containers. No error. I thought maybe my environment variable handling was wrong - that the dynamically allocated IP address and port number of the linked MySQL container being passed to the bmoweb container weren't getting honored. I added some logging to disprove that theory. The wheels inside my brain spun for a little bit. And, aided by some real-time logging, I realized I was dealing with a race condition: Fig was starting the MySQL and bmoweb containers concurrently and bmoweb was attempting to access the MySQL server before MySQL had fully initialized and started listening on its TCP port! That made sense. And I think it's a reasonable optimization for Fig to start containers concurrently to speed up start time. But surely a tool that orchestrates different containers has considered the problem of dependencies and has a mechanism to declare them to prevent these race conditions. I check the online docs and there's nothing to be found. A red panda weeps. So, I change the bmoweb entrypoint script to wait until it can open a TCP socket to MySQL before actually using MySQL and sure enough, the race condition goes away and the bmoweb container starts just fine!
OK, I'm real close now. I can feel it.
I start playing around with manually starting and stopping containers as part of a toy test. The good news is things appear to work. The bad news is it is extremely slow. It didn't take long for me to realize that the reason for the slowness is Bugzilla's bootstrap on first run. Bugzilla, like many complex applications, has a first run step that sets up database schema, writes out some files on the filesystem, inserts some skeleton data in the database, creates an admin user, etc. Much to my dismay this was taking a long time. Something on the order of 25 to 30 seconds. And that's on a Haswell with plenty of RAM and an SSD. Oy. The way things are currently implemented would result in a 25 to 30 second delay when running every test. Change 1 line and wait say 25s for any kind of output. Are you kidding me?! Unacceptable. It violated my core goal of having tests that are quick to run. Again, humans should not have to wait on machines.
I think about this problem for like half a second and the solution is obvious: take a snapshot of the bootstrapped images and start instances of that snapshot from tests. In other words, you perform the common bootstrap operations once and only once. And, you can probably do that outside the scope of running tests so that the same snapshot can be used across multiple invocations of the test harness. Sounds simple! To the Docker uninitiated, it sounds like the solution would be to move the BMO bootstrapping into the Dockerfile code so it gets executed at image creation time. Yes, that would be ideal. Unfortunately, when building images via Dockerfile, you can't tell Docker to link that image to another container. Without container linking, you can't have MySQL. Without MySQL, you can't do BMO bootstrap. So, BMO bootstrap must be done during container startup. And in Docker land, that means putting it as part of your entrypoint script (where it was conveniently already located for this reason).
Talking Directly to the Docker API
Of course, the tools that I found that help with Docker image building and container orchestration don't seem to have an answer for this create a snapshot of a bootstrapped container problem. I'm sure someone has solved this problem. But in my limited searching, I couldn't find anything. And, I figured the problem would be easy enough to solve manually, so I set about creating a script to do it. I'm not a huge fan of shell script for automation. It's hard to debug and simple things can be hard and hard things can be harder. Plus, why solve solutions such as parsing output for relevant data when you can talk to an API directly and get native types. Since the existing test harness automation in version-control-tools was written in Python, I naturally decided to write some Python to create the bootstrapped images. So, I do a PyPI search and discover docker-py, a Python client library to the Docker Remote API, an HTTP API that the Docker daemon runs and is what the docker CLI tool itself uses to interface with Docker. Good, now I have access to the full power of Docker and am not limited by what the docker CLI may not expose. So, I spent some time looking at source and the Docker Remote API documentation to get an understanding of my new abilities and what I'd need to do. I was pleasantly surprised to learn that the docker CLI is pretty similar to the Remote API and the Python API was similar as well, so the learning was pretty shallow. Yay for catching a break!
Confusion Over Container Stopping
I wrote some Python for building the BMO images, launching the containers, committing the result, and saving state to disk (so it could be consulted later - preventing a bootstrap by subsequent consumers). This was pleasantly smooth at first, but I encountered some bumps along the way. First, I didn't have a complete grasp on the differences between stop and kill. I was seeing some weird behavior by MySQL on startup and didn't know why. Turns out I was forcefully killing the container after bootstrap via the kill API and this was sending a SIGKILL to MySQL, effectively causing unclean shutdown. After some documentation reading, I realized stop is the better API - it issues SIGTERM, waits for a grace period, then issues SIGKILL. Issuing SIGTERM made MySQL shut down gracefully and this issue stemming from my ignorance was resolved. (If anyone from Docker is reading, I think the help output for docker kill should mention the forcefullness of the command versus stop. Not all of us remember the relative forcefullness of the POSIX signals and having documentation reinforce their cryptic meaning could help people select the proper command.) A few lines of Python later and I was talking directly to the Docker Remote API, doing everything I needed to do to save (commit in Docker parlance) a bootstrapped BMO environment for re-use among multiple tests.
It was pretty easy to hook the bootstrapped images up to a single test. Just load the bootstrapped image IDs from the config file and start new containers based on them. That's Docker 101 (except I was using Python to do everything).
Concurrent Execution Confuses Bugzilla
Now that I could start Dockerized BMO from a single test, it was time to make things work concurrently. I hooked Docker up to a few tests and launched them in parallel to see what would happen. The containers appeared to start just fine! Great anticipation on my part to design for concurrency from the beginning, I thought. It appeared I was nearly done. Victory was near. So, I changed some tests to actually interact with BMO running from Docker. (Up until this point I was merely starting containers, not doing anything with them.) Immediately I see errors. Cannot connect to Bugzilla http://... connection refused. Huh? It took a few moments, but I realized the experience I had with MySQL starting and this error were very similar. I changed my start BMO containers code to wait for the HTTP server's TCP socket to start accepting connections before returning control and sure enough, I was able to make HTTP requests against Bugzilla running in Docker! Woo!
Next step, make an authenticated query against Bugzilla running in Docker. HTTP request completes... with an internal server error. What?! I successfully browsed BMO from containers days before and was able to log in just fine - this shouldn't be happening. This problem took me ages to diagnose. I traced every step of provisioning and couldn't figure out what was going on. After resorting to print debugging in nearly every component, including Bugzilla's Perl code itself, I found the culprit: Bugzilla wasn't liking the dynamic nature of the MySQL and HTTP endpoints. You see, when you start Docker containers, network addresses change. The IP address assigned to the container is whatever is available to Docker at the time the container was started. Likewise the IP address and port number of linked services can change. So, your container entrypoint has to deal with this dynamic nature of addresses. For example, if you have a configuration file, you need to update that configuration file on every run with the proper network address info. My Bugzilla entrypoint script was doing this. Or so I thought. It turns out that Bugzilla's bootstrap process has multiple config files. There's an answers file that provides static answers to questions asked when running the bootstrap script (checksetup.pl). checksetup.pl will produce a localconfig file (actually a Perl script) containing all that data. There's also a data/params file containing yet more configuration options. And, the way I was running bootstrap, checksetup.pl refused to update files with new values. I initially had the entrypoint script updating only the answers file and running checksetup.pl, thinking checksetup.pl would update localconfig if the answers change. Nope! checksetup.pl only appears to update localconfig if localconfig is missing a value. So, here my entrypoint script was, successully calling checksetup.pl with the proper network values, which checksetup.pl was more than happy to use. But when I started the web application, it used the old values from localconfig and data/params and blew up. Derp. So, to have dynamic MySQL hosts and ports and a dynamic self-referential HTTP URL, I needed to manually update localconfig and data/params during the entrypoint script. The entrypoint script now rewrites Perl scripts during container load to reflect appropriate variables. Oy.
At some point I got working BMO containers running concurrently from multiple tests. This was a huge milestone. But it only revealed my next problem: resource constraints. The running containers were consuming gobs of memory and I couldn't run more than 2 or 3 tests concurrently before running out of memory. Before, I was able to run 8 tests concurrently no problem. Well crap, I just slowed down the test harness significantly by reducing concurrency. No bueno.
Some quick investigation revealed the culprit was MySQL and Apache being greedier than they needed to be. MySQL was consuming 1GB RSS on start. Apache was something like 350 MB. It had been a while since I ran a MySQL server, so I had to scour the net for settings to put MySQL on a diet. The results were not promising. I knew enough about MySQL to know that the answers I found had similar quality to comments on the php.net function documentation circa 2004 (it was not uncommon to see things like SQL injection in the MySQL pages back then - who knows, maybe that's still the case). Anyway, a little tuning later and I was able to get MySQL using a few hundred MB RAM and I reduced the Apache worker pool to something reasonable (maybe 2) to free up enough memory to be able to run tests with the desired concurrency again. If using Docker as part of testing ever takes off, I imagine there will be two flavors of every container: low memory and regular. I'm not running a production service here: I'll happily trade memory for high-end performance as long as it doesn't impact my tests too much.
Caching, Invalidating, and Garbage Collecting Bootstrapped Images
As part of iterating on making BMO bootstrap work, I encountered another problem: knowing when to perform a bootstrap. As mentioned earlier, bootstrap was slow: 25 to 30 seconds. While I had reduced the cost of bootstrap to at most once per test suite execution (as opposed to once per test), there was still the potential for a painful 25-30s delay when running tests. Unacceptable! Furthermore, when I changed how bootstrap worked, I needed a way to invalidate the previous bootstrapped image. Otherwise, we may use an outdated bootstrapped image that doesn't represent the environment it needs to and test execution would fail. How should I do this?
Docker has considered this problem and they have a solution: build context. When you do a docker build, Docker takes all the files from the directory containing the Dockerfile and makes them available to the environment doing the building. If you ADD one of these files in your Dockerfile, the image ID will change if the file changes, invalidating the cache used by Docker to build images. So, if I ADDed the scripts that perform BMO bootstrap to my Docker images, Docker would automagically invalidate the built images and force a bootstrap for me. Nice! Unfortunately, docker build doesn't allow you to add files outside of the current directory to the build context. Internet sleuthing reveals the solution here is to copy things to a temporary directory and run docker build from that. Seriously? Fortunately, I was using the Docker API directly via Python. And that API simply takes an archive of files. And since you can create archives dynamically inside Python using e.g. tarfile, it wasn't too difficult to build proper custom context archives that contained my extra data that could be used to invalidate bootstrapped images. I threw some simple ADD directives into my Dockerfiles and now I got bootstrapped image invalidation!
To avoid having to perform bootstrap on every test run, I needed a mapping between the base images and the bootstrapped result. I ended up storing this in a simple JSON file. I realize now I could have queried Docker for images having the base image as its parent since there is supposed to be a 1:1 relationship between them. I may do this as a follow-up.
With the look-up table in place, ensuring bootstrapped images were current involved doing a couple docker builds, finding the bootstrapped images from those base images, and doing the bootstrap if necessary. If everything is up-to-date, docker build finishes quickly and we have less than 1s of delay. Very acceptable. If things aren't current, well, there's not much you can do there if accuracy is important. I was happy with where I was.
Once I started producing bootstrapped images every time the code impacting the generation of that image changed, I ran into a new problem: garbage collection. All those old bootstrapped images were piling up inside of Docker! I needed a way to prune them. Docker has support for associating a repository and a tag with images. Great, I thought, I'll just associate all images with a well-defined repository, leave the tag blank (because it isn't really relevant), and garbage collection will iterate over all images in to-be-pruned repositories and delete all but the most recent one. Of course, this simple solution did not work. As far as I can tell, Docker doesn't really let you have multiple untagged images. You can set a repository with no tag and Docker will automagically assign the latest tag to that image. But the second you create a new image in that repository, the original image loses that repository association. I'm not sure if this is by design or a bug, but it feels wrong to me. I want the ability to associate tags with images (and containers) so I can easily find all entities in a logical set. It seemed to me that repository facilitated that need (albeit with the restriction of only associating 1 identifier per image). My solution here was to assign type 1 UUIDs to the tag field for each image. This forced Docker to retain the repository association when new images were created. I chose type 1 UUIDs so I can later extract the time component embedded within and do time-based garbage collection e.g. delete all images created more than a week ago.
Making Things Work in Jenkins/Ubuntu
At about this point, I figured things were working well enough on my boot2docker machine that it was time to update the Jenkins virtual machine / Vagrant configuration to run Docker. So, I hacked up the provisioner to install the docker.io package and tried to run things. First, I had to update code that talks to Docker to know where Docker is in an Ubuntu VM. Before, I was keying things off DOCKER_HOST, which I guess is used by the docker CLI and boot2docker reminds you to set. Easy enough. When I finally got things talking to Docker, my scripts threw a cryptic error when talking to Docker. Huh? This worked in boot2docker! When in doubt, always check your package versions. Sure enough, Ubuntu was installing an old Docker version. I added the Docker Apt repo to the Vagrant provisioner and tried again. Bingo - working Docker in an Ubuntu VM!
Choice of storage engines
I started building the BMO Docker images quickly noticed something: building images was horribly slow. Specifically, the part where new images are committed was taking seemingly forever. 5 to 8 seconds or something. Ugh. This wouldn't really bother me except due to subsequent issues, I found myself changing images enough as part of debugging that image building latency became a huge time sink. I felt I was spending more time waiting for layers to commit than making progress. So, I I decided to do something about it. I remembered glancing at an overview of storage options in Docker the week or two prior. I instinctively pinned the difference on different storage drivers between boot2docker and Ubuntu. Sure enough, boot2docker was using aufs and Ubuntu was using devicemapper. OK, now I identified a potential culprit. Time to test the theory. A few paragraphs into that blog post, I see a sorted list of storage driver priorities. I see aufs first, btrfs second, and devicemapper third. I know aufs has kernel inclusion issues (plus a nasty data leakage bug). I don't want that. devicemapper is slow. I figured the list is ordered for a reason and just attempted to use btrfs without really reading the article. Sure enough, btrfs is much faster at committing images than devicemapper. And, it isn't aufs. While images inside btrfs are building, I glance over the article and come to the conclusion that btrfs is in fact good enough for me.
So now I'm running Docker on btrfs on Ubuntu and Docker on aufs in boot2docker. Hopefully that will be the last noticable difference between host environments. After all, Docker is supposed to abstract all this away, right? I wish.
The Mystery of Inconsistent State
It was here that I experienced the most baffling, mind bending puzzle yet. As I was trying to get things working on the Jenkins/Ubuntu VM - things that had already been proved out in boot2docker - I was running into inexplicable issues during creation of the bootstrapped BMO containers. It seemed that my bootstrapped containers were somehow missing data. It appeared as if bootstrap had completed but data written during bootstrap failed to write. You start the committed/bootstrapped image and bootstrap had obviously completed partially, but it appeared to have never finished. Same Docker version. Same images. Same build scripts. Only the host environment was different. Ummmm, Bueller?
This problem had me totally and completely flabbergasted. My brain turned to mush exhausting possibilities. My initial instinct was this was a filesystem buffering problem. Different storage driver (btrfs vs aufs) means different semantics in how data is flushed, right? I once again started littering code with print statements to record the presence or non-presence of files and content therein. MySQL wasn't seeing all its data, so I double and triple check I'm shutting down MySQL correctly. Who knows, maybe one of the options I used to trim the fat from MySQL removed some of the safety from writing data and unclean shutdown is causing MySQL to lose data?
While I was investigating this problem, I noticed an additional oddity: I was having trouble getting reliable debug output from running containers (docker log -f). It seemed as if I was losing log events. I could tell from the state of a container that something happened, but I was seeing no evidence from docker logs -f that that thing actually happened. Weird! On a hunch, a threw some sys.stdout.flush() calls in my Python scripts, and sure enough my missing output started arriving! Pipe buffering strikes again. So, now we have dirty hacks in all the Python scripts related to Docker to unbuffer stdout to prevent data loss. Don't ask how much time was wasted tracking down bad theories due to stdout output being buffered.
Getting back to the problem at hand, I still hand Docker containers seemingly lose data. And it was only happening when Ubuntu/btrfs was the host environment for Docker. I eventually exhausted all leads in my filesystem wasn't flushed theory. At some point, I compared the logs of docker logs -f between boot2docker and Ubuntu and eventually noticed that the bmoweb container in Ubuntu wasn't printing as much. This wasn't obvious at first because the output from bootstrap on Ubuntu looked fine. Besides, the script that waits for bootstrap to complete waits for the Apache HTTP TCP socket to come alive before it gracefully stops the container and snapshots the bootstrapped result: bootstrap must be completing, ignore what docker logs -f says.
Eventually I hit an impasse and resort to context dumping everything on IRC. Ben Kero is around and he picks up on something almost immediately. He simply says ... systemd?. I knew almost instantly what he was referring to and knew the theory fit the facts. Do you know what I was doing wrong?
I still don't know what and quite frankly I don't care, but something in my Ubuntu host environment had a trigger on the TCP port the HTTP server would be listening on. Remember, I was detecting bootstrap completion by waiting until a TCP socket could be opened to the HTTP server. As soon as that connection was established, we stopped the containers gracefully and took a snapshot of the bootstrapped result. Except on Ubuntu something was accepting that socket open, giving a false positive to my wait code, and triggering early shutdown. Docker issued the signal to stop the container gracefully, but it wasn't finished bootstrapping yet, so it forcefully killed the container, resulting in bootstrap being in a remarkably-consistent-across-runs inconsistent state. Changing the code from wait on TCP socket to wait for valid HTTP response fixed the problem. And just for good measure, I changed the code waiting on the MySQL server to also try to establish an actual connection to the MySQL application layer, not merely a TCP socket.
After solving this mystery, I thought there's no way I could be so blind as to not see the container receiving the stop signal during bootstrap. So, I changed things back to prove to myself I wasn't on crack. No matter how hard I tried, I could not get the logs to show that the signal was received. I think what was happening was that my script was starting the container and issuing the graceful stop so quickly that it wasn't captured by log clients. Sure enough, adding some sleeps in the proper places made it possible to catch the events in action. In hindsight, I suppose I could have used docker events to shed some light on this as well. If Docker persisted logs/output from containers and allowed me to scroll back in time, I think this would have saved me. Although, there's a chance my entrypoint script wouldn't have informed me about the received signal. Perhaps checksetup.pl was ignoring it? What I really need is a unified event + log stream from Docker containers so I can debug exactly what's going on.
Everything is Working, Right?
After solving the inconsistent bootstrap state problem, things were looking pretty good. I had BMO bootstrapping and running from tests on both boot2docker and Ubuntu hosts. Tests were seeing completely independent environments and there were no race conditions. I was nearly done.
So, I started porting more and more tests to Docker. I started running tests more and more. Things worked. Most of the time. But I'm still frustrated by periodic apparent bugs in Docker. For example, our containers periodically fail to shut down. Our images periodically fail to delete.
During container shutdown and delete at the end of tests, we periodically see error messagess like the following:
docker.errors.APIError: 500 Server Error: Internal Server Error ("Cannot destroy container f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255: Driver aufs failed to remove root filesystem f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255: rename /mnt/sda1/var/lib/docker/aufs/mnt/f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255 /mnt/sda1/var/lib/docker/aufs/mnt/f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255-removing: device or resource busy")
500 Server Error: Internal Server Error ("Cannot destroy container 7e87e5950501734b2a1c02705e9c19f65357a15bad605d8168452aa564d63786: Unable to remove filesystem for 7e87e5950501734b2a1c02705e9c19f65357a15bad605d8168452aa564d63786: remove /mnt/sda1/var/lib/docker/containers/7e87e5950501734b2a1c02705e9c19f65357a15bad605d8168452aa564d63786: directory not empty")
Due to the way we're executing tests (Mercurial's .t test format), this causes the test's output to change and the test to fail. Sadness.
I think these errors are limited to boot2docker/aufs. But we haven't executed enough test runs in the Jenkins/Ubuntu/btrfs VM yet to be sure. This definitely smells like a bug in Docker and it is very annoying.
After much wrangling and going deeper in a rabbit hole than I ever felt was possible, I finally managed to get BMO running inside Docker as part of our test infrastructure. We're now building tests for complicated components that touch Mercurial, Review Board, and Bugzilla and people are generally happy with how things work.
There are still a handful of bugs, workarounds, and components that aren't as optimal as I would like them to be. But you can't always have perfection.
My takeaway from this ordeal is that Docker still has a number of bugs and user experience issues to resolve. I really want to support Docker and to see it succeed. But every time I try doing something non-trivial with Docker, I get bit hard. Yes, some of the issues I experienced were due to my own ignorance. But at the same time, if one of Docker's mantras is about simplicity and usability, then should there be such gaping cracks for people like me to fall through?
In the end, the promise of Docker fits my use case almost perfectly. I know the architecture is a good fit for testing. We will likely stick with Docker, especially now that I've spent the time to make it work. I really wish this project would have taken a few days, not a few weeks.
Version control services and interaction with them play an important role at any company. Despite version control being a critical part of your infrastructure, my experience from working at a few companies and talking with others is that version control often doesn't get the testing love that other services do. Hooks get written, spot-tested by the author, and deployed. Tools that interact with version control often rely on behavior that may or may not change over time, especially when the version of your version control software is upgraded.
We've seen this pattern at Mozilla. Mercurial hooks and extensions were written and deployed to the server without test coverage. As a result, things break when we try to upgrade the server. This happens a few times and you naturally develop an attitude of fear, uncertainty, and doubt around touching anything on the server (or the clients for that matter). If it isn't broken, why fix it prevails for months or years. Then one an enthusiastic individual comes around wanting to deploy some hot new functionality. You tell them the path is arduous because the server is running antiquated versions of software and nothing is tested. The individual realizes the amazing change isn't worth the effort and justifiably throws up their hands and gives up. This is almost a textbook definition of how not having test coverage can result in technical debt. This is the position Mozilla is trying to recover from.
One of the biggest impacts I've had since joining the Developer Services Team at Mozilla a little over a month ago has been changing the story about how we test version control at Mozilla.
I'm proud to say that Mozilla now has a robust enough testing infrastructure in place around our Mercurial server that we're feeling pretty good about silencing the doubters when it comes to changing server behavior. Here's how we did it.
The genesis of this project was likely me getting involved with the hg-git and Mercurial projects. For hg-git, I learned a bit about Mercurial internals and how extensions work. When I looked at Mercurial extensions and hooks used by Mozilla, I started to realize what parts were good and what parts were bad. I realized what parts would likely break after upgrades. When I started contributing patches to Mercurial itself, I took notice of how Mercurial is tested. When I discovered T Tests, I thought, wow, that's pretty cool: we should use them to test Mozilla's Mercurial customizations!
After some frustrations with Mercurial extensions breaking after Mercurial upgrades, I wanted to do something about it to prevent this from happening again. I'm a huge fan of unified repositories. So earlier this year, I reached out to the various parties who maintain all the different components and convinced nearly everyone that establishing a single repository for all the version control code was a good idea. The version-control-tools repository was born. Things were slow at first. It was initially pretty much my playground for hosting Mercurial extensions that I authored. Fast forward a few months, and the version-control-tools repository now contains full history imports of our Mercurial hooks that are deployed on hg.mozilla.org, the templates used to render HTML on hg.mozilla.org, and pretty much every Mercurial extension authored by Mozillians, including pushlog. Having all the code in one repository has been very useful. It has simplified server deployments: we now pull 1 repository instead of 3. If there is a dependency between different components, we can do the update atomically. These are all benefits of using a single repository instead of N>1.
While version-control-tools was still pretty much my personal playground, I introduced a short script for running tests. It was pretty basic: just find test files and invoke them with Mercurial's test harness. It served my needs pretty well. Over time, as more and more functionality was rolled into version-control-tools, we expanded the scope of the test harness.
We can now run Python unit tests (in addition to Mercurial .t tests). Test all of the things!
We set up continuous integration with Jenkins so tests run after check-in and alert us when things fail.
We added support for running tests against multiple versions of Mercurial. We even test the bleeding edge of Mercurial so we know when an upstream Mercurial change breaks our code. So, no more surprises on Mercurial release day. I can tell you today that we have a handful of extensions that are broken in Mercurial 3.2, due for release around November 1. (Hopefully we'll fix them before release.)
We have Vagrant configurations so you can start a virtual machine that runs the tests the same way Jenkins does.
The latest addition to the test harness is the ability to spin up Docker containers as part of tests. Right now, this is limited to running Bugzilla during tests. But I imagine the scope will only increase over time.
Before I go on, I want to quickly explain how amazing Mercurial's .t tests are. These are a flavor of tests used by Mercurial and the dominant form of new tests added to the version-control-tools repository. These tests are glorified shell scripts annotated with expected command output and other metadata. It might be easier to explain by showing. Take bzpost's tests as an example. The bzpost extension automatically posts commit URLs to Bugzilla during push. Read more if you are interested. What I like so much about .t tests is that they are actually testing the user experience. The test actually runs hg push and verifies the output is exactly what is intended. Furthermore, since we're running a Dockerized Bugzilla server during the test, we're able to verify that the bzpost extension actually resulted in Bugzilla comments being added to the appropriate bug(s). Contrast this with unit tests that only test a subset of functionality. Or, contrast with writing a lot of boilerplate and often hard-to-read code that invokes processes and uses regular expressions, etc to compare output. I find .t tests are more concise and they do a better job of testing user experience. More than once I've written a .t test and thought this user experience doesn't feel right, I should change the behavior to be more user friendly. This happened because I was writing actual end-user commands as part of writing tests and seeing the exact output the user would see. It is much harder to attain this sense of understanding when writing unit tests. I can name a few projects with poor command line interfaces that could benefit from this approach... I'm not saying .t tests are perfect or that they should replace other testing methodologies such as unit tests. I just think they are very useful for accurately testing higher-level functionality and for assessing user experience. I really wish we had these tests for mach commands...
Anyway, with a proper testing harness in place for our version control code, we've been pretty good about ensuring new code is properly tested. When people submit new hooks or patches to existing hooks, we can push back and refuse to grant review unless tests are included. When someone requests a new deployment to the server, we can look at what changed, cross-reference to test coverage, and assess the riskiness of the deployment. We're getting to the point where we just trust our tests and server deployments are minor events. Concerns over accidental regressions due to server changes are waning. We can tell people if you really care about this not breaking, you need a test and if you add a test, we'll support it for you. People are often more than happy to write tests to ensure them peace of mind, especially when that test's presence shifts maintenance responsibility away from them. We're happy because we don't have many surprises (and fire drills) at deployment time. It's a win-win!
So, what's next? Good question! We still have a number of large gaps in our test coverage. Our code to synchronize repositories from the master server to read-only slaves is likely the most critical omission. We also don't yet have a good way of reproducing our server environment. Ideally, we'd run the continuous integration in an environment that's very similar to production. Same package versions and everything. This would also allow us to simulate the actual hg.mozilla.org server topology during tests. Currently, our tests are more unit-style than integration-style. We rely on the consistent behavior of Mercurial and other tools as sufficient proxies for test accuracy and we back those up with running the tests on the staging server before production deployment. But these aren't a substitute for an accurate reproduction of the production servers, especially when it comes to things like the replication tests. We'll get there some day. I also have plans to improve Mercurial's test harness to better facilitate some of our advanced use cases. I would absolutely love to make Mercurial's .t test harness more consumable outside the context of Mercurial. (cram is one such attempt at this.) We also need to incorporate the Git server code into this repository. Currently, I'm pretty sure everything Git at Mozilla is untested. Challenge accepted!
In summary, our story for testing version control at Mozilla has gone from a cobbled together mess to something cohesive and comprehensive. This has given us confidence to move fast without breaking things. I think the weeks of people time invested into improving the state of testing was well spent and will pay enormous dividends going forward. Looking back, the mountain of technical debt now looks like a mole hill. I feel good knowing that I played a part in making this change.
I recently gained SSH access to Mozilla's Mercurial servers. This allows me to run some custom queries directly against the data. I was interested in some high-level numbers and thought I'd share the results.
hg.mozilla.org hosts a total of 3,445 repositories. Of these, there are 1,223 distinct root commits (i.e. distinct graphs). Altogether, there are 32,123,211 commits. Of those, there are 865,594 distinct commits (not double counting commits that appear in multiple repositories).
We have a high ratio of total commits to distinct commits (about 37:1). This means we have high duplication of data on disk. This basically means a lot of repos are clones/forks of existing ones. No big surprise there.
What is surprising to me is the low number of total distinct commits. I was expecting the number to run into the millions. (Firefox itself accounts for ~240,000 commits.) Perhaps a lot of the data is sitting in Git, Bitbucket, and GitHub. Sounds like a good data mining expedition...
Next Page »