Mercurial Server Hiccup 2014-11-06

November 07, 2014 at 11:00 AM | categories: Mercurial, Mozilla | View Comments

We had a hiccup on hg.mozilla.org yesterday. It resulted in prolonged tree closures for Firefox. Bug 1094922 tracks the issue.

What Happened

We noticed that many HTTP requests to hg.mozilla.org were getting 503 responses. On initial glance, the servers were healthy. CPU was below 100% utilization, I/O wait was reasonable. And there was little to no swapping. Furthermore, the logs showed a healthy stream of requests being successfully processed at levels that are typical. In other words, it looked like business as usual on the servers.

Upon deeper investigation, we noticed that the WSGI process pool on the servers was fully saturated. There were were 24 slots/processes per server allocated to process Mercurial requests and all 24 of them were actively processing requests. This created a backlog of requests that had been accepted by the HTTP server but were waiting an internal dispatch/proxy to WSGI. To the client, this would appear as a request with a long lag before response generation.

Mitigation

This being an emergency (trees were already closed and developers were effectively unable to use hg.mozilla.org), we decided to increase the size of the WSGI worker pool. After all, we had CPU, I/O, and memory capacity to spare and we could identify the root cause later. We first bumped worker pool capacity from 24 to 36 and immediately saw a significant reduction in the number of pending requests awaiting a WSGI worker. We still had spare CPU, I/O, and memory capacity and were still seeing requests waiting on a WSGI worker, so we bumped the capacity to 48 processes. At that time, we stopped seeing worker pool exhaustion and all incoming requests were being handed off to a WSGI worker as soon as they came in.

At this time, things were looking pretty healthy from the server end.

Impact on Memory and Swapping

Increasing the number of WSGI processes had the side-effect of increasing the total amount of system memory used by Mercurial processes in two ways. First, more processes means more memory. That part is obvious. Second, more processes means fewer requests for each process per unit of time and thus it takes longer for each process to reach its max number of requests being being reaped. (It is a common practice in servers to have a single process hand multiple requests. This prevents overhead associated with spawning a new process and loading possibly expensive context in it.)

We had our Mercurial WSGI processes configured to serve 100 requests before being reaped. With the doubling of WSGI processes from 24 to 48, those processes were lingering for 2x as long as before. Since the Mercurial processes grow over time (they are aggressive about caching repository data), this was slowly exhausting our memory pool.

It took a few hours, but a few servers started flirting with high swap usage. (We don't expect the servers to swap.) This is how we identified that memory use wasn't sane.

We lowered the maximum requests per process from 100 to 50 to match the ratio increase in the WSGI worker pool.

Mercurial Memory "Leak"

When we started looking at the memory usage of WSGI processes in more detail, we noticed something strange: RSS of Mercurial processes was growing steadily when processes were streaming bundle data. This seemed very odd to me. Being a Mercurial developer, I was pretty sure the behavior was wrong.

I filed a bug against Mercurial.

I was able to reproduce the issue locally and started running a bisection to find the regressing changeset. To my surprise, this issue has been around since Mercurial 2.7!

I looked at the code in question, identified why so much memory was being allocated, and submitted patches to stop an unbounded memory growth during clone/pull and to further reduce memory use during those operations. Both of those patches have been queued to go in the next stable release of Mercurial, 3.2.1.

Mercurial 3.2 is still not as memory efficient during clones as Mercurial 2.5.4. If I have time, I'd like to formulate more patches. But the important fix - not growing memory unbounded during clone/pull - is in place.

Armed with the knowledge that Mercurial is leaky (although not a leak in the traditional sense since the memory was eventually getting garbage collected), we further reduced the max requests per process from 50 to 20. This will cause processes to get reaped sooner and will be more aggressive about keeping RSS growth in check.

Root Cause

We suspect the root cause of the event is a network event.

Before this outage, we rarely had more than 10 requests being served from the WSGI worker pool. In other words, we were often well below 50% capacity. But something changed yesterday. More slots were being occupied and high-bandwidth operations were taking longer to complete. Kendall Libby noted that outbound traffic dropped by ~800 Mbps during the event. For reasons that still haven't been identified, the network became slower, clones weren't being processed as quickly, and clients were occupying WSGI processes for longer amounts of time. This eventually exhausted the available process pool, leading to HTTP 503's, intermittent service availability, and a tree closure.

Interestingly, we noticed that in-flight HTTP requests are abnormally high again this morning. However, because the servers are now configured to handle the extra capacity, we seem to be powering through it without any issues.

In Hindsight

You can make the argument that the servers weren't configured to serve as much traffic as possible. After all, we were able to double the WSGI process pool without hitting CPU, I/O, and memory limits.

The servers were conservatively configured. However, the worker pool was initially configured at 2x CPU core count. And as a general rule of thumb, you don't want your worker pool to be much greater than CPU count because that introduces context switching and can give each individual process a smaller slice of the CPU to process requests, leading to higher latency. Since clone operations often manage to peg a single CPU core, there is some justification for keeping the ratio of WSGI workers to CPU count low. Furthermore, we rarely came close to exhausting the WSGI worker pool before. There was little to no justification for increasing capacity to a threshold not normally seen.

But at the same time, even with 4x workers to CPU cores, our CPU usage rarely flirts with 100% across all cores, even with the majority of workers occupied. Until we actually hit CPU (or I/O) limits, running a high multiplier seems like the right thing to do.

Long term, we expect CPU usage during clone operations to drop dramatically. Mike Hommey has contributed a patch to Mercurial that allows servers to hand out a URL of a bundle file to fetch during clone. So, a server can say I have your data: fetch this static file from S3 and then apply this small subset of the data that I'll give you. When properly deployed and used at Mozilla, this will effectively drop server-side CPU usage for clones to nothing.

Where to do Better

There was a long delay between the Nagios alerts firing and someone with domain-specific knowledge looking at the problem.

The trees could have reopened earlier. We were pretty confident about the state of things at 1000. Trees opened in metered mode at 1246 and completely at 1909. Although, the swapping issue wasn't mitigated until 1615, so you can argue that being conservative on the tree reopening was justified. There is a chance that full reopening could have triggered excessive swap and another round of chaos for everyone involved.

We need an alert on WSGI pool exhaustion. It took longer than it should have to identify this problem. However, now that we've encountered it, it should be obvious if/when it happens again.

Firefox release automation is the largest single consumer of hg.mozilla.org. Since they are operating thousands of machines, any reduction in interaction or increase in efficiency will result in drastic load reductions on the server. Chris AtLee and Jordan Lund have been working on bug 1050109 to reduce clones of the mozharness and build/tools repositories, which should go a long way to dropping load on the server.

Timeline of Events

All times PST.

November 6

  • 0705 - First Nagios alerts fire
  • 0819 - Trees closed
  • 0915 - WSGI process pool increased from 24 to 36
  • 0945 - WSGI process pool increased from 36 to 48
  • 1246 - Trees reopen in metered mode
  • 1615 - Decrease max requests per process from 100 to 50
  • 1909 - Trees open completely

November 7

  • 0012 - Patches to reduce memory usage submitted to Mercurial
  • 0800 - Mercurial patches accepted
  • 0915 - Decrease max requests per process from 50 to 20
Read and Post Comments

Soft Launch of MozReview

October 30, 2014 at 11:15 AM | categories: Review Board, Mozilla, code review | View Comments

We performed a soft launch of MozReview: Mozilla's new code review tool yesterday!

What does that mean? How do I use it? What are the features? How do I get in touch or contribute? These are all great questions. The answers to those and more can all be found in the MozReview documentation. If they aren't, it's a bug in the documentation. File a bug or submit a patch. Instructions to do that are in the documentation.

Read and Post Comments

Implications of Using Bugzilla for Firefox Patch Development

October 27, 2014 at 03:27 PM | categories: Review Board, Bugzilla, Mozilla, code review | View Comments

Mozilla is very close to rolling out a new code review tool based on Review Board. When I became involved in the project, I viewed it as an opportunity to start from a clean slate and design the ideal code development workflow for the average Firefox developer. When the design of the code review experience was discussed, I would push for decisions that were compatible with my utopian end state.

As part of formulating the ideal workflows and design of the new tool, I needed to investigate why we do things the way we do, whether they are optimal, and whether they are necessary. As part of that, I spent a lot of time thinking about Bugzilla's role in shaping the code that goes into Firefox. This post is a summary of my findings.

The primary goal of this post is to dissect the practices that Bugzilla influences and to prepare the reader for the potential to reassemble the pieces - to change the workflows - in the future, primarily around Mozilla's new code review tool. By showing that Bugzilla has influenced the popularization of what I consider non-optimal practices, it is my hope that readers start to question the existing processes and open up their mind to change.

Since the impetus for this post in the near deployment of Mozilla's new code review tool, many of my points will focus on code review.

Before I go into my findings, I'd like to explicitly state that while many of the things I'm about to say may come across as negativity towards Bugzilla, my intentions are not to put down Bugzilla or the people who maintain it. Yes, there are limitations in Bugzilla. But I don't think it is correct to point fingers and blame Bugzilla or its maintainers for these limitations. I think we got where we are following years of very gradual shifts. I don't think you can blame Bugzilla for the circumstances that led us here. Furthermore, Bugzilla maintainers are quick to admit the faults and limitations of Bugzilla. And, they are adamant about and instrumental in rolling out the new code review tool, which shifts code review out of Bugzilla. Again, my intent is not to put down Bugzilla. So please don't direct ire that way yourself.

So, let's drill down into some of the implications of using Bugzilla.

Difficult to Separate Contexts

The stream of changes on a bug in Bugzilla (including review comments) is a flat, linear list of plain text comments. This works great when the activity of a bug follows a nice, linear, singular topic flow. However, real bug activity does not happen this way. All but the most trivial bugs usually involve multiple points of discussion. You typically have discussion about what the bug is. When a patch comes along, reviewer feedback comes in both high-level and low-level forms. Each item in each group is its own logical discussion thread. When patches land, you typically have points of discussion tracking the state of this patch. Has it been tested, does it need uplift, etc.

Bugzilla has things like keywords, flags, comment tags, and the whiteboard to enable some isolation of these various contexts. However, you still have a flat, linear list of plain text comments that contain the meat of the activity. It can be extremely difficult to follow these many interleaved logical threads.

In the context of code review, lumping all review comments into the same linear list adds overhead and undermines the process of landing the highest-quality patch possible.

Review feedback consists of both high-level and low-level comments. High-level would be things like architecture discussions. Low-level would be comments on the code itself. When these two classes of comments are lumped together in the same text field, I believe it is easy to lose track of the high-level comments and focus on the low-level. After all, you may have a short paragraph of high-level feedback right next to a mountain of low-level comments. Your eyes and brain tend to gravitate towards the larger set of more concrete low-level comments because you sub-consciously want to fix your problems and that large mass of text represents more problems, easier problems to solve than the shorter and often more abstract high-level summary. You want instant gratification and the pile of low-level comments is just too tempting to pass up. We have to train ourselves to temporarily ignore the low-level comments and focus on the high-level feedback. This is very difficult for some people. It is not an ideal disposition. Benjamin Smedberg's recent post on code review indirectly talks about some of this by describing his rational approach of tackling high-level first.

As review iterations occur, the bug devolves into a mix of comments related to high and low-level comments. It thus becomes harder and harder to track the current high-level state of the feedback, as they must be picked out from the mountain of low-level comments. If you've ever inherited someone else's half-finished bug, you know what I'm talking about.

I believe that Bugzilla's threadless and contextless comment flow disposes us towards focusing on low-level details instead of the high-level. I believe that important high-level discussions aren't occurring at the rate they need and that technical debt increases as a result.

Difficulty Tracking Individual Items of Feedback

Code review feedback consists of multiple items of feedback. Each one is related to the review at hand. But oftentimes each item can be considered independent from others, relevant only to a single line or section of code. Style feedback is one such example.

I find it helps to model code review as a tree. You start with one thing you want to do. That's the root node. You split that thing into multiple commits. That's a new layer on your tree. Finally, each comment on those commits and the comments on those comments represent new layers to the tree. Code review thus consists of many related, but independent branches, all flowing back to the same central concept or goal. There is a one to many relationship at nearly every level of the tree.

Again, Bugzilla lumps all these individual items of feedback into a linear series of flat text blobs. When you are commenting on code, you do get some code context printed out. But everything is plain text.

The result of this is that tracking the progress on individual items of feedback - individual branches in our conceptual tree - is difficult. Code authors must pore through text comments and manually keep an inventory of their progress towards addressing the comments. Some people copy the review comment into another text box or text editor and delete items once they've fixed them locally! And, when it comes time to review the new patch version, reviewers must go through the same exercise in order to verify that all their original points of feedback have been adequately addressed! You've now redundantly duplicated the feedback tracking mechanism among at least two people. That's wasteful in of itself.

Another consequence of this unstructured feedback tracking mechanism is that points of feedback tend to get lost. On complex reviews, you may be sorting through dozens of individual points of feedback. It is extremely easy to lose track of something. This could have disastrous consequences, such as the accidental creation of a 0day bug in Firefox. OK, that's a worst case scenario. But I know from experience that review comments can and do get lost. This results in new bugs being filed, author and reviewer double checking to see if other comments were not acted upon, and possibly severe bugs with user impacting behavior. In other words, this unstructured tracking of review feedback tends to lessen code quality and is thus a contributor to technical debt.

Fewer, Larger Patches

Bugzilla's user interface encourages the writing of fewer, larger patches. (The opposite would be many, smaller patches - sometimes referred to as micro commits.)

This result is achieved by a user interface that handles multiple patches so poorly that it effectively discourages that approach, driving people to create larger patches.

The stream of changes on a bug (including review comments) is a flat, linear list of plain text comments. This works great when the activity of a bug follows a nice, linear flow. However, reviewing multiple patches doesn't work in a linear model. If you attach multiple patches to a bug, the review comments and their replies for all the patches will be interleaved in the same linear comment list. This flies in the face of the reality that each patch/review is logically its own thread that deserves to be followed on its own. The end result is that it is extremely difficult to track what's going on in each patch's review. Again, we have different contexts - different branches of a tree - all living in the same flat list.

Because conducting review on separate patches is so painful, people are effectively left with two choices: 1) write a single, monolithic patch 2) create a new bug. Both options suck.

Larger, monolithic patches are harder and slower to review. Larger patches require much more cognitive load to review, as the reviewer needs to capture the entire context in order to make a review determination. This takes more time. The increased surface area of the patch also increases the liklihood that the reviewer will find something wrong and will require a re-review. The added complexity of a larger patch also means the chances of a bug creeping in are higher, leading to more bugs being filed and more reviews later. The more review cycles the patch goes through, the greater the chances it will suffer from bit rot and will need updating before it lands, possibly incurring yet more rounds of review. And, since we measure progress in terms of code landing, the delay to get a large patch through many rounds of review makes us feel lethargic and demotivates us. Large patches have intrinsic properties that lead to compounding problems and increased development cost.

As bad as large patches are, they are roughly in the same badness range as the alternative: creating more bugs.

When you create a new bug to hold the context for the review of an individual commit, you are doing a lot of things, very few of them helpful. First, you must create a new bug. There's overhead to do that. You need to type in a summary, set up the bug dependencies, CC the proper people, update the commit message in your patch, upload your patch/attachment to the new bug, mark the attachment on the old bug obsolete, etc. This is arguably tolerable, especially with tools that can automate the steps (although I don't believe there is a single tool that does all of what I mentioned automatically). But the badness of multiple bugs doesn't stop there.

Creating multiple bugs fragments the knowledge and history of your change and diminishes the purpose of a bug. You got in the situation of creating multiple bugs because you were working on a single logical change. It just so happened that you needed/wanted multiple commits/patches/reviews to represent that singular change. That initial change was likely tracked by a single bug. And now, because of Bugzilla's poor user interface around mutliple patch reviews, you now find yourself creating yet another bug. Now you have two bug numbers - two identifiers that look identical, only varying by their numeric value - referring to the same logical thing. We've started with a single bug number referring to your logical change and created what are effectively sub-issues, but allocated them in the same namespace as normal bugs. We've diminished the importance of the average bug. We've introduced confusion as to where one should go to learn about this single, logical change. Should I go to bug X or bug Y? Sure, you can likely go to one and ultimately find what you were looking for. But that takes more effort.

Creating separate bugs for separate reviews also makes refactoring harder. If you are going the micro commit route, chances are you do a lot of history rewriting. Commits are combined. Commits are split. Commits are reordered. And if those commits are all mapping to individual bugs, you potentially find yourself in a huge mess. Combining commits might mean resolving bugs as duplicates of each other. Splitting commits means creating yet another bug. And let's not forget about managing bug dependencies. Do you set up your dependencies so you have a linear, waterfall dependency corresponding to commit order? That logically makes sense, but it is hard to keep in sync. Or, do you just make all the review bugs depend on a single parent bug? If you do that, how do you communicate the order of the patches to the reviewer? Manually? That's yet more overhead. History rewriting - an operation that modern version control tools like Git and Mercurial have enabled to be a lightweight operation and users love because it doesn't constrain them to pre-defined workflows - thus become much more costly. The cost may even be so high that some people forego rewriting completely, trading their effort for some poor reviewer who has to inherit a series of patches that isn't organized as logically as it could be. Like larger patches, this increases cognitive load required to perform reviews and increases development costs.

As you can see, reviewing multiple, smaller patches with Bugzilla often leads to a horrible user experience. So, we find ourselves writing larger, monolithic patches and living with their numerous deficiencies. At least with monolithic patches we have a predictable outcome for how interaction with Bugzilla will play out!

I have little doubt that large patches (whose existence is influenced by the UI of Bugzilla) slows down the development velocity of Firefox.

Commit Message Formatting

The heavy involvement of Bugzilla in our code development lifecycle has influenced how we write commit messages. Let's start with the obvious example. Here is our standard commit message format for Firefox:

Bug 1234 - Fix some feature foo; r=gps

The bug is right there at the front of the commit message. That prominent placement is effectively saying the bug number is the most important detail about this commit - everything else is ancillary.

Now, I'm sure some of you are saying, but Greg, the short description of the change is obviously more important than the bug number. You are right. But we've allowed ourselves to make the bug and the content therein more important than the commit.

Supporting my theory is the commit message content following the first/summary line. That data is almost always - wait for it - nothing: we generally don't write commit messages that contain more than a single summary line. My repository forensics show that that less than 20% of commit messages to Firefox in 2014 contain multiple lines (this excludes merge and backout commits). (We are doing better than 2013 - the rate was less than 15% then).

Our commit messages are basically saying, here's a highly-abbreviated summary of the change and a pointer (a bug number) to where you can find out more. And of course loading the bug typically reveals a mass of interleaved comments on various topics, hardly the high-level summary you were hoping was captured in the commit message.

Before I go on, in case you are on the fence as to the benefit of detailed commit messages, please read Phabricator's recommendations on revision control and writing reviewable code. I think both write-ups are terrific and are excellent templates that apply to nearly everyone, especially a project as large and complex as Firefox.

Anyway, there are many reasons why we don't capture a detailed, multi-line commit message. For starters, you aren't immediately rewarded for doing it: writing a good commit message doesn't really improve much in the short term (unless someone yells at you for not doing it). This is a generic problem applicable to all organizations and tools. This is a problem that culture must ultimately rectify. But our tools shouldn't reinforce the disposition towards laziness: they should reward best practices.

I don't believe Bugzilla and our interactions with it do an adequate job rewarding good commit message writing. Chances are your mechanism for posting reviews to Bugzilla or posting the publishing of a commit to Bugzilla (pasting the URL in the simple case) brings up a text box for you to type review notes, a patch description, or extra context for the landing. These should be going in the commit message, as they are the type of high-level context and summarizations of choices or actions that people crave when discerning the history of a repository. But because that text box is there, taunting you with its presence, we write content there instead of in the commit message. Even where tools like bzexport exist to upload patches to Bugzilla, potentially nipping this practice in the bug, it still engages in frustrating behavior like reposting the same long commit message on every patch upload, producing unwanted bug spam. Even a tool that is pretty sensibly designed has an implementation detail that undermines a good practice.

Machine Processing of Patches is Difficult

I have a challenge for you: identify all patches currently under consideration for incorporation in the Firefox source tree, run static analysis on them, and tell me if they meet our code style policies.

This should be a solved problem and deployed system at Mozilla. It isn't. Part of the problem is because we're using Bugzilla for conducting review and doing patch management. That may sound counter-intuitive at first: Bugzilla is a centralized service - surely we can poll it to discover patches and then do stuff with those patches. We can. In theory. Things break down very quickly if you try this.

We are uploading patch files to Bugzilla. Patch files are representations of commits that live outside a repository. In order to get the full context - the result of the patch file - you need all the content leading up to that patch file - the repository data. When a naked patch file is uploaded to Bugzilla, you don't always have this context.

For starters, you don't know with certainly which repository the patch belongs to because that isn't part of the standard patch format produced by Mercurial or Git. There are patches for various repositories floating around in Bugzilla. So now you need a way to identify which repository a patch belongs to. It is a solvable problem (aggregate data for all repositories and match patches based on file paths, referenced commits, etc), albeit one Mozilla has not yet solved (but should).

Assuming you can identify the repository a patch belongs to, you need to know the parent commit so you can apply this patch. Some patches list their parent commits. Others do not. Even those that do may lie about it. Patches in MQ don't update their parent field when they are pushed, only after they are refreshed. You could be testing and uploading a patch with a different parent commit than what's listed in the patch file! Even if you do identify the parent commit, this commit could belong to another patch under consideration that's also on Bugzilla! So now you need to assemble a directed graph with all the patches known from Bugzilla applied. Hopefully they all fit in nicely.

Of course, some patches don't have any metadata at all: they are just naked diffs or are malformed commits produced by tools that e.g. attempt to convert Git commits to Mercurial commits (Git users: you should be using hg-git to produce proper Mercurial commits for Firefox patches).

Because Bugzilla is talking in terms of patch files, we often lose much of the context needed to build nice tools, preventing numerous potential workflow optimizations through automation. There are many things machines could be doing for us (such as looking for coding style violations). Instead, humans are doing this work and costing Mozilla a lot of time and lost developer productivity in the process. (A human costs ~$100/hr. A machine on EC2 is pennies per hour and should do the job with lower latency. In other words, you can operate over 300 machines 24 hours a day for what you may an engineer to work an 8 hour shift.)

Conclusion

I have outlined a few of the side-effects of using Bugzilla as part of our day-to-day development, review, and landing of changes to Firefox.

There are several takeways.

First, one cannot argue the fact that Firefox development is bug(zilla) centric. Nearly every important milestone in the lifecycle of a patch involves Bugzilla in some way. This has its benefits and drawbacks. This article has identified many of the drawbacks. But before you start crying to expunge Bugzilla from the loop completely, consider the benefits, such as a place anyone can go to to add metadata or comments on something. That's huge. There is a larger discussion to be had here. But I don't want to be inviting it quite yet.

A common thread between many of the points above is Bugzilla's unstructured and generic handling of code and metadata attached to it (patches, review comments, and landing information). Patches are attachments, which can be anything under the sun. Review comments are plain text comments with simple author, date, and tag metadata. Landings are also communicated by plain text review comments (at least initially - keywords and flags are used in some scenarios).

By being a generic tool, Bugzilla throws away a lot of the rich metadata that we produce. That data is still technically there in many scenarios. But it becomes extremely difficult if not practically impossible for both humans and machines to access efficiently. We lose important context and feedback by normalizing all this data to Bugzilla. This data loss creates overhead and technical debt. It slows Mozilla down.

Fortunately, the solutions to these problems and shortcomings are conceptually simple (and generally applicable): preserve rich context. In the context of patch distribution, push commits to a repository and tell someone to pull those commits. In the context of code review, create sub-reviews for different commits and allow tracking and easy-to-follow (likely threaded) discussions on found issues. Design workflow to be code first, not tool or bug first. Optimize workflows to minimize people time. Lean heavily on machines to do grunt work. Integrate issue tracking and code review, but not too tightly (loosely coupled, highly cohesive). Let different tools specialize in the handling of different forms of data: let code review handle code review. Let Bugzilla handle issue tracking. Let a landing tool handle tracking the state of landings. Use middleware to make them appear as one logical service if they aren't designed to be one from the start (such as is Mozilla's case with Bugzilla).

Another solution that's generally applicable is to refine and optimize the whole process to land a finished commit. Your product is based on software. So anything that adds overhead or loss of quality in the process of developing that software is fundamentally a product problem and should be treated as such. Any time and brain cycles lost to development friction or bugs that arise from things like inadequate code reviews tools degrade the quality of your product and take away from the user experience. This should be plain to see. Attaching a cost to this to convince the business-minded folks that it is worth addressing is a harder matter. I find management with empathy and shared understanding of what amazing tools can do helps a lot.

If I had to sum up the solution in one sentence, it would be: invest in tools and developer happiness.

I hope to soon publish a post on how Mozilla's new code review tool addresses many of the workflow deficiencies present today. Stay tuned.

Read and Post Comments

The Rabbit Hole of Using Docker in Automated Tests

October 16, 2014 at 01:45 PM | categories: Review Board, Mercurial, Docker, Mozilla | View Comments

Warning: This post is long and rambling. There is marginal value in reading beyond the first few paragraphs unless you care about Docker.

I recently wrote about how Mozilla tests version control. In this post, I want to talk about the part of that effort that consumed the most time: adding Docker support to the test harness.

Introducing the Problem and Desired End State

Running Docker containers inside tests just seems like an obvious thing you'd want to do. I mean, wouldn't it be cool if your tests could spin up MySQL, Redis, Cassandra, Nginx, etc inside Docker containers and test things against actual instances of the things running in your data centers? Of course it would! If you ask me, this approach beats mocking because many questions around accuracy of the mocked interface are removed. Furthermore, you can run all tests locally, while on a plane: no data center or staging environment required. How cool is that! And, containers are all isolated so there's no need to pollute your system with extra packages and system services. Seems like wins all around.

When Mozilla started adding customizations to the Review Board code review software in preparation for deploying it at Mozilla as a replacement for Bugzilla's Splinter, it quickly became apparant that we had a significant testing challenge ahead of us. We weren't just standing up Review Board and telling people to use it, we were integrating user authentication with Bugzilla, having Review Board update Bugzilla after key events, and were driving the initiation of code review in Review Board by pushing code to a Mercurial server. That's 3 user-visible services all communicating with each to expose a unified workflow. It's the kind of thing testing nightmares are made of.

During my early involvement with the project, I recognized the challenge ahead and was quick to insist that we write automated tests for as much as possible. I insisted that all the code (there are multiple extensions to ReviewBoard, a Mercurial hook, and a Mercurial extension) live under one common repository and share testing. That way we could tinker with all the parts easily and test them in concern without having to worry about version sync. We moved all the code to the version-control-tools repository and Review Board was the driving force behind improvements to the test harness in that repository. We had Mercurial .t tests starting Django dev servers hosting Review Board running from per-test SQLite databases and all was nice. Pretty much every scenario involving the interaction between Mercurial and ReviewBoard was tested. If you cared about just these components, life was happy.

A large piece of the integration story was lacking in this testing world: Bugzilla. We had somewhat complex code for having Review Board and Bugzilla talk to each other but no tests for it because nobody had yet hooked Bugzilla up to the tests. As my responsibilities in the project expanded from covering just the Mercurial and Review Board interaction to Bugzilla as well, I again looked at the situation and thought there's a lot of complex interaction here and alpha testing has revealed the presence of many bugs: we need a better testing story. So, I set out to integrate Bugzilla into the test harness.

My goals were for Review Board tests to be able to make requests against a Bugzilla instance configured just like bugzilla.mozilla.org, to allow tests to execute concurrently (don't make developers wait on machines), for tests to run as quickly as possible, to run tests in an environment as similar to production as possible, and to be able to run tests from a plane or train or anywhere without internet connectivity. I was unwilling to budge on these core testing requirements because they represent what's best from test accuracy and developer productivity standpoints: you want your tests to be representative of the real world and you want to enable people to hack on this service anywhere, anytime, and not be held back by tests that take too long to execute. Move fast and don't break things.

Before I go on, it's time for a quick aside on tolerable waiting times. Throughout this article I talk about minimizing the run time of tests. This may sound like premature optimization. I argue it isn't, at least not if you are optimizing for developer productivity. There is a fair bit of academic research in this area. A study on tolerable waiting time: how long are Web users willing to wait gets cited a lot. It says 2 seconds for web content. If you read a few paragraphs in, it references other literature. They disagree on specific thresholds, but one thing is common: the thresholds are typically low - just a few seconds. The latencies I deal with are all longer than what research says leads to badness. When given a choice, I want to optimize workflows for what humans are conditioned to tolerate. If I can't do that, I've failed and the software will be ineffective.

The architecture of Bugzilla created some challenges and eliminated some implementation possibilities. First, I wasn't using any Bugzilla: I was using Mozilla's branch of Bugzilla that powers bugzilla.mozilla.org. Let's call it BMO. I could try hosting it from local SQLite files and running a local, Perl-based HTTP server (Bugzilla is written in Perl). But my experience with Perl and takeaways from talking to the BMO admins was that pain would likely be involved. Plus, this would be a departure from test accuracy. So, I would be using MySQL, Apache HTTPd, and mod_perl, just like BMO uses them in production.

Running Apache and MySQL is always a... fun endeavor. It wasn't a strict requirement, but I also highly preferred that the tests didn't pollute the system they ran on. In other words, having tests connect to an already-running MySQL or Apache server felt like the wrong solution. That's just one more thing people must set up and run locally to run the tests. That's just one more thing that could differ from production and cause bad testing results. It felt like a dangerous approach. Plus, there's the requirement to run things concurrently. Could I have multiple tests talking to the same MySQL server concurrently? They'd have to use separate databases so they don't conflict. That's a possibility. Honestly, I didn't entertain the thought of running Apache and MySQL manually for too long. I knew about this thing called Docker and that it theoretically fit my use case perfectly: construct building blocks for your application and then dymanically hook things up. Perfect. I could build Docker containers for all the required services and have each test start a new, independent set of containers for just that test.

So, I set out integrating Docker into the version-control-tools test harness. Specifically, my goal was to enable the running of independent BMO instances during individual tests. It sounded simple enough.

What I didn't know was that integrating a Dockerized BMO into the test harness would take the better part of 2 weeks. And it's still not up to my standards. This post is the story about the trials and tribulations I encountered along the way. I hope it serves as a warning and potentially a guide for others attempting similar feats. If any Docker developers are reading, I hope it gives you ideas on how to improve Docker.

Running Bugzilla inside Docker

First thing's first: to run BMO inside Docker I needed to make Docker containers for BMO. Fortunately, David Lawrence has prior art here. I really just wanted to take that code, dump it into version-control-tools and call it a day. In hindsight, I probably should have done that. Instead, armed with the knowledge of the Docker best practice of one container per service and David Lawrence's similar wishes to make his code conform to that ideal, I decided to spend some time to fix David's code so that MySQL and Apache were in separate containers, not part of a single container running supervisord. Easy enough, right?

It was relatively easy extracting the MySQL and Apache parts of BMO into separate containers. For MySQL, I started with the official MySQL container from the Docker library and added a custom my.cnf. Simple enough. For Apache, I just copied everything from David's code that wasn't MySQL. I was able to manually hook the containers together using the Docker CLI. It sort of just worked. I was optimistic this project would only take a few hours.

A garbage collection bug in Docker

My first speed bump came as I was iterating on Dockerfiles. All of a sudden I get an error from Docker that it is out of space. Wat? I look at docker images and don't see anything too obvious eating up space. What could be going on? At this point, I'm using boot2docker to host Docker. boot2docker is this nifty tool that allows Windows and OS X users to easily run Docker (Docker requires a Linux host). boot2docker spins up a Linux virtual machine running Docker and tells you how to point your local docker CLI interface at that VM. So, when Docker complains it is out of space, I knew immediately that the VM must be low on space. I SSH into it, run df, and sure enough, the VM is nearly out of space. But I looked at docker images -a and confirmed there's not enough data to fill the disk. What's going on? I can't find the issue right now, but it turns out there is a bug in Docker! When running Docker on aufs filesystems (like boot2docker does), Docker does not always remove data volumes containers when deleting a container. It turns out that the MySQL containers from the official Docker library were creating a data-only container to hold persistent MySQL data that outlives the container itself. These containers are apparently light magic. They are containers that are attached to other containers, but they don't really show up in the Docker interfaces. When you delete the host container, these containers are supposed to be garbage collected. Except on aufs, they aren't. My MySQL containers were creating 1+ GB InnoDB data files on start and the associated data containers were sitting around after container deletion, effectively leaking 1+ GB every time I created a MySQL container, quickly filling the boot2docker disk. Derp.

I worked around this problem by forking the official MySQL container. I didn't need persistent MySQL data (the containers only need to live for one invocation - for the lifetime of a single test), so I couldn't care less about persisted data volumes. So, I changed the MySQL container to hold its data locally, not in a data volume container. The solution was simple enough. But it took me a while to identify the problem. Here I was seeing Docker do something extremely stupid. Surely my understanding of Docker was wrong and I was doing something stupid to cause it to leak data. I spent hours digging through the documentation to make sure I was doing things exactly as recommended. It wasn't until I started an Ubuntu VM and tried the same thing there did I realize this looked like a bug in boot2docker. A few Google searches later led me to a comment hiding at the bottom of an existing GitHub issue that pins aufs as the culprit. And here I thought Docker reached 1.0 and wouldn't have bad bugs like this. I certainly wouldn't expect boot2docker to be shipping a VM with a sub-par storage driver (shouldn't it be using devicemapper or btrfs instead). Whatever.

Wrangling with Mozilla's Branch of Bugzilla

At this point, I've got basic Docker containers for MySQL and Apache+mod_perl+Bugzilla being created. Now, I needed to convert from vanilla Bugzilla to BMO. Should be straightforward. Just change the Git remote URL and branch to check out. I did this and all-of-a-sudden my image started encountering errors building! It turns out that the BMO code base doesn't work on a fresh database! Fortunately, this is a known issue and I've worked around it previously. When I tackled it a few months ago, I spent a handful of hours disecting this problem. It wasn't pretty. But this time I knew what to do. I even had a Puppet manifest for installing BMO on a fresh machine. So, I just needed to translate that Puppet config into Dockerfile commands. No big deal, right? Well, when I did that Puppet config a few months ago, I based it on Ubuntu because I'm more familiar with Debian-based distros and figured Ubuntu would be the easiest since it tends to have the largest package diversity. Unfortunately, David's Docker work is based on Fedora. So, I spent some time converting the Dockerfile to Ubuntu rather than trying to port things to Fedora. Arguably the wrong decision since Mozilla operates the RedHat flavor of Linux distributions in production. But I was willing to trade accuracy for time here, having lost time dealing with the aufs bug.

Unfortunately, I under-estimated how long it would take to port the image to Ubuntu. It didn't take so long from a code change perspective. Instead, most of the time was spent waiting for Docker to run the commands to build the image. In the final version, Apt is downloading and installing over 250 packages. And Bugzilla's bootstrap process installs dozens of packages from CPAN. Every time I made a small change, I invalidated Docker's image building cache, causing extreme delays while waiting for Apt and CPAN to do their thing. This experience partially contributed to my displeasure with how Docker currently handles image creation. If Docker images were composed of pre-built pieces instead of stacked commands, my cache hit rate would have been much higher and I would have converted the image in no time. But no, that's not how things work. So I lost numerous hours through this 2 week process waiting for Docker images to perform operations I've already done elsewhere dozens of times before.

Docker Container Orchestration

After porting the Bugzilla image to Ubuntu and getting BMO to bootstrap in a manually managed container (up to this point I'm using the docker CLI to create images, start containers, etc), it was time to automate the process so that tests could run the containers. At this time, I started looking for tools that performed multiple container orchestration. I had multiple containers that needed to be treated as a single logical unit, so I figured I'd use an existing tool to solve this problem for me. Don't reinvent the wheel unless you have to, right? I discovered Fig, which seemed to fit the bill. I read that it is being integrated into Docker itself, so it must be best of breed. Even if it weren't its future seems to be more certain than other tools. So, I stopped my tools search and used Fig without much consideration for other tools.

Lack of a useful feature in Fig

I quickly whipped up a fig.yml and figured it would just work. Nope! Starting the containers from scratch using Fig resulted in an error. I wasn't sure what the error was at first because Fig didn't tell me. After some investigation, I realized that my bmoweb container (the container holding Apache + BMO code) was failing in its entrypoint command (that's a command that runs when the container starts up, but not the primary command a container runs - that's a bit confusing I know - read the docs). I added some debug statements and quickly realized that Bugzilla was erroring connecting to MySQL. Strange, I thought. Fig is essentially a DSL around manual docker commands, so I checked everything by typing everything into the shell. No error. Again on a new set of containers. No error. I thought maybe my environment variable handling was wrong - that the dynamically allocated IP address and port number of the linked MySQL container being passed to the bmoweb container weren't getting honored. I added some logging to disprove that theory. The wheels inside my brain spun for a little bit. And, aided by some real-time logging, I realized I was dealing with a race condition: Fig was starting the MySQL and bmoweb containers concurrently and bmoweb was attempting to access the MySQL server before MySQL had fully initialized and started listening on its TCP port! That made sense. And I think it's a reasonable optimization for Fig to start containers concurrently to speed up start time. But surely a tool that orchestrates different containers has considered the problem of dependencies and has a mechanism to declare them to prevent these race conditions. I check the online docs and there's nothing to be found. A red panda weeps. So, I change the bmoweb entrypoint script to wait until it can open a TCP socket to MySQL before actually using MySQL and sure enough, the race condition goes away and the bmoweb container starts just fine!

OK, I'm real close now. I can feel it.

Bootstrapping Bugzilla

I start playing around with manually starting and stopping containers as part of a toy test. The good news is things appear to work. The bad news is it is extremely slow. It didn't take long for me to realize that the reason for the slowness is Bugzilla's bootstrap on first run. Bugzilla, like many complex applications, has a first run step that sets up database schema, writes out some files on the filesystem, inserts some skeleton data in the database, creates an admin user, etc. Much to my dismay this was taking a long time. Something on the order of 25 to 30 seconds. And that's on a Haswell with plenty of RAM and an SSD. Oy. The way things are currently implemented would result in a 25 to 30 second delay when running every test. Change 1 line and wait say 25s for any kind of output. Are you kidding me?! Unacceptable. It violated my core goal of having tests that are quick to run. Again, humans should not have to wait on machines.

I think about this problem for like half a second and the solution is obvious: take a snapshot of the bootstrapped images and start instances of that snapshot from tests. In other words, you perform the common bootstrap operations once and only once. And, you can probably do that outside the scope of running tests so that the same snapshot can be used across multiple invocations of the test harness. Sounds simple! To the Docker uninitiated, it sounds like the solution would be to move the BMO bootstrapping into the Dockerfile code so it gets executed at image creation time. Yes, that would be ideal. Unfortunately, when building images via Dockerfile, you can't tell Docker to link that image to another container. Without container linking, you can't have MySQL. Without MySQL, you can't do BMO bootstrap. So, BMO bootstrap must be done during container startup. And in Docker land, that means putting it as part of your entrypoint script (where it was conveniently already located for this reason).

Talking Directly to the Docker API

Of course, the tools that I found that help with Docker image building and container orchestration don't seem to have an answer for this create a snapshot of a bootstrapped container problem. I'm sure someone has solved this problem. But in my limited searching, I couldn't find anything. And, I figured the problem would be easy enough to solve manually, so I set about creating a script to do it. I'm not a huge fan of shell script for automation. It's hard to debug and simple things can be hard and hard things can be harder. Plus, why solve solutions such as parsing output for relevant data when you can talk to an API directly and get native types. Since the existing test harness automation in version-control-tools was written in Python, I naturally decided to write some Python to create the bootstrapped images. So, I do a PyPI search and discover docker-py, a Python client library to the Docker Remote API, an HTTP API that the Docker daemon runs and is what the docker CLI tool itself uses to interface with Docker. Good, now I have access to the full power of Docker and am not limited by what the docker CLI may not expose. So, I spent some time looking at source and the Docker Remote API documentation to get an understanding of my new abilities and what I'd need to do. I was pleasantly surprised to learn that the docker CLI is pretty similar to the Remote API and the Python API was similar as well, so the learning was pretty shallow. Yay for catching a break!

Confusion Over Container Stopping

I wrote some Python for building the BMO images, launching the containers, committing the result, and saving state to disk (so it could be consulted later - preventing a bootstrap by subsequent consumers). This was pleasantly smooth at first, but I encountered some bumps along the way. First, I didn't have a complete grasp on the differences between stop and kill. I was seeing some weird behavior by MySQL on startup and didn't know why. Turns out I was forcefully killing the container after bootstrap via the kill API and this was sending a SIGKILL to MySQL, effectively causing unclean shutdown. After some documentation reading, I realized stop is the better API - it issues SIGTERM, waits for a grace period, then issues SIGKILL. Issuing SIGTERM made MySQL shut down gracefully and this issue stemming from my ignorance was resolved. (If anyone from Docker is reading, I think the help output for docker kill should mention the forcefullness of the command versus stop. Not all of us remember the relative forcefullness of the POSIX signals and having documentation reinforce their cryptic meaning could help people select the proper command.) A few lines of Python later and I was talking directly to the Docker Remote API, doing everything I needed to do to save (commit in Docker parlance) a bootstrapped BMO environment for re-use among multiple tests.

It was pretty easy to hook the bootstrapped images up to a single test. Just load the bootstrapped image IDs from the config file and start new containers based on them. That's Docker 101 (except I was using Python to do everything).

Concurrent Execution Confuses Bugzilla

Now that I could start Dockerized BMO from a single test, it was time to make things work concurrently. I hooked Docker up to a few tests and launched them in parallel to see what would happen. The containers appeared to start just fine! Great anticipation on my part to design for concurrency from the beginning, I thought. It appeared I was nearly done. Victory was near. So, I changed some tests to actually interact with BMO running from Docker. (Up until this point I was merely starting containers, not doing anything with them.) Immediately I see errors. Cannot connect to Bugzilla http://... connection refused. Huh? It took a few moments, but I realized the experience I had with MySQL starting and this error were very similar. I changed my start BMO containers code to wait for the HTTP server's TCP socket to start accepting connections before returning control and sure enough, I was able to make HTTP requests against Bugzilla running in Docker! Woo!

Next step, make an authenticated query against Bugzilla running in Docker. HTTP request completes... with an internal server error. What?! I successfully browsed BMO from containers days before and was able to log in just fine - this shouldn't be happening. This problem took me ages to diagnose. I traced every step of provisioning and couldn't figure out what was going on. After resorting to print debugging in nearly every component, including Bugzilla's Perl code itself, I found the culprit: Bugzilla wasn't liking the dynamic nature of the MySQL and HTTP endpoints. You see, when you start Docker containers, network addresses change. The IP address assigned to the container is whatever is available to Docker at the time the container was started. Likewise the IP address and port number of linked services can change. So, your container entrypoint has to deal with this dynamic nature of addresses. For example, if you have a configuration file, you need to update that configuration file on every run with the proper network address info. My Bugzilla entrypoint script was doing this. Or so I thought. It turns out that Bugzilla's bootstrap process has multiple config files. There's an answers file that provides static answers to questions asked when running the bootstrap script (checksetup.pl). checksetup.pl will produce a localconfig file (actually a Perl script) containing all that data. There's also a data/params file containing yet more configuration options. And, the way I was running bootstrap, checksetup.pl refused to update files with new values. I initially had the entrypoint script updating only the answers file and running checksetup.pl, thinking checksetup.pl would update localconfig if the answers change. Nope! checksetup.pl only appears to update localconfig if localconfig is missing a value. So, here my entrypoint script was, successully calling checksetup.pl with the proper network values, which checksetup.pl was more than happy to use. But when I started the web application, it used the old values from localconfig and data/params and blew up. Derp. So, to have dynamic MySQL hosts and ports and a dynamic self-referential HTTP URL, I needed to manually update localconfig and data/params during the entrypoint script. The entrypoint script now rewrites Perl scripts during container load to reflect appropriate variables. Oy.

Resource Constraints

At some point I got working BMO containers running concurrently from multiple tests. This was a huge milestone. But it only revealed my next problem: resource constraints. The running containers were consuming gobs of memory and I couldn't run more than 2 or 3 tests concurrently before running out of memory. Before, I was able to run 8 tests concurrently no problem. Well crap, I just slowed down the test harness significantly by reducing concurrency. No bueno.

Some quick investigation revealed the culprit was MySQL and Apache being greedier than they needed to be. MySQL was consuming 1GB RSS on start. Apache was something like 350 MB. It had been a while since I ran a MySQL server, so I had to scour the net for settings to put MySQL on a diet. The results were not promising. I knew enough about MySQL to know that the answers I found had similar quality to comments on the php.net function documentation circa 2004 (it was not uncommon to see things like SQL injection in the MySQL pages back then - who knows, maybe that's still the case). Anyway, a little tuning later and I was able to get MySQL using a few hundred MB RAM and I reduced the Apache worker pool to something reasonable (maybe 2) to free up enough memory to be able to run tests with the desired concurrency again. If using Docker as part of testing ever takes off, I imagine there will be two flavors of every container: low memory and regular. I'm not running a production service here: I'll happily trade memory for high-end performance as long as it doesn't impact my tests too much.

Caching, Invalidating, and Garbage Collecting Bootstrapped Images

As part of iterating on making BMO bootstrap work, I encountered another problem: knowing when to perform a bootstrap. As mentioned earlier, bootstrap was slow: 25 to 30 seconds. While I had reduced the cost of bootstrap to at most once per test suite execution (as opposed to once per test), there was still the potential for a painful 25-30s delay when running tests. Unacceptable! Furthermore, when I changed how bootstrap worked, I needed a way to invalidate the previous bootstrapped image. Otherwise, we may use an outdated bootstrapped image that doesn't represent the environment it needs to and test execution would fail. How should I do this?

Docker has considered this problem and they have a solution: build context. When you do a docker build, Docker takes all the files from the directory containing the Dockerfile and makes them available to the environment doing the building. If you ADD one of these files in your Dockerfile, the image ID will change if the file changes, invalidating the cache used by Docker to build images. So, if I ADDed the scripts that perform BMO bootstrap to my Docker images, Docker would automagically invalidate the built images and force a bootstrap for me. Nice! Unfortunately, docker build doesn't allow you to add files outside of the current directory to the build context. Internet sleuthing reveals the solution here is to copy things to a temporary directory and run docker build from that. Seriously? Fortunately, I was using the Docker API directly via Python. And that API simply takes an archive of files. And since you can create archives dynamically inside Python using e.g. tarfile, it wasn't too difficult to build proper custom context archives that contained my extra data that could be used to invalidate bootstrapped images. I threw some simple ADD directives into my Dockerfiles and now I got bootstrapped image invalidation!

To avoid having to perform bootstrap on every test run, I needed a mapping between the base images and the bootstrapped result. I ended up storing this in a simple JSON file. I realize now I could have queried Docker for images having the base image as its parent since there is supposed to be a 1:1 relationship between them. I may do this as a follow-up.

With the look-up table in place, ensuring bootstrapped images were current involved doing a couple docker builds, finding the bootstrapped images from those base images, and doing the bootstrap if necessary. If everything is up-to-date, docker build finishes quickly and we have less than 1s of delay. Very acceptable. If things aren't current, well, there's not much you can do there if accuracy is important. I was happy with where I was.

Once I started producing bootstrapped images every time the code impacting the generation of that image changed, I ran into a new problem: garbage collection. All those old bootstrapped images were piling up inside of Docker! I needed a way to prune them. Docker has support for associating a repository and a tag with images. Great, I thought, I'll just associate all images with a well-defined repository, leave the tag blank (because it isn't really relevant), and garbage collection will iterate over all images in to-be-pruned repositories and delete all but the most recent one. Of course, this simple solution did not work. As far as I can tell, Docker doesn't really let you have multiple untagged images. You can set a repository with no tag and Docker will automagically assign the latest tag to that image. But the second you create a new image in that repository, the original image loses that repository association. I'm not sure if this is by design or a bug, but it feels wrong to me. I want the ability to associate tags with images (and containers) so I can easily find all entities in a logical set. It seemed to me that repository facilitated that need (albeit with the restriction of only associating 1 identifier per image). My solution here was to assign type 1 UUIDs to the tag field for each image. This forced Docker to retain the repository association when new images were created. I chose type 1 UUIDs so I can later extract the time component embedded within and do time-based garbage collection e.g. delete all images created more than a week ago.

Making Things Work in Jenkins/Ubuntu

At about this point, I figured things were working well enough on my boot2docker machine that it was time to update the Jenkins virtual machine / Vagrant configuration to run Docker. So, I hacked up the provisioner to install the docker.io package and tried to run things. First, I had to update code that talks to Docker to know where Docker is in an Ubuntu VM. Before, I was keying things off DOCKER_HOST, which I guess is used by the docker CLI and boot2docker reminds you to set. Easy enough. When I finally got things talking to Docker, my scripts threw a cryptic error when talking to Docker. Huh? This worked in boot2docker! When in doubt, always check your package versions. Sure enough, Ubuntu was installing an old Docker version. I added the Docker Apt repo to the Vagrant provisioner and tried again. Bingo - working Docker in an Ubuntu VM!

Choice of storage engines

I started building the BMO Docker images quickly noticed something: building images was horribly slow. Specifically, the part where new images are committed was taking seemingly forever. 5 to 8 seconds or something. Ugh. This wouldn't really bother me except due to subsequent issues, I found myself changing images enough as part of debugging that image building latency became a huge time sink. I felt I was spending more time waiting for layers to commit than making progress. So, I I decided to do something about it. I remembered glancing at an overview of storage options in Docker the week or two prior. I instinctively pinned the difference on different storage drivers between boot2docker and Ubuntu. Sure enough, boot2docker was using aufs and Ubuntu was using devicemapper. OK, now I identified a potential culprit. Time to test the theory. A few paragraphs into that blog post, I see a sorted list of storage driver priorities. I see aufs first, btrfs second, and devicemapper third. I know aufs has kernel inclusion issues (plus a nasty data leakage bug). I don't want that. devicemapper is slow. I figured the list is ordered for a reason and just attempted to use btrfs without really reading the article. Sure enough, btrfs is much faster at committing images than devicemapper. And, it isn't aufs. While images inside btrfs are building, I glance over the article and come to the conclusion that btrfs is in fact good enough for me.

So now I'm running Docker on btrfs on Ubuntu and Docker on aufs in boot2docker. Hopefully that will be the last noticable difference between host environments. After all, Docker is supposed to abstract all this away, right? I wish.

The Mystery of Inconsistent State

It was here that I experienced the most baffling, mind bending puzzle yet. As I was trying to get things working on the Jenkins/Ubuntu VM - things that had already been proved out in boot2docker - I was running into inexplicable issues during creation of the bootstrapped BMO containers. It seemed that my bootstrapped containers were somehow missing data. It appeared as if bootstrap had completed but data written during bootstrap failed to write. You start the committed/bootstrapped image and bootstrap had obviously completed partially, but it appeared to have never finished. Same Docker version. Same images. Same build scripts. Only the host environment was different. Ummmm, Bueller?

This problem had me totally and completely flabbergasted. My brain turned to mush exhausting possibilities. My initial instinct was this was a filesystem buffering problem. Different storage driver (btrfs vs aufs) means different semantics in how data is flushed, right? I once again started littering code with print statements to record the presence or non-presence of files and content therein. MySQL wasn't seeing all its data, so I double and triple check I'm shutting down MySQL correctly. Who knows, maybe one of the options I used to trim the fat from MySQL removed some of the safety from writing data and unclean shutdown is causing MySQL to lose data?

While I was investigating this problem, I noticed an additional oddity: I was having trouble getting reliable debug output from running containers (docker log -f). It seemed as if I was losing log events. I could tell from the state of a container that something happened, but I was seeing no evidence from docker logs -f that that thing actually happened. Weird! On a hunch, a threw some sys.stdout.flush() calls in my Python scripts, and sure enough my missing output started arriving! Pipe buffering strikes again. So, now we have dirty hacks in all the Python scripts related to Docker to unbuffer stdout to prevent data loss. Don't ask how much time was wasted tracking down bad theories due to stdout output being buffered.

Getting back to the problem at hand, I still hand Docker containers seemingly lose data. And it was only happening when Ubuntu/btrfs was the host environment for Docker. I eventually exhausted all leads in my filesystem wasn't flushed theory. At some point, I compared the logs of docker logs -f between boot2docker and Ubuntu and eventually noticed that the bmoweb container in Ubuntu wasn't printing as much. This wasn't obvious at first because the output from bootstrap on Ubuntu looked fine. Besides, the script that waits for bootstrap to complete waits for the Apache HTTP TCP socket to come alive before it gracefully stops the container and snapshots the bootstrapped result: bootstrap must be completing, ignore what docker logs -f says.

Eventually I hit an impasse and resort to context dumping everything on IRC. Ben Kero is around and he picks up on something almost immediately. He simply says ... systemd?. I knew almost instantly what he was referring to and knew the theory fit the facts. Do you know what I was doing wrong?

I still don't know what and quite frankly I don't care, but something in my Ubuntu host environment had a trigger on the TCP port the HTTP server would be listening on. Remember, I was detecting bootstrap completion by waiting until a TCP socket could be opened to the HTTP server. As soon as that connection was established, we stopped the containers gracefully and took a snapshot of the bootstrapped result. Except on Ubuntu something was accepting that socket open, giving a false positive to my wait code, and triggering early shutdown. Docker issued the signal to stop the container gracefully, but it wasn't finished bootstrapping yet, so it forcefully killed the container, resulting in bootstrap being in a remarkably-consistent-across-runs inconsistent state. Changing the code from wait on TCP socket to wait for valid HTTP response fixed the problem. And just for good measure, I changed the code waiting on the MySQL server to also try to establish an actual connection to the MySQL application layer, not merely a TCP socket.

After solving this mystery, I thought there's no way I could be so blind as to not see the container receiving the stop signal during bootstrap. So, I changed things back to prove to myself I wasn't on crack. No matter how hard I tried, I could not get the logs to show that the signal was received. I think what was happening was that my script was starting the container and issuing the graceful stop so quickly that it wasn't captured by log clients. Sure enough, adding some sleeps in the proper places made it possible to catch the events in action. In hindsight, I suppose I could have used docker events to shed some light on this as well. If Docker persisted logs/output from containers and allowed me to scroll back in time, I think this would have saved me. Although, there's a chance my entrypoint script wouldn't have informed me about the received signal. Perhaps checksetup.pl was ignoring it? What I really need is a unified event + log stream from Docker containers so I can debug exactly what's going on.

Everything is Working, Right?

After solving the inconsistent bootstrap state problem, things were looking pretty good. I had BMO bootstrapping and running from tests on both boot2docker and Ubuntu hosts. Tests were seeing completely independent environments and there were no race conditions. I was nearly done.

So, I started porting more and more tests to Docker. I started running tests more and more. Things worked. Most of the time. But I'm still frustrated by periodic apparent bugs in Docker. For example, our containers periodically fail to shut down. Our images periodically fail to delete.

During container shutdown and delete at the end of tests, we periodically see error messagess like the following:

docker.errors.APIError: 500 Server Error: Internal Server Error ("Cannot destroy container f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255: Driver aufs failed to remove root filesystem f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255: rename /mnt/sda1/var/lib/docker/aufs/mnt/f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255 /mnt/sda1/var/lib/docker/aufs/mnt/f13828df94c9d295bfe24b69ac02377a757edcf948a3355cf7bc16ff2de84255-removing: device or resource busy")
500 Server Error: Internal Server Error ("Cannot destroy container 7e87e5950501734b2a1c02705e9c19f65357a15bad605d8168452aa564d63786: Unable to remove filesystem for 7e87e5950501734b2a1c02705e9c19f65357a15bad605d8168452aa564d63786: remove /mnt/sda1/var/lib/docker/containers/7e87e5950501734b2a1c02705e9c19f65357a15bad605d8168452aa564d63786: directory not empty")

Due to the way we're executing tests (Mercurial's .t test format), this causes the test's output to change and the test to fail. Sadness.

I think these errors are limited to boot2docker/aufs. But we haven't executed enough test runs in the Jenkins/Ubuntu/btrfs VM yet to be sure. This definitely smells like a bug in Docker and it is very annoying.

Conclusion

After much wrangling and going deeper in a rabbit hole than I ever felt was possible, I finally managed to get BMO running inside Docker as part of our test infrastructure. We're now building tests for complicated components that touch Mercurial, Review Board, and Bugzilla and people are generally happy with how things work.

There are still a handful of bugs, workarounds, and components that aren't as optimal as I would like them to be. But you can't always have perfection.

My takeaway from this ordeal is that Docker still has a number of bugs and user experience issues to resolve. I really want to support Docker and to see it succeed. But every time I try doing something non-trivial with Docker, I get bit hard. Yes, some of the issues I experienced were due to my own ignorance. But at the same time, if one of Docker's mantras is about simplicity and usability, then should there be such gaping cracks for people like me to fall through?

In the end, the promise of Docker fits my use case almost perfectly. I know the architecture is a good fit for testing. We will likely stick with Docker, especially now that I've spent the time to make it work. I really wish this project would have taken a few days, not a few weeks.

Read and Post Comments

Robustly Testing Version Control at Mozilla

October 14, 2014 at 12:00 PM | categories: Mercurial, Mozilla | View Comments

Version control services and interaction with them play an important role at any company. Despite version control being a critical part of your infrastructure, my experience from working at a few companies and talking with others is that version control often doesn't get the testing love that other services do. Hooks get written, spot-tested by the author, and deployed. Tools that interact with version control often rely on behavior that may or may not change over time, especially when the version of your version control software is upgraded.

We've seen this pattern at Mozilla. Mercurial hooks and extensions were written and deployed to the server without test coverage. As a result, things break when we try to upgrade the server. This happens a few times and you naturally develop an attitude of fear, uncertainty, and doubt around touching anything on the server (or the clients for that matter). If it isn't broken, why fix it prevails for months or years. Then one an enthusiastic individual comes around wanting to deploy some hot new functionality. You tell them the path is arduous because the server is running antiquated versions of software and nothing is tested. The individual realizes the amazing change isn't worth the effort and justifiably throws up their hands and gives up. This is almost a textbook definition of how not having test coverage can result in technical debt. This is the position Mozilla is trying to recover from.

One of the biggest impacts I've had since joining the Developer Services Team at Mozilla a little over a month ago has been changing the story about how we test version control at Mozilla.

I'm proud to say that Mozilla now has a robust enough testing infrastructure in place around our Mercurial server that we're feeling pretty good about silencing the doubters when it comes to changing server behavior. Here's how we did it.

The genesis of this project was likely me getting involved with the hg-git and Mercurial projects. For hg-git, I learned a bit about Mercurial internals and how extensions work. When I looked at Mercurial extensions and hooks used by Mozilla, I started to realize what parts were good and what parts were bad. I realized what parts would likely break after upgrades. When I started contributing patches to Mercurial itself, I took notice of how Mercurial is tested. When I discovered T Tests, I thought, wow, that's pretty cool: we should use them to test Mozilla's Mercurial customizations!

After some frustrations with Mercurial extensions breaking after Mercurial upgrades, I wanted to do something about it to prevent this from happening again. I'm a huge fan of unified repositories. So earlier this year, I reached out to the various parties who maintain all the different components and convinced nearly everyone that establishing a single repository for all the version control code was a good idea. The version-control-tools repository was born. Things were slow at first. It was initially pretty much my playground for hosting Mercurial extensions that I authored. Fast forward a few months, and the version-control-tools repository now contains full history imports of our Mercurial hooks that are deployed on hg.mozilla.org, the templates used to render HTML on hg.mozilla.org, and pretty much every Mercurial extension authored by Mozillians, including pushlog. Having all the code in one repository has been very useful. It has simplified server deployments: we now pull 1 repository instead of 3. If there is a dependency between different components, we can do the update atomically. These are all benefits of using a single repository instead of N>1.

While version-control-tools was still pretty much my personal playground, I introduced a short script for running tests. It was pretty basic: just find test files and invoke them with Mercurial's test harness. It served my needs pretty well. Over time, as more and more functionality was rolled into version-control-tools, we expanded the scope of the test harness.

We can now run Python unit tests (in addition to Mercurial .t tests). Test all of the things!

We set up continuous integration with Jenkins so tests run after check-in and alert us when things fail.

We added code coverage so we can see what is and isn't being tested. Using code coverage data, we've identified a server upgrade bug before it happens. We're also using the data to ensure that code is tested as thoroughly as it needs to be. The code coverage data has been invaluable at assessing the quality of our tests. I'm still shocked that Firefox developers tolerate not having JavaScript code coverage when developing Firefox features. (I'm not saying code coverage is perfect, merely that it is a valuable tool in your arsenal.)

We added support for running tests against multiple versions of Mercurial. We even test the bleeding edge of Mercurial so we know when an upstream Mercurial change breaks our code. So, no more surprises on Mercurial release day. I can tell you today that we have a handful of extensions that are broken in Mercurial 3.2, due for release around November 1. (Hopefully we'll fix them before release.)

We have Vagrant configurations so you can start a virtual machine that runs the tests the same way Jenkins does.

The latest addition to the test harness is the ability to spin up Docker containers as part of tests. Right now, this is limited to running Bugzilla during tests. But I imagine the scope will only increase over time.

Before I go on, I want to quickly explain how amazing Mercurial's .t tests are. These are a flavor of tests used by Mercurial and the dominant form of new tests added to the version-control-tools repository. These tests are glorified shell scripts annotated with expected command output and other metadata. It might be easier to explain by showing. Take bzpost's tests as an example. The bzpost extension automatically posts commit URLs to Bugzilla during push. Read more if you are interested. What I like so much about .t tests is that they are actually testing the user experience. The test actually runs hg push and verifies the output is exactly what is intended. Furthermore, since we're running a Dockerized Bugzilla server during the test, we're able to verify that the bzpost extension actually resulted in Bugzilla comments being added to the appropriate bug(s). Contrast this with unit tests that only test a subset of functionality. Or, contrast with writing a lot of boilerplate and often hard-to-read code that invokes processes and uses regular expressions, etc to compare output. I find .t tests are more concise and they do a better job of testing user experience. More than once I've written a .t test and thought this user experience doesn't feel right, I should change the behavior to be more user friendly. This happened because I was writing actual end-user commands as part of writing tests and seeing the exact output the user would see. It is much harder to attain this sense of understanding when writing unit tests. I can name a few projects with poor command line interfaces that could benefit from this approach... I'm not saying .t tests are perfect or that they should replace other testing methodologies such as unit tests. I just think they are very useful for accurately testing higher-level functionality and for assessing user experience. I really wish we had these tests for mach commands...

Anyway, with a proper testing harness in place for our version control code, we've been pretty good about ensuring new code is properly tested. When people submit new hooks or patches to existing hooks, we can push back and refuse to grant review unless tests are included. When someone requests a new deployment to the server, we can look at what changed, cross-reference to test coverage, and assess the riskiness of the deployment. We're getting to the point where we just trust our tests and server deployments are minor events. Concerns over accidental regressions due to server changes are waning. We can tell people if you really care about this not breaking, you need a test and if you add a test, we'll support it for you. People are often more than happy to write tests to ensure them peace of mind, especially when that test's presence shifts maintenance responsibility away from them. We're happy because we don't have many surprises (and fire drills) at deployment time. It's a win-win!

So, what's next? Good question! We still have a number of large gaps in our test coverage. Our code to synchronize repositories from the master server to read-only slaves is likely the most critical omission. We also don't yet have a good way of reproducing our server environment. Ideally, we'd run the continuous integration in an environment that's very similar to production. Same package versions and everything. This would also allow us to simulate the actual hg.mozilla.org server topology during tests. Currently, our tests are more unit-style than integration-style. We rely on the consistent behavior of Mercurial and other tools as sufficient proxies for test accuracy and we back those up with running the tests on the staging server before production deployment. But these aren't a substitute for an accurate reproduction of the production servers, especially when it comes to things like the replication tests. We'll get there some day. I also have plans to improve Mercurial's test harness to better facilitate some of our advanced use cases. I would absolutely love to make Mercurial's .t test harness more consumable outside the context of Mercurial. (cram is one such attempt at this.) We also need to incorporate the Git server code into this repository. Currently, I'm pretty sure everything Git at Mozilla is untested. Challenge accepted!

In summary, our story for testing version control at Mozilla has gone from a cobbled together mess to something cohesive and comprehensive. This has given us confidence to move fast without breaking things. I think the weeks of people time invested into improving the state of testing was well spent and will pay enormous dividends going forward. Looking back, the mountain of technical debt now looks like a mole hill. I feel good knowing that I played a part in making this change.

Read and Post Comments

« Previous Page -- Next Page »