Benefits of Clone Offload on Version Control Hosting

July 27, 2018 at 03:48 PM | categories: Mercurial, Mozilla | View Comments

Back in 2015, I implemented a feature in Mercurial 3.6 that allows servers to advertise URLs of pre-generated bundle files. When a compatible client performs a hg clone against a repository leveraging this feature, it downloads and applies the bundle from a URL then goes back to the server and performs the equivalent of an hg pull to obtain the changes to the repository made after the bundle was generated.

On, we've been using this feature since 2015. We host bundles in Amazon S3 and make them available via the CloudFront CDN. We perform IP filtering on the server so clients connecting from AWS IPs are served S3 URLs corresponding to the closest region / S3 bucket where bundles are hosted. Most Firefox build and test automation is run out of EC2 and automatically clones high-volume repositories from an S3 bucket hosted in the same AWS region. (Doing an intra-region transfer is very fast and clones can run at >50 MB/s.) Everyone else clones from a CDN. See our official docs for more.

I last reported on this feature in October 2015. Since then, Bitbucket also deployed this feature in early 2017.

I was reminded of this clone bundles feature this week when posted Best way to do linux clones for your CI and that post was making the rounds in my version control circles. tl;dr apparently suffers high load due to high clone volume against the Linux Git repository and since Git doesn't have an equivalent feature to clone bundles built in to Git itself, they are asking people to perform equivalent functionality to mitigate server load.

(A clone bundles feature has been discussed on the Git mailing list before. I remember finding old discussions when I was doing research for Mercurial's feature in 2015. I'm sure the topic has come up since.)

Anyway, I thought I'd provide an update on just how valuable the clone bundles feature is to Mozilla. In doing so, I hope maintainers of other version control tools see the obvious benefits and consider adopting the feature sooner.

In a typical week, is currently serving ~135 TB of data. The overwhelming majority of this data is related to the Mercurial wire protocol (i.e. not HTML / JSON served from the web interface). Of that ~135 TB, ~5 TB is served from the CDN, ~126 TB is served from S3, and ~4 TB is served from the Mercurial servers themselves. In other words, we're offloading ~97% of bytes served from the Mercurial servers to S3 and the CDN.

If we assume this offloaded ~131 TB is equally distributed throughout the week, this comes out to ~1,732 Mbps on average. In reality, we do most of our load from California's Sunday evenings to early Friday evenings. And load is typically concentrated in the 12 hours when the sun is over Europe and North America (where most of Mozilla's employees are based). So the typical throughput we are offloading is more than 2 Gbps. And at a lower level, automation tends to perform clones soon after a push is made. So load fluctuates significantly throughout the day, corresponding to when pushes are made.

By volume, most of the data being offloaded is for the mozilla-unified Firefox repository. Without clone bundles and without the special stream clone Mercurial feature (which we also leverage via clone bundles), the servers would be generating and sending ~1,588 MB of zstandard level 3 compressed data for each clone of that repository. Each clone would consume ~280s of CPU time on the server. And at ~195,000 clones per month, that would come out to ~309 TB/mo or ~72 TB/week. In CPU time, that would be ~54.6 million CPU-seconds, or ~21 CPU-months. I will leave it as an exercise to the reader to attach a dollar cost to how much it would take to operate this service without clone bundles. But I will say the total AWS bill for our S3 and CDN hosting for this service is under $50 per month. (It is worth noting that intra-region data transfer from S3 to other AWS services is free. And we are definitely taking advantage of that.)

Despite a significant increase in the size of the Firefox repository and clone volume of it since 2015, our servers are still performing less work (in terms of bytes transferred and CPU seconds consumed) than they were in 2015. The ~97% of bytes and millions of CPU seconds offloaded in any given week have given us a lot of breathing room and have saved Mozilla several thousand dollars in hosting costs. The feature has likely helped us avoid many operational incidents due to high server load. It has made Firefox automation faster and more reliable.

Succinctly, Mercurial's clone bundles feature has successfully and largely effortlessly offloaded a ton of load from the Mercurial servers. Other version control tools should implement this feature because it is a game changer for server operators and results in a better client-side experience (eliminates server-side CPU bottleneck and may eliminate network bottleneck due to a geo-local CDN typically being as fast as your Internet pipe). It's a win-win. And a massive win if you are operating at scale.

Read and Post Comments

Deterministic Firefox Builds

June 20, 2018 at 11:10 AM | categories: Mozilla | View Comments

As of Firefox 60, the build environment for official Firefox Linux builds switched from CentOS to Debian.

As part of the transition, we overhauled how the build environment for Firefox is constructed. We now populate the environment from deterministic package snapshots and are much more stringent about dependencies and operations being deterministic and reproducible. The end result is that the build environment for Firefox is deterministic enough to enable Firefox itself to be built deterministically.

Changing the underlying operating system environment used for builds was a risky change. Differences in the resulting build could result in new bugs or some users not being able to run the official builds. We figured a good way to mitigate that risk was to make the old and new builds as bit-identical as possible. After all, if the environments produce the same bits, then nothing has effectively changed and there should be no new risk for end-users.

Employing the diffoscope tool, we identified areas where Firefox builds weren't deterministic in the same environment and where there was variance across build environments. We iterated on differences and changed systems so variance would no longer occur. By the end of the process, we had bit-identical Firefox builds across environments.

So, as of Firefox 60, Firefox builds on Linux are deterministic in our official build environment!

That being said, the builds we ship to users are using PGO. And an end-to-end build involving PGO is intrinsically not deterministic because it relies on timing data that varies from one run to the next. And we don't yet have continuous automated end-to-end testing that determinism holds. But the underlying infrastructure to support deterministic and reproducible Firefox builds is there and is not going away. I think that's a milestone worth celebrating.

This milestone required the effort of many people, often working indirectly toward it. Debian's reproducible builds effort gave us an operating system that provided deterministic and reproducible guarantees. Switching Firefox CI to Taskcluster enabled us to switch to Debian relatively easily. Many were involved with non-determinism fixes in Firefox over the years. But Mike Hommey drove the transition of the build environment to Debian and he deserves recognition for his individual contribution. Thanks to all these efforts - and especially Mike Hommey's - we can now say Firefox builds deterministically!

The fx-reproducible-build bug tracks ongoing efforts to further improve the reproducibility story of Firefox. (~300 bugs in its dependency tree have already been resolved!)

Read and Post Comments

Scaling Firefox Development Workflows

May 16, 2018 at 04:10 PM | categories: Mozilla | View Comments

One of the central themes of my time at Mozilla has been my pursuit of making it easier to contribute to and hack on Firefox.

I vividly remember my first day at Mozilla in 2011 when I went to build Firefox for the first time. I thought the entire experience - from obtaining the source code, installing build dependencies, building, running tests, submitting patches for review, etc was quite... lacking. When I asked others if they thought this was an issue, many rightfully identified problems (like the build system being slow). But there was a significant population who seemed to be naive and/or apathetic to the breadth of the user experience shortcomings. This is totally understandable: the scope of the problem is immense and various people don't have the perspective, are blinded/biased by personal experience, and/or don't have the product design or UX experience necessary to comprehend the problem.

When it comes to contributing to Firefox, I think the problems have as much to do with user experience (UX) as they do with technical matters. As I wrote in 2012, user experience matters and developers are people too. You can have a technically superior product, but if the UX is bad, you will have a hard time attracting and retaining new users. And existing users won't be as happy. These are the kinds of problems that a product manager or designer deals with. A difference is that in the case of Firefox development, the target audience is a very narrow and highly technically-minded subset of the larger population - much smaller than what your typical product targets. The total addressable population is (realistically) in the thousands instead of millions. But this doesn't mean you ignore the principles of good product design when designing developer tooling. When it comes to developer tooling and workflows, I think it is important to have a product manager mindset and treat it not as a collection of tools for technically-minded individuals, but as a product having an overall experience. You only have to look as far as the Firefox Developer Tools to see this approach applied and the positive results it has achieved.

Historically, Mozilla has lacked a formal team with the domain expertise and mandate to treat Firefox contribution as a product. We didn't have anything close to this until a few years ago. Before we had such a team, I took on some of these problems individually. In 2012, I wrote mach - a generic CLI command dispatch tool - to provide a central, convenient, and easy-to-use command to discover development actions and to run them. (Read the announcement blog post for some historical context.) I also introduced a one-line bootstrap tool (now mach bootstrap) to make it easier to configure your machine for building Firefox. A few months later, I was responsible for introducing files, which paved the way for countless optimizations and for rearchitecting the Firefox build system to use modern tools - a project that is still ongoing (digging out from ~two decades of technical debt is a massive effort). And a few months after that, I started going down the version control rabbit hole and improving matters there. And I was also heavily involved with MozReview for improving the code review experience.

Looking back, I was responsible for and participated in a ton of foundational changes to how Firefox is developed. Of course, dozens of others have contributed to getting us to where we are today and I can't and won't take credit for the hard work of others. Nor will I claim I was the only person coming up with good ideas or transforming them into reality. I can name several projects (like Taskcluster and Treeherder) that have been just as or more transformational than the changes I can take credit for. It would be vain and naive of me to elevate my contributions on a taller pedestal and I hope nobody reads this and thinks I'm doing that.

(On a personal note, numerous people have told me that things like mach and the bootstrap tool have transformed the Firefox contribution experience for the better. I've also had very senior people tell me that they don't understand why these tools are important and/or are skeptical of the need for investments in this space. I've found this dichotomy perplexing and troubling. Because some of the detractors (for lack of a better word) are highly influential and respected, their apparent skepticism sews seeds of doubt and causes me to second guess my contributions and world view. This feels like a form or variation of imposter syndrome and it is something I have struggled with during my time at Mozilla.)

From my perspective, the previous five or so years in Firefox development workflows has been about initiating foundational changes and executing on them. When it was introduced, mach was radical. It didn't do much and its use was optional. Now almost everyone uses it. Similar stories have unfolded for Taskcluster, MozReview, and various other tools and platforms. In other words, we laid a foundation and have been steadily building upon it for the past several years. That's not to say other foundational changes haven't occurred since (they have - the imminent switch to Phabricator is a terrific example). But the volume of foundational changes has slowed since 2012-2014. (I think this is due to Mozilla deciding to invest more in tools as a result of growing pains from significant company expansion that began in 2010. With that investment, we invested in the bigger ticket long-standing workflow pain points, such as CI (Taskcluster), the Firefox build system, Treeherder, and code review.)

Workflows Today and in the Future

Over the past several years, the size, scope, and complexity of Firefox development activities has increased.

One way to see this is at the source code level. The following chart shows the size of the mozilla-central version control repository over time.

mozilla-central size over time

The size increases are obvious. The increases cumulatively represent new features, technologies, and workflows. For example, the repository contains thousands of Web Platform Tests (WPT) files, a shared test suite for web platform implementations, like Gecko and Blink. WPT didn't exist a few years ago. Now we have files under source control, tools for running those tests, and workflows revolving around changing those tests. The incorporation of Rust and components of Servo into Firefox is also responsible for significant changes. Firefox features such as Developer Tools have been introduced or ballooned in size in recent years. The Go Faster project and the move to system add-ons has introduced various new workflows and challenges for testing Firefox.

Many of these changes are building upon the user-facing foundational workflow infrastructure that was last significantly changed in 2012-2014. This has definitely contributed to some growing pains. For example, there are now 92 mach commands instead of like 5. mach help - intended to answer what can I do and how should I do it - is overwhelming, especially to new users. The repository is around 2 gigabytes of data to clone instead of around 500 megabytes. We have 240,000 files in a full checkout instead of 70,000 files. There's a ton of new pieces floating around. Any product manager tasked with user acquisition and retention will tell you that increasing the barrier to entry and use will jeopardize these outcomes. But with the growth of Firefox's technical underbelly in the previous years, we've made it harder to contribute by requiring users to download and see a lot more files (version control) and be overwhelmed by all the options for actions to take (mach having 92 commands). And as the sheer number of components constituting Firefox increases, it becomes harder and harder for everyone - not just new contributors - to reason about how everything fits together.

I've been framing this general problem as scaling Firefox development workflows and every time I think about the high-level challenges facing Firefox contribution today and in the years ahead, this problem floats to the top of my list of concerns. Yes, we have pressing issues like improving the code review experience and making the Firefox build system and Taskcluster-based CI fast, efficient, and reliable. But even if you make these individual pieces great, there is still a cross-domain problem of how all these components weave together. This is why I think it is important to take a wholistic view and treat developer workflow as a product.

When I look at this the way a product manager or designer would, I see a few fundamental problems that need addressing.

First, we're not optimizing for comprehensive end-to-end workflows. By and large, we're designing our tools in isolation. We focus more on maximizing the individual components instead of maximizing the interaction between them. For example, Taskcluster and Treeherder are pretty good in isolation. But we're missing features like Treeherder being able to tell me the command to run locally to reproduce a failure: I want to see a failure on Treeherder and be able to copy and paste commands into my terminal to debug the failure. In the case of code review, we've designed two good code review tools (MozReview and Phabricator) but we haven't invested in making submitting code reviews turn key (the initial system configuration is difficult and we still don't have things like automatic bug filing or reviewer selection). We are leaving many workflow optimizations on the table by not implementing thoughtful tie-ins and transitions between various tools.

Second, by-and-large we're still optimizing for a single, monolithic user segment instead of recognizing and optimizing for different users and their workflow requirements. For example, mach help lists 92 commands. I don't think any single person cares about all 92 of those commands. The average person may only care about 10 or even 20. In terms of user interface design, the features and workflow requirements of small user segments are polluting the interface for all users and making the entire experience complicated and difficult to reason about. As a concrete example, why should a system add-on developer or a Firefox Developer Tools developer (these people tend to care about testing a standalone Firefox add-on) care about Gecko's build system or tests? If you aren't touching Gecko or Firefox's chrome code, why should you be exposed to workflows and requirements that don't have a major impact on you? Or something more extreme, if you are developing a standalone Rust module or Python package in mozilla-central, why do you need to care about Firefox at all? (Yes, Firefox or another downstream consumer may care about changes to that standalone component and you can't ignore those dependencies. But it should at least be possible to hide those dependencies.)

Waving my hands, the solution to these problems is to treat Firefox development workflow as a product and to apply the same rigor that we use for actual Firefox product development. Give people with a vision for the entire workflow the ability to prioritize investment across tools and platforms. Give them a process for defining features that work across tools. Perform formal user studies. See how people are actually using the tools you build. Bring in design and user experience experts to help formulate better workflows. Perform user typing so different, segmentable workflows can be optimized for. Treat developers as you treat users of real products: listen to them. Give developers a voice to express frustrations. Let them tell you what they are trying to do and what they wish they could do. Integrate this feedback into a feature roadmap. Turn common feedback into action items for new features.

If you think these ideas are silly and it doesn't make sense to apply a product mindset to developer workflows and tooling, then you should be asking whether product management and all that it entails is also a silly idea. If you believe that aspects of product management have beneficial outcomes (which most companies do because otherwise there wouldn't be product managers), then why wouldn't you want to apply the methods of that discipline to developers and development workflows? Developers are users too and the fact that they work for the same company that is creating the product shouldn't make them immune from the benefits of product management.

If we want to make contributing to Firefox an even better experience for Mozilla employees and community contributors, I think we need to take a step back and assess the situation as a product manager would. The improvements that have been made to the individual pieces constituting Firefox's development workflow during my nearly seven years at Mozilla have been incredible. But I think in order to achieve the next round of major advancements in workflow productivity, we'll need to focus on how all of the pieces fit together. And that requires treating the entire workflow as a cohesive product.

Read and Post Comments

Revisiting Using Docker

May 16, 2018 at 01:45 PM | categories: Docker, Mozilla | View Comments

When Docker was taking off like wildfire in 2013, I was caught up in the excitement like everyone else. I remember knowing of the existence of LXC and container technologies in Linux at the time. But Docker seemed to be the first open source tool to actually make that technology usable (a terrific example of how user experience matters).

At Mozilla, Docker was adopted all around me and by me for various utilities. Taskcluster - Mozilla's task execution framework geared for running complex CI systems - adopted Docker as a mechanism to run processes in self-contained images. Various groups in Mozilla adopted Docker for running services in production. I adopted Docker for integration testing of complex systems.

Having seen various groups use Docker and having spent a lot of time in the trenches battling technical problems, my conclusion is Docker is unsuitable as a general purpose container runtime. Instead, Docker has its niche for hosting complex network services. Other uses of Docker should be highly scrutinized and potentially discouraged.

When Docker hit first the market, it was arguably the only game in town. Using Docker to achieve containerization was defensible because there weren't exactly many (any?) practical alternatives. So if you wanted to use containers, you used Docker.

Fast forward a few years. We now have the Open Container Initiative (OCI). There are specifications describing common container formats. So you can produce a container once and take it to any number OCI compatible container runtimes for execution. And in 2018, there are a ton of players in this space. runc, rkt, and gVisor are just some. So Docker is no longer the only viable tool for executing a container. If you are just getting started with the container space, you would be wise to research the available options and their pros and cons.

When you look at all the options for running containers in 2018, I think it is obvious that Docker - usable though it may be - is not ideal for a significant number of container use cases. If you divide use cases into a spectrum where one end is run a process in a sandbox and the other is run a complex system of orchestrated services in production, Docker appears to be focusing on the latter. Take it from Docker themselves:

Docker is the company driving the container movement and the only container platform provider to address every application across the hybrid cloud. Today's businesses are under pressure to digitally transform but are constrained by existing applications and infrastructure while rationalizing an increasingly diverse portfolio of clouds, datacenters and application architectures. Docker enables true independence between applications and infrastructure and developers and IT ops to unlock their potential and creates a model for better collaboration and innovation.

That description of Docker (the company) does a pretty good job of describing what Docker (the technology) has become: a constellation of software components providing the underbelly for managing complex applications in complex infrastructures. That's pretty far detached on the spectrum from run a process in a sandbox.

Just because Docker (the company) is focused on a complex space doesn't mean they are incapable of exceeding at solving simple problems. However, I believe that in this particular case, the complexity of what Docker (the company) is focusing on has inhibited its Docker products to adequately address simple problems.

Let's dive into some technical specifics.

At its most primitive, Docker is a glorified tool to run a process in a sandbox. On Linux, this is accomplished by using the clone(2) function with specific flags and combined with various other techniques (filesystem remounting, capabilities, cgroups, chroot, seccomp, etc) to sandbox the process from the main operating system environment and kernel. There are a host of tools living at this not-quite-containers level that make it easy to run a sandboxed process. The bubblewrap tool is one of them.

Strictly speaking, you don't need anything fancy to create a process sandbox: just an executable you want to invoke and an executable that makes a set of system calls (like bubblewrap) to run that executable.

When you install Docker on a machine, it starts a daemon running as root. That daemon listens for HTTP requests on a network port and/or UNIX socket. When you run docker run from the command line, that command establishes a connection to the Docker daemon and sends any number of HTTP requests to instruct the daemon to take actions.

A daemon with a remote control protocol is useful. But it shouldn't be the only way to spawn containers with Docker. If all I want to do is spawn a temporary container that is destroyed afterwards, I should be able to do that from a local command without touching a network service. Something like bubblewrap. The daemon adds all kinds of complexity and overhead. Especially if I just want to run a simple, short-lived command.

Docker at this point is already pretty far detached from a tool like bubblewrap. And the disparity gets worse.

Docker adds another abstraction on top of basic process sandboxing in the form of storage / filesystem management. Docker insists that processes execute in self-contained, chroot()'d filesystem environment and that these environments (Docker images) be managed by Docker itself. When Docker images are imported into Docker, Docker manages them using one of a handful of storage drivers. You can choose from devicemapper, overlayfs, zfs, btrfs, and aufs and employ various configurations of all these. Docker images are composed of layers, with one layer stacked on top of the prior. This allows you to have an immutable base layer (that can be shared across containers) where run-time file changes can be isolated to a specific container instance.

Docker's ability to manage storage is pretty cool. And I dare say Docker's killer feature in the very beginning of Docker was the ability to easily produce and exchange self-contained Docker images across machines.

But this utility comes at a very steep price. Again, if our use case is run a process in a sandbox, do we really care about all this advanced storage functionality? Yes, if you are running hundreds of containers on a single system, a storage model built on top of copy-on-write is perhaps necessary for scaling. But for simple cases where you just want to run a single or small number of processes, it is extremely overkill and adds many more problems than it solves.

I cannot stress this enough, but I have spent hours debugging and working around problems due to how filesystems/storage works in Docker.

When Docker was initially released, aufs was widely used. As I previously wrote, aufs has abysmal performance as you scale up the number of concurrent I/O operations. We shaved minutes off tasks in Firefox CI by ditching aufs for overlayfs.

But overlayfs is far from a panacea. File metadata-only updates are apparently very slow in overlayfs. We're talking ~100ms to call fchownat() or utimensat(). If you perform an rsync -a or chown -R on a directory with only just a hundred files that were defined in a base image layer, you can have delays of seconds.

The Docker storage drivers backed by real filesystems like zfs and btrfs are a bit better. But they have their quirks too. For example, creating layers in images is comparatively very slow compared to overlayfs (which are practically instantaneous). This matters when you are iterating on a Dockerfile for example and want to quickly test changes. Your edit-compile cycle grows frustratingly long very quickly.

And I could opine on a handful of other problems I've encountered over the years.

Having spent hours of my life debugging and working around issues with Docker's storage, my current attitude is enough of this complexity, just let me use a directory backed by the local filesystem, dammit.

For many use cases, you don't need the storage complexity that Docker forces upon you. Pointing Docker at a directory on a local filesystem to chroot into is good enough. I know the behavior and performance properties of common Linux filesystems. ext4 isn't going to start making fchownat() or utimensat() calls take ~100ms. It isn't going to complain when a hard link spans multiple layers in an image. Or slow down to a crawl when multiple threads are performing concurrent read I/O. There's not going to be intrinsically complicated algorithms and caching to walk N image layers to find the most recent version of a file (or if there is, it will be so far down the stack in kernel land that I likely won't ever have to deal with it as a normal user). Docker images with their multiple layers add complexity and overhead. For many use uses, the pain it inflicts offsets the initial convenience it saves.

Docker's optimized-for-complex-use-cases architecture demonstrates its inefficiency in simple benchmarks.

On my machine, docker run -i --rm debian:stretch /bin/ls / takes ~850ms. Almost a second to perform a directory listing (sometimes it does take over 1 second - I was being generous and quoting a quicker time). This command takes ~1ms when run in a local shell. So we're at 2.5-3 magnitudes of overhead. The time here does include time to initially create the container and destroy it afterwards. We can isolate that overhead by starting a persistent container and running docker exec -i <cid> /bin/ls / to spawn a new process in an existing container. This takes ~85ms. So, ~2 magnitudes of overhead to spawn a process in a Docker container versus spawning it natively. What's adding so much overhead, I'm not sure. Yes, there are HTTP requests under the hood. But HTTP to a local daemon shouldn't be that slow. I'm not sure what's going on.

If we docker export that image to the local filesystem and use runc state to configure so we can run it with runc, runc run takes ~85ms to run /bin/ls /. If we runc exec <cid> /bin/ls / to start a process in an existing container, that completes in ~10ms. runc appears to be executing these simple tasks ~10x faster than Docker.

But to even get to that point, we had to make a filesystem available to spawn the container in. With Docker, you need to load an image into Docker. Using docker save to produce a 105,523,712 tar file, docker load -i image.tar takes ~1200ms to complete. tar xf image.tar takes ~65ms to extract that image to the local filesystem. Granted, Docker is computing the SHA-256 of the image as part of import. But SHA-256 runs at ~250MB/s on my machine and on that ~105MB input takes ~400ms. Where is that extra ~750ms of overhead in Docker coming from?

The Docker image loading overhead is still present on large images. With a 4,336,605,184 image, docker load was taking ~32s and tar x was taking ~2s. Obviously the filesystem was buffering writes in the tar case. And the ~2s is ignoring the ~17s to obtain the SHA-256 of the entire input. But there's still a substantial disparity here. (I suspect a lot of it is overlayfs not being as optimal as ext4.)

Several years ago there weren't many good choices for tools to execute containers. But today, there are good tools readily available. And thanks to OCI standards, you can often swap in alternate container runtimes. Docker (the tool) has an architecture that is optimized for solving complex use cases (coincidentally use cases that Docker the company makes money from). Because of this, my conclusion - drawn from using Docker for several years - is that Docker is unsuitable for many common use cases. If you care about low container startup/teardown overhead, low latency when interacting with containers (including spawning processes from outside of them), and for workloads where Docker's storage model interferes with understanding or performance, I think Docker should be avoided. A simpler tool (such as runc or even bubblewrap) should be used instead.

Call me a curmudgeon, but having seen all the problems that Docker's complexity causes, I'd rather see my containers resemble a tarball that can easily be chroot()'d into. I will likely be revisiting projects that use Docker and replacing Docker with something lighter weight and architecturally simpler. As for the use of Docker in the more complex environments it seems to be designed for, I don't have a meaningful opinion as I've never really used it in that capacity. But given my negative experiences with Docker over the years, I am definitely biased against Docker and will lean towards simpler products, especially if their storage/filesystem management model is simpler. Docker introduced containers to the masses and they should be commended for that. But for my day-to-day use cases for containers, Docker is simply not the right tool for the job.

I'm not sure exactly what I'll replace Docker with for my simpler use cases. If you have experiences you'd like to share, sharing them in the comments will be greatly appreciated.

Read and Post Comments

Release of python-zstandard 0.9

April 09, 2018 at 09:30 AM | categories: Python, Mozilla | View Comments

I have just released python-zstandard 0.9.0. You can install the latest release by running pip install zstandard==0.9.0.

Zstandard is a highly tunable and therefore flexible compression algorithm with support for modern features such as multi-threaded compression and dictionaries. Its performance is remarkable and if you use it as a drop-in replacement for zlib, bzip2, or other common algorithms, you'll frequently see more than a doubling in performance.

python-zstandard provides rich bindings to the zstandard C library without sacrificing performance, safety, features, or a Pythonic feel. The bindings run on Python 2.7, 3.4, 3.5, 3.6, 3.7 using either a C extension or CFFI bindings, so it works with CPython and PyPy.

I can make a compelling argument that python-zstandard is one of the richest compression packages available to Python programmers. Using it, you will be able to leverage compression in ways you couldn't with other packages (especially those in the standard library) all while achieving ridiculous performance. Due to my focus on performance, python-zstandard is able to outperform Python bindings to other compression libraries that should be faster. This is because python-zstandard is very diligent about minimizing memory allocations and copying, minimizing Python object creation, reusing state, etc.

While python-zstandard is formally marked as a beta-level project and hasn't yet reached a 1.0 release, it is suitable for production usage. python-zstandard 0.8 shipped with Mercurial and is in active production use there. I'm also aware of other consumers using it in production, including at Facebook and Mozilla.

The sections below document some of the new features of python-zstandard 0.9.

File Object Interface for Reading

The 0.9 release contains a stream_reader() API on the compressor and decompressor objects that allows you to treat those objects as readable file objects. This means that you can pass a ZstdCompressor or ZstdDecompressor around to things that accept file objects and things generally just work. e.g.:

   with open(compressed_file, 'rb') as ifh:
       cctx = zstd.ZstdDecompressor()
       with cctx.stream_reader(ifh) as reader:
           while True:
               chunk =
               if not chunk:

This is probably the most requested python-zstandard feature.

While the feature is usable, it isn't complete. Support for readline(), readinto(), and a few other APIs is not yet implemented. In addition, you can't use these reader objects for opening zstandard compressed tarball files because Python's tarfile package insists on doing backward seeks when reading. The current implementation doesn't support backwards seeking because that requires buffering decompressed output and that is not trivial to implement. I recognize that all these features are useful and I will try to work them into a subsequent release of 0.9.

Negative Compression Levels

The 1.3.4 release of zstandard (which python-zstandard 0.9 bundles) supports negative compression levels. I won't go into details, but negative compression levels disable extra compression features and allow you to trade compression ratio for more speed.

When compressing a 6,472,921,921 byte uncompressed bundle of the Firefox Mercurial repository, the previous fastest we could go with level 1 was ~510 MB/s (measured on the input side) yielding a 1,675,227,803 file (25.88% of original).

With level -1, we compress to 1,934,253,955 (29.88% of original) at ~590 MB/s. With level -5, we compress to 2,339,110,873 bytes (36.14% of original) at ~720 MB/s.

On the decompress side, level 1 decompresses at ~1,150 MB/s (measured at the output side), -1 at ~1,320 MB/s, and -5 at ~1,350 MB/s (generally speaking, zstandard's decompression speeds are relatively similar - and fast - across compression levels).

And that's just with a single thread. zstandard supports using multiple threads to compress a single input and python-zstandard makes this feature easy to use. Using 8 threads on my 4+4 core i7-6700K, level 1 compresses at ~2,000 MB/s (3.9x speedup), -1 at ~2,300 MB/s (3.9x speedup), and -5 at ~2,700 MB/s (3.75x speedup).

That's with a large input. What about small inputs?

If we take 456,599 Mercurial commit objects spanning 298,609,254 bytes from the Firefox repository and compress them individually, at level 1 we yield a total of 133,457,198 bytes (44.7% of original) at ~112 MB/s. At level -1, we compress to 161,241,797 bytes (54.0% of original) at ~215 MB/s. And at level -5, we compress to 185,885,545 bytes (62.3% of original) at ~395 MB/s.

On the decompression side, level 1 decompresses at ~260 MB/s, -1 at ~1,000 MB/s, and -5 at ~1,150 MB/s.

Again, that's 456,599 operations on a single thread with Python.

python-zstandard has an experimental API where you can pass in a collection of inputs and it batch compresses or decompresses them in a single operation. It releases and GIL and uses multiple threads. It puts the results in shared buffers in order to minimize the overhead of memory allocations and Python object creation and garbage collection. Using this mode with 8 threads on my 4+4 core i7-6700K, level 1 compresses at ~525 MB/s, -1 at ~1,070 MB/s, and -5 at ~1,930 MB/s. On the decompression side, level 1 is ~1,320 MB/s, -1 at ~3,800 MB/s, and -5 at ~4,430 MB/s.

So, my consumer grade desktop i7-6700K is capable of emitting decompressed data at over 4 GB/s with Python. That's pretty good if you ask me. (Full disclosure: the timings were taken just around the compression operation itself: overhead of loading data into memory was not taken into account. See the script in the source repository for more.

Long Distance Matching Mode

Negative compression levels take zstandard into performance territory that has historically been reserved for compression formats like lz4 that are optimized for that domain. Long distance matching takes zstandard in the other direction, towards compression formats that aim to achieve optimal compression ratios at the expense of time and memory usage.

python-zstandard 0.9 supports long distance matching and all the configurable parameters exposed by the zstandard API.

I'm not going to capture many performance numbers here because python-zstandard performs about the same as the C implementation because LDM mode spends most of its time in zstandard C code. If you are interested in numbers, I recommend reading the zstandard 1.3.2 and 1.3.4 release notes.

I will, however, underscore that zstandard can achieve close to lzma's compression ratios (what the xz utility uses) while completely smoking lzma on decompression speed. For a bundle of the Firefox Mercurial repository, zstandard level 19 with a long distance window size of 512 MB using 8 threads compresses to 1,033,633,309 bytes (16.0%) in ~260s wall, 1,730s CPU. xz -T8 -8 compresses to 1,009,233,160 (15.6%) in ~367s wall, ~2,790s CPU.

On the decompression side, zstandard takes ~4.8s and runs at ~1,350 MB/s as measured on the output side while xz takes ~54s and runs at ~114 MB/s. Zstandard, however, does use a lot more memory than xz for decompression, so that performance comes with a cost (512 MB versus 32 MB for this configuration).

Other Notable Changes

python-zstandard now uses the advanced compression and decompression APIs everywhere. All tunable compression and decompression parameters are available to python-zstandard. This includes support for disabling magic headers in frames (saves 4 bytes per frame - this can matter for very small inputs, especially when using dictionary compression).

The full dictionary training API is exposed. Dictionary training can now use multiple threads.

There are a handful of utility functions for inspecting zstandard frames, querying the state of compressors, etc.

Lots of work has gone into shoring up the code base. We now build with warnings as errors in CI. I performed a number of focused auditing passes to fix various classes of deficiencies in the C code. This includes use of the buffer protocol: python-zstandard is now able to accept any Python object that provides a view into its underlying raw data.

Decompression contexts can now be constructed with a max memory threshold so attempts to decompress something that would require more memory will result in error.

See the full release notes for more.


Since I last released a major version of python-zstandard, a lot has changed in the zstandard world. As I blogged last year, zstandard circa early 2017 was a very compelling compression format: it already outperformed popular compression formats like zlib and bzip2 across the board. As a general purpose compression format, it made a compelling case for itself. In my mind, brotli was its only real challenger.

As I wrote then, zstandard isn't perfect. (Nothing is.) But a year later, it is refreshing to see advancements.

A criticism one year ago was zstandard was pretty good as a general purpose compression format but it wasn't great if you live at the fringes. If you were a speed freak, you'd probably use lz4. If you cared about compression ratios, you'd probably use lzma. But recent releases of zstandard have made huge strides into the territory of these niche formats. Negative compression levels allow zstandard to flirt with lz4's performance. Long distance matching allows zstandard to achieve close to lzma's compression ratios. This is a big friggin deal because it makes it much, much harder to justify a domain-specific compression format over zstandard. I think lzma still has a significant edge for ultra compression ratios when memory utilization is a concern. But for many consumers, memory is readily available and it is easy to justify trading potentially hundreds of megabytes of memory to achieve a 10x speedup for decompression. Even if you aren't willing to sacrifice more memory, the ability to tweak compression parameters is huge. You can do things like store multiple versions of a compressed document and conditionally serve the one most appropriate for the client, all while running the same zstandard-only code on the client. That's huge.

A year later, zstandard continues to impress me for its set of features and its versatility. The library is continuing to evolve - all while maintaining backwards compatibility on the decoding side. (That's a sign of a good format design if you ask me.) I was honestly shocked to see that zstandard was able to change its compression settings in a way that allowed it to compete with lz4 and lzma without requiring a format change.

The more I use zstandard, the more I think that everyone should use this and that popular compression formats just aren't cut out for modern computing any more. Every time I download a zlib/gz or bzip2 compressed archive, I'm thinking if only they used zstandard this archive would be smaller, it would have decompressed already, and I wouldn't be thinking about how annoying it is to wait for compression operations to complete. In my mind, zstandard is such an obvious advancement over the status quo and is such a versatile format - now covering the gamut of super fast compression to ultra ratios - that it is bordering on negligent to not use zstandard. With the removal of the controversial patent rights grant license clause in zstandard 1.3.1, that justifiable resistance to widespread adoption of zstandard has been eliminated. Zstandard is objectively superior for many workloads and I heavily encourage its use. I believe python-zstandard provides a high-quality interface to zstandard and I encourage you to give it and zstandard a try the next time you compress data.

If you run into any problems or want to get involved with development, python-zstandard lives at indygreg/python-zstandard on GitHub.

*(I updated the post on 2018-05-16 to remove a paragraph about zstandard competition. In the original post, I unfairly compared zstandard to Snappy instead of Brotli and made some inaccurate statements around that comparison.)

Read and Post Comments

Next Page ยป