Global Kernel Locks in APFS

October 29, 2018 at 02:20 PM | categories: Python, Mercurial, Apple | View Comments

Over the past several months, a handful of people had been complaining that Mercurial's test harness was executing much slower on Macs. But this slowdown seemingly wasn't occurring on Linux or Windows. And not every Mac user experienced the slowness!

Before jetting off to the Mercurial 4.8 developer meetup in Stockholm a few weeks ago, I sat down with a relatively fresh 6+6 core MacBook Pro and experienced the problem firsthand: on my 4+4 core i7-6700K running Linux, the Mercurial test harness completes in ~12 minutes, but on this MacBook Pro, it was executing in ~38 minutes! On paper, this result doesn't make any sense because there's no way that the MacBook Pro should be ~3x slower than that desktop machine.

Looking at Activity Monitor when running the test harness with 12 tests in parallel revealed something odd: the system was spending ~75% of overall CPU time inside the kernel! When reducing the number of tests that ran in parallel, the percentage of CPU time spent in the kernel decreased and the overall test harness execution time also decreased. This kind of behavior is usually a sign of something very inefficient in kernel land.

I sample profiled all processes on the system when running the Mercurial test harness. Aggregate thread stacks revealed a common pattern: readdir() being in the stack.

Upon closer examination of the stacks, readdir() calls into apfs_vnop_readdir(), which calls into some functions with bt or btree in their name, which call into lck_mtx_lock(), lck_mtx_lock_grab_mutex() and various other functions with lck_mtx in their name. And the caller of most readdir() appeared to be Python 2.7's module importing mechanism (notably import.c:case_ok()).

APFS refers to the Apple File System, which is a filesystem that Apple introduced in 2017 and is the default filesystem for new versions of macOS and iOS. If upgrading an old Mac to a new macOS, its HFS+ filesystems would be automatically converted to APFS.

While the source code for APFS is not available for me to confirm, the profiling results showing excessive time spent in lck_mtx_lock_grab_mutex() combined with the fact that execution time decreases when the parallel process count decreases leads me to the conclusion that APFS obtains a global kernel lock during read-only operations such as readdir(). In other words, APFS slows down when attempting to perform parallel read-only I/O.

This isn't the first time I've encountered such behavior in a filesystem: last year I blogged about very similar behavior in AUFS, which was making Firefox CI significantly slower.

Because Python 2.7's module importing mechanism was triggering the slowness by calling readdir(), I posted to python-dev about the problem, as I thought it was important to notify the larger Python community. After all, this is a generic problem that affects the performance of starting any Python process when running on APFS. i.e. if your build system invokes many Python processes in parallel, you could be impacted by this. As part of obtaining data for that post, I discovered that Python 3.7 does not call readdir() as part of module importing and therefore doesn't exhibit a severe slowdown. (Python's module importing code was rewritten significantly in Python 3 and the fix was likely introduced well before Python 3.7.)

I've produced a gist that can reproduce the problem. The script essentially performs a recursive directory walk. It exercises the opendir(), readdir(), closedir(), and lstat() functions heavily and is essentially a benchmark of the filesystem and filesystem cache's ability to return file metadata.

When you tell it to walk a very large directory tree - say a Firefox version control checkout (which has over 250,000 files) - the excessive time spent in the kernel is very apparent on macOS 10.13 High Sierra:

$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 172.209s

real    2m52.470s
user    1m54.053s
sys    23m42.808s

$ time ./slow-readdir.py -l 12 -j 1 ~/src/firefox
ran 12 walks across 1 processes in 523.440s

real    8m43.740s
user    1m13.397s
sys     3m50.687s

$ time ./slow-readdir.py -l 18 -j 18 ~/src/firefox
ran 18 walks across 18 processes in 210.487s

real    3m30.731s
user    2m40.216s
sys    33m34.406s

On the same machine upgraded to macOS 10.14 Mojave, we see a bit of a speedup!:

$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 97.833s

real    1m37.981s
user    1m40.272s
sys    10m49.091s

$ time ./slow-readdir.py -l 12 -j 1 ~/src/firefox
ran 12 walks across 1 processes in 461.415s

real    7m41.657s
user    1m05.830s
sys     3m47.041s

$ time ./slow-readdir.py -l 18 -j 18 ~/src/firefox
ran 18 walks across 18 processes in 140.474s

real    2m20.727s
user    3m01.048s
sys    17m56.228s

Contrast with my i7-6700K Linux machine backed by EXT4:

$ time ./slow-readdir.py -l 8 ~/src/firefox
ran 8 walks across 8 processes in 6.018s

real    0m6.191s
user    0m29.670s
sys     0m17.838s

$ time ./slow-readdir.py -l 8 -j 1 ~/src/firefox
ran 8 walks across 1 processes in 33.958s

real    0m34.164s
user    0m17.136s
sys     0m13.369s

$ time ./slow-readdir.py -l 12 -j 12 ~/src/firefox
ran 12 walks across 12 processes in 25.465s

real    0m25.640s
user    1m4.801s
sys     1m20.488s

It is apparent that macOS 10.14 Mojave has received performance work relative to macOS 10.13! Overall kernel CPU time when performing parallel directory walks has decreased substantially - to ~50% of original on some invocations! Stacks seem to reveal new code for lock acquisition, so this might indicate generic improvements to the kernel's locking mechanism rather than APFS specific changes. Changes to file metadata caching could also be responsible for performance changes. Although it is difficult to tell without access to the APFS source code. Despite those improvements, APFS is still spending a lot of CPU time in the kernel. And the kernel CPU time is still comparatively very high compared to Linux/EXT4, even for single process operation.

At this time, I haven't conducted a comprehensive analysis of APFS to determine what other filesystem operations seem to acquire global kernel locks: all I know is readdir() does. A casual analysis of profiled stacks when running Mercurial's test harness against Python 3.7 seems to show apfs_* functions still on the stack a lot and that seemingly indicates more APFS slowness under parallel I/O load. But HFS+ exhibited similar problems (it appeared HFS+ used a single I/O thread inside the kernel for many operations, making I/O on macOS pretty bad), so I'm not sure if these could be considered regressions the way readdir()'s new behavior is.

I've reported this issue to Apple at https://bugreport.apple.com/web/?problemID=45648013 and on OpenRadar at https://openradar.appspot.com/radar?id=5025294012383232. I'm told that issues get more attention from Apple when there are many duplicates of the same issue. So please reference this issue if you file your own report.

Now that I've elaborated on the technical details, I'd like to add some personal commentary. While this post is about APFS, this issue of global kernel locks during common I/O operations is not unique to APFS. I already referenced similar issues in AUFS. And I've encountered similar behaviors with Btrfs (although I can't recall exactly which operations). And NTFS has its own bag of problems.

This seeming pattern of global kernel locks for common filesystem operations and slow filesystems is really rubbing me the wrong way. Modern NVMe SSDs are capable of reading and writing well over 2 gigabytes per second and performing hundreds of thousands of I/O operations per second. We even have Intel soon producing persistent solid state storage that plugs into DIMM slots because it is that friggin fast.

Today's storage hardware is capable of ludicrous performance. It is fast enough that you will likely saturate multiple CPU cores processing the read or written data coming from and going to storage - especially if you are using higher-level, non-JITed (read: slower) programming languages (like Python). There has also been a trend that systems are growing more CPU cores faster than they are instructions per second per core. And SSDs only achieve these ridiculous IOPS numbers if many I/O operations are queued and can be more efficiently dispatched within the storage device. What this all means is that it probably makes sense to use parallel I/O across multiple threads in order to extract all potential performance from your persistent storage layer.

It's also worth noting that we now have solid state storage that outperforms (in some dimensions) what DRAM from ~20 years ago was capable of. Put another way I/O APIs and even some filesystems were designed in an era when its RAM was slower than what today's persistent storage is capable of! While I'm no filesystems or kernel expert, it does seem a bit silly to be using APIs and filesystems designed for an era when storage was multiple orders of magnitude slower and systems only had a single CPU core.

My takeaway is I can't help but feel that systems-level software (including the kernel) is severely limiting the performance potential of modern storage devices. If we have e.g. global kernel locks when performing common I/O operations, there's no chance we'll come close to harnessing the full potential of today's storage hardware. Furthermore, the behavior of filesystems is woefully under documented and software developers have little solid advice for how to achieve optimal I/O performance. As someone who cares about performance, I want to squeeze every iota of potential out of hardware. But the lack of documentation telling me which operations acquire locks, which strategies are best for say reading or writing 10,000 files using N threads, etc makes this extremely difficult. And even if this documentation existed, because of differences in behavior across filesystems and operating systems and the difficulty in programmatically determining the characteristics of filesystems at run time, it is practically impossible to design a one size fits all approach to high performance I/O.

The filesystem is a powerful concept. I want to agree and use the everything is a file philosophy. Unfortunately, filesystems don't appear to be scaling very well to support the potential of modern day storage technology. We're probably at the point where commodity priced solid state storage is far more capable than today's software for the majority of applications. Storage hardware manufacturers will keep producing faster and faster storage and their marketing teams will keep convincing us that we need to buy it. But until software catches up, chances are most of us won't come close to realizing the true potential of modern storage hardware. And that's even true for specialized applications that do employ tricks taking hundreds or thousands of person hours to implement in order to eek out every iota of performance potential. The average software developer and application using filesystems as they were designed to be used has little to no chance of coming close to utilizing the performance potential of modern storage devices. That's really a shame.

Read and Post Comments

Release of python-zstandard 0.9

April 09, 2018 at 09:30 AM | categories: Python, Mozilla | View Comments

I have just released python-zstandard 0.9.0. You can install the latest release by running pip install zstandard==0.9.0.

Zstandard is a highly tunable and therefore flexible compression algorithm with support for modern features such as multi-threaded compression and dictionaries. Its performance is remarkable and if you use it as a drop-in replacement for zlib, bzip2, or other common algorithms, you'll frequently see more than a doubling in performance.

python-zstandard provides rich bindings to the zstandard C library without sacrificing performance, safety, features, or a Pythonic feel. The bindings run on Python 2.7, 3.4, 3.5, 3.6, 3.7 using either a C extension or CFFI bindings, so it works with CPython and PyPy.

I can make a compelling argument that python-zstandard is one of the richest compression packages available to Python programmers. Using it, you will be able to leverage compression in ways you couldn't with other packages (especially those in the standard library) all while achieving ridiculous performance. Due to my focus on performance, python-zstandard is able to outperform Python bindings to other compression libraries that should be faster. This is because python-zstandard is very diligent about minimizing memory allocations and copying, minimizing Python object creation, reusing state, etc.

While python-zstandard is formally marked as a beta-level project and hasn't yet reached a 1.0 release, it is suitable for production usage. python-zstandard 0.8 shipped with Mercurial and is in active production use there. I'm also aware of other consumers using it in production, including at Facebook and Mozilla.

The sections below document some of the new features of python-zstandard 0.9.

File Object Interface for Reading

The 0.9 release contains a stream_reader() API on the compressor and decompressor objects that allows you to treat those objects as readable file objects. This means that you can pass a ZstdCompressor or ZstdDecompressor around to things that accept file objects and things generally just work. e.g.:

   with open(compressed_file, 'rb') as ifh:
       cctx = zstd.ZstdDecompressor()
       with cctx.stream_reader(ifh) as reader:
           while True:
               chunk = reader.read(32768)
               if not chunk:
                   break

This is probably the most requested python-zstandard feature.

While the feature is usable, it isn't complete. Support for readline(), readinto(), and a few other APIs is not yet implemented. In addition, you can't use these reader objects for opening zstandard compressed tarball files because Python's tarfile package insists on doing backward seeks when reading. The current implementation doesn't support backwards seeking because that requires buffering decompressed output and that is not trivial to implement. I recognize that all these features are useful and I will try to work them into a subsequent release of 0.9.

Negative Compression Levels

The 1.3.4 release of zstandard (which python-zstandard 0.9 bundles) supports negative compression levels. I won't go into details, but negative compression levels disable extra compression features and allow you to trade compression ratio for more speed.

When compressing a 6,472,921,921 byte uncompressed bundle of the Firefox Mercurial repository, the previous fastest we could go with level 1 was ~510 MB/s (measured on the input side) yielding a 1,675,227,803 file (25.88% of original).

With level -1, we compress to 1,934,253,955 (29.88% of original) at ~590 MB/s. With level -5, we compress to 2,339,110,873 bytes (36.14% of original) at ~720 MB/s.

On the decompress side, level 1 decompresses at ~1,150 MB/s (measured at the output side), -1 at ~1,320 MB/s, and -5 at ~1,350 MB/s (generally speaking, zstandard's decompression speeds are relatively similar - and fast - across compression levels).

And that's just with a single thread. zstandard supports using multiple threads to compress a single input and python-zstandard makes this feature easy to use. Using 8 threads on my 4+4 core i7-6700K, level 1 compresses at ~2,000 MB/s (3.9x speedup), -1 at ~2,300 MB/s (3.9x speedup), and -5 at ~2,700 MB/s (3.75x speedup).

That's with a large input. What about small inputs?

If we take 456,599 Mercurial commit objects spanning 298,609,254 bytes from the Firefox repository and compress them individually, at level 1 we yield a total of 133,457,198 bytes (44.7% of original) at ~112 MB/s. At level -1, we compress to 161,241,797 bytes (54.0% of original) at ~215 MB/s. And at level -5, we compress to 185,885,545 bytes (62.3% of original) at ~395 MB/s.

On the decompression side, level 1 decompresses at ~260 MB/s, -1 at ~1,000 MB/s, and -5 at ~1,150 MB/s.

Again, that's 456,599 operations on a single thread with Python.

python-zstandard has an experimental API where you can pass in a collection of inputs and it batch compresses or decompresses them in a single operation. It releases and GIL and uses multiple threads. It puts the results in shared buffers in order to minimize the overhead of memory allocations and Python object creation and garbage collection. Using this mode with 8 threads on my 4+4 core i7-6700K, level 1 compresses at ~525 MB/s, -1 at ~1,070 MB/s, and -5 at ~1,930 MB/s. On the decompression side, level 1 is ~1,320 MB/s, -1 at ~3,800 MB/s, and -5 at ~4,430 MB/s.

So, my consumer grade desktop i7-6700K is capable of emitting decompressed data at over 4 GB/s with Python. That's pretty good if you ask me. (Full disclosure: the timings were taken just around the compression operation itself: overhead of loading data into memory was not taken into account. See the bench.py script in the source repository for more.

Long Distance Matching Mode

Negative compression levels take zstandard into performance territory that has historically been reserved for compression formats like lz4 that are optimized for that domain. Long distance matching takes zstandard in the other direction, towards compression formats that aim to achieve optimal compression ratios at the expense of time and memory usage.

python-zstandard 0.9 supports long distance matching and all the configurable parameters exposed by the zstandard API.

I'm not going to capture many performance numbers here because python-zstandard performs about the same as the C implementation because LDM mode spends most of its time in zstandard C code. If you are interested in numbers, I recommend reading the zstandard 1.3.2 and 1.3.4 release notes.

I will, however, underscore that zstandard can achieve close to lzma's compression ratios (what the xz utility uses) while completely smoking lzma on decompression speed. For a bundle of the Firefox Mercurial repository, zstandard level 19 with a long distance window size of 512 MB using 8 threads compresses to 1,033,633,309 bytes (16.0%) in ~260s wall, 1,730s CPU. xz -T8 -8 compresses to 1,009,233,160 (15.6%) in ~367s wall, ~2,790s CPU.

On the decompression side, zstandard takes ~4.8s and runs at ~1,350 MB/s as measured on the output side while xz takes ~54s and runs at ~114 MB/s. Zstandard, however, does use a lot more memory than xz for decompression, so that performance comes with a cost (512 MB versus 32 MB for this configuration).

Other Notable Changes

python-zstandard now uses the advanced compression and decompression APIs everywhere. All tunable compression and decompression parameters are available to python-zstandard. This includes support for disabling magic headers in frames (saves 4 bytes per frame - this can matter for very small inputs, especially when using dictionary compression).

The full dictionary training API is exposed. Dictionary training can now use multiple threads.

There are a handful of utility functions for inspecting zstandard frames, querying the state of compressors, etc.

Lots of work has gone into shoring up the code base. We now build with warnings as errors in CI. I performed a number of focused auditing passes to fix various classes of deficiencies in the C code. This includes use of the buffer protocol: python-zstandard is now able to accept any Python object that provides a view into its underlying raw data.

Decompression contexts can now be constructed with a max memory threshold so attempts to decompress something that would require more memory will result in error.

See the full release notes for more.

Conclusion

Since I last released a major version of python-zstandard, a lot has changed in the zstandard world. As I blogged last year, zstandard circa early 2017 was a very compelling compression format: it already outperformed popular compression formats like zlib and bzip2 across the board. As a general purpose compression format, it made a compelling case for itself. In my mind, brotli was its only real challenger.

As I wrote then, zstandard isn't perfect. (Nothing is.) But a year later, it is refreshing to see advancements.

A criticism one year ago was zstandard was pretty good as a general purpose compression format but it wasn't great if you live at the fringes. If you were a speed freak, you'd probably use lz4. If you cared about compression ratios, you'd probably use lzma. But recent releases of zstandard have made huge strides into the territory of these niche formats. Negative compression levels allow zstandard to flirt with lz4's performance. Long distance matching allows zstandard to achieve close to lzma's compression ratios. This is a big friggin deal because it makes it much, much harder to justify a domain-specific compression format over zstandard. I think lzma still has a significant edge for ultra compression ratios when memory utilization is a concern. But for many consumers, memory is readily available and it is easy to justify trading potentially hundreds of megabytes of memory to achieve a 10x speedup for decompression. Even if you aren't willing to sacrifice more memory, the ability to tweak compression parameters is huge. You can do things like store multiple versions of a compressed document and conditionally serve the one most appropriate for the client, all while running the same zstandard-only code on the client. That's huge.

A year later, zstandard continues to impress me for its set of features and its versatility. The library is continuing to evolve - all while maintaining backwards compatibility on the decoding side. (That's a sign of a good format design if you ask me.) I was honestly shocked to see that zstandard was able to change its compression settings in a way that allowed it to compete with lz4 and lzma without requiring a format change.

The more I use zstandard, the more I think that everyone should use this and that popular compression formats just aren't cut out for modern computing any more. Every time I download a zlib/gz or bzip2 compressed archive, I'm thinking if only they used zstandard this archive would be smaller, it would have decompressed already, and I wouldn't be thinking about how annoying it is to wait for compression operations to complete. In my mind, zstandard is such an obvious advancement over the status quo and is such a versatile format - now covering the gamut of super fast compression to ultra ratios - that it is bordering on negligent to not use zstandard. With the removal of the controversial patent rights grant license clause in zstandard 1.3.1, that justifiable resistance to widespread adoption of zstandard has been eliminated. Zstandard is objectively superior for many workloads and I heavily encourage its use. I believe python-zstandard provides a high-quality interface to zstandard and I encourage you to give it and zstandard a try the next time you compress data.

If you run into any problems or want to get involved with development, python-zstandard lives at indygreg/python-zstandard on GitHub.

*(I updated the post on 2018-05-16 to remove a paragraph about zstandard competition. In the original post, I unfairly compared zstandard to Snappy instead of Brotli and made some inaccurate statements around that comparison.)

Read and Post Comments

from __past__ import bytes_literals

March 13, 2017 at 09:55 AM | categories: Python, Mercurial, Mozilla | View Comments

Last year, I simultaneously committed one of the ugliest and impressive hacks of my programming career. I haven't had time to write about it. Until now.

In summary, the hack is a source-transforming module loader for Python. It can be used by Python 3 to import a Python 2 source file while translating certain primitives to their Python 3 equivalents. It is kind of like 2to3 except it executes at run-time during import. The main goal of the hack was to facilitate porting Mercurial to Python 3 while deferring having to make the most invasive - and therefore most annoying - elements of the port in the canonical source code representation.

For the technically curious, it works as follows.

The hg Python executable registers a custom meta path finder instance. This entity is invoked during import statements to try to find the module being imported. It tells a later phase of the import mechanism how to load that module from wherever it is (usually a .py or .pyc file on disk) to a Python module object. The custom finder only responds to requests for modules known to be managed by the Mercurial project. For these modules, it tells the next stage of the import mechanism to invoke a custom SourceLoader instance. Here's where the real magic is: when the custom loader is invoked, it tokenizes the Python source code using the tokenize module, iterates over the token stream, finds specific patterns, and rewrites them to something more appropriate. It then untokenizes back to Python source code then falls back to the built-in loader which does the heavy lifting of compiling the source to Python code objects. So, we have Python 2 source files on disk that magically get transformed to be Python compatible when they are loaded by Python 3. Oh, and there is no performance penalty for the token transformation on subsequence loads because the transformed bytecode is cached in the .pyc file (using a custom header so we know it was transformed and can be invalidated when the transformation logic changes).

At the time I wrote it, the token stream manipulation converted most string literals ('') to bytes literals (b''). In other words, it restored the Python 2 behavior of string literals being bytes and not unicode. We jokingly call it from __past__ import bytes_literals (a play on Python 2's from __future__ import unicode_literals special syntax which changes string literals from Python 2's str/bytes type to unicode to match Python 3's behavior).

Since I implemented the first version, others have implemented:

As one can expect, when I tweeted a link to this commit, many Python developers (including a few CPython core developers) expressed a mix of intrigue and horror. But mostly horror.

I fully concede that what I did here is a gross hack. And, it is the intention of the Mercurial project to undo this hack and perform a proper port once Python 3 support in Mercurial is more mature. But, I want to lay out my defense on why I did this and why the Mercurial project is tolerant of this ugly hack.

Individuals within the Mercurial project have wanted to port to Python 3 for years. Until recently, it hasn't been a project priority because a port was too much work for too little end-user gain. And, on the technical front, a port was just not practical until Python 3.5. (Two main blockers were no u'' literals - restored in Python 3.3 - and no % formatting for b'' literals - restored in 3.5. And as I understand it, senior members of the Mercurial project had to lobby Python maintainers pretty hard to get features like % formatting of b'' literals restored to Python 3.)

Anyway, after a number of failed attempts to initiate the Python 3 port over the years, the Mercurial project started making some positive steps towards Python 3 compatibility, such as switching to absolute imports and addressing syntax issues that allowed modules to be parsed into an AST and even compiled and loadable. These may seem like small steps, but for a larger project, it was a lot of work.

The porting effort hit a large wall when it came time to actually make the AST-valid Python code run on Python 3. Specifically, we had a strings problem.

When you write software that exchanges data between machines - sometimes machines running different operating systems or having different encodings - and there is an expectation that things work the same and data roundtrips accordingly, trying to force text encodings is essentially impossible and inevitably breaks something or someone. It is much easier for Mercurial to operate bytes first and only take text encoding into consideration when absolutely necessary (such as when emitting bytes to the terminal in the wanted encoding or when emitting JSON). That's not to say Mercurial ignores the existence of encodings. Far from it: Mercurial does attempt to normalize some data to Unicode. But it often does so with a special Python type that internally stores the raw byte sequence of the source so that a consumer can choose to operate at the bytes or Unicode level.

Anyway, this means that practically every string variable in Mercurial is a bytes type (or something that acts like a bytes type). And since string literals in Python 3 are the str type (which represents Unicode), that would mean having to prefix almost every '' string literal in Mercurial with b'' in order to placate Python 3. Having to update every occurrence of simple primitives that could be statically transformed automatically felt like busy work. We wanted to spend time on the meaningful parts of the Python 3 port so we could find interesting problems and challenges, not toil with mechanical conversions that add little to no short-term value while simultaneously increasing cognitive dissonance and quite possibly increasing the odds of introducing a bug in Python 2. In other words, why should humans do the work that machines can do for us? Thus, the source-transforming module importer was born.

While I concede what Mercurial did is a giant hack, I maintain it was the correct thing to do. It has allowed the Python 3 port to move forward without being blocked on the more tedious and invasive transformations that could introduce subtle bugs (including performance regressions) in Python 2. Perfect is the enemy of good. People time is valuable. The source-transforming module importer allowed us to unblock an important project without sinking a lot of people time into it. I'd make that trade-off again.

While I won't encourage others to take this approach to porting to Python 3, if you want to, Mercurial's source is available under a GPL license and the custom module importer could be adapted to any project with minimal modifications. If someone does extract it as reusable code, please leave a comment and I'll update the post to link to it.

Read and Post Comments

Better Compression with Zstandard

March 07, 2017 at 09:55 AM | categories: Python, Mercurial, Mozilla | View Comments

I think I first heard about the Zstandard compression algorithm at a Mercurial developer sprint in 2015. At one end of a large table a few people were uttering expletives out of sheer excitement. At developer gatherings, that's the universal signal for something is awesome. Long story short, a Facebook engineer shared a link to the RealTime Data Compression blog operated by Yann Collet (then known as the author of LZ4 - a compression algorithm known for its insane speeds) and people were completely nerding out over the excellent articles and the data within showing the beginnings of a new general purpose lossless compression algorithm named Zstandard. It promised better-than-deflate/zlib compression ratios and performance on both compression and decompression. This being a Mercurial meeting, many of us were intrigued because zlib is used by Mercurial for various functionality (including on-disk storage and compression over the wire protocol) and zlib operations frequently appear as performance hot spots.

Before I continue, if you are interested in low-level performance and software optimization, I highly recommend perusing the RealTime Data Compression blog. There are some absolute nuggets of info in there.

Anyway, over the months, the news about Zstandard (zstd) kept getting better and more promising. As the 1.0 release neared, the Facebook engineers I interact with (Yann Collet - Zstandard's author - is now employed by Facebook) were absolutely ecstatic about Zstandard and its potential. I was toying around with pre-release versions and was absolutely blown away by the performance and features. I believed the hype.

Zstandard 1.0 was released on August 31, 2016. A few days later, I started the python-zstandard project to provide a fully-featured and Pythonic interface to the underlying zstd C API while not sacrificing safety or performance. The ulterior motive was to leverage those bindings in Mercurial so Zstandard could be a first class citizen in Mercurial, possibly replacing zlib as the default compression algorithm for all operations.

Fast forward six months and I've achieved many of those goals. python-zstandard has a nearly complete interface to the zstd C API. It even exposes some primitives not in the C API, such as batch compression operations that leverage multiple threads and use minimal memory allocations to facilitate insanely fast execution. (Expect a dedicated post on python-zstandard from me soon.)

Mercurial 4.1 ships with the python-zstandard bindings. Two Mercurial 4.1 peers talking to each other will exchange Zstandard compressed data instead of zlib. For a Firefox repository clone, transfer size is reduced from ~1184 MB (zlib level 6) to ~1052 MB (zstd level 3) in the default Mercurial configuration while using ~60% of the CPU that zlib required on the compressor end. When cloning from hg.mozilla.org, the pre-generated zstd clone bundle hosted on a CDN using maximum compression is ~707 MB - ~60% the size of zlib! And, work is ongoing for Mercurial to support Zstandard for on-disk storage, which should bring considerable performance wins over zlib for local operations.

I've learned a lot working on python-zstandard and integrating Zstandard into Mercurial. My primary takeaway is Zstandard is awesome.

In this post, I'm going to extol the virtues of Zstandard and provide reasons why I think you should use it.

Why Zstandard

The main objective of lossless compression is to spend one resource (CPU) so that you may reduce another (I/O). This trade-off is usually made because data - either at rest in storage or in motion over a network or even through a machine via software and memory - is a limiting factor for performance. So if compression is needed for your use case to mitigate I/O being the limiting resource and you can swap in a different compression algorithm that magically reduces both CPU and I/O requirements, that's pretty exciting. At scale, better and more efficient compression can translate to substantial cost savings in infrastructure. It can also lead to improved application performance, translating to better end-user engagement, sales, productivity, etc. This is why companies like Facebook (Zstandard), Google (brotli, snappy, zopfli), and Pied Piper (middle-out) invest in compression.

Today, the most widely used compression algorithm in the world is likely DEFLATE. And, software most often interacts with DEFLATE via what is likely the most widely used software library in the world, zlib.

Being at least 27 years old, DEFLATE is getting a bit long in the tooth. Computers are completely different today than they were in 1990. The Pentium microprocessor debuted in 1993. If memory serves (pun intended), it used PC66 DRAM, which had a transfer rate of 533 MB/s. For comparison, a modern NVMe M.2 SSD (like the Samsung 960 PRO) can read at 3000+ MB/s and write at 2000+ MB/s. In other words, persistent storage today is faster than the RAM from the era when DEFLATE was invented. And of course CPU and network speeds have increased as well. We also have completely different instruction sets on CPUs for well-designed algorithms and software to take advantage of. What I'm trying to say is the market is ripe for DEFLATE and zlib to be dethroned by algorithms and software that take into account the realities of modern computers.

(For the remainder of this post I'll use zlib as a stand-in for DEFLATE because it is simpler.)

Zstandard initially piqued my attention by promising better-than-zlib compression and performance in both the compression and decompression directions. That's impressive. But it isn't unique. Brotli achieves the same, for example. But what kept my attention was Zstandard's rich feature set, tuning abilities, and therefore versatility.

In the sections below, I'll describe some of the benefits of Zstandard in more detail.

Before I do, I need to throw in an obligatory disclaimer about data and numbers that I use. Benchmarking is hard. Benchmarks should not be trusted. There are so many variables that can influence performance and benchmarks. (A recent example that surprised me is the CPU frequency/power ramping properties of Xeon versus non-Xeon Intel CPUs. tl;dr a Xeon won't hit max CPU frequency if only a core or two is busy, meaning that any single or low-threaded benchmark is likely misleading on Xeons unless you change power settings to mitigate its conservative power ramping defaults. And if you change power settings, does that reflect real-life usage?)

Reporting useful and accurate performance numbers for compression is hard because there are so many variables to care about. For example:

  • Every corpus is different. Text, JSON, C++, photos, numerical data, etc all exhibit different properties when fed into compression and could cause compression ratios or speeds to vary significantly.
  • Few large inputs versus many smaller inputs (some algorithms work better on large inputs; some libraries have high per-operation overhead).
  • Memory allocation and use strategy. Performance can vary significantly depending on how a compression library allocates, manages, and uses memory. This can be an implementation specific detail as opposed to a core property of the compression algorithm.

Since Mercurial is the driver for my work in Zstandard, the data and numbers I report in this post are mostly Mercurial data. Specifically, I'll be referring to data in the mozilla-unified Firefox repository. This repository contains over 300,000 commits spanning almost 10 years. The data within is a good mix of text (mostly C++, JavaScript, Python, HTML, and CSS source code and other free-form text) and binary (like PNGs). The Mercurial layer adds some binary structures to e.g. represent metadata for deltas, diffs, and patching. There are two Mercurial-specific pieces of data I will use. One is a Mercurial bundle. This is essentially a representation of all data in a repository. It stores a mix of raw, fulltext data and deltas on that data. For the mozilla-unified repo, an uncompressed bundle (produced via hg bundle -t none-v2 -a) is ~4457 MB. The other piece of data is revlog chunks. This is a mix of fulltext and delta data for a specific item tracked in version control. I frequently use the changelog corpus, which is the fulltext data describing changesets or commits to Firefox. The numbers quoted and used for charts in this post are available in a Google Sheet.

All performance data was obtained on an i7-6700K running Ubuntu 16.10 (Linux 4.8.0) with a mostly stock config. Benchmarks were performed in memory to mitigate storage I/O or filesystem interference. Memory used is DDR4-2133 with a cycle time of 35 clocks.

While I'm pretty positive about Zstandard, it isn't perfect. There are corpora for which Zstandard performs worse than other algorithms, even ones I compare it directly to in this post. So, your mileage may vary. Please enlighten me with your counterexamples by leaving a comment.

With that (rather large) disclaimer out of the way, let's talk about what makes Zstandard awesome.

Flexibility for Speed Versus Size Trade-offs

Compression algorithms typically contain parameters to control how much work to do. You can choose to spend more CPU to (hopefully) achieve better compression or you can spend less CPU to sacrifice compression. (OK, fine, there are other factors like memory usage at play too. I'm simplifying.) This is commonly exposed to end-users as a compression level. (In reality there are often multiple parameters that can be tuned. But I'll just use level as a stand-in to represent the concept.)

But even with adjustable compression levels, the performance of many compression algorithms and libraries tend to fall within a relatively narrow window. In other words, many compression algorithms focus on niche markets. For example, LZ4 is super fast but doesn't yield great compression ratios. LZMA yields terrific compression ratios but is extremely slow.

This can be visualized in the following chart showing results when compressing a mozilla-unified Mercurial bundle:

bundle compression with common algorithms

This chart plots the logarithmic compression speed in megabytes per second against achieved compression ratio. The further right a data point is, the better the compression and the smaller the output. The higher up a point is, the faster compression is.

The ideal compression algorithm lives in the top right, which means it compresses well and is fast. But the powers of mathematics push compression algorithms away from the top right.

On to the observations.

LZ4 is highly vertical, which means its compression ratios are limited in variance but it is extremely flexible in speed. So for this data, you might as well stick to a lower compression level because higher values don't buy you much.

Bzip2 is the opposite: a horizontal line. That means it is consistently the same speed while yielding different compression ratios. In other words, you might as well crank bzip2 up to maximum compression because it doesn't have a significant adverse impact on speed.

LZMA and zlib are more interesting because they exhibit more variance in both the compression ratio and speed dimensions. But let's be frank, they are still pretty narrow. LZMA looks pretty good from a shape perspective, but its top speed is just too slow - only ~26 MB/s!

This small window of flexibility means that you often have to choose a compression algorithm based on the speed versus size trade-off you are willing to make at that time. That choice often gets baked into software. And as time passes and your software or data gains popularity, changing the software to swap in or support a new compression algorithm becomes harder because of the cost and disruption it will cause. That's technical debt.

What we really want is a single compression algorithm that occupies lots of space in both dimensions of our chart - a curve that has high variance in both compression speed and ratio. Such an algorithm would allow you to make an easy decision choosing a compression algorithm without locking you into a narrow behavior profile. It would allow you make a completely different size versus speed trade-off in the future by only adjusting a config knob or two in your application - no swapping of compression algorithms needed!

As you can guess, Zstandard fulfills this role. This can clearly be seen in the following chart (which also adds brotli for comparison).

bundle compression with modern algorithms

The advantages of Zstandard (and brotli) are obvious. Zstandard's compression speeds go from ~338 MB/s at level 1 to ~2.6 MB/s at level 22 while covering compression ratios from 3.72 to 6.05. On one end, zstd level 1 is ~3.4x faster than zlib level 1 while achieving better compression than zlib level 9! That fastest speed is only 2x slower than LZ4 level 1. On the other end of the spectrum, zstd level 22 runs ~1 MB/s slower than LZMA at level 9 and produces a file that is only 2.3% larger.

It's worth noting that zstd's C API exposes several knobs for tweaking the compression algorithm. Each compression level maps to a pre-defined set of values for these knobs. It is possible to set these values beyond the ranges exposed by the default compression levels 1 through 22. I've done some basic experimentation with this and have made compression even faster (while sacrificing ratio, of course). This covers the gap between Zstandard and brotli on this end of the tuning curve.

The wide span of compression speeds and ratios is a game changer for compression. Unless you have special requirements such as lightning fast operations (which LZ4 can provide) or special corpora that Zstandard can't handle well, Zstandard is a very safe and flexible choice for general purpose compression.

Multi-threaded Compression

Zstd 1.1.3 contains a multi-threaded compression API that allows a compression operation to leverage multiple threads. The output from this API is compatible with the Zstandard frame format and doesn't require any special handling on the decompression side. In other words, a compressor can switch to the multi-threaded API and decompressors won't care.

This is a big deal for a few reasons. First, today's advancements in computer processors tend to yield more capacity from more cores not from faster clocks and better cycle efficiency (although many cases do benefit greatly from modern instruction sets like AVX and therefore better cycle efficiency). Second, so many compression libraries are only single-threaded and require consumers to invent their own framing formats or storage models to facilitate multi-threading. (See Blosc for such a library.) Lack of a multi-threaded API in the compression library means trusting another piece of software or writing your own multi-threaded code.

The following chart adds a plot of Zstandard multi-threaded compression with 4 threads.

multi-threaded compression

The existing curve for Zstandard basically shifted straight up. Nice!

The ~338 MB/s speed for single-threaded compression on zstd level 1 increases to ~1,376 MB/s with 4 threads. That's ~4.06x faster. And, it is ~2.26x faster than the previous fastest entry, LZ4 at level 1! The output size only increased by ~4 MB or ~0.3% over single-threaded compression.

The scaling properties for multi-threaded compression on this input are terrific: all 4 cores are saturated and the output size barely changed.

Because Zstandard's multi-threaded compression API produces data compatible with any Zstandard decompressor, it can logically be considered an extension of compression levels. This means that the already extremely flexible speed vs ratio curve becomes even wider in the speed axis. Zstandard was already a justifiable choice with its extreme versatility. But when you throw in native multi-threaded compression API support, the flexibility for tuning compression performance is just absurd. With enough cores, you are likely to run into I/O limits long before you exhaust the CPU, at which point you can crank up the compression level and sacrifice as much CPU as you are willing to burn. That's a good position to be in.

Decompression Speed

Compression speed and ratios only tell half the story about a compression algorithm. Except for archiving scenarios where you write once and read rarely, you probably care about decompression performance.

Popular compression algorithms like zlib and bzip2 have less than stellar decompression speeds. On my i7-6700K, zlib decompression can deliver many decompressed data sets at the output end at 200+ MB/s. However, on the input/compressed end, it frequently fails to reach 100 MB/s or even 80 MB/s. This is significant because if your application is reading data over a 1 Gbps network or from a local disk (modern SSDs can read at several hundred MB/s or more), then your application has a CPU bottleneck at decoding the data - and that's before you actually do anything useful with the data in the application layer! (Remember: the idea behind compression is to spend CPU to mitigate an I/O bottleneck. So if compression makes you CPU bound, you've undermined the point of compression!) And if my Skylake CPU running at 4.0 GHz is CPU - not I/O - bound, A Xeon in a data center will be even slower and even more CPU bound (Xeons tend to run at much lower clock speeds - the laws of thermodynamics require that in order to run more cores in the package). In short, if you are using zlib for high throughput scenarios, there's a good chance it is a bottleneck and slowing down your application.

We again measure the speed of algorithms using a Firefox Mercurial bundle. The following charts plot decompression speed versus ratio for this file. The first chart measures decompression speed on the input end of the decompressor. The second measures speed at the output end.

decompression input

decompression output

Zstandard matches its great compression speed with great decompression speed. Zstandard can deliver decompressed output at 1000+ MB/s while consuming input at 200-275MB/s. Furthermore, decompression speed is mostly independent of the compression level. (Although higher compression levels require more memory in the decompressor.) So, if you want to throw more CPU at re-compression later so data at rest takes less space, you can do that without sacrificing read performance. I haven't done the math, but there is probably a break-even point where having dedicated machines re-compress terabytes or petabytes of data at rest offsets the costs of those machine through reduced storage costs.

While Zstandard is not as fast decompressing as LZ4 (which can consume compressed input at 500+ MB/s), its performance is often ~4x faster than zlib. On many CPUs, this puts it well above 1 Gbps, which is often desirable to avoid a bottleneck at the network layer.

It's also worth noting that while Zstandard and brotli were comparable on the compression half of this data, Zstandard has a clear advantage doing decompression.

Finally, you don't appear to pay a price for multi-threaded Zstandard compression on the decompression side (zstdmt in the chart).

Dictionary Support

The examples so far in this post have used a single 4,457 MB piece of input data to measure behavior. Large data can behave completely differently from small data. This is because so much of what compression algorithms do is find patterns that came before so incoming data can be referenced to old data instead of uniquely stored. And if data is small, there isn't much of it that came before to reference!

This is often why many small, independent chunks of input compress poorly compared to a single large chunk. This can be demonstrated by comparing the widely-used zip and tar archive formats. On the surface, both do the same thing: they are a container of files. But they employ compression at different phases. A zip file will zlib compress each entry independently. However, a tar file doesn't use compression internally. Instead, the tar file itself is fed into a compression algorithm and compressed as a whole.

We can observe the difference on real world data. Firefox ships with a file named omni.ja. Despite the weird extension, this is a zip file. The file contains most of the assets for non-compiled code used by Firefox. This includes the JavaScript, HTML, CSS, and images that power many parts of the Firefox frontend. The file weighs in at 9,783,749 bytes for the 64-bit Windows Firefox Nightly from 2017-03-06. (Or 9,965,793 bytes when using zip -9 - the code for generating omni.ja is smarter than zip and creates smaller files.) But a zlib level 9 compressed tar.gz file of that directory is 8,627,155 bytes. That 1,156KB / 13% size difference is significant when you are talking about delivering bits to end users! (In this case, the content within the archive needs to be individually addressable to facilitate fast access to any item without having to decompress the entire archive: this matters for performance.)

A more extreme example of the differences between zip and tar is the files in the Firefox source checkout. On revision a08ec245fa24 of the Firefox Mercurial repository, a zip file of all files in version control is 430,446,549 bytes versus 322,916,403 bytes for a tar.gz file (1,177,430,383 bytes uncompressed spanning 180,912 files). Using Zstandard, compressing each file discretely at compression level 3 yields 391,387,299 bytes of compressed data versus 294,926,418 as a single stream (without the tar container). Same compression algorithm. Different application method. Drastically different results. That's the impact of input size on compression performance.

While the compression ratio and speed of a single large stream is often better than multiple smaller chunks, there are still use cases that either don't have enough data or prefer independent access to each piece of input (like Firefox's omni.ja file). So a robust compression algorithm should handle small inputs as well as it does large inputs.

Zstandard helps offset the inherent inefficiencies of small inputs by supporting dictionary compression. A dictionary is essentially data used to seed the compressor's state. If the compressor sees data that exists in the dictionary, it references the dictionary instead of storing new data in the compressed output stream. This results in smaller output sizes and better compression ratios. One drawback to this is the dictionary has to be used to decompress data, which means you need to figure out how to distribute the dictionary and ensure it remains in sync with all data producers and consumers. This isn't always trivial.

Dictionary compression only works if there is enough repeated data and patterns in the inputs that can be extracted to yield a useful dictionary. Examples of this include markup languages, source code, or pieces of similar data (such as JSON payloads from HTTP API requests or telemetry data), which often have many repeated keywords and patterns.

Dictionaries are typically produced by training them on existing data. Essentially, you feed a bunch of samples into an algorithm that spits out a meaningful and useful dictionary. The more coherency in the data that will be compressed, the better the dictionary and the better the compression ratios.

Dictionaries can have a significant effect on compression ratios and speed.

Let's go back to Firefox's omni.ja file. Compressing each file discretely at zstd level 12 yields 9,177,410 bytes of data. But if we produce a 131,072 byte dictionary by training it on all files within omni.ja, the total size of each file compressed discretely is 7,942,886 bytes. Including the dictionary, the total size is 8,073,958 bytes, 1,103,452 bytes smaller than non-dictionary compression! (The zlib-based omni.ja is 9,783,749 bytes.) So Zstandard plus dictionary compression would likely yield a meaningful ~1.5 MB size reduction to the omni.ja file. This would make the Firefox distribution smaller and may improve startup time (since many files inside omni.ja are accessed at startup), which would make a number of people very happy. (Of course, Firefox doesn't yet contain the zstd C library. And adding it just for this use case may not make sense. But Firefox does ship with the brotli library and brotli supports dictionary compression and has similar performance characteristics as Zstandard, so, uh, someone may want to look into transitioning omni.jar to not zlib.)

But the benefits of dictionary compression don't end at compression ratios: operations with dictionaries can be faster as well!

The following chart shows performance when compressing Mercurial changeset data (describes a Mercurial commit) for the Firefox repository. There are 382,530 discrete inputs spanning 221,429,458 bytes (mean: 579 bytes, median: 306 bytes). (Note: measurements were conducted in Python and therefore may introduce some overhead.)

dictionary compression performance

Aside from zstd level 3 dictionary compression, Zstandard is faster than zlib level 6 across the board (I suspect this one-off is an oddity with the zstd compression parameters at this level and this corpus because zstd level 4 is faster than level 3, which is weird).

It's also worth noting that non-dictionary zstandard compression has similar compression ratios to zlib. Again, this demonstrates the intrinsic difficulties of compressing small inputs.

But the real takeaway from this data are the speed differences with dictionary compression enabled. Dictionary decompression is 2.2-2.4x faster than non-dictionary decompression. Already respectable ~240 MB/s decompression speed (measured at the output end) becomes ~530 MB/s. Zlib level 6 was ~140 MB/s, so swapping in dictionary compression makes things ~3.8x faster. It takes ~1.5s of CPU time to zlib decompress this corpus. So if Mercurial can be taught to use Zstandard dictionary compression for changelog data, certain operations on this corpus will complete ~1.1s faster. That's significant.

It's worth stating that Zstandard isn't the only compression algorithm or library to support dictionary compression. Brotli and zlib do as well, for example. But, Zstandard's support for dictionary compression seems to be more polished than other libraries I've seen. It has multiple APIs for training dictionaries from sample data. (Brotli has none nor does brotli's documentation say how to generate dictionaries as far as I can tell.)

Dictionary compression is definitely an advanced feature, applicable only to certain use cases (lots of small, similar data). But there's no denying that if you can take advantage of dictionary compression, you may be rewarded with significant performance wins.

A Versatile C API

I spend a lot of my time these days in higher-level programming languages like Python and JavaScript. By the time you interact with compression in high-level languages, the low-level compression APIs provided by the compression library are most likely hidden from you and bundled in a nice, friendly abstraction, suitable for a higher-level language. And more often than not, many features of that low-level API are not exposed for you to call. So, you don't get an appreciation for how good (or bad) or feature rich (or lacking) the low-level API is.

As part of writing python-zstandard, I've spent a lot of time interfacing with the zstd C API. And, as part of evaluating other compression libraries for use in Mercurial, I've been looking at C APIs for other libraries and the Python bindings to them. A takeaway from this is an appreciation for the quality of zstd's C API.

Many compression library APIs are either too simple or too complex. Zstandard's is in the Goldilocks zone. Aside from a few minor missing features, its C API was more than adequate in its 1.0 release.

What I really appreciate about the zstd C API is that it provides high, medium, and low-level APIs. From the highest level, you throw it pointers to input and output buffers and it does an operation. From the medium level, you use a reusable context holding state and other parameters and it does an operation. From the low-level, you are calling multiple functions and shuffling bytes around, maintaining your own state and potentially bypassing the Zstandard framing format in the process. The different levels give you almost total control over everything. This is critical for performance optimization and when writing bindings for higher-level languages that may have different expectations on the behavior of software. The performance I've achieved in python-zstandard just isn't (easily) possible with other compression libraries because of their lacking API design.

Oftentimes when interacting with a C library I think if only there were a function to let me do X my life would be much easier. I rarely have this experience with Zstandard. The C API is well thought out, has almost all the features I want/need, and is pretty easy to use. While most won't notice this difference, it should be a significant advantage for Zstandard in the long run, as more bindings are written and more people have a high-quality experience with it because the C API allows them to.

Zstandard Isn't Perfect

I've been pretty positive about Zstandard so far in this post. In fear of sounding like a fanboy who is so blinded by admiration that he can't see faults and because nothing is perfect, I need to point out some negatives about Zstandard. (Aside: put little faith in the words uttered by someone who can't find a fault in something they praise.)

First, the framing format is a bit heavyweight in some scenarios. The frame header is at least 6 bytes. For input of 256-65791 bytes, recording the original source size and its checksum will result in a 12 byte frame. Zlib, by contrast, is only 6 bytes for this scenario. When storing tens of thousands of compressed records (this is a use case in Mercurial), the frame overhead can matter and this can make it difficult for compressed Zstandard data to be as small as zlib for very small inputs. (It's worth noting that zlib doesn't store the decompressed size in its header. There are pros and cons to this, which I'll discuss in my eventual post about python-zstandard and how it achieves optimal performance.) If the frame overhead matters to you, the zstd C API does expose a block API that operates at a level below the framing format, allowing you to roll your own framing protocol. I also filed a GitHub issue to make the 4 byte magic number optional, which would go a long way to cutting down on frame overhead.

Second, the C API is not yet fully stabilized. There are a number of functions marked as experimental that aren't exported from the shared library and are only available via static linking. There's a ton of useful functionality in there, including low-level compression parameter adjustment, digested dictionaries (for reusing computed dictionaries across multiple contexts), and the multi-threaded compression API. python-zstandard makes heavy use of these experimental APIs. This requires bundling zstd with python-zstandard and statically linking with this known version because functionality could change at any time. This is a bit annoying, especially for distro packagers.

Third, the low-level compression parameters are under-documented. I think I understand what a lot of them do. But it isn't obvious when I should consider adjusting what. The default compression levels seem to work pretty well and map to reasonable compression parameters. But a few times I've noticed that tweaking things slightly can result in desirable improvements. I wish there were a guide of sorts to help you tune these parameters.

Fourth, dictionary compression is still a bit too complicated and hand-wavy for my liking. I can measure obvious benefits when using it largely out of the box with some corpora. But it isn't always a win and the cost for training dictionaries is too high to justify using it outside of scenarios where you are pretty sure it will be beneficial. When I do use it, I'm not sure which compression levels it works best with, how many samples need to be fed into the dictionary trainer, which training algorithm to use, etc. If that isn't enough, there is also the concept of content-only dictionaries where you use a fulltext as the dictionary. This can be useful for delta-encoding schemes (where compression effectively acts like a diff/delta generator instead of using something like Myers diff). If this topic interests you, there is a thread on the Mercurial developers list where Yann Collet and I discuss this.

Fifth, the patent rights grant. There is some wording in the PATENTS file in the Zstandard project that may... concern lawyers. While Zstandard is covered by the standard BSD 3-Clause license, that supplemental PATENTS file may scare some lawyers enough that you won't be able to use Zstandard. You may want to talk to a lawyer before using Zstandard, especially if you or your company likes initiating patent lawsuits against companies (or wishes to reserve that right - as many companies do), as that is the condition upon which the license terminates. Note that there is a long history between Facebook and consumers of its open source software regarding this language in the PATENTS file. Do a search for React patent grant to read more.

Sixth and finally, Zstandard is still relatively new. I can totally relate to holding off until something new and shiny proves itself. That being said, the Zstandard framing protocol has some escape hatches for future needs. And, the project proved during its pre-1.0 days that it knows how to handle backwards and future compatibility issues. And considering Facebook and others are using Zstandard in production, I wouldn't be too worried. I think the biggest risk is to people (like me) who are writing code against the experimental C APIs. But even then, the changes to the experimental APIs in the past several months have been minor. I'm not losing sleep over it.

That may seem like a long and concerning list. Most of the issues are relatively minor. The language in the PATENTS file may be a showstopper to some. From my perspective, the biggest thing Zstandard has going against it is its youth. But that will only improve with age. While I'm usually pretty conservative about adopting new technology (I've gotten burned enough times that I prefer the neophytes do the field testing for me), the upside to using Zstandard is potentially drastic performance and efficiency gains. And that can translate to success versus failure or millions of dollars in saved infrastructure costs and productivity gains. I'm willing to take my chances.

Conclusion

For the corpora I've thrown at it, Zstandard handily outperforms zlib in almost every dimension. And, it even manages to best other modern compression algorithms like brotli in many tests.

The underlying algorithm and techniques used by Zstandard are highly parameterized, lending themselves to a variety of use cases from embedded hardware to massive data crunching machines with hundreds of gigabytes of memory and dozens of CPU cores.

The C API is well-designed and facilitates high performance and adaptability to numerous use cases. It is batteries included, providing functions to train dictionaries and perform multi-threaded compression.

Zstandard is backed by Facebook and seems to have a healthy open source culture on Github. My interactions with Yann Collet have been positive and he seems to be a great project maintainer.

Zstandard is an exciting advancement for data compression and therefore for the entire computing field. As someone who has lived in the world of zlib for years, was a casual user of compression, and thought zlib was good enough for most use cases, I can attest that Zstandard is game changing. After being enlightened to all the advantages of Zstandard, I'll never casually use zlib again: it's just too slow and inflexible for the needs of modern computing. If you use compression, I highly recommend investigating Zstandard.

(I updated the post on 2017-03-08 to include a paragraph about the supplemental license in the PATENTS file.)

Read and Post Comments

Automatic Python Static Analysis on MozReview

January 24, 2015 at 11:30 PM | categories: Python, MozReview, Mozilla, code review | View Comments

A bunch of us were in Toronto last week hacking on MozReview.

One of the cool things we did was deploy a bot for performing Python static analysis. If you submit some .py files to MozReview, the bot should leave a review. If it finds violations (it uses flake8 internally), it will open an issue for each violation. It also leaves a comment that should hopefully give enough detail on how to fix the problem.

While we haven't done much in the way of performance optimizations, the bot typically submits results less than 10 seconds after the review is posted! So, a human should never be reviewing Python that the bot hasn't seen. This means you can stop thinking about style nits and start thinking about what the code does.

This bot should be considered an alpha feature. The code for the bot isn't even checked in yet. We're running the bot against production to get a feel for how it behaves. If things don't go well, we'll turn it off until the problems are fixed.

We'd like to eventually deploy C++, JavaScript, etc bots. Python won out because it was the easiest to integrate (it has sane and efficient tooling that is compatible with Mozilla's code bases - most existing JavaScript tools won't work with Gecko-flavored JavaScript, sadly).

I'd also like to eventually make it easier to locally run the same static analysis we run in MozReview. Addressing problems locally before pushing is a no-brainer since it avoids needless context switching from other people and is thus better for productivity. This will come in time.

Report issues in #mozreview or in the Developer Services :: MozReview Bugzilla component.

Read and Post Comments

Next Page ยป