Surprisingly Slow

April 06, 2021 at 07:00 AM | categories: Programming

I have an affinity for performance optimization and making software as efficient as possible. Over the years, I've encountered specific instances and common patterns that make software or computers slow. In this post, I'll shine a spotlight on some of them.

I'm titling this post Surprisingly Slow because the slowness was either surprising to me or the sub-optimal practices leading to slowness are prevalent enough that I think many programmers would be surprised by their existence.

The sections below are largely independent. So feel free to cherry pick the ones that interest you.

Environment Detection in Build Systems (e.g. configure and cmake)

This is the topic that inspired this post.

Build systems often feature an environment detection / configuration phase before the build phase. In UNIX land, autoconf generated configure scripts are prevalent. CMake is also popular. These tools run a bunch of code to probe the state of the current system so that the build configuration is appropriate for the current build environment. For example, they'll probe for which compiler to use, its version, and what bugs and capabilities it has.

This environment detection and configuration is a necessary evil because machines and environments often vary substantially and you need to account for those variances.

The problem is that this configuration step often takes longer to run than the build itself! Build systems for small programs or libraries will often spend 10+ seconds running configure and complete the actual compilation and linking in a fraction of that time. In other words, the setup to perform the build takes longer than the build itself!

Depending on how many CPU cores you have, the discrepancy may not be obvious. But I have a 16 core / 32 thread Ryzen 5950X as my primary PC and the relative slowness of the configuration step is painful to observe.

What I find even more shocking is that configuration time often still eclipses actual build time even for large projects. I'm not sure if this is still true, but a few years ago Mozilla observed that building LLVM/Clang on a 96 vCPU EC2 instance resulted in more time spent in cmake/configuring than compiling and linking! And that's a very large C++ project with thousands of source files being compiled!

Build configuration is often a discrete step that executes serially before what most people consider the actual build. To improve efficiency, build configuration needs to be parallelized. Even better, it should be integrated into the main build DAG itself so parts of the build can start running without having to wait for all build configuration. Unfortunately, many common tools performing build configuration can't easily be adapted to this model. So there's not much many of us can do.

Another solution to this problem is avoiding the problem of environment detection in the first place. If you have deterministic and reproducible build environments, you can take a lot of shortcuts to skip environment detection that just isn't needed any more. This is more or less the approach of modern build tools like Bazel. I do wonder how much of the speed gains from tools like Bazel are due to eliminating environment configuration. I suspect it is a lot!

New Process Overhead on Windows

New processes on Windows can't be spawned as quickly as they can on POSIX based operating systems, like Linux. On Windows, assume a new process will take 10-30ms to spawn. On Linux, new processes (often via fork() + exec() will take single digit milliseconds to spawn, if that).

However, thread creation on Windows is very fast (~dozens of microseconds).

These Stack Overflow have some more details.

A few dozen milliseconds is an eternity in CPU time. And it is long enough that it eats into a large percentage of the time budget for people to perceive something as instantaneous. So this may contribute to the perception that Windows is slower than Linux.

If your program architecture consists of spawning new processes left and right (this is common in UNIX land), this can pose performance problems on Windows, as the overhead of new process creation on Windows can really add up:

10ms * 1,000 invocations = 10s
20ms * 10,000 invocations = 200s
30ms * 100,000 invocations = 3,000s

Using the example of configure above, configure files are often shell scripts. And shell scripts often do a lot of their work by spawning other processes like grep, sed, and sort. Even the [ operator could be a new process (seriously: there's probably a /usr/bin/[ executable in your POSIX environment). (Although [ might be a shell built-in.) Command pipe chains (e.g. command | grep | awk) spawn multiple processes serially and can be visually slow to run. Anyway, it is not uncommon for a configure script to spawn thousands of new processes. Assuming 10ms per process, at 1,000 invocations that is 10s of overhead just spawning new processes! This further exacerbates the problem in the previous section!

If your software runs on Windows, consider the impact that relatively slow process spawning will have. Consider a multi-threaded architecture or using longer-lived daemon/background processes instead.

Closing File Handles on Windows

Many years ago I was profiling Mercurial to help improve the working directory checkout speed on Windows, as users were observing that checkout times on Windows were much slower than on Linux, even on the same machine.

I thought I could chalk this up to NTFS versus Linux filesystems or general kernel/OS level efficiency differences. What I actually learned was much more surprising.

When I started profiling Mercurial on Windows, I observed that most I/O APIs were completing in a few dozen microseconds, maybe a single millisecond or two ever now and then. Windows/NTFS performance seemed great!

Except for CloseHandle(). These calls were often taking 1-10+ milliseconds to complete. It seemed odd to me that file writes - even sustained file writes that were sufficient to blow past any write buffering capacity - were fast but closes slow. It was even more perplexing that CloseHandle() was slow even if you were using completion ports (i.e. async I/O). This behavior for completion ports was counter to what the MSDN documentation said should happen (the function should return immediately and its status can be retrieved later).

While I didn't realize it at the time, the cause for this was/is Windows Defender. Windows Defender (and other anti-virus / scanning software) typically work on Windows by installing what's called a filesystem filter driver. This is a kernel driver that essentially hooks itself into the kernel and receives callbacks on I/O and filesystem events. It turns out the close file callback triggers scanning of written data. And this scanning appears to occur synchronously, blocking CloseHandle() from returning. This adds milliseconds of overhead. The net effect is that file mutation I/O on Windows is drastically reduced by Windows Defender and other A/V scanners.

As far as I can tell, as long as Windows Defender (and presumably other A/V scanners) are running, there's no way to make the Windows I/O APIs consistently fast. You can disable A/V scanning (at your own peril). But the trick that Mercurial employs (which has later been emulated by rustup among other tools) is to use a thread pool for calling CloseHandle(). Even if you perform all file open and write I/O on a single thread and use a background thread pool only for calling CloseHandle(), you can see a >3x speedup in time to write files. This optimization should ideally be employed by any software that creates or mutates as little as a few hundred files on Windows. This includes version control tools, installers, and archive extraction tools. Fun fact: rustup can extract tar files on Windows faster than open source and commercial fast extraction/copy tools because it employs this trick and more. I believe rustup on Windows is actually faster at extracting tar archives than it is on Linux!

The artificial I/O latency added by scanning software such as Windows Defender is super annoying. But the performance gains from working around it by using a thread pool for background is often worth the complexity. I have no doubt that if this optimization were baked into popular Windows tools (namely installers), people would be shocked by how much faster things could be.

Writing to Terminals

As a maintainer of Firefox's build system, I fielded a handful of reports from people complaining about builds being slower than their peers on identical hardware. While there are many causes for this, one of the most surprising was the impact the terminal has on build performance.

Writing to the terminal is usually fast. Until it isn't.

What I learned is that writing tons of output or getting clever with writing to the terminal (e.g. writing colors, moving the cursor position to write over existing content) can drastically slow down applications.

Writing to the terminal via stderr/stdout is likely performed via blocking I/O. So if the thing handling your write() (the terminal emulator) doesn't finish its handling promptly, your process just sits around waiting on the terminal to do its thing.

We discovered that different terminals have their own quirks. Historically, the Windows Command Prompt and the built-in Terminal.app on macOS were very slow at handling tons of output. I remember (but can't find the bug or commit to Firefox) when we made the build system quiet by default and that reduced build times by minutes in some configurations.

A few years ago, npm infamously had a performance sucking progress spinner. While I'm not sure how much of this was terminal slowness versus calling progress update code too frequently, the terminal likely played a part because terminals do have a limit to how often they can accept input to draw.

I've found that modern terminals are better about writing a ton of plain text than they were in ~2012, when I was tackling these problems in Firefox's build system. But I would still exercise extreme caution when doing fancy things with the terminal, like coloring text, drawing footers, etc. Always use buffered I/O to minimize the number of write() actually going to the terminal, flushing as needed (hopefully sparingly). Consider using an async thread for writing to stdout/stderr. Record the total time spent in blocking I/O to stdout/stderr so you can measure terminal I/O latency. And periodically compare the wall time delta between stdout/stderr connected to a terminal and /dev/null when running your program to see if there is a discrepancy worth caring about. Finally, consider throttling writes to the terminal. Instead of writing a footer after every line of output, consider buffering lines for a few milliseconds and emitting all lines plus the new footer in batches. If drawing a progress bar or spinner or something of that nature, I would limit drawing to ~10 Hz to minimize terminal overhead.

Thermal Throttling / ACPI C/P-States / Processor Throttling Behavior

We like to think that a computer and its processors are either on or off. If only things were that simple.

Processors are constantly changing their operating envelope as they are running. The following statements are all true (although not every item applies to all machines or CPU models):

The MHz each CPU core is running at can fluctuate wildly from 1 second to the next.
CPU cores may go to sleep or enter a very low power mode, even if others are running.
Cores may underclock significantly if temperature goes beyond a threshold. They may refuse to run faster until the temperature drops. Faulty sensors can lead to premature behavior.
Cores may only reach their maximum frequency if other cores are also running. The physical proximity of that other core may matter.
It could take dozens, hundreds, or even thousands of milliseconds for an idling core to ramp up to its full speed.
The behavior of power scaling can vary substantially depending on whether a machine is connected to external power or running off the battery.
The behavior of power scaling can vary substantially depending on whether the battery is fully charged or nearly empty.
Apple laptops may exhibit thermal throttling when charging from the left side. (Yes, seriously: always charge your MacBook Pro from the right. And if your employees use Apple laptops for CPU heavy tasks, consider an awareness campaign to encourage charging from the right side. Even better, deploy software that checks for left side charging and alert accordingly. Although I have yet to find any software or API to detect this.)
A core may slow down in order to process certain instructions (like AVX-512).

Modern CPUs are really dynamic beasts and their operating behavior is often seemingly unpredictable. Furthermore, CPU models can vary from one to the next. For example, an EPYC or Xeon processor will likely behave differently from a Ryzen or Core i7/i9 which will behave differently depending on whether you are running in a desktop or laptop. (I observed a few years ago that Xeon cores won't turbo as easily as consumer grade CPUs.)

Power fluctuations and their impact on performance are one of the reasons why it is extremely difficult to conduct proper benchmarks. When benchmarking, you need to control the power variable or at least report its state so results are qualified appropriately. I am very skeptical of benchmark results that don't report the power configuration/methodology (this is most of them, sadly) and especially of benchmarks conducted on laptops, as battery operated devices are much more susceptible to power throttling than desktops or servers.

I have personally had a MacBook Pro become thermal throttled because an internal screw came loose and blocked a fan from spinning. macOS didn't warn me: all I knew was that my Firefox builds become 2-3x slower for no apparent reason! I have also observed my MacBook Pro becoming hot due to left side charging. Charging from the right magically made things faster.

At Mozilla, when we started rolling out Xeon desktops to employees, we had reports of wildly varying build speeds. On some operating systems (Mozilla had very lax central machine provisioning and allowed people full domain of their company issued hardware), the default ACPI C/P-States were such that CPU cores were scaling differently.

What we observed was the compile phase of the build was fine. But some people were reporting linking times 2-4x longer (dozens of seconds to minutes) than others on equivalent configurations! This was a big deal because the wall time of an incremental/non-full build is dominated by linking time. We eventually discovered that on the slow machines, the CPU core doing the linking was only running at 25-50% of its potential. Think 1.0-1.5 GHz. But if you started additional CPU heavy tasks, that core ramped up. We discovered that different operating systems had different defaults for the ACPI C/P-States. The more conservative settings would result in CPU cores not scaling their frequency unless there was sufficient CPU load to merit it. Changing to more aggressive power settings ensured better and consistent results.

Laptops are highly susceptible to thermal throttling and aggressive power throttling to conserve battery. I hold the general opinion that laptops are just too variable to have reliable performance. Given the choice, I want CPU heavy workloads running in controlled and observed desktops or server environments.

But servers aren't immune: their ACPI C-State and P-State settings can drastically impact performance. Dialing these up to max so all the cores run at full (or are ready to run at full in a few milliseconds) is possible. However, this may greatly increase your power consumption. You can do this on some cloud providers (like AWS) for no additional direct cost to you. However, higher energy consumption is bad for the environment. Data centers already have a carbon footprint about the size of the airline industry (during non-pandemic times) and that footprint is growing. So think about your ethical responsibilities to the environment before having your server fleet consume potentially megawatts more power.

Python, Node.js, Ruby, and other Interpreter Startup Overhead

Complex systems will often execute Python, Node.js, and other interpreters thousands or more times during their execution. For example, the Firefox build system invokes thousands of Python processes performing common tasks, such as wrapping the compiler invocation. And the Mercurial test harness invokes thousands of Python processes by running hg as part of its testing. I've heard of similar stories involving Node.js, Ruby, and other interpreters, often in the context of use in build systems.

An oft ignored fact about launching a new interpreter process is that each invocation often takes single to dozens of milliseconds to initialize the interpreter. i.e. the new process spends time at the beginning of process execution just getting to the code you are telling it to run. Sometimes the new process overhead is so bad that the slowdown is obvious and rules out the use of a technology. The JVM historically has been notorious for this, which is why use of Java typically entails fewer, longer-running processes over more, domain-limited processes.

I've written about Python's startup overhead before. In 2014 I measured that Mercurial's test harness spends 10-18% of its total CPU time just getting to the point where the interpreter/process can run custom bytecode and 30-38% of its total CPU time getting to the point where Mercurial performs command dispatch (additional time here is mostly module importing overhead).

You may think that a few milliseconds of overhead can't matter that much. But if you multiply by 1,000, 10,000, 100,000 or more, milliseconds matter:

1ms * 1,000 invocations = 1s
10ms * 10,000 invocations = 100s
100ms * 100,000 invocations = 10,000s (2.77 hours)

On Windows, this problem is compounded because of it relatively slow new process startup (see section above).

Programmers need to think long and hard about your process invocation model. Consider the use of fewer processes and/or consider alternative programming languages that don't have significant startup overhead if this could become a problem (anything that compiles down to assembly is usually fine).

Pretty Much all Storage I/O

Of my general affinity for performance optimization, I have a special affinity for I/O optimization. I think the main reason is that the disconnect between the potential for modern storage devices and what is actually achieved is so wide. On paper, software should be getting ~10x the performance from modern storage devices than what we typically see.

Modern storage devices are absurdly fast. The NVMe storage in my primary PC can sustain reads at >3 GB/s (>6 GB/s sequential), writes at ~1 GB/s (4+ GB/s sequential), can perform >500,000 I/O operations per second, and can service many I/O operations in the ~10 microsecond latency range. Modern NVMe storage is roughly on par with the performance of DDR2 DRAM (launched in 2003) in terms of throughput (latency still trails but ~10us is nothing to scoff at).

For comparison, the 1 TB Western Digital Caviar Black spinning disk I retired from my PC a few weeks ago can only do ~90 MB/s sequential reads and writes, 1-2 MB/s random reads and writes, ~12 ms access times. I'm unsure what IOPS is, but considering ~12 ms access times and the physical nature of spinning disks, it can't be more than a few hundred.

Modern NVMe storage is 1.5-3 magnitudes faster than the best spinning disks from little over a decade ago. So why isn't all storage I/O ~instantaneous?

The short answer is that most software fails to utilize the potential of modern storage devices or even worse actively undermines it through bad practices.

For the former, I'll refer you to the excellent Modern Storage is Plenty Fast. It is the APIs That are Bad. tl;dr you can harness the full power of your modern storage device if you bypass the standard OS/kernel I/O primitives and issue I/O operations directly against the device. So, software abstractions in the OS/kernel are eating a lot of potential.

For the software undermining storage device potential aspect, I'll briefly touch on the fsync() POSIX function. By calling this function, you effectively say be sure the state of this file descriptor is persisted to the storage device or I don't want to lose any changes I've made.

Data consistency and durability are important. But the cost to achieving them can be absurdly high. And as it turns out, it is also subtly difficult to do correctly in practice. I'll refer you to Dan Luu's excellent Files are Hard. The papers linked offer a sobering assessment. I'll reinforce the message with PostgreSQL's fsync() surprise, which chronicles how PostgreSQL maintainers learned about how Linux can flat out drop errors when performing device I/O, leading to data corruption. Yikes!

Anyway, about fsync(). The concept of fsync() is sound: ensure this thing is persisted to the storage device. But the implementation is often a pile of inefficiency leading to slowness.

On many Linux filesystems (including ext4), the implementation of fsync() is such that upon calls, all unflushed writes are persisted to storage. So if process A writes out a 1 GB file and process B writes 1 byte to another file and calls fsync() on that single byte write, Linux/ext4 will need to write 1 GB to the storage device, not 1 byte. So on Linux/ext4, all it takes is a random process somewhere to issue fsync() and all dirty page cache entries need to be flushed. On most systems, there's usually something continuously incurring write I/O, so the amount of storage device I/O incurred by fsync() is almost always larger than just the mutated file/directory you actually want persisted.

This behavior can cause a ton of problems. For starters, it artificially increases I/O latency. You'd think that calling fsync() after a minimal change would be ~instantaneous. But if there are lots of dirty pages to be flushed, it could take seconds. At my current employer, we ran into this exact problem with GitHub Enterprise, which has a monolithic architecture. A MySQL database was running off the same ext4 filesystem as the Git repositories. MySQL will call fsync() frequently to ensure transactions and the transaction journal are persisted to storage. But if a Git GC were running and Git just finished writing a multi-gigabyte packfile, MySQL's fsync() would be stuck waiting on Git's large write to finish persisting. This led to slowness of future MySQL transactions and even some application-level timeouts. When people say databases and other stores should be isolated to their own volumes/filesystems, fsync()'s wonky behavior is a big reason why.

Fortunately, newer versions of Linux/ext4 contain a fast commits feature that changes behavior and enables more granular flushing of fsync() to storage, just like it is documented to do. But as the feature is pretty new, it could take a while to stabilize and make its way to distros. I can't wait for it though!

Another problem with fsync() is that it is called more often than it needs to be. Now, if you have mission critical data and need consistency and durability, you should absolutely be calling fsync() appropriately. But the reality is that many data workloads and machine environments don't actually need strong data guarantees!

Take for example Kubernetes pods or CI runners. Or even servers for a stateless service. Ask yourself, what's the worst that could happen if the machine loses power and there is data loss on the local filesystem? In a lot of scenarios the answer is nothing. You've designed your system to be stateless and fault tolerant. You manage your servers as cattle. You treat local filesystems as ephemeral. So if a machine fails, you provision a new one to replace it. In these scenarios, fsync() buys you little to nothing but can cost you a lot!

The cost of avoidable fsync() can be substantial. Combined with the inefficient global flushing behavior of Linux/ext4, it can be a performance sapper, especially on slower storage devices. Fortunately, there are options. Many databases and other popular software has a way to prevent the issuance of fsync(). If your data is ephemeral, consider disabling fsync() for a likely significant performance boost! For software that doesn't support disabling fsync(), the aptly named eatmydata tool and LD_PRELOAD library can be used to nerf fsync() and other similar functionality by intercepting the function calls and making them no-op. Last but not least, for ephemeral machines, consider building a patched Linux kernel that turns fsync() and friends into no-ops. (I'm not sure of anyone who does this. But I've considered it because getting eatmydata to work in places like launched containers can be a bit of a pain.)

I'll close this section with a link to my favorite commit to the Firefox repository: Disable Places during reftests, preventing 50 GB of I/O. While this commit goes beyond disabling fsync(), fsync() (and its Windows equivalent) was responsible for some of the performance loss. Excessive I/O and needless persisting of changes to device can really sap performance. Storage software usually errors on the side of consistency (this is the correct default in my opinion). Given the costs that consistency imposes, you should seriously consider nerfing the guarantees and speeding up I/O when that option is viable for you.

Data Compression

I could write an entire post on the topic of data compression and its widespread suboptimal use. Here is the concise version.

At its core, data compression is a trade-off between CPU and I/O usage. Typically it involves one of the following scenarios:

I/O (either storage or network) is the bottleneck, so we want to trade more CPU to reduce I/O throughput.
At rest storage is expensive, so we want to trade more CPU for lower storage utilization/costs.

Since the early days of computing, a maxim has been that storage is slow and expensive compared to CPU. So trading CPU to reduce storage utilization seemed like a solid bet.

Fast forward to 2021.

As I wrote in the previous section, modern storage I/O is absurdly fast. It is also historically cheap.

Networks have also gotten faster. 1 gbps (125 MB/s) is pretty universal at this point. 2.5 gbps (312 MB/s) is getting deployed in consumer and office environments. 10 gbps (1250 MB/s) is common in data centers. And faster than 10 gbps is possible.

Meanwhile CPUs have somewhat plateaued in their single core performance in the past decade. We've been stuck at ~4 GHz for years. All of the performance gains from CPUs have come from adding more CPU cores to the package and instructions per cycle (IPC) efficiency wins (we've also gotten some agonizing security vulnerabilities like Spectre and Meltdown out of this IPC work as well).

What this all means is that the relative performance difference between CPUs and I/O has compressed significantly (pardon my pun). ~30 years ago, CPUs ran at ~100 MHz and Internet was using dial-up at say 50 kbps, or 0.05 mbps, or 6.25 kBps. That's 16,000 cycles per byte. Today, we're at ~4 GHz with say 1 Gbps / 125 MB/s networks. That's 32 cycles per byte, a decrease of 500x. (In fairness, the ratio closes when you consider that we likely have >1 CPU core competing for I/O and factor in IPC gains. But we're still talking about the relative difference in CPU and I/O decreasing by 1-1.5 magnitudes.) Years ago, trading CPU to lessen the I/O load was often obviously correct. Today, because of the advancements in I/O performance relative to CPU and a substantially reduced cycles per I/O byte budget, the answer is a lot murkier.

Not helping is the prevalence of ancient compression algorithms. DEFLATE - the algorithm behind the ubiquitous zlib library and gzip data format - is ~30 years old. DEFLATE was designed in an era when computers had like 1 MB RAM and 100 MB hard drives. Different times.

DEFLATE/zlib became very popular in a world where I/O was much slower and compression was often a necessity. Not using compression on a dial-up modem resulted in massive performance differences! And because of its popularity in the early days of the Internet, DEFLATE/zlib is available in the standard library of many programming languages. It seems to be the first compression format people reach for when someone says/thinks add compression.

The ubiquity of zlib is good from a dependency perspective: everyone can read zlib/gzip. But for scenarios where you control the reader and writer, use of zlib in 2021 constitutes negligence because its performance lags contemporary solutions. Modern compression libraries (zstandard is my favorite) can yield substantially faster compression and decompression speed while delivering better compression ratios in most data sets. My 2017 Better Compression with Zstandard post dives into the numbers. (I've been meaning to revisit that post since zstandard has since seen multiple 10+% speedups in subsequent releases, making it even more compelling.) If you don't need the ubiquity of zlib (e.g. you control the writers and readers), there's little reason to use zlib over something more modern. Compared to zlib, modern compression libraries like zstandard are the closest thing to magical pixie dust that you can sprinkle on your software for free performance.

If you are using compression (especially zlib) for real-time compression (sending compressed data somewhere where it will be decompressed immediately), you need to measure the line speed of the compressor and decompressor. Then compare that to the uncompressed line speed. Are you bottlenecked by I/O in the uncompressed case? If not, do you need the bandwidth or I/O capacity being saved by compression? If not, why are you using compression at all? You just measured that all compression did was artificially slow down your software for no reason! Given that zlib compression will often fail to saturate a 1 gbps link, there's a very real chance your use of compression introduces an artificial CPU bottleneck!

If you are using compression (especially zlib) for data archiving (storing compressed data somewhere where it will be decompressed eventually), you need to measure and compare compression ratios and line speeds of different compression formats and their settings. Like the real-time compression scenario, if decompression reduces your line speed from uncompressed, you are artificially slowing down access to your data. Maybe that's justified to save on storage costs. But in many cases, you can swap in a different compression library and get similar to better compression ratios while achieving better (de)compression speeds. Who wouldn't want free performance and storage cost reductions?

As an aside, one of the reasons I love zstandard is it can be tuned from something that is screaming fast (GB/s at compression and decompression ends) to something that is very slow on the compression side but yields terrific compression ratios, while still preserving GB/s decompression speeds. This enables you to use the same format for vastly different use cases. You can also dynamically change the storage characteristics of your data. For example, you can initially write data with a fast setting so you aren't CPU constrained on the writer. Then you can have some batch job come around and recompress your data with more aggressive settings, making it much smaller. It's not like zlib where the range of compression settings goes from kinda slow and not very good compression ratios to pretty slow and still not very good compression ratios.

When you know to look for it, inefficiency due to unjustified use of compression or failure to leverage modern compression libraries is everywhere. Here are some common operations in my daily workflow that are bottlenecked by use of slow compression formats and could be made faster by using a different compression format:

Installing Apt packages (packages are gzip compressed). (Fun fact, installing apt packages is also subject to fsync() slowness as described above because the package manager will issue an fsync() at least once for each package.)
Installing Homebrew packages (packages are gzip compressed).
Installing Python packages via pip (source archives are gzip tarballs and wheels are zip files, which use zlib compression).
Pushing/pulling Docker images (layers inside Docker images are gzip compressed).
Git (wire protocol data exchange and on-disk storage is using zlib). (When I added zstandard support to Mercurial, it reduced the transfer size from servers to ~89% of original while using ~60% of the server-side CPU.)

In the corporate world, there's probably multiple petabyte scale data warehouses, data lakes, data coliseums (I can't keep up with what we're calling them now) storing data in gzip. Dozens of terabytes could likely be shaved by moving to something like zstandard. If using LZMA (which has extremely slow decompression speeds), storage costs are cheap, but data access is extremely slow, making your data querying slow. I haven't had the opportunity to measure it, but I suspect some of the reputation Hadoop and other Big Data systems have for being slow is because they are CPU constrained by suboptimal use of compression.

My experience is that many programmers don't understand the trade-offs and nuances of compression and/or lack knowledge about the existence of more modern, superior compression libraries. Instead, the collective opinion is compression is good, use [zlib] compression. Like many things in software, the real world is complex and nuanced. The dynamics of the relative power and cost of computer components has shifted the pendulum towards compression adding more cost than it saves. And it hasn't helped that industry still widely uses a ~30 year old compression format (DEFLATE/zlib) that is far from ideal for modern computers. If you take the time to measure, I'm sure you'll find many cases where use of compression is either ill-advised or would benefit from a more modern compression library (like zstandard).

x86_64 Binaries in Linux Distribution Packages

Linux distributions often provide pre-built binaries to install via packaging tools (e.g. apt install or yum install).

To keep things simpler and to ensure maximum compatibility, these pre-built binaries are built such that they run on as many computers as possible. Currently, many Linux distributions (including RHEL and Debian) have binary compatibility with the first x86_64 processor, the AMD K8, launched in 2003. These processors featured modern instruction sets, like MMX, 3DNow!, SSE, and SSE2.

What this means is that by default, binaries provided by many Linux distributions won't contain instructions from modern Instruction Set Architectures (ISAs). No SSE4. No AVX. No AVX2. And more. (Well, technically binaries can contain newer instructions. But they likely won't be in default code paths and there will likely be run-time dispatching code to opt into using them.)

Furthermore, C/C++ compilers (like Clang and GCC) will also target an ancient x86_64 microarchitecture level by default (this is where the distribution's binary compatibility defaults come from). So if you compile your own code and don't specify settings like -march or -mtune to change the default targeting settings, your compiled binaries won't leverage SSE4, AVX, etc. You can still force your application to emit these instructions in dynamic code paths without -march/-mtune overrides. But you have to opt in and add additional code complexity to do that.

Because of the conservative microarchitecture targeting settings of compilers and distribution binaries by extension, that's nearly 20 years of ISA work and efficiency gains from more powerful ISAs (like superlinear vectorized instructions) left on the table. And here I get frustrated when my PRs linger unreviewed for more than a day. Imagine what it is like to be an AMD or Intel engineer and have your ISA work take ~decades to be adopted at scale!

Truth be told, I'm unsure how much of a performance impact this ISA backwards compatibility sacrifices. It will vary heavily from workload to workload. But I have no doubt there are some very large datacenters running CPU intensive workloads that could see massive efficiency gains by leveraging modern ISAs. If you are running thousands of servers and your CPU load isn't coming from a JIT'ed language like Java (JITs can emit instructions for the machine they are running on... because they compile just in time), it might very well be worth compiling CPU heavy packages (and their dependencies of course) from source targeting a modern microarchitecture level so you don't leave the benefits of modern ISAs on the table. And be forewarned: use of modern ISAs isn't a silver bullet! Some instructions can actually result in the CPU underclocking in order to run them, making code using those instructions fast but other code slow.

Maintaining binary compatibility with a vanishingly small number of ancient CPUs at the expense of performance on modern CPUs seems... questionable. Fortunately, Linux distributions and Clang/GCC are paying attention.

GCC 11 and Clang 12 define x86_64-{v2, v3, v4} architecture levels targeting ~Nehalem (released 2008), ~Haswell (released 2013), and AVX-512 (~2015), respectively. So you can add e.g. -march=x86_64-v3 to target Haswell era and newer and have the compiler emit SSE4, AVX, AVX2, and other modern instructions.

RHEL9 will be raising their minimum architecture requirement from x86_64 to x86_64-v2, effectively requiring CPUs from 2008+ instead of 2003+.

If you'd like to learn more about this topic, start at this Pharonix article and follow the links to other articles and mailing list discussions.

It's worth noting that at the time I write this, AWS 4th generation EC2 instances (c4, m4, and r4) all support AVX2 and I believe are compatible with GCC/Clang's x86_64-v3 target. And 5th generation Intel instances have AVX-512, presumably making them compatible with x86_64-v4. So even if your distribution targets x86_64-v2, there is still potential free performance from newer ISAs on the table.

If I were operating a server fleet consisting of thousands of machines, I would be very tempted to compile all packages from source targeting a modern microarchitecture level. This would be costly in terms of complexity. But for some workloads, the performance gains could be worth the effort. And this conservative targeting approach may provide justification for running modern-optimized Linux distributions or cloud vendor specific Linux distributions (e.g. Amazon Linux). I'm unsure if distributions like Amazon Linux take advantage of this. If not, they should look into it!

Read the next section for an example of where failure to leverage modern ISAs translates to a performance loss.

Many Implementations of Myers Diff and Other Line Based Diffing Algorithms

This one is rather domain specific but I find it an illustrious example because the behavior is quite counter-intuitive!

Various classes of software need to take two text documents and emit a textual diff of their contents. Think what git diff displays.

There are various algorithms for generating a diff of text. Myers Diff is probably the most famous. The run-time of the algorithms is proportional to the number of lines. Probably O(nlog(n)) or O(n^2).

These text-based diffing algorithms often operate at the line level (rather than say the byte or codepoint level) because it drastically limits the search space and minimizes n to keep the algorithm run-time in check.

Over the years, various people have realized that when diffing two text documents, large parts of the inputs are often identical (why would you diff unrelated content after all). So most implementations of diff algorithms have a myriad of optimizations to limit the number of lines compared. Two common optimizations are to identify and exclude the common prefix and suffix of the input.

This is over-simplified, but text-based diffing algorithms often do the following:

Split the input into lines.
Hash each line to facilitate fast line equivalence testing (comparing a u32 or u64 checksum is a ton faster than memcmp() or strcmp()).
Identity and exclude common prefix and suffix lines.
Feed remaining lines into diffing algorithm.

The idea is that steps 1-3 - which should be O(n) - reduce work for an algorithm (step 4) with run-time complexity worse than O(n). Sounds good on paper.

So what actually happens?

If you profile a number of these diff implementations, you find that steps 1-3 actually take more time than the supposedly slow/expensive algorithm! How can this be?!

One culprit is the line splitting. Even assuming we can use memory 0-copy / references for storing the line contents (as opposed to allocating a new string to hold each parsed line, which can be much less efficient), splitting text into lines can be grossly inefficient!

There are various reasons for this. Maybe you are decoding the text into code points rather than operate in the domain of bytes (you shouldn't need to decode the entire input to search for newlines). Maybe you are traversing the file one character/byte at a time looking for LF.

An efficient solution to this problem employs the use of vectorized CPU instructions (like AVX/AVX2) which can scan several bytes at a time looking for a sentinel value or matching a byte mask. So instead of 1 instruction per input byte, you have 1/n. Your C runtime library probably has assembly implementations of memchr(), strchr(), and similar functions and automatically chooses the newest/fastest assembly/instructions supported by the run-time CPU (glibc does).

In theory, compilers recognize such patterns and emit modern vectorized instructions automagically. In reality, because the default target ISA of compilers is relatively ancient compared to what your CPU is capable of (see previous section), you are stuck with old instructions and linear scanning. Your best bet is to stick with functions in the C runtime that are probably backed by assembly. (Although watch out for function call overhead.)

Another culprit causing inefficiency is hashing each line. The hashing is performed to reduce equivalence testing to a u32/u64 compare rather than strcmp(). Many implementations don't seem to give much consideration to the hashing algorithm, using something like crc32 or djb2. An inefficiency here is many older hashing algorithms operate at the byte level: you need to feed in 1 byte at a time, update state (XOR if often employed), then feed in the next byte. This is inefficient because it throws away the instruction pipelining and superscalar properties of modern CPUs. A better approach is to use a hashing algorithm that digests 4, 8, or more bytes at a time. Again, this lowers run-time from ~n cycles per byte to ~1/n.

Another common inefficiency is computing the lines and hashes of content in the common prefix and suffix. Use of memcmp() (or even better: hand-rolled assembly to give you the offset of the first divergence) is more efficient, as again, your C runtime library probably has assembly implementations of memcmp() which can compare input at near native memory speed.

I quite enjoy this example because it demonstrates that something that is seemingly O(n) is slower than O(nlog(n))/O(n^2). This is because often the result of the optimization reduces the n of the expensive algorithm to such a small value that the computational complexity is trivial. Compilers targeting ancient microarchitectures and failing to leverage vectorized instructions which unlock superlinear performance further shift the time towards the O(n) optimizations.

Conclusion

Computers and software can be surprisingly slow for surprising reasons. While this post was long and touched on a number of topics, it only scratched the surface of potential topics. I could easily find another 10 topics to write about. But that will have to be for another post.

Before I go, if you find inaccuracies in this post, please shoot me an email (address in resume in site header) so I can correct the post, as I don't want to unintentionally mislead others.

Also, computers and software are complex. When it comes to performance and optimizations, always be measuring. The issues I described could be manifesting in your software and environments but the effort to address them may not be worth the reward. Computers and software, like life, are full of trade-offs. Performance is just one trade-off. Please don't cargo cult my advice without measuring and applying critical thinking first.