Absorbing Commit Changes in Mercurial 4.8

November 05, 2018 at 09:25 AM | categories: Mercurial, Mozilla | View Comments

Every so often a tool you use introduces a feature that is so useful that you can't imagine how things were before that feature existed. The recent 4.8 release of the Mercurial version control tool introduces such a feature: the hg absorb command.

hg absorb is a mechanism to automatically and intelligently incorporate uncommitted changes into prior commits. Think of it as hg histedit or git rebase -i with auto squashing.

Imagine you have a set of changes to prior commits in your working directory. hg absorb figures out which changes map to which commits and absorbs each of those changes into the appropriate commit. Using hg absorb, you can replace cumbersome and often merge conflict ridden history editing workflows with a single command that often just works. Read on for more details and examples.

Modern version control workflows often entail having multiple unlanded commits in flight. What this looks like varies heavily by the version control tool, standards and review workflows employed by the specific project/repository, and personal preferences.

A workflow practiced by a lot of projects is to author your commits into a sequence of standalone commits, with each commit representing a discrete, logical unit of work. Each commit is then reviewed/evaluated/tested on its own as part of a larger series. (This workflow is practiced by Firefox, the Git and Mercurial projects, and the Linux Kernel to name a few.)

A common task that arises when working with such a workflow is the need to incorporate changes into an old commit. For example, let's say we have a stack of the following commits:

$ hg show stack
  @  1c114a ansible/hg-web: serve static files as immutable content
  o  d2cf48 ansible/hg-web: synchronize templates earlier
  o  c29f28 ansible/hg-web: convert hgrc to a template
  o  166549 ansible/hg-web: tell hgweb that static files are in /static/
  o  d46d6a ansible/hg-web: serve static template files from httpd
  o  37fdad testing: only print when in verbose mode
 /   (stack base)
o  e44c2e (@) testing: install Mercurial 4.8 final

Contained within this stack are 5 commits changing the way that static files are served by hg.mozilla.org (but that's not important).

Let's say I submit this stack of commits for review. The reviewer spots a problem with the second commit (serve static template files from httpd) and wants me to make a change.

How do you go about making that change?

Again, this depends on the exact tool and workflow you are using.

A common workflow is to not rewrite the existing commits at all: you simply create a new fixup commit on top of the stack, leaving the existing commits as-is. e.g.:

$ hg show stack
  o  deadad fix typo in httpd config
  o  1c114a ansible/hg-web: serve static files as immutable content
  o  d2cf48 ansible/hg-web: synchronize templates earlier
  o  c29f28 ansible/hg-web: convert hgrc to a template
  o  166549 ansible/hg-web: tell hgweb that static files are in /static/
  o  d46d6a ansible/hg-web: serve static template files from httpd
  o  37fdad testing: only print when in verbose mode
 /   (stack base)
o  e44c2e (@) testing: install Mercurial 4.8 final

When the entire series of commits is incorporated into the repository, the end state of the files is the same, so all is well. But this strategy of using fixup commits (while popular - especially with Git-based tooling like GitHub that puts a larger emphasis on the end state of changes rather than the individual commits) isn't practiced by all projects. hg absorb will not help you if this is your workflow.

A popular variation of this fixup commit workflow is to author a new commit then incorporate this commit into a prior commit. This typically involves the following actions:

<save changes to a file>

$ hg commit
<type commit message>

$ hg histedit
<manually choose what actions to perform to what commits>

OR

<save changes to a file>

$ git add <file>
$ git commit
<type commit message>

$ git rebase --interactive
<manually choose what actions to perform to what commits>

Essentially, you produce a new commit. Then you run a history editing command. You then tell that history editing command what to do (e.g. to squash or fold one commit into another), that command performs work and produces a set of rewritten commits.

In simple cases, you may make a simple change to a single file. Things are pretty straightforward. You need to know which two commits to squash together. This is often trivial. Although it can be cumbersome if there are several commits and it isn't clear which one should be receiving the new changes.

In more complex cases, you may make multiple modifications to multiple files. You may even want to squash your fixups into separate commits. And for some code reviews, this complex case can be quite common. It isn't uncommon for me to be incorporating dozens of reviewer-suggested changes across several commits!

These complex use cases are where things can get really complicated for version control tool interactions. Let's say we want to make multiple changes to a file and then incorporate those changes into multiple commits. To keep it simple, let's assume 2 modifications in a single file squashing into 2 commits:

<save changes to file>

$ hg commit --interactive
<select changes to commit>
<type commit message>

$ hg commit
<type commit message>

$ hg histedit
<manually choose what actions to perform to what commits>

OR

<save changes to file>

$ git add <file>
$ git add --interactive
<select changes to stage>

$ git commit
<type commit message>

$ git add <file>
$ git commit
<type commit message>

$ git rebase --interactive
<manually choose which actions to perform to what commits>

We can see that the number of actions required by users has already increased substantially. Not captured by the number of lines is the effort that must go into the interactive commands like hg commit --interactive, git add --interactive, hg histedit, and git rebase --interactive. For these commands, users must tell the VCS tool exactly what actions to take. This takes time and requires some cognitive load. This ultimately distracts the user from the task at hand, which is bad for concentration and productivity. The user just wants to amend old commits: telling the VCS tool what actions to take is an obstacle in their way. (A compelling argument can be made that the work required with these workflows to produce a clean history is too much effort and it is easier to make the trade-off favoring simpler workflows versus cleaner history.)

These kinds of squash fixup workflows are what hg absorb is designed to make easier. When using hg absorb, the above workflow can be reduced to:

<save changes to file>

$ hg absorb
<hit y to accept changes>

OR

<save changes to file>

$ hg absorb --apply-changes

Let's assume the following changes are made in the working directory:

$ hg diff
diff --git a/ansible/roles/hg-web/templates/vhost.conf.j2 b/ansible/roles/hg-web/templates/vhost.conf.j2
--- a/ansible/roles/hg-web/templates/vhost.conf.j2
+++ b/ansible/roles/hg-web/templates/vhost.conf.j2
@@ -76,7 +76,7 @@ LimitRequestFields 1000
      # Serve static files straight from disk.
      <Directory /repo/hg/htdocs/static/>
          Options FollowSymLinks
 -        AllowOverride NoneTypo
 +        AllowOverride None
          Require all granted
      </Directory>

@@ -86,7 +86,7 @@ LimitRequestFields 1000
      # and URLs are versioned by the v-c-t revision, they are immutable
      # and can be served with aggressive caching settings.
      <Location /static/>
 -        Header set Cache-Control "max-age=31536000, immutable, bad"
 +        Header set Cache-Control "max-age=31536000, immutable"
      </Location>

      #LogLevel debug

That is, we have 2 separate uncommitted changes to ansible/roles/hg-web/templates/vhost.conf.j2.

Here is what happens when we run hg absorb:

$ hg absorb
showing changes for ansible/roles/hg-web/templates/vhost.conf.j2
        @@ -78,1 +78,1 @@
d46d6a7 -        AllowOverride NoneTypo
d46d6a7 +        AllowOverride None
        @@ -88,1 +88,1 @@
1c114a3 -        Header set Cache-Control "max-age=31536000, immutable, bad"
1c114a3 +        Header set Cache-Control "max-age=31536000, immutable"

2 changesets affected
1c114a3 ansible/hg-web: serve static files as immutable content
d46d6a7 ansible/hg-web: serve static template files from httpd
apply changes (yn)?
<press "y">
2 of 2 chunk(s) applied

hg absorb automatically figured out that the 2 separate uncommitted changes mapped to 2 different changesets (Mercurial's term for commit). It print a summary of what lines would be changed in what changesets and prompted me to accept its plan for how to proceed. The human effort involved is a quick review of the proposed changes and answering a prompt.

At a technical level, hg absorb finds all uncommitted changes and attempts to map each changed line to an unambiguous prior commit. For every change that can be mapped cleanly, the uncommitted changes are absorbed into the appropriate prior commit. Commits impacted by the operation are rebased automatically. If a change cannot be mapped to an unambiguous prior commit, it is left uncommitted and users can fall back to an existing workflow (e.g. using hg histedit).

But wait - there's more!

The automatic rewriting logic of hg absorb is implemented by following the history of lines. This is fundamentally different from the approach taken by hg histedit or git rebase, which tend to rely on merge strategies based on the 3-way merge to derive a new version of a file given multiple input versions. This approach combined with the fact that hg absorb skips over changes with an ambiguous application commit means that hg absorb will never encounter merge conflicts! Now, you may be thinking if you ignore lines with ambiguous application targets, the patch would always apply cleanly using a classical 3-way merge. This statement logically sounds correct. But it isn't: hg absorb can avoid merge conflicts when the merging performed by hg histedit or git rebase -i would fail.

The above example attempts to exercise such a use case. Focusing on the initial change:

diff --git a/ansible/roles/hg-web/templates/vhost.conf.j2 b/ansible/roles/hg-web/templates/vhost.conf.j2
--- a/ansible/roles/hg-web/templates/vhost.conf.j2
+++ b/ansible/roles/hg-web/templates/vhost.conf.j2
@@ -76,7 +76,7 @@ LimitRequestFields 1000
     # Serve static files straight from disk.
     <Directory /repo/hg/htdocs/static/>
         Options FollowSymLinks
-        AllowOverride NoneTypo
+        AllowOverride None
         Require all granted
     </Directory>

This patch needs to be applied against the commit which introduced it. That commit had the following diff:

diff --git a/ansible/roles/hg-web/templates/vhost.conf.j2 b/ansible/roles/hg-web/templates/vhost.conf.j2
--- a/ansible/roles/hg-web/templates/vhost.conf.j2
+++ b/ansible/roles/hg-web/templates/vhost.conf.j2
@@ -73,6 +73,15 @@ LimitRequestFields 1000
         {% endfor %}
     </Location>

+    # Serve static files from templates directory straight from disk.
+    <Directory /repo/hg/hg_templates/static/>
+        Options None
+        AllowOverride NoneTypo
+        Require all granted
+    </Directory>
+
+    Alias /static/ /repo/hg/hg_templates/static/
+
     #LogLevel debug
     LogFormat "%h %v %u %t \"%r\" %>s %b %D \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\""
     ErrorLog "/var/log/httpd/hg.mozilla.org/error_log"

But after that commit was another commit with the following change:

diff --git a/ansible/roles/hg-web/templates/vhost.conf.j2 b/ansible/roles/hg-web/templates/vhost.conf.j2
--- a/ansible/roles/hg-web/templates/vhost.conf.j2
+++ b/ansible/roles/hg-web/templates/vhost.conf.j2
@@ -73,14 +73,21 @@ LimitRequestFields 1000
         {% endfor %}
     </Location>

-    # Serve static files from templates directory straight from disk.
-    <Directory /repo/hg/hg_templates/static/>
-        Options None
+    # Serve static files straight from disk.
+    <Directory /repo/hg/htdocs/static/>
+        Options FollowSymLinks
         AllowOverride NoneTypo
         Require all granted
     </Directory>

...

When we use hg histedit or git rebase -i to rewrite this history, the VCS would first attempt to re-order commits before squashing 2 commits together. When we attempt to reorder the fixup diff immediately after the commit that introduces it, there is a good chance your VCS tool would encounter a merge conflict. Essentially your VCS is thinking you changed this line but the lines around the change in the final version are different from the lines in the initial version: I don't know if those other lines matter and therefore I don't know what the end state should be, so I'm giving up and letting the user choose for me.

But since hg absorb operates at the line history level, it knows that this individual line wasn't actually changed (even though the lines around it did), assumes there is no conflict, and offers to absorb the change. So not only is hg absorb significantly simpler than today's hg histedit or git rebase -i workflows in terms of VCS command interactions, but it can also avoid time-consuming merge conflict resolution as well!

Another feature of hg absorb is that all the rewriting occurs in memory and the working directory is not touched when running the command. This means that the operation is fast (working directory updates often account for a lot of the execution time of hg histedit or git rebase commands). It also means that tools looking at the last modified time of files (e.g. build systems like GNU Make) won't rebuild extra (unrelated) files that were touched as part of updating the working directory to an old commit in order to apply changes. This makes hg absorb more friendly to edit-compile-test-commit loops and allows developers to be more productive.

And that's hg absorb in a nutshell.

When I first saw a demo of hg absorb at a Mercurial developer meetup, my jaw - along with those all over the room - hit the figurative floor. I thought it was magical and too good to be true. I thought Facebook (the original authors of the feature) were trolling us with an impossible demo. But it was all real. And now hg absorb is available in the core Mercurial distribution for anyone to use.

From my experience, hg absorb just works almost all of the time: I run the command and it maps all of my uncommitted changes to the appropriate commit and there's nothing more for me to do! In a word, it is magical.

To use hg absorb, you'll need to activate the absorb extension. Simply put the following in your hgrc config file:

[extensions]
absorb =

hg absorb is currently an experimental feature. That means there is no commitment to backwards compatibility and some rough edges are expected. I also anticipate new features (such as hg absorb --interactive) will be added before the experimental label is removed. If you encounter problems or want to leave comments, file a bug, make noise in #mercurial on Freenode, or submit a patch. But don't let the experimental label scare you away from using it: hg absorb is being used by some large install bases and also by many of the Mercurial core developers. The experimental label is mainly there because it is a brand new feature in core Mercurial and the experimental label is usually affixed to new features.

If you practice workflows that frequently require amending old commits, I think you'll be shocked at how much easier hg absorb makes these workflows. I think you'll find it to be a game changer: once you use hg abosrb, you'll soon wonder how you managed to get work done without it.

Read and Post Comments

Global Kernel Locks in APFS

October 29, 2018 at 02:20 PM | categories: Python, Mercurial, Apple | View Comments

Over the past several months, a handful of people had been complaining that Mercurial's test harness was executing much slower on Macs. But this slowdown seemingly wasn't occurring on Linux or Windows. And not every Mac user experienced the slowness!

Before jetting off to the Mercurial 4.8 developer meetup in Stockholm a few weeks ago, I sat down with a relatively fresh 6+6 core MacBook Pro and experienced the problem firsthand: on my 4+4 core i7-6700K running Linux, the Mercurial test harness completes in ~12 minutes, but on this MacBook Pro, it was executing in ~38 minutes! On paper, this result doesn't make any sense because there's no way that the MacBook Pro should be ~3x slower than that desktop machine.

Looking at Activity Monitor when running the test harness with 12 tests in parallel revealed something odd: the system was spending ~75% of overall CPU time inside the kernel! When reducing the number of tests that ran in parallel, the percentage of CPU time spent in the kernel decreased and the overall test harness execution time also decreased. This kind of behavior is usually a sign of something very inefficient in kernel land.

I sample profiled all processes on the system when running the Mercurial test harness. Aggregate thread stacks revealed a common pattern: readdir() being in the stack.

Upon closer examination of the stacks, readdir() calls into apfs_vnop_readdir(), which calls into some functions with bt or btree in their name, which call into lck_mtx_lock(), lck_mtx_lock_grab_mutex() and various other functions with lck_mtx in their name. And the caller of most readdir() appeared to be Python 2.7's module importing mechanism (notably import.c:case_ok()).

APFS refers to the Apple File System, which is a filesystem that Apple introduced in 2017 and is the default filesystem for new versions of macOS and iOS. If upgrading an old Mac to a new macOS, its HFS+ filesystems would be automatically converted to APFS.

While the source code for APFS is not available for me to confirm, the profiling results showing excessive time spent in lck_mtx_lock_grab_mutex() combined with the fact that execution time decreases when the parallel process count decreases leads me to the conclusion that APFS obtains a global kernel lock during read-only operations such as readdir(). In other words, APFS slows down when attempting to perform parallel read-only I/O.

This isn't the first time I've encountered such behavior in a filesystem: last year I blogged about very similar behavior in AUFS, which was making Firefox CI significantly slower.

Because Python 2.7's module importing mechanism was triggering the slowness by calling readdir(), I posted to python-dev about the problem, as I thought it was important to notify the larger Python community. After all, this is a generic problem that affects the performance of starting any Python process when running on APFS. i.e. if your build system invokes many Python processes in parallel, you could be impacted by this. As part of obtaining data for that post, I discovered that Python 3.7 does not call readdir() as part of module importing and therefore doesn't exhibit a severe slowdown. (Python's module importing code was rewritten significantly in Python 3 and the fix was likely introduced well before Python 3.7.)

I've produced a gist that can reproduce the problem. The script essentially performs a recursive directory walk. It exercises the opendir(), readdir(), closedir(), and lstat() functions heavily and is essentially a benchmark of the filesystem and filesystem cache's ability to return file metadata.

When you tell it to walk a very large directory tree - say a Firefox version control checkout (which has over 250,000 files) - the excessive time spent in the kernel is very apparent on macOS 10.13 High Sierra:

$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 172.209s

real    2m52.470s
user    1m54.053s
sys    23m42.808s

$ time ./slow-readdir.py -l 12 -j 1 ~/src/firefox
ran 12 walks across 1 processes in 523.440s

real    8m43.740s
user    1m13.397s
sys     3m50.687s

$ time ./slow-readdir.py -l 18 -j 18 ~/src/firefox
ran 18 walks across 18 processes in 210.487s

real    3m30.731s
user    2m40.216s
sys    33m34.406s

On the same machine upgraded to macOS 10.14 Mojave, we see a bit of a speedup!:

$ time ./slow-readdir.py -l 12 ~/src/firefox
ran 12 walks across 12 processes in 97.833s

real    1m37.981s
user    1m40.272s
sys    10m49.091s

$ time ./slow-readdir.py -l 12 -j 1 ~/src/firefox
ran 12 walks across 1 processes in 461.415s

real    7m41.657s
user    1m05.830s
sys     3m47.041s

$ time ./slow-readdir.py -l 18 -j 18 ~/src/firefox
ran 18 walks across 18 processes in 140.474s

real    2m20.727s
user    3m01.048s
sys    17m56.228s

Contrast with my i7-6700K Linux machine backed by EXT4:

$ time ./slow-readdir.py -l 8 ~/src/firefox
ran 8 walks across 8 processes in 6.018s

real    0m6.191s
user    0m29.670s
sys     0m17.838s

$ time ./slow-readdir.py -l 8 -j 1 ~/src/firefox
ran 8 walks across 1 processes in 33.958s

real    0m34.164s
user    0m17.136s
sys     0m13.369s

$ time ./slow-readdir.py -l 12 -j 12 ~/src/firefox
ran 12 walks across 12 processes in 25.465s

real    0m25.640s
user    1m4.801s
sys     1m20.488s

It is apparent that macOS 10.14 Mojave has received performance work relative to macOS 10.13! Overall kernel CPU time when performing parallel directory walks has decreased substantially - to ~50% of original on some invocations! Stacks seem to reveal new code for lock acquisition, so this might indicate generic improvements to the kernel's locking mechanism rather than APFS specific changes. Changes to file metadata caching could also be responsible for performance changes. Although it is difficult to tell without access to the APFS source code. Despite those improvements, APFS is still spending a lot of CPU time in the kernel. And the kernel CPU time is still comparatively very high compared to Linux/EXT4, even for single process operation.

At this time, I haven't conducted a comprehensive analysis of APFS to determine what other filesystem operations seem to acquire global kernel locks: all I know is readdir() does. A casual analysis of profiled stacks when running Mercurial's test harness against Python 3.7 seems to show apfs_* functions still on the stack a lot and that seemingly indicates more APFS slowness under parallel I/O load. But HFS+ exhibited similar problems (it appeared HFS+ used a single I/O thread inside the kernel for many operations, making I/O on macOS pretty bad), so I'm not sure if these could be considered regressions the way readdir()'s new behavior is.

I've reported this issue to Apple at https://bugreport.apple.com/web/?problemID=45648013 and on OpenRadar at https://openradar.appspot.com/radar?id=5025294012383232. I'm told that issues get more attention from Apple when there are many duplicates of the same issue. So please reference this issue if you file your own report.

Now that I've elaborated on the technical details, I'd like to add some personal commentary. While this post is about APFS, this issue of global kernel locks during common I/O operations is not unique to APFS. I already referenced similar issues in AUFS. And I've encountered similar behaviors with Btrfs (although I can't recall exactly which operations). And NTFS has its own bag of problems.

This seeming pattern of global kernel locks for common filesystem operations and slow filesystems is really rubbing me the wrong way. Modern NVMe SSDs are capable of reading and writing well over 2 gigabytes per second and performing hundreds of thousands of I/O operations per second. We even have Intel soon producing persistent solid state storage that plugs into DIMM slots because it is that friggin fast.

Today's storage hardware is capable of ludicrous performance. It is fast enough that you will likely saturate multiple CPU cores processing the read or written data coming from and going to storage - especially if you are using higher-level, non-JITed (read: slower) programming languages (like Python). There has also been a trend that systems are growing more CPU cores faster than they are instructions per second per core. And SSDs only achieve these ridiculous IOPS numbers if many I/O operations are queued and can be more efficiently dispatched within the storage device. What this all means is that it probably makes sense to use parallel I/O across multiple threads in order to extract all potential performance from your persistent storage layer.

It's also worth noting that we now have solid state storage that outperforms (in some dimensions) what DRAM from ~20 years ago was capable of. Put another way I/O APIs and even some filesystems were designed in an era when its RAM was slower than what today's persistent storage is capable of! While I'm no filesystems or kernel expert, it does seem a bit silly to be using APIs and filesystems designed for an era when storage was multiple orders of magnitude slower and systems only had a single CPU core.

My takeaway is I can't help but feel that systems-level software (including the kernel) is severely limiting the performance potential of modern storage devices. If we have e.g. global kernel locks when performing common I/O operations, there's no chance we'll come close to harnessing the full potential of today's storage hardware. Furthermore, the behavior of filesystems is woefully under documented and software developers have little solid advice for how to achieve optimal I/O performance. As someone who cares about performance, I want to squeeze every iota of potential out of hardware. But the lack of documentation telling me which operations acquire locks, which strategies are best for say reading or writing 10,000 files using N threads, etc makes this extremely difficult. And even if this documentation existed, because of differences in behavior across filesystems and operating systems and the difficulty in programmatically determining the characteristics of filesystems at run time, it is practically impossible to design a one size fits all approach to high performance I/O.

The filesystem is a powerful concept. I want to agree and use the everything is a file philosophy. Unfortunately, filesystems don't appear to be scaling very well to support the potential of modern day storage technology. We're probably at the point where commodity priced solid state storage is far more capable than today's software for the majority of applications. Storage hardware manufacturers will keep producing faster and faster storage and their marketing teams will keep convincing us that we need to buy it. But until software catches up, chances are most of us won't come close to realizing the true potential of modern storage hardware. And that's even true for specialized applications that do employ tricks taking hundreds or thousands of person hours to implement in order to eek out every iota of performance potential. The average software developer and application using filesystems as they were designed to be used has little to no chance of coming close to utilizing the performance potential of modern storage devices. That's really a shame.

Read and Post Comments

Benefits of Clone Offload on Version Control Hosting

July 27, 2018 at 03:48 PM | categories: Mercurial, Mozilla | View Comments

Back in 2015, I implemented a feature in Mercurial 3.6 that allows servers to advertise URLs of pre-generated bundle files. When a compatible client performs a hg clone against a repository leveraging this feature, it downloads and applies the bundle from a URL then goes back to the server and performs the equivalent of an hg pull to obtain the changes to the repository made after the bundle was generated.

On hg.mozilla.org, we've been using this feature since 2015. We host bundles in Amazon S3 and make them available via the CloudFront CDN. We perform IP filtering on the server so clients connecting from AWS IPs are served S3 URLs corresponding to the closest region / S3 bucket where bundles are hosted. Most Firefox build and test automation is run out of EC2 and automatically clones high-volume repositories from an S3 bucket hosted in the same AWS region. (Doing an intra-region transfer is very fast and clones can run at >50 MB/s.) Everyone else clones from a CDN. See our official docs for more.

I last reported on this feature in October 2015. Since then, Bitbucket also deployed this feature in early 2017.

I was reminded of this clone bundles feature this week when kernel.org posted Best way to do linux clones for your CI and that post was making the rounds in my version control circles. tl;dr git.kernel.org apparently suffers high load due to high clone volume against the Linux Git repository and since Git doesn't have an equivalent feature to clone bundles built in to Git itself, they are asking people to perform equivalent functionality to mitigate server load.

(A clone bundles feature has been discussed on the Git mailing list before. I remember finding old discussions when I was doing research for Mercurial's feature in 2015. I'm sure the topic has come up since.)

Anyway, I thought I'd provide an update on just how valuable the clone bundles feature is to Mozilla. In doing so, I hope maintainers of other version control tools see the obvious benefits and consider adopting the feature sooner.

In a typical week, hg.mozilla.org is currently serving ~135 TB of data. The overwhelming majority of this data is related to the Mercurial wire protocol (i.e. not HTML / JSON served from the web interface). Of that ~135 TB, ~5 TB is served from the CDN, ~126 TB is served from S3, and ~4 TB is served from the Mercurial servers themselves. In other words, we're offloading ~97% of bytes served from the Mercurial servers to S3 and the CDN.

If we assume this offloaded ~131 TB is equally distributed throughout the week, this comes out to ~1,732 Mbps on average. In reality, we do most of our load from California's Sunday evenings to early Friday evenings. And load is typically concentrated in the 12 hours when the sun is over Europe and North America (where most of Mozilla's employees are based). So the typical throughput we are offloading is more than 2 Gbps. And at a lower level, automation tends to perform clones soon after a push is made. So load fluctuates significantly throughout the day, corresponding to when pushes are made.

By volume, most of the data being offloaded is for the mozilla-unified Firefox repository. Without clone bundles and without the special stream clone Mercurial feature (which we also leverage via clone bundles), the servers would be generating and sending ~1,588 MB of zstandard level 3 compressed data for each clone of that repository. Each clone would consume ~280s of CPU time on the server. And at ~195,000 clones per month, that would come out to ~309 TB/mo or ~72 TB/week. In CPU time, that would be ~54.6 million CPU-seconds, or ~21 CPU-months. I will leave it as an exercise to the reader to attach a dollar cost to how much it would take to operate this service without clone bundles. But I will say the total AWS bill for our S3 and CDN hosting for this service is under $50 per month. (It is worth noting that intra-region data transfer from S3 to other AWS services is free. And we are definitely taking advantage of that.)

Despite a significant increase in the size of the Firefox repository and clone volume of it since 2015, our servers are still performing less work (in terms of bytes transferred and CPU seconds consumed) than they were in 2015. The ~97% of bytes and millions of CPU seconds offloaded in any given week have given us a lot of breathing room and have saved Mozilla several thousand dollars in hosting costs. The feature has likely helped us avoid many operational incidents due to high server load. It has made Firefox automation faster and more reliable.

Succinctly, Mercurial's clone bundles feature has successfully and largely effortlessly offloaded a ton of load from the hg.mozilla.org Mercurial servers. Other version control tools should implement this feature because it is a game changer for server operators and results in a better client-side experience (eliminates server-side CPU bottleneck and may eliminate network bottleneck due to a geo-local CDN typically being as fast as your Internet pipe). It's a win-win. And a massive win if you are operating at scale.

Read and Post Comments

Deterministic Firefox Builds

June 20, 2018 at 11:10 AM | categories: Mozilla | View Comments

As of Firefox 60, the build environment for official Firefox Linux builds switched from CentOS to Debian.

As part of the transition, we overhauled how the build environment for Firefox is constructed. We now populate the environment from deterministic package snapshots and are much more stringent about dependencies and operations being deterministic and reproducible. The end result is that the build environment for Firefox is deterministic enough to enable Firefox itself to be built deterministically.

Changing the underlying operating system environment used for builds was a risky change. Differences in the resulting build could result in new bugs or some users not being able to run the official builds. We figured a good way to mitigate that risk was to make the old and new builds as bit-identical as possible. After all, if the environments produce the same bits, then nothing has effectively changed and there should be no new risk for end-users.

Employing the diffoscope tool, we identified areas where Firefox builds weren't deterministic in the same environment and where there was variance across build environments. We iterated on differences and changed systems so variance would no longer occur. By the end of the process, we had bit-identical Firefox builds across environments.

So, as of Firefox 60, Firefox builds on Linux are deterministic in our official build environment!

That being said, the builds we ship to users are using PGO. And an end-to-end build involving PGO is intrinsically not deterministic because it relies on timing data that varies from one run to the next. And we don't yet have continuous automated end-to-end testing that determinism holds. But the underlying infrastructure to support deterministic and reproducible Firefox builds is there and is not going away. I think that's a milestone worth celebrating.

This milestone required the effort of many people, often working indirectly toward it. Debian's reproducible builds effort gave us an operating system that provided deterministic and reproducible guarantees. Switching Firefox CI to Taskcluster enabled us to switch to Debian relatively easily. Many were involved with non-determinism fixes in Firefox over the years. But Mike Hommey drove the transition of the build environment to Debian and he deserves recognition for his individual contribution. Thanks to all these efforts - and especially Mike Hommey's - we can now say Firefox builds deterministically!

The fx-reproducible-build bug tracks ongoing efforts to further improve the reproducibility story of Firefox. (~300 bugs in its dependency tree have already been resolved!)

Read and Post Comments

Scaling Firefox Development Workflows

May 16, 2018 at 04:10 PM | categories: Mozilla | View Comments

One of the central themes of my time at Mozilla has been my pursuit of making it easier to contribute to and hack on Firefox.

I vividly remember my first day at Mozilla in 2011 when I went to build Firefox for the first time. I thought the entire experience - from obtaining the source code, installing build dependencies, building, running tests, submitting patches for review, etc was quite... lacking. When I asked others if they thought this was an issue, many rightfully identified problems (like the build system being slow). But there was a significant population who seemed to be naive and/or apathetic to the breadth of the user experience shortcomings. This is totally understandable: the scope of the problem is immense and various people don't have the perspective, are blinded/biased by personal experience, and/or don't have the product design or UX experience necessary to comprehend the problem.

When it comes to contributing to Firefox, I think the problems have as much to do with user experience (UX) as they do with technical matters. As I wrote in 2012, user experience matters and developers are people too. You can have a technically superior product, but if the UX is bad, you will have a hard time attracting and retaining new users. And existing users won't be as happy. These are the kinds of problems that a product manager or designer deals with. A difference is that in the case of Firefox development, the target audience is a very narrow and highly technically-minded subset of the larger population - much smaller than what your typical product targets. The total addressable population is (realistically) in the thousands instead of millions. But this doesn't mean you ignore the principles of good product design when designing developer tooling. When it comes to developer tooling and workflows, I think it is important to have a product manager mindset and treat it not as a collection of tools for technically-minded individuals, but as a product having an overall experience. You only have to look as far as the Firefox Developer Tools to see this approach applied and the positive results it has achieved.

Historically, Mozilla has lacked a formal team with the domain expertise and mandate to treat Firefox contribution as a product. We didn't have anything close to this until a few years ago. Before we had such a team, I took on some of these problems individually. In 2012, I wrote mach - a generic CLI command dispatch tool - to provide a central, convenient, and easy-to-use command to discover development actions and to run them. (Read the announcement blog post for some historical context.) I also introduced a one-line bootstrap tool (now mach bootstrap) to make it easier to configure your machine for building Firefox. A few months later, I was responsible for introducing moz.build files, which paved the way for countless optimizations and for rearchitecting the Firefox build system to use modern tools - a project that is still ongoing (digging out from ~two decades of technical debt is a massive effort). And a few months after that, I started going down the version control rabbit hole and improving matters there. And I was also heavily involved with MozReview for improving the code review experience.

Looking back, I was responsible for and participated in a ton of foundational changes to how Firefox is developed. Of course, dozens of others have contributed to getting us to where we are today and I can't and won't take credit for the hard work of others. Nor will I claim I was the only person coming up with good ideas or transforming them into reality. I can name several projects (like Taskcluster and Treeherder) that have been just as or more transformational than the changes I can take credit for. It would be vain and naive of me to elevate my contributions on a taller pedestal and I hope nobody reads this and thinks I'm doing that.

(On a personal note, numerous people have told me that things like mach and the bootstrap tool have transformed the Firefox contribution experience for the better. I've also had very senior people tell me that they don't understand why these tools are important and/or are skeptical of the need for investments in this space. I've found this dichotomy perplexing and troubling. Because some of the detractors (for lack of a better word) are highly influential and respected, their apparent skepticism sews seeds of doubt and causes me to second guess my contributions and world view. This feels like a form or variation of imposter syndrome and it is something I have struggled with during my time at Mozilla.)

From my perspective, the previous five or so years in Firefox development workflows has been about initiating foundational changes and executing on them. When it was introduced, mach was radical. It didn't do much and its use was optional. Now almost everyone uses it. Similar stories have unfolded for Taskcluster, MozReview, and various other tools and platforms. In other words, we laid a foundation and have been steadily building upon it for the past several years. That's not to say other foundational changes haven't occurred since (they have - the imminent switch to Phabricator is a terrific example). But the volume of foundational changes has slowed since 2012-2014. (I think this is due to Mozilla deciding to invest more in tools as a result of growing pains from significant company expansion that began in 2010. With that investment, we invested in the bigger ticket long-standing workflow pain points, such as CI (Taskcluster), the Firefox build system, Treeherder, and code review.)

Workflows Today and in the Future

Over the past several years, the size, scope, and complexity of Firefox development activities has increased.

One way to see this is at the source code level. The following chart shows the size of the mozilla-central version control repository over time.

mozilla-central size over time

The size increases are obvious. The increases cumulatively represent new features, technologies, and workflows. For example, the repository contains thousands of Web Platform Tests (WPT) files, a shared test suite for web platform implementations, like Gecko and Blink. WPT didn't exist a few years ago. Now we have files under source control, tools for running those tests, and workflows revolving around changing those tests. The incorporation of Rust and components of Servo into Firefox is also responsible for significant changes. Firefox features such as Developer Tools have been introduced or ballooned in size in recent years. The Go Faster project and the move to system add-ons has introduced various new workflows and challenges for testing Firefox.

Many of these changes are building upon the user-facing foundational workflow infrastructure that was last significantly changed in 2012-2014. This has definitely contributed to some growing pains. For example, there are now 92 mach commands instead of like 5. mach help - intended to answer what can I do and how should I do it - is overwhelming, especially to new users. The repository is around 2 gigabytes of data to clone instead of around 500 megabytes. We have 240,000 files in a full checkout instead of 70,000 files. There's a ton of new pieces floating around. Any product manager tasked with user acquisition and retention will tell you that increasing the barrier to entry and use will jeopardize these outcomes. But with the growth of Firefox's technical underbelly in the previous years, we've made it harder to contribute by requiring users to download and see a lot more files (version control) and be overwhelmed by all the options for actions to take (mach having 92 commands). And as the sheer number of components constituting Firefox increases, it becomes harder and harder for everyone - not just new contributors - to reason about how everything fits together.

I've been framing this general problem as scaling Firefox development workflows and every time I think about the high-level challenges facing Firefox contribution today and in the years ahead, this problem floats to the top of my list of concerns. Yes, we have pressing issues like improving the code review experience and making the Firefox build system and Taskcluster-based CI fast, efficient, and reliable. But even if you make these individual pieces great, there is still a cross-domain problem of how all these components weave together. This is why I think it is important to take a wholistic view and treat developer workflow as a product.

When I look at this the way a product manager or designer would, I see a few fundamental problems that need addressing.

First, we're not optimizing for comprehensive end-to-end workflows. By and large, we're designing our tools in isolation. We focus more on maximizing the individual components instead of maximizing the interaction between them. For example, Taskcluster and Treeherder are pretty good in isolation. But we're missing features like Treeherder being able to tell me the command to run locally to reproduce a failure: I want to see a failure on Treeherder and be able to copy and paste commands into my terminal to debug the failure. In the case of code review, we've designed two good code review tools (MozReview and Phabricator) but we haven't invested in making submitting code reviews turn key (the initial system configuration is difficult and we still don't have things like automatic bug filing or reviewer selection). We are leaving many workflow optimizations on the table by not implementing thoughtful tie-ins and transitions between various tools.

Second, by-and-large we're still optimizing for a single, monolithic user segment instead of recognizing and optimizing for different users and their workflow requirements. For example, mach help lists 92 commands. I don't think any single person cares about all 92 of those commands. The average person may only care about 10 or even 20. In terms of user interface design, the features and workflow requirements of small user segments are polluting the interface for all users and making the entire experience complicated and difficult to reason about. As a concrete example, why should a system add-on developer or a Firefox Developer Tools developer (these people tend to care about testing a standalone Firefox add-on) care about Gecko's build system or tests? If you aren't touching Gecko or Firefox's chrome code, why should you be exposed to workflows and requirements that don't have a major impact on you? Or something more extreme, if you are developing a standalone Rust module or Python package in mozilla-central, why do you need to care about Firefox at all? (Yes, Firefox or another downstream consumer may care about changes to that standalone component and you can't ignore those dependencies. But it should at least be possible to hide those dependencies.)

Waving my hands, the solution to these problems is to treat Firefox development workflow as a product and to apply the same rigor that we use for actual Firefox product development. Give people with a vision for the entire workflow the ability to prioritize investment across tools and platforms. Give them a process for defining features that work across tools. Perform formal user studies. See how people are actually using the tools you build. Bring in design and user experience experts to help formulate better workflows. Perform user typing so different, segmentable workflows can be optimized for. Treat developers as you treat users of real products: listen to them. Give developers a voice to express frustrations. Let them tell you what they are trying to do and what they wish they could do. Integrate this feedback into a feature roadmap. Turn common feedback into action items for new features.

If you think these ideas are silly and it doesn't make sense to apply a product mindset to developer workflows and tooling, then you should be asking whether product management and all that it entails is also a silly idea. If you believe that aspects of product management have beneficial outcomes (which most companies do because otherwise there wouldn't be product managers), then why wouldn't you want to apply the methods of that discipline to developers and development workflows? Developers are users too and the fact that they work for the same company that is creating the product shouldn't make them immune from the benefits of product management.

If we want to make contributing to Firefox an even better experience for Mozilla employees and community contributors, I think we need to take a step back and assess the situation as a product manager would. The improvements that have been made to the individual pieces constituting Firefox's development workflow during my nearly seven years at Mozilla have been incredible. But I think in order to achieve the next round of major advancements in workflow productivity, we'll need to focus on how all of the pieces fit together. And that requires treating the entire workflow as a cohesive product.

Read and Post Comments

Next Page ยป