Python Packaging Do's and Don'ts

July 15, 2014 at 05:20 PM | categories: Python, Mozilla

Are you someone who casually interacts with Python but don't know the inner workings of Python? Then this post is for you. Read on to learn why some things are the way they are and how to avoid making some common mistakes.

Always use Virtualenvs

It is an easy trap to view virtualenvs as an obstacle, a distraction towards accomplishing something. People see me adding virtualenvs to build instructions and they say I don't use virtualenvs, they aren't necessary, why are you doing that?

A virtualenv is effectively an overlay on top of your system Python install. Creating a virtualenv can be thought of as copying your system Python environment into a local location. When you modify virtualenvs, you are modifying an isolated container. Modifying virtualenvs has no impact on your system Python.

A goal of a virtualenv is to isolate your system/global Python install from unwanted changes. When you accidentally make a change to a virtualenv, you can just delete the virtualenv and start over from scratch. When you accidentally make a change to your system Python, it can be much, much harder to recover from that.

Another goal of virtualenvs is to allow different versions of packages to exist. Say you are working on two different projects and each requires a specific version of Django. With virtualenvs, you install one version in one virtualenv and a different version in another virtualenv. Things happily coexist because the virtualenvs are independent. Contrast with trying to manage both versions of Django in your system Python installation. Trust me, it's not fun.

Casual Python users may not encounter scenarios where virtualenvs make their lives better... until they do, at which point they realize their system Python install is beyond saving. People who eat, breath, and die Python run into these scenarios all the time. We've learned how bad life without virtualenvs can be and so we use them everywhere.

Use of virtualenvs is a best practice. Not using virtualenvs will result in something unexpected happening. It's only a matter of time.

Please use virtualenvs.

Never use sudo

Do you use sudo to install a Python package? You are doing it wrong.

If you need to use sudo to install a Python package, that almost certainly means you are installing a Python package to your system/global Python install. And this means you are modifying your system Python instead of isolating it and keeping it pristine.

Instead of using sudo to install packages, create a virtualenv and install things into the virtualenv. There should never be permissions issues with virtualenvs - the user that creates a virtualenv has full realm over it.

Never modify the system Python environment

On some systems, such as OS X with Homebrew, you don't need sudo to install Python packages because the user has write access to the Python directory (/usr/local in Homebrew).

For the reasons given above, don't muck around with the system Python environment. Instead, use a virtualenv.

Beware of the package manager

Your system's package manager (apt, yum, etc) is likely using root and/or installing Python packages into the system Python.

For the reasons given above, this is bad. Try to use a virtualenv, if possible. Try to not use the system package manager for installing Python packages.

Use pip for installing packages

Python packaging has historically been a mess. There are a handful of tools and APIs for installing Python packages. As a casual Python user, you only need to know of one of them: pip.

If someone says install a package, you should be thinking create a virtualenv, activate a virtualenv, pip install <package>. You should never run pip install outside of a virtualenv. (The exception is to install virtualenv and pip itself, which you almost certainly want in your system/global Python.)

Running pip install will install packages from PyPI, the Python Packaging Index by default. It's Python's official package repository.

There are a lot of old and outdated tutorials online about Python packaging. Beware of bad content. For example, if you see documentation that says use easy_install, you should be thinking, easy_install is a legacy package installer that has largely been replaced by pip, I should use pip instead. When in doubt, consult the Python packaging user guide and do what it recommends.

Don't trust the Python in your package manager

The more Python programming you do, the more you learn to not trust the Python package provided by your system / package manager.

Linux distributions such as Ubuntu that sit on the forward edge of versions are better than others. But I've run into enough problems with the OS or package manager maintained Python (especially on OS X), that I've learned to distrust them.

I use pyenv for installing and managing Python distributions from source. pyenv also installs virtualenv and pip for me, packages that I believe should be in all Python installs by default. As a more experienced Python programmer, I find pyenv just works.

If you are just a beginner with Python, it is probably safe to ignore this section. Just know that as soon as something weird happens, start suspecting your default Python install, especially if you are on OS X. If you suspect trouble, use something like pyenv to enforce a buffer so the system can have its Python and you can have yours.

Recovering from the past

Now that you know the preferred way to interact with Python, you are probably thinking oh crap, I've been wrong all these years - how do I fix it?

The goal is to get a Python install somewhere that is as pristine as possible. You have two approaches here: cleaning your existing Python or creating a new Python install.

To clean your existing Python, you'll want to purge it of pretty much all packages not installed by the core Python distribution. The exception is virtualenv, pip, and setuptools - you almost certainly want those installed globally. On Homebrew, you can uninstall everything related to Python and blow away your Python directory, typically /usr/local/lib/python*. Then, brew install python. On Linux distros, this is a bit harder, especially since most Linux distros rely on Python for OS features and thus they may have installed extra packages. You could try a similar approach on Linux, but I don't think it's worth it.

Cleaning your system Python and attempting to keep it pure are ongoing tasks that are very difficult to keep up with. All it takes is one dependency to get pulled in that trashes your system Python. Therefore, I shy away from this approach.

Instead, I install and run Python from my user directory. I use pyenv. I've also heard great things about Miniconda. With either solution, you get a Python in your home directory that starts clean and pure. Even better, it is completely independent from your system Python. So if your package manager does something funky, there is a buffer. And, if things go wrong with your userland Python install, you can always nuke it without fear of breaking something in system land. This seems to be the best of both worlds.

Please note that installing packages in the system Python shouldn't be evil. When you create virtualenvs, you can - and should - tell virtualenv to not use the system site-packages (i.e. don't use non-core packages from the system installation). This is the default behavior in virtualenv. It should provide an adequate buffer. But from my experience, things still manage to bleed through. My userland Python install is extra safety. If something wrong happens, I can only blame myself.

Conclusion

Python's long and complicated history of package management makes it very easy for you to shoot yourself in the foot. The long list of outdated tutorials on The Internet make this a near certainty for casual Python users. Using the guidelines in this post, you can adhere to best practices that will cut down on surprises and rage and keep your Python running smoothly.


Update Bugzilla Automatically on Push

June 30, 2014 at 11:15 PM | categories: Mercurial, Mozilla

Do you manually create Bugzilla comments when you push changes to a Firefox source repository? Yeah, I do too.

That's always annoyed me.

It is screaming to be automated.

So I automated it.

You can too. From a Firefox source checkout:

$ ./mach mercurial-setup

That should clone the version-control-tools repository into ~/.mozbuild/version-control-tools.

Then, add the following to your ~/.hgrc file:

[extensions]
bzpost = ~/.mozbuild/version-control-tools/hgext/bzpost

[bugzilla]
username = me@example.com
password = password

Now, when you hg push to a Firefox repository, the commit URLs will get posted to referenced bugs automatically.

Please note that pushing to release trees such as mozilla-central is not yet supported. In due time.

Please let me know if you run into any issues.

Estimated Cost Savings

Assuming the following:

  • It costs Mozilla $200,000 per year per full-time engineer working on Firefox (a general rule of thumb for non-senior positions is that your true employee cost is 2x your base salary).
  • Each full-time engineer works 40 hours per week for 46 weeks out of the year.
  • It takes 15 seconds to manually update Bugzilla for each push.
  • There are 20,000 pushes requiring Bugzilla attention per year.

We arrive at the following:

  • Cost per employee per hour worked: $108.70
  • Total person-time to manually update Bugzilla: ~83 hours
  • Total cost to manually update Bugzilla after push: $9,058.

I was intentionally conservative with all the inputs except time worked (I think many of us work more than 40 hour weeks). My estimates also don't take into account the lost productivity associated with getting mentally derailed by interacting with Bugzilla. With this in mind, I could very easily justify a total cost at least 2x-3x higher.

It took me maybe 3 hours to crank this out. I could spend another few weeks on it full time and Mozilla would still save money (assuming 100% adoption).

I encourage people to run their own cost calculations on other tasks that can be automated. Inefficiencies multiplied by millions of dollars (your collective employee cost) result in large piles of money. Not having tools (even simple ones like this) is equivalent to setting loads of cash on fire.


Track Firefox Repositories with Local-Only Mercurial Tags

June 30, 2014 at 10:25 AM | categories: Mercurial, Mozilla

After reading my recent Please Stop Using MQ post, a number of people asked me about my development workflow. While I still owe a full answer to that question, one of the cornerstores is a unified Mercurial repository. Instead of having separate clones for mozilla-central, mozilla-inbound, aurora, beta, etc, I have a single clone with the changesets from all the repositories.

I feel having a unified repository has made me more productive. I no longer have to waste time shuffling changesets between local clones. This introduced all kinds of extra cognitive load and manual processes that slowed me down. I highly encourage others to adopt unified repositories.

Because the various Firefox repositories don't have unique branches or bookmarks tracking the various heads, aggregating multiple repositories introduces a client-side problem of identifying which head is which. If you merely do a hg pull, you'll get a bunch of anonymous heads.

To solve this problem, you'll need to employ minor client-side magic. Previously, I recommended my heavyweight mozext extension.

Today, I'm proud to announce a new, lighter extension: firefoxtree. When pulling from a known Firefox repository, this extension will add a local-only Mercurial tag for that repository. For example, when pulling mozilla-central, the central tag will be created.

Local-only tags are a Mercurial feature only available to extensions. A local-only tag is effectively an overlay of tags that don't get transferred as part of push and pull operations. They behave like normal tags: you can hg up to them and reference them elsewhere changeset identifiers are used. They are also read only: if you update to a tag and then commit, the tag will not move forward (contrast with branches or bookmarks).

Example Usage

Clone https://hg.mozilla.org/hgcustom/version-control-tools and add the following to your .hg/hgrc:

[extensions]
firefoxtree = /path/to/version-control-tools/hgext/firefoxtree

To use, simply pull from a Firefox repo:

$ hg pull https://hg.mozilla.org/mozilla-central

$ hg up central

# Do your development work.

# Time to land.
$ hg pull https://hg.mozilla.org/integration/mozilla-inbound
$ hg rebase -d inbound
$ hg out -r . https://hg.mozilla.org/integration/mozilla-inbound
$ hg push -r .  https://hg.mozilla.org/integration/mozilla-inbound

Please note that hg push tries to push all local changes on all heads to a remote by default. When operating a unified repo, you'll need to use the -r argument to hg push and hg out to limit what changesets are considered. I most frequently use -r . to limit changes to the current checked out changeset.

Also note that this extension conflicts with my mozext extension. I hope to update mozext to make it behave in the same manner. (It was easier to write a new extension than to update mozext.)


Please Stop Using MQ

June 23, 2014 at 11:50 AM | categories: Mercurial, Mozilla

Are you a Mercurial user?

Do you use the mq extension? If so, please don't.

Why, you ask? Good question. The short version is that mq is a solution spawned from version control techniques that were popular over a decade ago. We have better tools now. mq is technologically obsolete.

But if you want to know the full answer, keep reading.

A history of Mercurial and mq

The mq extension is just that: an extension to Mercurial's core capabilities. (Mercurial consists of a core/basic feature set supplemented by built-in and 3rd party extensions that provide more advanced or lesser used, niche features.) Modern versions of Mercurial (if you aren't running version 3.0+, please upgrade ASAP) are significantly different from the Mercurial that necessitated the existence and usage of the mq extension.

In the early days of Mercurial (Mercurial was announced in April 2005), your choices for branching were not spectacular. Your option in the core of Mercurial was to use Mercurial branches. Mercurial branches are very heavy beasts. Branches are permanent. If you change the branch a changeset (Mercurial's term for a commit) is assciated with, the SHA-1 of that changeset changes. The takeaway is Mercurial branches aren't very user friendly. That's why modern versions of Mercurial print a warning when you create one:

$ hg branch my-new-feature
marked working directory as branch my-new-feature
(branches are permanent and global, did you want a bookmark?)

Branches were and still are a relatively poor user experience inside Mercurial, especially when compared to Git branches (comparing Git and Mercurial branches is an apples to oranges comparison: Mercurial branches are more synonymous with CVS or Subversion branches). In defense of branches, they are good for some roles, such as tracking release branches. (Mozilla's aurora, beta, release, etc Firefox repositories should arguably be modeled as branches instead of separate repositories.)

Further complicating the usability of branches in early versions of Mercurial was the lack of commands to rewrite history. Concepts such as rebasing or reordering patches were not always available in Mercurial (or if they were there were significant limitations). Today, this seems like an obvious shortcoming of a version control system. But keep in mind the state of version control around 2005-2007. CVS and Subversion were the big players. Perforce, SourceSafe, TFS, etc were popular in corporate settings. While I'm sure there were version control systems at the time that supported rewriting history, my recollection is that rewriting history did not become a thing until Git rose in popularity (which I think was around 2008 or 2009). At that time, the concept of mutating past commits was alien, if not absurd. Why would you want to lose data on what you've already done? Isn't that the antithesis of a version control system?! Git - and many of its concepts, including history rewriting - were perceived as radical. It doesn't matter if they borrowed the ideas from other tools: Git's newfound popularity made them common and a new necessity. If you went from Git to Subversion (or even Mercurial), the superiority of Git's flexibility was obvious (although the UI was not so great, but people can - and do - still cope).

It was under this renaissance of modern version control tools in the late 2000's where mq became popular. But its popularity was not because of the strength of mq: it was because of its necessity.

mq was added to Mercurial in February 2006 and released as part of Mercurial 0.8.1 in April 2006. As far as I can tell, at the time of mq's release, mq was the only tolerable way to perform history rewriting in Mercurial (hg qref and hg qpush --move are a very crude method of history rewriting). Now, you could leverage multiple Mercurial commands to give the illusion of history rewriting, but it was very cumbersome. No sane person wanted to deal with that. So mq was used.

The introduction of the transplant extension in the Mercurial 0.9.2 release in December 2006 added another history rewriting facility to Mercurial. The transplant command effectively copies changesets between branches or heads.

The Mercurial 0.9.5 release in September 2007 introduced the record extension. The record command provides interactive and incremental commit support, similar to git add -i. While record isn't strictly history rewriting, it provides a very useful functionality for helping to produce separate, smaller commits from a larger ongoing change.

The release of Mercurial 1.1 in December 2008 was - on paper at least - an milestone for Mercurial. This release added the rebase extension. This introduced the rebase command, which moves changesets (as opposed to transplant, which merely copies them). Rebasing allowed you to pull remote changes and then easily move your local changesets on top of the new changes, among other things.

An even bigger feature added in Mercurial 1.1 was the bookmarks extension. Bookmarks are Mercurial's lightweigh branches. Instead of permanent association with a changeset, a bookmark is a movable label attached to a changeset. They are very similar to Git branches.

The initial bookmarks feature was far from robust. You couldn't push bookmarks and share them with others until Mercurial 1.6 in June 2010. It's worth noting that bookmarks became part of the Mercurial core (as opposed to existing in an extension that must be enabled) in Mercurial 1.8, released in March 2011.

Bookmarks, record, and rebase provided a decent framework for history editing in Mercurial. But there were still some rough edges, notably around making it easier to edit history: mq was still easier to use for doing tasks like reordering patches.

It wasn't until Mercurial 2.2 in May 2012 that Mercurial gained the ability to easily reorder patches using something other than mq. It came via the histedit extension, which provides the histedit command. This command provides a mechanism to interactively edit history. It is very similar to git rebase --interactive.

With the introduction of the histedit extension, you could (finally) perform most of the more advanced repository interaction workflows that mq allowed without using mq. Continue reading the next section to learn why not using mq is important.

For Mozillians reading, it's worth noting that the conversion of the Firefox source repository from CVS to Mercurial occurred in March 2007. At that time, Mercurial had mq and transplant for performing history rewriting. mq was thus the most convenient method for rewriting history and managing individual patches (a process Firefox development largely maintains to this day). Thus, mq effectively became a de facto requirement for developing Firefox patches. Despite mq being arguably unnecessary since Mercurial 2.2's release in May 2012, mq is still widely used within Mozilla. My perception is that most developers continue to use mq. Furthermore, even new developers are still picking it up instead of utilizing Mercurial's other commands for managing changesets. FWIW, I believe many are so turned off by mq or don't want to learn Mercurial or mq that they do their day-to-day development in Git and only involve Mercurial when doing the final push to the canonical Firefox Mercurial repository.

Now that we learned why mq is popular, let's talk about why it shouldn't be used any more.

The hack that is mq

In terms of architecture, mq is a giant hack: a wart on top of Mercurial. The goal of every version control system is to track changes over time. mq actively works against that.

At the core of every Mercurial repository is a store that contains the repository data. The changesets in the repository are logically represented as a directed acyclic graph (DAG). When you run hg commit, hg pull, hg import, or any command that introduces new changesets, new data is added to the DAG and written to the store.

One of the most important properties of the store is that it is supposed to be append-only: once data is added, it is never removed. This property allows Mercurial to fulfill the contract of version control systems, which is to keep track of data without losing anything.

This append-only property is important because it means Mercurial can know about all of the history for all of time. If you need to perform a merge, Mercurial knows exactly what came before and thus the most effective way to perform that merge. (It's worth noting that merging can be quite complicated. All this extra data helps do the merge correctly the first time, which means you can spend your time writing code instead of resolving merge conflicts.)

The "mq is a hack" statement derives from how mq interacts with Mercurial's store. Your mq patch queue is effectively an overlay over Mercurial's built-in store that is in an ever-changing state of partial application. When you hg qpush to apply a patch from the patch queue, you effectively do an hg import to import a Mercurial changeset file into the core Mercurial store. That's mostly fine. But when you hg qpop (unapply a patch from mq), you are effectively asking Mercurial to delete a changeset and all data associated with it from the core Mercurial store. In other words, qpop breaks the ideal append-only property of the Mercurial store: qpop forces Mercurial to delete data and lose track of history. This makes Mercurial very sad.

Because you are throwing away repository data, merges become much harder. This means you spend more of your time dealing with resolving conflicts. To further complicate that, the mq code paths for performing merges and conflict resolution aren't as robust as those used by, say, hg rebase and hg histedit. So if you use mq, you are pretty much guaranteed a poorer conflict resolution process. This means you spend more of your time wrestling with tools instead of, you know, actually doing something productive.

Deleting repository data by unapplying patches via mq also has negative performance implications. Given a large enough repository (such as Firefox's - it currently has ~200,000 changesets), these performance issues can result in severely degraded performance. An extreme example of this is Mozilla's Try repository. The runaway process/CPU issues leading to outages is due to a known Mercurial issue dealing with inefficient handling of stores that aren't append-only. When you use mq, you risk running into the same issues. Although the performance implication should hopefully be a magnitude or two less pronounced than what Mozilla's Try repository experiences.

In theory, mq could compensate for deletion of data from Mercurial's core store by retaining that data itself. But it doesn't. This leads to yet more reasons why you shouldn't use mq.

mq throws away perfectly fine history data. When you hg qrefresh, your current patch is overwritten with the new one. The old version is discared. This actively works against the goal of a version control system to record history. Have you ever started down a path and realized a few commits later that you need to throw it away and backtrack to a few commits back? I do that all the time. If you qrefresh, tough luck: you've lost your history. If only the history store was append-only.

(In fairness to mq, the patch queue maintained by mq is itself a Mercurial repository and can be committed to, preventing history loss. The mqext extension even provides a mechanism for automatically committing the patch queue repo during qrefresh and other mutation events. But as we will see, even with this, mq is not perfect.)

mq also fails to adequately update patches when they are moved around. Little known fact: mq patches have their parent changeset encoded in them. If you run hg qpush --exact, Mercurial will apply the patch against the parent changeset it was actually created against, as opposed to the current work directory changeset. In theory, you should never encounter conflicts when applying patches this way. Because Mercurial's changeset SHA-1s are dependent on content (like Git), if 23d165967da3 is the child of d5741ada659d, then there should be absolutely no way that 23d165967da3 should fail to apply directly to d5741ada659d. The problem is that mq fails to update the parent changeset when you push patches on top of a new parent! e.g.

# Let's create a root commit.
$ echo 'foo' > foo
$ hg commit -A -m 'initial commit'
adding foo
$ hg log
changeset:   0:23d165967da3
user:        Gregory Szorc <gps@mozilla.com>
date:        Sun Jun 22 13:48:05 2014 -0700
summary:     initial commit

Notice the changeset of that commit. 023d165....

Now let's create a new mq patch.

$ echo 'patch 1' > foo
$ hg qnew -m 'make mq patch 1' p1

# Creating an mq patch creates a file in .hg/patches.
$ cat .hg/patches/p1
# HG changeset patch
# Parent 23d165967da39e5846976eb6ed967f1058827ffa
# User Gregory Szorc <gps@mozilla.com>
make p1

diff --git a/foo b/foo
--- a/foo
+++ b/foo
@@ -1,1 +1,1 @@
-foo
+patch 1

Notice how the parent is listed as the changeset in Mercurial's core store.

Now let's pop that patch, create a new commit, and push the old mq patch.

$ hg qpop
popping p1
patch queue now empty

$ echo 'bar' > bar
$ hg commit -A -m 'second commit'
adding bar

$ hg log
changeset:   1:d5741ada659d
tag:         tip
user:        Gregory Szorc <gps@mozilla.com>
date:        Sun Jun 22 13:48:37 2014 -0700
summary:     second commit

changeset:   0:23d165967da3
user:        Gregory Szorc <gps@mozilla.com>
date:        Sun Jun 22 13:48:05 2014 -0700
summary:     initial commit

$ hg qpush
applying p1
now at: p1

$ hg log --graph
@ changeset:   2:64851294d038
| tag:         p1
| tag:         qbase
| tag:         qtip
| tag:         tip
| user:        Gregory Szorc <gps@mozilla.com>
| date:        Sun Jun 22 14:27:55 2014 -0700
| summary:     make mq patch 1
|
o changeset:   1:d5741ada659d
| tag:         qparent
| user:        Gregory Szorc <gps@mozilla.com>
| date:        Sun Jun 22 13:48:37 2014 -0700
| summary:     second commit
|
o changeset:   0:23d165967da3
  user:        Gregory Szorc <gps@mozilla.com>
  date:        Sun Jun 22 13:48:05 2014 -0700
  summary:     initial commit

We see our mq patch is now applied on top of the second commit, d5741ada659d because it was pushed while the working directory was sitting at d5741ada659d.

But let's see what mq says:

$ hg cat .hg/patches/p1
# HG changeset patch
# Parent 23d165967da39e5846976eb6ed967f1058827ffa
# User Gregory Szorc <gps@mozilla.com>
make p1

diff --git a/foo b/foo
--- a/foo
+++ b/foo
@@ -1,1 +1,1 @@
-foo
+patch 1

mq still says 23d165967da3 is the parent, even though we changed the parent! mq is lying to us! Now, we can rectify the situation by running hg qrefresh. That will cause mq to regenerate the patch file, which will pick up the new parent. But who does that? Unless you need to qrefresh to resolve conflicts or other changes, nobody. This means that if you distribute the patch - say you are uploading it for review - the parent changeset may not be accurate and anyone applying that patch may apply it to the wrong changeset and get unexpected results. That's no good.

For what it's worth, this behavior of not auto refreshing patches is by design. We don't necessarily want application to be a mutating operation, especially since mq repositories aren't automatically versioned. This is partially because mq was effectively designed to be quilt built into Mercurial. And quilt is a tool that was invented before the modern version control tools era. mq's behavior is influenced more from emulating quilt than from the desire to facilitate a sane patch/stack/feature based workflow. If you ever wanted to know why mq works the way it does and not like something more modern, that's why.

Another huge reason to not use mq is because its concepts and workflows are alien to modern and well-understood practices. While I'm a huge fan of Mercurial and prefer it over Git (read my thoughts on the topic), I don't pretend that Git isn't the most widely used version control software right now (at least in the open source world). It got that way despite its horrible UI shortcomings. I argue its rise was because people enjoy workflows such as lightweight branches and fast, often-simple merging. (GitHub and its fairly good web UI certainly helped the cause.) People today know Git. They know lightweight branches. I think they largely understand the concept of repositories with multiple heads and grok how a distributed version control system means you can commit locally without affecting a remote (server). They grok how can you push local commits to a remote (sometimes in the form of a pull request). mq doesn't work this way - the way that people have come to expect from Git. There are cognitive biases that lead us to believe that because mq is different (from Git) that it is inferior. Why should someone apply and unapply individual patches on a temporary queue (or stack depending on how you think about it) instead of working with branches? If someone groks Git and multiple heads/branches/bookmarks, why should they go through all the trouble of learning mq? The fact that mq doesn't work as well as non-mq mechanisms is icing on the figurative cake.

If the reasons I've listed are not enough, know this: mq users will not able to fully utilize the game-changing Changeset Evolution feature of Mercurial. Remember in the late 2000's when people thought Git was crazy for allowing you to mutate history - the antithesis of version control because it threw away data? Changeset Evolution and obsolescence markers - the mechanism used to enable Changeset Evolution - are Mercurial's answer to that. They enable Mercurial to retain metadata about previous changesets and how they evolved into newer changesets. That criticism about history rewriting deleting history: Changeset Evolution makes it largely go away. A light version of the history of past commits is preserved and propagated during push and pull operations.

Changeset Evolution changes the game much like Git changed the landscape of version control in the late 2000's. So much of our notions of how modern version control work are based on the current behavior of tools. For example, any Git user will tell you, never git push --force. Bad things will happen. Why do they say that? They say that because the user experience of distributed history rewriting in Git can be horrible and error prone. And it is like this because Git's implementation of history rewriting burns the old books of history once they are translated to the new world order. (OK, to be fair to Git it does have a reflog that can kinda sorta be used to recover in disaster scenarios. But it's far from robust and there are numerous caveats, notably that the reflog is not part of data distributed with push/fetch and therefore susceptible to single/few points of failure.) Mercurial's obsolescence markers, by contrast, leave footnotes in our history books and transfer these annotations as part of the distributed repository data. Changeset Evolution uses these footnotes to reconstruct the pages of history, even if they differ from our own perspective. For example, you can force push rewritten history and other clients can recover from that automagically. Want to do complex history rewriting and force push? Go right ahead: Mercurial and Changeset Evolution will enable you to work without imposing as many (practical) restrictions on your workflow as other tools. Good tools are flexible and impose fewer restrictions. This is why Mercurial and Changeset Evolution have the potential to change the world of version control much like Git changed things half a decade ago.

Are you still not convinced mq is a bad idea? Know this: a Mercurial core developer recently proposed marking mq as deprecated. That's right: there's very little love for mq within the Mercurial Project. Maintainers generally think what I've stated in this post: mq is yesterday's solution to yesterday's problem and it should be cast aside.

If you are a mq user, I hope you are now convinced that mq is a bad idea and you should transition away from it as soon as possible. In the next section, I'll talk about moving off mq.

Moving away from mq

I completely abandonded mq in December 2013 and am never using it again. In hindsight, I should have made the transition much sooner.

I could write an entire post about my workflow. Since this post is already a bit long, I'm going to describe it with mostly command line examples.

I use bookmarks for managing my feature branches. I use one bookmark per logical feature. This workflow is very similar to using Git branches.

# Create a new bookmark for the new feature I'm developing.
$ hg bookmark gps/my-new-feature

# make changes to files
(gps/my-new-feature) $ hg commit
# make more changes)
(gps/my-new-feature) $ hg commit
# keep making small changes
(gps/my-new-features) $ hg commit

# I decide I want to reorder or merge a few commits. I look at the
# DAG to see which changeset to rewrite.
(gps/my-new-features) $ hg log --graph

(gps/my-new-feature) $ hg histedit 5e907ed10e22
# this opens an editor where I can say what changes to make)

# Let's start a new feature.
(gps/my-new-feature) $ hg up central
(leaving bookmark gps/my-new-feature)
$ hg bookmark gps/feature2
# Make some changes
(gps/feature2) $ hg commit

# I got review on my original feature. Let's push it.
(gps/feature2) $ hg up gps/my-new-feature
(activating bookmark gps/marionette-restart)

(gps/my-new-feature) $ hg pull gecko
pulling from http://hg.stage.mozaws.net/gecko
searching for changes
adding changesets
adding manifests
adding file changes
added 118 changesets with 543 changes to 328 files (+1 heads)
OBSEXC: pull obsolescence markers
OBSEXC: looking for common markers in 215590 nodes
OBSEXC: no unknown remote markers
OBSEXC: DONE
updating bookmark aurora
updating bookmark b2g28
updating bookmark b2ginbound
updating bookmark beta
updating bookmark central
updating bookmark esr24
updating bookmark fx-team
updating bookmark inbound
updating bookmark release
(run 'hg heads .' to see heads, 'hg merge' to merge)

(gps/my-new-feature) $ hg rebase -d fx-team

# Verify I'm about to push what I want to push.
(gps/my-new-feature) $ hg out -r . fx-team

# And push the changes.
(gps/my-new-feature) $ hg push -r . fx-team

# And delete the bookmark that I don't need any more.
(gps/my-new-feature) $ hg bookmark -d gps/my-new-feature

That's how bookmarks work. If you upgrade to Mercurial 3.0 or newer, Mercurial will print messages when you enter and leave bookmarks. This is a great UI improvement!

I also have the prompt extension installed and configured so my shell prompt contains the active bookmark. This helps me keep track of what head I'm committing to.

I'm also a user of the experimental evolve extension. This is the extension that is implementing the Changeset Evolution feature. Eventually the functionality will be merged into core Mercurial. One such workflow with evolve is as follows.

# Create a bookmark for a new feature
$ hg bookmark gps/my-new-feature

# Make some changes.
(gps/my-new-feature) $ hg commit -m 'first commit'

# Make some more changes.
(gps/my-new-feature) $ hg commit -m 'second commit'

# Submit code for review. (Exact commands excluded.)

# Wait for review comments. There is a nit on the first commit.
$ hg previous
[204142] first commit

# Make changes
$ hg amend
1 new unstable changesets

# (amend is provided by evolve. It is equivalent to hg commit --amend)

# evolve is the command that looks at the history rewriting markers
# and knows how to rebase, etc other changesets in the face of
# rewriting.
$ hg evolve
move:[204143] second commit
atop:[204144] first commit
merging foo
warning: conflicts during merge.
merging foo incomplete! (edit conflicts, then use 'hg resolve --mark')
evolve failed!
fix conflict and run "hg evolve --continue"
abort: unresolved merge conflicts (see hg help resolve)

# Manually address conflict markers in file "foo"
$ hg resolve -m foo
no more unresolved files

$ hg evolve --continue
grafting revision 204143

# And I'm ready to push.

If you really love the concept of mq and patch queues and absolutely must do development that way (as opposed to bookmarks/branches), then you should consider using the shelve extension instead of mq. While shelve is still a bit hacky in that it violates the append-only goal of the Mercurial store, it does so in a way that integrates much better than mq. For example, when you unshelve, the changesets will only apply to the parent changeset they were last attached to. You will get no merge conflicts during unshelve. If you want to rebase, you will need to use the rebase command, which will go through Mercurial's more robust merging code paths, leaving conflict markers instead of .rej files (assuming you are using the internal merge tool). Shelve also groups multiple changesets together. Contrast with mq's default behavior of a flat list of patches (my mq queue grew to over 100 patches with patches belonging to dozens of bugs). Shelve is all around a better mq and is a good compromise between the stack/queue based development workflow of mq with the more modern development workflow of bookmarks. If nothing else, shelve integrates much better into the Mercurial core. As a quantitative measurement of this, mq weighs in at 3461 lines. Shelve, by comparison, is a svelt 704 lines. Considering they do nearly the same thing (shelve offloads functionality like folding and reordering to core Mercurial commands that didn't exist at the time of mq's inception), I trust shelve with its reduced surface area to be more robust and bug free.

But as good as shelve is, it's still not as integrated as bookmarks or branches. If you can, try to stick to the Mercurial core features and avoid shelve. But whatever you do, try to avoid mq!

Conclusion

I never used Mercurial before becoming an employee of Mozilla and a contributor to Firefox. I was a mq user for my first two plus years of Mercurial use. I blindly followed the Mozilla documentation and advice of my fellow Mozillians to use mq. It wasn't until I truly invested in learning Mercurial (as opposed to learning merely the command line interface - the minimum required for me to contribute to Firefox) that I realized how wrong the common Mozilla mindset about Mercurial and mq was. I was incorrectly associating mq with "The Mercurial Way" and thought Mercurial was inherently lacking. I knew about branches and bookmarks, but didn't realize I could use them as an alternative to mq.

Only after learning Mercurial's internals, contributing patches to the Mercurial core, and reading the changelog of the Mercurial project itself did I realize the truth: mq's popularity is a result of historical necessity that no longer exists; and, continued usage of mq represents a general failure in user eduction about mq's drawbacks and Mercurial's modern features. In other words, the more informed I became, the more I learned what a bad idea mq is and why it should be avoided.

At Mozilla, we happened to switch to Mercurial at a time (2007) when mq was the only reasonable solution available to Mercurial. The procedures and documentation developed in 2007 have persisted to 2014, largely ignoring advancements in Mercurial over that time. To be fair to my fellow Mozillians and Mercurial users everywhere, Mercurial wasn't truly ready to jettison mq until the histedit extension came into existence a few years ago. But that was years ago. We all should have had time to switch, but we haven't. Or if we have, we've switched to Git. (And I can't blame anyone for switching to Git - both its mind share and the network effect of GitHub are very compelling reasons.)

I hope you now know why mq is a bad idea. Please join me in realizing that mq was a solution for yesterday's problems. mq is a legacy tool that loses data and makes your life harder. It's time to let go of the past and embrace the future. Please join me in shaving the mq neck beard that has been growing since CVS was a popular version control system.

Addendum on Code Review

While I've wanted to write this post for a while, the trigger for me finally writing it is my work on integrating Mercurial with the Review Board code review software. The Bugzilla Team at Mozilla is working on a project to integrate Review Board into Bugzilla for code review, supplementing the antiquated-by-comparison Splinter code review feature. Instead of uploading patches to Bugzilla - how Mozilla and Firefox development has done it for over a decade - you will push changesets to a Mercurial repository and pushed changesets will automatically turn into code reviews on Review Board and get cross-posted to referenced Bugzilla bugs.

This is all part of a larger goal to make it easier for patches to be submitted and landed consistently and centrally. This should lead to various automation and tooling improvements, such as bots conducting aspects of code review and automatically landing reviewed patches.

These things are very difficult to do in a Bugzilla-only and patch-based world because it is difficult to first aggregate patches and second process them in a unified manner. Patches come in many different flavors. You have Mercurial style diffs. You have plain patches, and patches with Git or Mercurial style annotation. You have different lines of context in the patch. You have different handling of binary content. You have mq lying about the parent changeset. All these contribute to a massive impedance mismatch of sorts that makes it extremely difficult to roll out any kind of automation. Any tools built on top must first solve a very difficult data normalization and distribution problem. These are problems that should not exist and drastically raise the barrier to forward progress.

Contrast this with pushing to Mercurial (or Git) repositories. The repository is the data, not the patch file. Clients are able to extract variations of that data however they see fit. Commit data is centralized, distributable, and unified. There are dozens of existing tools and services that already know how to speak with repositories. Our data problem not only goes away, but many solutions have already been invented for us.

Back to Review Board.

I've written an extension that pushes review-specific metadata to the Mercurial server. This effectively allows Mercurial to speak code review and interact with the code review server. For Mercurial clients that are writing obsolescence markers, we are able to track the changesets undergoing review against their logical successors. For example, if you pull and rebase, this will rewrite changeset X to changeset Y (because the SHA-1 changes). Mercurial knows that a) changeset X was under review and b) Y is the new version of X. So, when you push Y for code review, the review associated with X is updated automagically. You could accomplish this with heuristic-based matching or user prompting, but these are error prone (you wouldn't believe the number of corner cases) or require excessive user time, especially when complex history rewriting is involved. Mercurial's obsolescence markers allow the matching to just work in most scenarios and for people to spend more time writing and reviewing code as opposed to telling the code review tool what and how to review.

As I was implementing this extension, I realized that mq users (most Mercurial users at Mozilla I reckon) would be unable to reap the benefits of all this magic because, well, they are using mq. I care passionately about developer productivity. Good tools are the tools that don't require constant massaging. I want everyone to get the full benefits of Mozilla's new code review system when it launches. And the only way that happens is if people stop using mq. And the only way that happens is if people learn why mq is a bad idea. So if nothing else has convinced you to stop using mq yet, maybe the lure of an amazing and automagical code review and collaboration experience will sway you.


Using Mercurial for Status Reports

April 01, 2014 at 12:30 PM | categories: Mercurial, Mozilla

Mercurial has a pair of amazing features called Revisions Sets and Templates. Combined, they allow you to query Mercurial like a database and to generate custom reports from obtained data.

As I've demonstrated, you can write Mercurial extensions to provide custom revision set queries and template functions and keywords. My mozext extension aggregates Mozilla's pushlog data into a local SQLite database and makes this data available to revision sets and templates.

My hack of the day is to use revision sets and templates to create a weekly status report:

hg log -r 'public() and me() and firstpushdate("-7")' \
--template '* {ifeq(reviewer, "gps", "Review: ", "Landing: ")}{firstline(desc)}\n'

When I run this, I get the output:

* Review: Bug 957241 - Don't package the full sdk when we don't need it. r=gps
* Review: Bug 987146 - Represent SQL queries more efficiently. r=gps.
* Review: Bug 987984 - VirtualenvManager.call_setup() should use self.python_path instead of sys.executable, r=gps
* Landing: Bug 987398 - Part 1: Run mochitests from manifests with mach; r=ahal
* Landing: Bug 987398 - Part 2: Handle install-to-subdir in TestResolver; r=ahal
* Landing: Bug 987414 - Pass multiple test arguments to mach testing commands; r=ahal
* Review: Bug 988141 - Clean up config/recurse.mk after bug 969164. r=gps
* Landing: Bug 973992 - Support experiments add-ons; r=Unfocused
* Review: Bug 927672 - Force pymake to fall back to mozmake when run on build slaves. r=gps
* Review: Bug 989147 - Use new sccache for Linux and Android builds. r=gps
* Review: Bug 989147 - Add missing part of the patch from rebase conflict. r=gps
* Landing: Bug 975000 - Disable updating and compatibility checking for Experiments; r=Unfocused
* Landing: Bug 985084 - Experiment add-ons should be disabled by default; r=Unfocused
* Landing: Backed out changeset 4834a3833639 and c580afddd1cb (bug 985084 and bug 97500)
* Landing: Bug 975000 - Disable updating and compatibility checking for Experiments; r=Unfocused
* Landing: Bug 985084 - Experiment add-ons should be disabled by default; r=Unfocused
* Landing: Bug 989137 - Part 1: Uninstall unknown experiments; r=Unfocused
* Landing: Bug 989137 - Part 2: Don't use a global logger; r=gfritzsche
* Landing: Bug 989137 - Part 3: Log.jsm API to get a Logger that prefixes messages; r=bsmedberg
* Landing: Bug 989137 - Part 4: Use a prefixing logger for Experiments logging; r=gfritzsche
* Landing: Bug 989137 - Part 5: Prefix each log message with the instance of the object; r=gfritzsche
* Review: Bug 988849 - Add mach target for jit tests; r=gps
* Landing: Bug 989137 - Part 6: Create experiment XPIs during the build; r=bsmedberg
* Landing: Bug 989137 - Part 7: Remove unncessary content from test experiments; r=Unfocused
* Landing: Bug 985084 - Part 2: Properly report userDisabled in the API; r=Unfocused

Which I can then copy and paste directly into the status tool to capture all my weekly code contributions! That takes a few seconds to run and saves me a few minutes of typing.

For the curious, let's break that Mercurial command down.

  • public() selects all public changesets. These are changesets in the repository that have been pushed to a publishing repository. In other words, patches that landed in Firefox.
  • me() is a custom revset from my mozext extension that parses the commit message and selects changesets that I authored or reviewed.
  • firstpushdate("-7") is a custom revset from my mozext extension. It selects changesets that were first pushed in the last 7 days (using pushlog data stored in a local SQLite database).

The template piece should be easy to read. I have a simple branch testing whether the changeset is a review or not, then output a label followed by the first line of the commit message.

I have this command saved under the [alias] section of my ~/.hgrc file so I can just type hg statusreport.

While there is room to improve the tool (stripping r= lines from commit messages for example), I think it's a pretty cool hack and shows how Mercurial can grow to solve problems you don't think your version control system knows how to solve.


« Previous Page -- Next Page »