<?xml version="1.0" encoding="UTF-8"?>
<feed
  xmlns="http://www.w3.org/2005/Atom"
  xmlns:thr="http://purl.org/syndication/thread/1.0"
  xml:lang="en"
   >
  <title type="text">Gregory Szorc's Digital Home</title>
  <subtitle type="text">Rambling on</subtitle>

  <updated>2013-05-19T20:42:38Z</updated>
  <generator uri="http://blogofile.com/">Blogofile</generator>

  <link rel="alternate" type="text/html" href="http://gregoryszorc.com/blog" />
  <id>http://gregoryszorc.com/blog/feed/atom/</id>
  <link rel="self" type="application/atom+xml" href="http://gregoryszorc.com/blog/feed/atom/" />
  <entry>
    <author>
      <name></name>
      <uri>http://gregoryszorc.com/blog</uri>
    </author>
    <title type="html"><![CDATA[Using Docker to Build Firefox]]></title>
    <link rel="alternate" type="text/html" href="http://gregoryszorc.com/blog/2013/05/19/using-docker-to-build-firefox" />
    <id>http://gregoryszorc.com/blog/2013/05/19/using-docker-to-build-firefox</id>
    <updated>2013-05-19T13:45:00Z</updated>
    <published>2013-05-19T13:45:00Z</published>
    <category scheme="http://gregoryszorc.com/blog" term="Mozilla" />
    <category scheme="http://gregoryszorc.com/blog" term="Firefox" />
    <summary type="html"><![CDATA[Using Docker to Build Firefox]]></summary>
    <content type="html" xml:base="http://gregoryszorc.com/blog/2013/05/19/using-docker-to-build-firefox"><![CDATA[<p>I have the privilege of having my desk located around a bunch of
really intelligent people from the Mozilla Services team. They've been
talking a lot about all the new technologies around server provisioning.
One that interested me is <a href="http://www.docker.io/">Docker</a>.</p>
<p>Docker is a pretty nifty piece of software. It's essentially a glorified
wrapper around <a href="http://lxc.sourceforge.net/">Linux Containers</a>. But,
calling it that is doing it an injustice.</p>
<p>Docker interests me because it allows simple environment isolation and
repeatability. I can create a run-time environment
once, package it up, then run it again on any other machine.
Furthermore, everything that runs in that environment is isolated from
the underlying host (much like a virtual machine). And best of all,
everything is fast and simple.</p>
<p>For my initial experimentation with Docker, I decided to create an
environment for building Firefox.</p>
<h2>Building Firefox with Docker</h2>
<p>To build Firefox with Docker, you'll first need to
<a href="http://www.docker.io/gettingstarted/">install</a> Docker. That's pretty
simple.</p>
<p>Then, it's just a matter of creating a new container with our build
environment:</p>
<div class="pygments_murphy"><pre>curl https://gist.github.com/indygreg/5608534/raw/30704c59364ce7a8c69a02ee7f1cfb23d1ffcb2c/Dockerfile | docker build
</pre></div>

<p>The output will look something like:</p>
<div class="pygments_murphy"><pre>FROM ubuntu:12.10
MAINTAINER Gregory Szorc &quot;gps@mozilla.com&quot;
RUN apt-get update
===&gt; d2f4faba3834
RUN dpkg-divert --local --rename --add /sbin/initctl &amp;&amp; ln -s /bin/true /sbin/initctl
===&gt; aff37cc837d8
RUN apt-get install -y autoconf2.13 build-essential unzip yasm zip
===&gt; d0fc534feeee
RUN apt-get install -y libasound2-dev libcurl4-openssl-dev libdbus-1-dev libdbus-glib-1-dev libgtk2.0-dev libiw-dev libnotify-dev libxt-dev mesa-common-dev uuid-dev
===&gt; 7c14cf7af304
RUN apt-get install -y binutils-gold
===&gt; 772002841449
RUN apt-get install -y bash-completion curl emacs git man-db python-dev python-pip vim
===&gt; 213b117b0ff2
RUN pip install mercurial
===&gt; d3987051be44
RUN useradd -m firefox
===&gt; ce05a44dc17e
Build finished. image id: ce05a44dc17e
ce05a44dc17e
</pre></div>

<p>As you can see, it is essentially <em>bootstrapping</em> an environment to
build Firefox.</p>
<p>When this has completed, you can activate a shell in the container by
taking the image id printed at the end and running it:</p>
<div class="pygments_murphy"><pre>docker run -i -t ce05a44dc17e /bin/bash
<span class="c"># You should now be inside the container as root.</span>
su - firefox
hg clone https://hg.mozilla.org/mozilla-central
<span class="nb">cd </span>mozilla-central
./mach build
</pre></div>

<p>If you want to package up this container for distribution, you just find
its ID then export it to a tar archive:</p>
<div class="pygments_murphy"><pre>docker ps -a
<span class="c"># Find ID of container you wish to export.</span>
docker <span class="nb">export </span>2f6e0edf64e8 &gt; image.tar
<span class="c"># Distribute that file somewhere.</span>
docker import - &lt; image.tar
</pre></div>

<p>Simple, isn't it?</p>
<h2>Future use at Mozilla</h2>
<p>I think it would be rad if Release Engineering used Docker for managing
their Linux builder configurations. Want to develop against the exact
system configuration that Mozilla uses in its automation - you could do
that. No need to worry about custom apt repositories, downloading
custom toolchains, keeping everything isolated from the rest of your
system, etc: Docker does that all automatically. Mozilla simply needs to
publish Docker images on the Internet and anybody can come along and
reproduce the official environment with minimal effort. Once we do that,
there are few excuses for someone breaking Linux builds because of
an environment discrepancy.</p>
<p>Release Engineering could also use Docker to manage isolation of
environments between builds. For example, it could spin up a new
container for each build or test job. It could even save images from the
results of these jobs. Have a weird build failure like a segmentation
fault in the compiler? Publish the Docker image and have someone take a
look! No need to take the builder offline while someone SSH's into it.
No need to worry about the probing changing state because you can always
revert to the state at the time of the failure! And, builds would likely
start faster. As it stands, our automation <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=851294">spends minutes managing
packages</a> before
builds begin. This lag would largely be eliminated with Docker. If
nothing else, executing automation jobs inside a container would allow
us to extract accurate resource usage info (CPU, memory, I/O) since
the Linux kernel effectively gives containers their own namespace
independent of the global system's.</p>
<p>I might also explore publishing Docker images that construct an ideal
development environment (since getting recommended tools in the hands of
everybody is a hard problem).</p>
<p>Maybe I'll even consider hooking up build system glue to automatically
run builds inside containers.</p>
<p>Lots of potential here.</p>
<h2>Conclusion</h2>
<p>I encourage Linux users to play around with Docker. It enables some
new and exciting workflows and is a really powerful tool despite its
simplicity. So far, the only major faults I have with it are that the
docs say it should not be used in production (yet) and it only works on
Linux.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://gregoryszorc.com/blog</uri>
    </author>
    <title type="html"><![CDATA[Build System Status Update 2013-05-14]]></title>
    <link rel="alternate" type="text/html" href="http://gregoryszorc.com/blog/2013/05/13/build-system-status-update-2013-05-14" />
    <id>http://gregoryszorc.com/blog/2013/05/13/build-system-status-update-2013-05-14</id>
    <updated>2013-05-13T19:35:00Z</updated>
    <published>2013-05-13T19:35:00Z</published>
    <category scheme="http://gregoryszorc.com/blog" term="Mozilla" />
    <category scheme="http://gregoryszorc.com/blog" term="build system" />
    <summary type="html"><![CDATA[Build System Status Update 2013-05-14]]></summary>
    <content type="html" xml:base="http://gregoryszorc.com/blog/2013/05/13/build-system-status-update-2013-05-14"><![CDATA[<p>I'd like to make an attempt at delivering regular status updates on the
Gecko/Firefox build system and related topics. Here we go with the
first instance. I'm sure I missed awesomeness. Ping me and I'll add it
to the next update.</p>
<h2>MozillaBuild Windows build environment updated</h2>
<p>Kyle Huey
<a href="https://groups.google.com/d/msg/mozilla.dev.platform/XRecAHF-H28/aSbrdKJLUNoJ">released version 1.7</a>
of our Windows build environment. It contains a newer version of Python
and a modern version of Mercurial among other features.</p>
<p><strong>I highly recommend every Windows developer update ASAP.</strong> Please note
that you will likely encounter Python errors unless you clobber your
build.</p>
<h2>New submodule and peers</h2>
<p>I used my power as module owner to create a submodule of the build
config module whose scope is the (largely mechanical) transition of
content from Makefile.in to moz.build files. I granted Joey Armstrong
and Mike Shal peer status for this module. I would like to eventually
see both elevated to build peers of the main build module.</p>
<h2>moz.build transition</h2>
<p>The following progress has been made:</p>
<ul>
<li>Mike Shal has converted variables related to defining XPIDL files in
  <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=818246">bug 818246</a>.</li>
<li>Mike Shal converted MODULE in
  <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=844654">bug 844654</a>.</li>
<li>Mike Shal converted EXPORTS in
  <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=846634">bug 846634</a>.</li>
<li>Joey Armstrong converted xpcshell test manifests in
  <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=844655">bug 844655</a>.</li>
<li>Brian O'Keefe converted PROGRAM in
  <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=862986">bug 862986</a>.</li>
<li>Mike Shal is about to land conversion of CPPSRCS in
  <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=864774">bug 864774</a>.</li>
</ul>
<h2>Non-recursive XPIDL generation</h2>
<p>In <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=850380">bug 850380</a>
I'm trying to land non-recursive building of XPIDL files. As part of
this I'm trying to combine the generation of .xpt and .h for each input
.idl file into a single process call because profiling revealed that
parsing the IDL consumes most of the CPU time. This shaves a few dozen
seconds off of build times.</p>
<p>I have encounterd multiple pymake bugs when developing this patch, which
is the primary reason it hasn't landed yet.</p>
<h2>WebIDL refactoring</h2>
<p>I was looking at my build logs and noticed WebIDL generation was taking
longer than I thought it should. I filed
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=861587">bug 861587</a> to
investigate making it faster. While my initial profiling turned out to
be wrong, Boris Zbarsky looked into things and discovered that the
serialization and deserialization of the parser output was extremely
slow. He is currently trying to land a refactor of how WebIDL bindings
are handled. The early results look <strong>very</strong> promising.</p>
<p>I think the bug is a good example of the challenges we face improving
the build system, as Boris can surely attest.</p>
<h2>Test directory reorganization</h2>
<p>Joel Maher is injecting sanity into the naming scheme of test
directories in
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=852065">bug 852065</a>.</p>
<h2>Manifests for mochitests</h2>
<p>Jeff Hammel, Joel Maher, Ted Mielczarek, and I are working out using
manifests for mochitests (like xpcshell tests) in
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=852416">bug 852416</a>.</p>
<h2>Mach core is now a standalone package</h2>
<p>I extracted the mach core to a
<a href="https://github.com/indygreg/mach">standalone repository</a> and
<a href="https://pypi.python.org/pypi/mach/">added it to PyPI</a>.</p>
<p>Mach now <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=856392">categorizes</a>
commands in its help output.</p>
<h2>Requiring Python 2.7.3</h2>
<p>Now that the Windows build environment ships with Python 2.7.4, I've
filed <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=870420">bug 870420</a>
to require Python 2.7.3+ to build the tree. We already require
Python 2.7.0+. I want to bump the point release because there are
<a href="http://hg.python.org/cpython/file/d46c1973d3c4/Misc/NEWS">many</a> small
bug fixes in 2.7.3, especially around Python 3 compatibility.</p>
<p>This is currently blocked on RelEng rolling out 2.7.3 to all the
builders.</p>
<h2>Eliminating master xpcshell manifest</h2>
<p>Now that xpcshell test manifests are defined in moz.build files, we
theoretically don't need the master manifest. Joshua Cranmer is working
on removing them in
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=869635">bug 869635</a>.</p>
<h2>Enabling GTests and dual linking libxul</h2>
<p>Benoit Gerard and Mike Hommey are working in
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=844288">bug 844288</a> to
dual link libxul so GTests can eventually be enabled and executed as
part of our automation.</p>
<p>This will regress build times since we need to link libxul twice. But,
giving C++ developers the ability to write unit tests with a real
testing framework is worth it, in my opinion.</p>
<h2>ICU landing</h2>
<p>ICU was briefly enabled in
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=853301">bug 853301</a> but
then backed out because it broke cross-compiling. It should be on track
for enabling in Firefox 24.</p>
<h2>Resource monitoring in mozbase</h2>
<p>I <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=802420">gave mozbase</a>
a class to record system resource usage. I plan to eventually hook this
up to the build system so the build system records how long it took to
perform key events. This will give us better insight into slow and
inefficient parts of the build and will help us track build system speed
improvements over time.</p>
<h2>Sorted lists in moz.build files</h2>
<p>I'm working on requiring lists in moz.build be sorted. Work is happening
in <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=863069">bug 863069</a>.</p>
<p>This idea started as a suggestion on the dev-platform list. If anyone
has more great ideas, don't hold them back!</p>
<h2>Smartmake added to mach</h2>
<p>Nicholas Alexander
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=677452">taught mach</a> how
to build intelligently by importing some of Josh Matthews' smartmake
tool's functionality into the tree.</p>
<h2>Source server fixed</h2>
<p>Kyle Huey and Ted Mielczarek
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=846864">collaborated</a> to
fix the source server.</p>
<h2>Auto clobber functionality</h2>
<p>Auto clobber functionality was added to the tree. After flirting briefly
with on-by-default, we changed it to opt-in. When you encounter it, it
will tell you how to enable it.</p>
<h2>Faster clobbers on automation</h2>
<p>I was looking at build logs and identified we were inefficiently
performing clobber.</p>
<p>Massimo Gervasini and Chris AtLee
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=851270">deployed changes</a>
to automation to make it more efficient. My measurements showed a
Windows try build that took 15 fewer minutes to start - a <em>huge</em>
improvement.</p>
<h2>Upgrading to Mercurial 2.5.4</h2>
<p>RelEng is <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=741353">tracking</a>
the global deployment of Mercurial 2.5.4. hg.mozilla.org is
currently running 2.0.2 and automation is all over the map. The upgrade
should make Mercurial operations faster and more robust across the
board.</p>
<p>I'm considering adding code to mach or the build system that prompts the
user when her Mercurial is out of date (since an out of date Mercurial
can result in a sub-par user experience).</p>
<h2>Parallelize reftests</h2>
<p>Nathan Froyd is
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=813742">leading an effort</a>
to parallelize reftest execution. If he pulls this off, it could shave
hours off of the total automation load per checkin. Go Nathan!</p>
<h2>Overhaul of MozillaBuild in the works</h2>
<p>I am mentoring a pair of interns this summer. I'm still working out the
final set of goals, but I'm keen to have one of them overhaul the
MozillaBuild Windows development environment. Cross your fingers.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://gregoryszorc.com/blog</uri>
    </author>
    <title type="html"><![CDATA[Mozilla Build System Brain Dump]]></title>
    <link rel="alternate" type="text/html" href="http://gregoryszorc.com/blog/2013/05/13/mozilla-build-system-brain-dump" />
    <id>http://gregoryszorc.com/blog/2013/05/13/mozilla-build-system-brain-dump</id>
    <updated>2013-05-13T17:25:00Z</updated>
    <published>2013-05-13T17:25:00Z</published>
    <category scheme="http://gregoryszorc.com/blog" term="build system" />
    <category scheme="http://gregoryszorc.com/blog" term="Mozilla" />
    <category scheme="http://gregoryszorc.com/blog" term="Firefox" />
    <category scheme="http://gregoryszorc.com/blog" term="mach" />
    <summary type="html"><![CDATA[Mozilla Build System Brain Dump]]></summary>
    <content type="html" xml:base="http://gregoryszorc.com/blog/2013/05/13/mozilla-build-system-brain-dump"><![CDATA[<p>I hold a lot of context in my head when it comes to the future of
Mozilla's build system and the interaction with it. I wanted to
perform a brain dump of sorts so people have an idea of where I'm
coming from when I inevitably propose radical changes.</p>
<h2>The sad state of build system interaction and the history of mach</h2>
<p>I believe that Mozilla's build system has had a poor developer
experience for as long as there has been a Mozilla build system.
Getting started with Firefox development was a rite of passage. It
required following (often out-of-date) directions on MDN. It
required finding pages through MDN search or asking other people
for info over IRC. It was the kind of process that turned away
potential contributors because it was just too damn hard.</p>
<p>mach - while born out of my initial efforts to radically change
the build system proper - morphed into a generic command
dispatching framework by the time it landed in mozilla-central.
It has one overarching purpose: provide a single gateway point for
performing common developer tasks (such as building the tree and
running tests). The concept was nothing new - individual developers
had long coded up scripts and tools to streamline workflows. Some
even published these for others to use. What set mach apart was a
unified interface for these <em>commands</em> (the mach script in the
top directory of a checkout) and that these productivity gains
were <strong>in the tree</strong> and thus easily discoverable and usable by
<em>everybody</em> without significant effort (just run <em>mach help</em>).</p>
<p>While mach doesn't yet satisfy everyone's needs, it's slowly
growing new features and making developers' lives easier with
every one. All of this is happening despite that there
is not a single person tasked with working on mach full time.
Until a few months ago, mach was largely my work. Recently, Matt
Brubeck has been contributing a flurry of enhancements - thanks
Matt! Ehsan Akhgari and Nicholas Alexander have contributed a
few commands as well! There are also a few people with a single
command to their name. This is fulfilling my original vision of
facilitating developers to scratch their own itches by
contributing mach commands.</p>
<p>I've noticed more people referencing mach in IRC channels. And,
more people get angry when a mach command breaks or changes
behavior. So, I consider the mach experiment a success. Is it
perfect, no. If it's not good enough for you, please file a bug
and/or code up a patch. If nothing else, please tell me: I love to
know about everyone's subtle requirements so I can keep them in
mind when refactoring the build system and hacking on mach.</p>
<h2>The object directory is a black box</h2>
<p>One of the ideas I'm trying to advance is that the object directory
should be considered a black box for the majority of developers. In
my ideal world, developers don't need to look inside the object
directory. Instead, they interact with it through condoned and
supported tools (like mach).</p>
<p>I say this for a few reasons. First, as the build config module owner
I would like the ability to massively refactor the <em>internals</em> of
the object directory without disrupting workflows. If people are
interacting directly with the object directory, I get significant
push back if things change. This inevitably holds back much-needed
improvements and triggers resentment towards me, build peers, and
the build system. Not a good situation. Whereas if people are
indirectly interacting with the object directory, we simply need to
maintain a consistent interface (like mach) and nobody should care
if things change.</p>
<p>Second, I believe that the methods used when directly interacting
with the object directory are often sub-par compared with going
through a more intelligent tool and that productivity suffers as a
result. For example, when you type <em>make</em> in inside the object
directory you need to know to pass <em>-j8</em>, use make vs pymake,
and that you also need to build <em>toolkit/library</em>, etc.
Also, by invoking make directly, you bypass other handy features,
such as automatic compiler warning aggregation (which only happens
if you invoke the build system through mach). If you go through a
tool like <em>mach</em>, you <em>should</em> automatically get the most ideal
experience possible.</p>
<p>In order for this vision to be realized, we need massive
improvements to tools like mach to cover the missing workflows that
still require direct object directory interaction. We also need people
to start using mach. I think increased mach usage comes after mach
has established itself as obviously superior to the alternatives
(I already believe it offers this for tasks like running tests).</p>
<h2>I don't want to force mach upon people but...</h2>
<p>Nobody likes when they are forced to change a process that has been
familiar for years. Developers especially. I get it. That's why
I've always attempted to position mach as an alternative to
existing workflows. If you don't like mach, you can always fall
back to the previous workflow. Or, you can improve mach (patches
more than welcome!). Having gone down the
please-use-this-tool-it's-better road before at other
organizations, I strongly believe that the best method to incur
adoption of a new tool is to gradually sway people through
obvious superiority and praise (as opposed to a mandate to switch).
I've been trying this approach with mach.</p>
<p>Lately, more and more people have been saying things like
<em>we should have the build infrastructure build through mach
instead of client.mk</em> and <em>why do we need testsuite-targets.mk when
we have mach commands.</em> While I personally feel that client.mk
and testsuite-targets.mk are antiquated as a developer-facing
interface compared to mach, I'm reluctant to eliminate them because
I don't like forcing change on others. That being said, there are
compelling reasons to eliminate or at least refactor how they work.</p>
<p>Let's take <em>testsuite-targets.mk</em> as an example. This is the make
file that provides the targets to run tests (like <em>make xpcshell-test</em>
and <em>make mochitest-browser-chrome</em>). What's interesting about this
file is that it's only used in local builds: our automation
infrastructure does not use <em>testsuite-targets.mk</em>! Instead,
<em>mozharness</em> and the old buildbot configs manually build up the
command used to invoke the test harnesses. Initially, the mach
commands for running tests simply invoked make targets defined
in <em>testsuite-targets.mk</em>. Lately, we've been converting the mach
commands to invoke the Python test runners directly. I'd argue that
the logic for <em>invoke the test runner</em> only needs to live in one
place in the tree. Furthermore as a build module peer, I have little
desire to support multiple implementations. Especially considering
how fragile they can be.</p>
<p>I think we're trending towards an outcome where mach (or the code
behind mach commands) transitions into the authoratitive invocation
method and <em>legacy</em> interfaces like <em>client.mk</em> and
<em>testsuite-targets.mk</em> are reimplemented to either call mach
commands or the same routine that powers them. Hopefully this
will be completely transparent to developers.</p>
<h2>The future of mozconfigs and environment configuration</h2>
<p><em>mozconfig</em> files are shell scripts used to define variables consumed
by the build system. They are the only officially supported mechanism
for configuring how the build system works.</p>
<p>I'd argue mozconfig files are a mediocre solution at best. First,
there's the issue of mozconfig statements that don't actually do
anything. I've seen no-op mozconfig content cargo culted into the
in-tree mozconfigs (used for the builder configurations)! Oops.
Second, doing things in mozconfig files is just awkward. Defining
the object directory requires <em>mk_add_options MOZ_OBJDIR=some-path</em>.
What's <em>mk_add_options</em>? If <em>some-path</em> is relative, what is it
relative <em>to</em>? While certainly addressable, the documentation on
how mozconfig files work is not terrific and fails to explain many
pitfalls. Even with proper documentation, there's still the issue
of the file format allowing no-op variable assignments to persist.</p>
<p>I'm very tempted to reinvent build configuration as something not
mozconfigs. What exactly, I don't know. mach has support for ini-like
configuration files. We could certainly have mach and the build
system pull configs from the same file.</p>
<p>I'm not sure what's going to happen here. But deprecating mozconfig
files as they are today is part of many of the options.</p>
<h2>Handling multiple mozconfig files</h2>
<p>A lot of developers only have a single mozconfig file (per source tree
at least). For these developers, life is easy. You simply install
your mozconfig in one of the default locations and it's automagically
used when you use mach or client.mk. Easy peasy.</p>
<p>I'm not sure what the relative numbers are, but many developers
maintain multiple mozconfig files per source tree. e.g. they'll
have one mozconfig to build desktop Firefox and another one for
Android. They may have debug variations of each.</p>
<p>Some developers even have a single mozconfig file but leverage the
fact that mozconfig files are shell scripts and have their
mozconfig dynamically do things depending on the current working
directory, value of an environment variable, etc.</p>
<p>I've also seen wrapper scripts that glorify setting environment
variables, changing directory, etc and invoke a command.</p>
<p>I've been thinking a lot about providing a common and well-supported
solution for switching between active build configurations.
<a href="https://developer.mozilla.org/en-US/docs/Developer_Guide/mach#Adding_mach_to_your_shell%27s_search_path">Installing mach on $PATH</a>
goes a long way to facilitate this. If you are in an object
directory, the mozconfig used when that object directory was
created is automatically applied. Simple enough. However, I want
people to start treating object directories as black boxes. So, I'd
rather not see people have their shell inside the object directory.</p>
<p>Whenever I think about solutions, I keep arriving at a
virtualenv-like solution. Developers would potentially need to
<em>activate</em> a Mozilla build environment (similar to how Windows
developers need to launch MozillaBuild). Inside this environment,
the shell prompt would contain the name of the current build
configuration. Users could switch between configurations using
<em>mach switch</em> or some other magic command on the $PATH.</p>
<p>Truth be told, I'm skeptical if people would find this useful. I'm
not sure it's that much better than exporting the MOZCONFIG
environment variable to define the active config. This one requires
more thought.</p>
<h2>The integration between the build environment and Python</h2>
<p>We use Python extensively in the build system and for common
developer tasks. mach is written in Python. moz.build processing
is implemented in Python. Most of the test harnesses are written in
Python.</p>
<p>Doing practically anything in the tree requires a Python
interpreter that knows about all the Python code in the tree and
how to load it.</p>
<p>Currently, we have two very similar Python environments. One is
a virtualenv created while running configure at the beginning of
a build. The other is essentially a cheap knock-off that mach
creates when it is launched.</p>
<p>At some point I'd like to consolidate these Python environments.
From any Python process we should have a way to automatically
bootstrap/activate into a well-defined Python environment. This
certainly sounds like establishing a unified Python virtualenv
used by both the build system and mach.</p>
<p>Unfortunately, things aren't straightforward. The virtualenv today
is constructed in the object directory. How do we determine the
current object directory? By loading the mozconfig file. How do we
do that? Well, if you are mach, we use Python. And, how does mach
know where to find the code to load the mozconfig file? You can
see the dilemma here.</p>
<p>A related issue is that of portable build environments. Currently, a
lot of our automation recreates the build system's virtualenv from
its own configuration (not that from the source tree). This has
and will continue to bite us. We'd <em>really</em> like to package up the
virtualenv (or at least its config) with tests so there is no
potential for discrepancy.</p>
<p>The inner workings of how we integrate with Python should be
invisible to most developers. But, I figured I'd capture it
here because it's an annoying problem. And, it's also related
to an <em>activated</em> build environment. What if we required all
developers to <em>activate</em> their shell with a Mozilla build
environment (like we do on Windows)? Not only would this solve
Python issues, but it would also facilitate simpler config
switching (outlined above). Hmmm...</p>
<h2>Direct interaction with the build system considered harmful</h2>
<p>Ever since there was a build system developers have been typing
<em>make</em> (or <em>make.py</em>) to build the tree. One of the goals of the
transition to <em>moz.build</em> files is to facilitate building the tree
with Tup. <em>make</em> will do nothing when you're not using Makefiles!
Another goal of the <em>moz.build</em> transition is to start
derecursifying the make build system such that we build things in
parallel. It's likely we'll produce monolithic make files and then
process <em>all</em> targets for a related class <em>IDLs</em>, <em>C++ compilation</em>,
etc in one invocation of <em>make</em>. So, uh, what happens during a partial
tree build? If a .cpp file from <em>/dom/src/storage</em> is being handled by
a monolithic make file invoked by the Makefile at the top of the
tree, how does a partial tree build pick that up? Does it build just
that target or every target in the monolithic/non-recursive make file?</p>
<p>Unless the build peers go out of our way to install redundant targets
in leaf Makefiles, directly invoking <em>make</em> from a subdirectory of
the tree won't do what it's done for years.</p>
<p>As I said above, I'm sympathetic to forced changes in procedure, so
it's likely we'll provide backwards-compatibile behavior. But, I'd
prefer to not do it. I'd first prefer partial-tree builds are not
necessary and a full tree build finishes quickly. But, we're not going
to get there for a bit. As an alternative, I'll take people building
through <em>mach build</em>. That way, we have an easily extensible interface
on which to build partial tree logic. We saw this recently when
dumbmake/smartmake landed. And, going through <em>mach</em> also reinforces my
ideal that the object directory is a black box.</p>
<h2>Semi-persistent state</h2>
<p>Currently, most state as it pertains to a checkout or build is in the
object directory. This is fine for artifacts from the build system.
However, there is a whole class of state that arguably shouldn't be in
the object directory. Specifically, it shouldn't be clobbered when you
rebuild. This includes logs from previous builds, the warnings database,
previously failing tests, etc. The list is only going to grow over time.</p>
<p>I'd like to establish a location for semi-persistant state related to
the tree and builds. Perhaps we change the clobber logic to ignore a
specific directory. Perhaps we start storing things in the user's home
directory. Perhaps we could establish a second <em>object directory</em> named
the <em>state directory</em>? How would this interact with <em>build environments</em>?</p>
<p>This will probably sit on the backburner until there is a compelling use
case for it.</p>
<h2>The battle against C++</h2>
<p>Compiling C++ consumes the bulk of our build time. Anything we can do to
speed up C++ compilation will work wonders for our build times.</p>
<p>I'm optimistic things like precompiled headers and compiling multiple
.cpp files with a single process invocation will drastically decrease
build times. However, no matter how much work we put in to make C++
compilation faster, we still have a giant issue: dependency hell.</p>
<p>As <a href="/presentations/2012-11-29-firefox-build-system/#34">shown</a> in my
build system presentation a few months back, we have dozens of header
files included by hundreds if not thousands of C++ files. If you change
one file: you invalidate build dependencies and trigger a rebuild. This
is why whenever files like mozilla-config.h change you are essentially
confronted with a full rebuild. ccache may help if you are lucky. But, I
fear that as long as headers proliferate the way they do, there is
little the build system by itself can do.</p>
<p>My attitude towards this is to wait and see what we can get out of
precompiled headers and the like. Maybe that makes it good enough. If
not, I'll likely be making a lot of noise at Platform meetings
requesting that C++ gurus brainstorm on a solution for reducing
header proliferation.</p>
<h2>Conclusion</h2>
<p>Belive it or not, these are only some of the topics floating around in
my head! But I've probably managed to bore everyone enough so I'll
call it a day.</p>
<p>I'm always interested in opinions and ideas, especially if they are
different from mine. I encourage you to leave a comment if you have
something to say.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://gregoryszorc.com/blog</uri>
    </author>
    <title type="html"><![CDATA[The State of Mercurial at Mozilla]]></title>
    <link rel="alternate" type="text/html" href="http://gregoryszorc.com/blog/2013/05/13/the-state-of-mercurial-at-mozilla" />
    <id>http://gregoryszorc.com/blog/2013/05/13/the-state-of-mercurial-at-mozilla</id>
    <updated>2013-05-13T13:25:00Z</updated>
    <published>2013-05-13T13:25:00Z</published>
    <category scheme="http://gregoryszorc.com/blog" term="Mercurial" />
    <category scheme="http://gregoryszorc.com/blog" term="Mozilla" />
    <summary type="html"><![CDATA[The State of Mercurial at Mozilla]]></summary>
    <content type="html" xml:base="http://gregoryszorc.com/blog/2013/05/13/the-state-of-mercurial-at-mozilla"><![CDATA[<p>I have an opinion on the usage of Mercurial at Mozilla: it stinks.</p>
<p>Here's why.</p>
<h2>The server is configured poorly</h2>
<p>Our Mozilla server, hg.mozilla.org, is currently running Mercurial 2.0.2.
In terms of Mercurial features, stability, and performance, we are light
years behind.</p>
<p>You know that annoying phases configuration you need to set when pushing
to Try? That's because the server isn't new enough to tell the client the
same thing the configuration option does. It
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=725362">will be fixed</a>
when the server is upgraded to 2.1+.</p>
<p>Furthermore, we are running the server over NFS, which introduces known
badness, including slowness.</p>
<p>I believe we blame Mercurial for issues that would go away if we
configured the Mercurial server properly.</p>
<p>Fortunately, it appears the
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=781012">upgrade to 2.5</a>
is near and I've heard we're moving from NFS to local disk storage as
part of that. This should go a long way to making the server better.
The upgrade can't happen soon enough.</p>
<h2>User education is poor</h2>
<p>I think a lot of people are ignorant on the features and abilities of
Mercurial.</p>
<p>I commonly hear people are dissatisfied with the behavior of their
Mercurial client. They encounter performance issues, bugs, corruption,
etc. Nine times out of ten this is due to running an old Mercurial
release. Just last Friday someone on my team asked me about
weird behavior involving file case. My first question: <em>what version
of Mercurial are you using?</em> He was running 2.0.2. I told him to
upgrade to 2.5+. It fixed his problem. <strong>If you aren't running
Mercurial 2.5 or newer, upgrade immediately.</strong></p>
<p>I've heard people say we should switch to Git because Git has feature X.
Most of the time, Mercurial has these features. Unfortunately, people
just don't realize it. When I point them at
<a href="http://mercurial.selenic.com/wiki/UsingExtensions">Mercurial's extensions list</a>
their eyes light up and they thank me for making their lives easier.</p>
<p>I think a problem is a lot of new Mozilla contributors knew Git
before and only pick up the bare essentials of Mercurial that allow them
to land patches. They prefer Git because it is familiar and just don't
bother to pick up Mercurial. The potential of Mercurial is thus lost on
them.</p>
<p>Perhaps we should have a brown bag and/or better documentation on
getting the most out of Mercurial?</p>
<h2>The branching model is far from ideal</h2>
<p>For Gecko/Firefox development, we maintain separate repositories for the
trunk and release branches. This introduces all kinds of annoying.</p>
<p>We should not have separate repositories for <em>central</em>, <em>inbound</em>,
<em>aurora</em>, <em>beta</em>, <em>release</em>, etc. We should be using some combination of
branches and bookmarks and have all the release heads in one
repository, just like how the
<a href="https://github.com/mozilla/mozilla-central/">GitHub mirror</a> is
configured.</p>
<p>As an experiment, I created a
<a href="https://hg.mozilla.org/users/gszorc_mozilla.com/gecko">unified Mercurial repository</a>.
Each current repository is tracked as a bookmark (there are
<a href="https://developer.mozilla.org/en-US/docs/Developer_Guide/Source_Code/Mercurial">instructions</a>
for reproducing this). Unfortunately, the web interface isn't showing
bookmarks (perhaps because the version of Mercurial is too old?), so
you'll have to clone the repository to play around. Just run
<em>hg bookmarks</em> and e.g. <em>hg up aurora</em> after cloning.
<em>Warning: I'm not actively synchronizing this repository, so don't rely
on it being up to date</em>.</p>
<p>A Mercurial contributor (who is familiar with Mozilla's
development model) suggested we use Mercurial branches for every Gecko
release (20, 21, 22, etc). I think this and other uses of branches and
bookmarks are ideas worth exploring.</p>
<h2>We're failing to harness the extensibility</h2>
<p>Gecko/Firefox has a complicated code lifecycle and landing process.
This could be significantly streamlined if we fully harnessed and
embraced the extensibility of Mercurial. While there are some
Mozilla-centric extensions (details in my
<a href="/blog/2013/05/12/thoughts-on-mercurial-%28and-git%29/">recent post</a>),
I don't think they are well known nor used.</p>
<p>I think Mozilla should embrace the functionality of extensions like
these (whether they be for Mercurial, Git, or something else) and invest
resources in improving the workflows for all developers. Until these
tools are obviously superior and advertised, I believe many developers
will unknowingly continue to toil without them. And, it's likely hurting
our ability to attract and retain new contributors as well.</p>
<h2>Conclusion</h2>
<p>Mozilla's current usage of Mercurial is far from ideal. It's no wonder
people don't like Mercurial (and why some want to switch to Git).</p>
<p>Fortunately, little has to do with shortcomings of Mercurial itself (at
least with newer versions). If you want to know why Mercurial isn't
working too well for Gecko/Firefox development, most of the problems
are self-inflicted or the solutions reside within each of us. Time
will tell if we as a community have the will to address these issues.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://gregoryszorc.com/blog</uri>
    </author>
    <title type="html"><![CDATA[Thoughts on Mercurial (and Git)]]></title>
    <link rel="alternate" type="text/html" href="http://gregoryszorc.com/blog/2013/05/12/thoughts-on-mercurial-(and-git)" />
    <id>http://gregoryszorc.com/blog/2013/05/12/thoughts-on-mercurial-(and-git)</id>
    <updated>2013-05-12T12:00:00Z</updated>
    <published>2013-05-12T12:00:00Z</published>
    <category scheme="http://gregoryszorc.com/blog" term="Mozilla" />
    <category scheme="http://gregoryszorc.com/blog" term="Mercurial" />
    <category scheme="http://gregoryszorc.com/blog" term="Git" />
    <summary type="html"><![CDATA[Thoughts on Mercurial (and Git)]]></summary>
    <content type="html" xml:base="http://gregoryszorc.com/blog/2013/05/12/thoughts-on-mercurial-(and-git)"><![CDATA[<p>My first experience with Mercurial (Firefox development)
was very unpleasant. Coming from Git, I thought Mercurial was slow
and perhaps even more awkward to use than Git. I frequently
encountered repository corruption that required me to reclone. I thought
the concept of a patch queue was silly compared to Git branches. It
was all extremely frustrating and I dare say a hinderance to my
productivity. It didn't help that I was surrounded by a bunch of people
who had previous experience with Git and opined about every minute
difference.</p>
<p>Two years later and I'm on much better terms with Mercurial. I initially
thought it might be Stockholm Syndrome, but after reflection I can point
at specific changes and enlightenments that have reshaped my opinions.</p>
<h2>Newer versions of Mercurial are much better</h2>
<p>I first started using Mercurial in the 1.8 days and thought it was
horrible. However, modern releases are much, much better. I've noticed
a steady improvement in the quality and speed of Mercurial in the last
few years.</p>
<p><strong>If you aren't running 2.5 or later (Mercurial 2.6 was released earlier
this month), you should take the time to upgrade today.</strong> When you upgrade,
you should of course read the
<a href="http://mercurial.selenic.com/wiki/WhatsNew">changelog</a> and
<a href="http://mercurial.selenic.com/wiki/UpgradeNotes">upgrade notes</a> so you
can make the most of the new features.</p>
<h2>Proper configuration is key</h2>
<p>For <em>my</em> workflow, the default configuration of Mercurial out of the box
is... far from optimal. There are a number of basic changes that need to
be made to satisfy <em>my</em> expectations for a version control tool.</p>
<p>I used to think this was a shortcoming with Mercurial: why not ship a
powerful and useful environment out of the box? But, after talking to a
Mercurial core contributor, this is mostly by design. Apparently a
principle of the Mercurial project is that the CLI tool (<em>hg</em>) should be
simple by default and should minimize foot guns. They view actions like
rebasing and patch queues as advanced and thus don't have them enabled
by default. Seasoned developers may scoff at this. But, I see where
Mercurial is coming from. I only need to refer everyone to her first
experience with Git as an example of what happens when you don't aim for
simplicity. (I've never met a Git user who didn't think it overly
complicated at first.)</p>
<p>Anyway, to get the most out of Mercurial, it is essential to configure
it to your liking, much like you install plugins or extensions in your
code editor.</p>
<p><strong>Every person running Mercurial should go to
<a href="http://mercurial.selenic.com/wiki/UsingExtensions">http://mercurial.selenic.com/wiki/UsingExtensions</a>
and take the time to find extensions that will make your life better</strong>.
You should also run <em>hg help hgrc</em> to view all the configuration
options. There is a mountain of productivity wins waiting to be realized.</p>
<p>For reference, my <a href="https://gist.github.com/indygreg/5511712">~/.hgrc</a>.
Worth noting are some of the built-in externsions I've enabled:</p>
<ul>
<li>color - Colorize terminal output. Clear UX win.</li>
<li>histedit - Provides  <em>git rebase --interactive</em> behavior.</li>
<li>pager - Feed command output into a pager (like <em>less</em>). Clear UX win.</li>
<li>progress - Draw progress bars on long-running operations. Clear UX
  win.</li>
<li>rebase - Ability to easily rebase patches on top of other heads.
  This is a basic feature of patch management.</li>
<li>transplant - Easily move patches between repositories, branches, etc.</li>
</ul>
<p>If I were on Linux, I'd also use the <em>inotify</em> extension, which installs
filesystem watchers so operations like <em>hg status</em> are instantaneous.</p>
<p>In addition to the built-in extensions, there are a number of 3rd party
extensions that improve my Mozilla workflow:</p>
<ul>
<li><a href="https://bitbucket.org/sfink/mqext">mqext</a> - Automatically commit to
  your patch queue when you qref, etc. This is a lifesaver. If that's
  not enough, it suggests reviewers for your patch, suggests a bug
  component, and let's you find bugs touching the files you are
  touching.</li>
<li><a href="https://github.com/pbiggar/trychooser">trychooser</a> - Easily push
  changes to Mozilla's Try infrastructure.</li>
<li><a href="https://hg.mozilla.org/users/robarnold_cmu.edu/qimportbz">qimportbz</a> -
  Easily import patches from Bugzilla.</li>
<li><a href="https://hg.mozilla.org/users/tmielczarek_mozilla.com/bzexport">bzexport</a> -
  Easily export patches to Bugzilla.</li>
</ul>
<p>I'm amazed more developers don't use these simple productivity wins.
Could it be that people simply don't realize they are available?</p>
<p>Mozilla has a <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=794580">bug</a>
tracking easier configuration of the user's Mercurial environment. My
hope is one day people simply run a single command and get a
Mozilla-optimized Mercurial environment that <em>just works</em>. Along the same
vein, if your extensions are out of date, it prompts you to update them.
This is one of the benefits of a unified developer tool like mach: you
can put these checks in one place and everyone can reap the benefits
easily.</p>
<h2>Mercurial is extensible</h2>
<p>The major differentiator from almost every other version control system
(especially Git) is the ease and degree to which Mercurial can be
extended and contorted. <strong>If you take anything away from this
post it should be that Mercurial is a flexible and agile tool.</strong></p>
<p>If you want to change the behavior of a built-in command, you can write
an extension that monkeypatches that command. If you want to write a new
command, you can of course do that easily. You can have extensions
interact with one another - all natively. You can even override the wire
protocol to provide new <em>capabilities</em> to extend how peers communicate
with one another. You can leverage this to transfer additional metadata
or data types. This has nearly infinite potential. If that's not enough,
it's possible to create a new branching/development primitive through
just an extension alone! If you want to invent Git-style branches with
Mercurial, you could do that! It may require client and server support,
but it's possible.</p>
<p>Mercurial accomplishes this by being written (mostly) in Python (as
opposed to C) and by having a clear API on which extensions can be
built. Writing extensions in Python is huge. You can easily drop into
the debugger to learn the API and your write-test loop is much smaller.</p>
<p>By contrast, most other version control systems (including Git) require
you to parse output of commands (this is the UNIX piping principle).
Mercurial supports this too, but the native Python API is so much more
powerful. Instead of parsing output, you can just read the raw values
from a Python data structure. Yes please.</p>
<p>Since I imagine a lot of people at Mozilla will be reading this, here
are some ways Mozilla could leverage the extensibility of Mercurial:</p>
<ul>
<li>Command to create try pushes (it exists - see above).</li>
<li>Record who pushed what when (we have this - it's called the pushlog).</li>
<li>Command to land patches. If inbound1 is closed,
  automatically rebase on inbound2. etc. This could even be
  monkeypatched into <em>hg push</em> so pushes to inbound are automatically
  intercepted and magic ensues.</li>
<li>Record the automation success/fail status against individual
  revisions and integrate with commands (e.g. only pull up to the most
  recent stable changeset).</li>
<li>Command to create a review request for a patch or patch queue.</li>
<li>Command to assist with reviews. Perhaps a reviewer wants to make minor
  changes. Mercurial could download and apply the patch(es), wait for
  your changes, then reupload to Bugzilla (or the review tool)
  automatically.</li>
<li>Annotating commits or pushes with automation info (which jobs to
  run, etc).</li>
<li>Find Bugzilla component for patch (it exists - see above).</li>
<li>Expose custom protocol for configuring automation settings for a
  repository or a head. e.g. clients (with access) could reconfigure
  PGO scheduling, coalescing, etc without having to involve RelEng -
  useful for twigs and lesser used repositories.</li>
<li>So much more.</li>
</ul>
<p>Essentially, Mercurial itself could become the CLI tool code development
centers around. Whether that is a good idea is up for
debate. But, it can. And that says a lot about the flexibility of
Mercurial.</p>
<h2>Future potential of Mercurial</h2>
<p>When you consider the previous three points, you arrive at a new one:
Mercurial has a ton of future potential. The fact that extensions can
evolve vanilla Mercurial into something that resembles Mercurial in
name only is a testament to this.</p>
<p>When I sat down with a Mercurial core contributor, they reinforced this.
To them, Mercurial is a core library with a limited set of user-facing
commands forming the stable API. Since core features (like storage) are
internal APIs (not public commands - like Git), this means they aren't
bound to backwards compatibility and can refactor internals as needed
and evolve over time without breaking the world. That is a terrific
luxury.</p>
<p>An example of this future potential is
<a href="http://mercurial.selenic.com/wiki/ChangesetEvolution">changeset evolution</a>.
If you don't know what that is, you should because it's awesome. One of
the things they figured out is how to propagate rebasing between
clones!</p>
<h2>Comparing to Git</h2>
<p>Two years ago I would have said I would never opt to use Mercurial over
Git. I cannot say that today.</p>
<p>I do believe Git still has the advantage over Mercurial in a few areas:</p>
<ul>
<li>Branch management. Mercurial branches are a non-starter for
  light-weigh work. Mercurial bookmarks are kinda-sorta like Git
  branches, but not quite. I <em>really</em> like aspects of Git branches.
  Hopefully changeset evolution will cover the remaining gaps and more.</li>
<li>Patch conflict management. Git seems to do a better job of resolving
  patch conflicts. But, I think this is mostly due to Mercurial's patch
  queue extension not using the same merge code as built-in commands
  (this is a fixable problem).</li>
<li>Developer mind share and GitHub. The GitHub ecosystem makes up for
  many of Git's shortcomings. Bitbucket isn't the same.</li>
</ul>
<p>However, I believe Mercurial has the upper hand for:</p>
<ul>
<li>Command line friendliness. Git's command line syntax is notoriously
  awful and the concepts can be difficult to master.</li>
<li>Extensibility. It's so easy to program custom workflows and commands
  with Mercurial. If you want to hack your version control system,
  Mercurial wins hands down. Where Mercurial embraces extensibility, I
  couldn't even find a page listing all the useful Git <em>extensions</em>!</li>
<li>Open source culture. Every time I've popped into the Mercurial IRC
  channel I've had a good experience. I get a response quickly and
  without snark. Git by contrast, well, let's just say I'd rather be
  affiliated with the Mercurial crowd.</li>
<li>Future potential. Git is a content addressable key-value store with a
  version control system bolted on top. Mercurial is designed to be a
  version control system. Furthermore, Mercurial's code base is much
  easier to hack on than Git's. While Git has largely maintained feature
  parity in the last few years, Mercurial has grown new features. I see
  Mercurial evolving faster than Git and in ways Git cannot.</li>
</ul>
<p>It's worth calling out the major detractors for each.</p>
<p>I think Git's major weakness is its lack of extensibility and inability
to evolve (at least currently). Git will need to grow a better
extensibility model with better abstractions to compete with Mercurial
on new features. Or, the Git community will need to be receptive to
experimental features living in the core. All of this will require
some major API breakage. Unfortunately, I see little evidence this will
occur. I'm unable to find a <em>vision</em> document for the future of Git, a
branch with major new features, or interesting threads on the mailing
list. I tried to ask in their IRC channel and got crickets.</p>
<p>I think Mercurial's greatest weakness is lack of developer mindshare.
Git and GitHub are where it's at. This is huge, especially for projects
wanting collaboration.</p>
<p>Of all those points, I want to stress the extensibility and future
potential of Mercurial. If hacking your tools to maximize potential
and awesomeness is your game, Mercurial wins. End of debate. However,
if you don't want to harness these advantages, then I think Git and
Mercurial are mostly on equal footing. But given the rate of
development in the Mercurial project and relative stagnation of Git
(I can't name a major new Git feature in years), I wouldn't be
surprised if Mercurial's feature set obviously overtakes Git's in
the next year or two. Mind share will of course take longer and will
likely depend on what hosting sites like GitHub and Bitbucket do
(I wouldn't be surprised if GitHub rebranded as <em>CodeHub</em> or
something some day). Time will tell.</p>
<h2>Extending case study</h2>
<p><em>I have removed the case study that appeared in the original article
because as Mike Hommey observed in the comments, it wasn't a totally
accurate comparison. I don't believe the case study significantly added
much to the post, so I likely won't write a new one.</em></p>
<h2>Conclusion</h2>
<p>From where I started with Mercurial, I never thought I'd say this. But
here it goes: I like Mercurial.</p>
<p>I started warming up when it became faster and more robust in recent
versions in the last few years. When I learned about its flexibility and
the fundamentals of the project and thus its future potential, I became
a true fan.</p>
<p>It's easy to not like Mercurial if you are a new user coming
from Git and are forced to use a new tool. But, once you take the time to
properly configure it and appreciate it for what it is and what it
can be, Mercurial is easy to like.</p>
<p>I think Mercurial and Git are both fine version control systems. I would
happily use either one for a new project. If the social aspects of
development (including encouraging new contributors) were important to
me, I would likely select Git and GitHub. But, if I wanted something
just for me or I was a large project looking for a system that scales
and is flexible or was looking to the future, I'd go with Mercurial.</p>
<p>Mercurial is a rising star in the version control world. It's getting
faster and better and enabling others to more easily innovate through
powerful extensions. The future is bright for this tool.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://gregoryszorc.com/blog</uri>
    </author>
    <title type="html"><![CDATA[Mozilla Automation Load Over Time]]></title>
    <link rel="alternate" type="text/html" href="http://gregoryszorc.com/blog/2013/05/06/mozilla-automation-load-over-time" />
    <id>http://gregoryszorc.com/blog/2013/05/06/mozilla-automation-load-over-time</id>
    <updated>2013-05-06T11:45:00Z</updated>
    <published>2013-05-06T11:45:00Z</published>
    <category scheme="http://gregoryszorc.com/blog" term="Mozilla" />
    <summary type="html"><![CDATA[Mozilla Automation Load Over Time]]></summary>
    <content type="html" xml:base="http://gregoryszorc.com/blog/2013/05/06/mozilla-automation-load-over-time"><![CDATA[<p><a href="/images/mozilla-automation-load-1.png">This chart</a> plots per-month sums
of total time of jobs in Mozilla's automation in days. The line running
through it is a best fit linear regression.</p>
<p>The raw data is <a href="https://gist.github.com/indygreg/5527091">available</a>.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://gregoryszorc.com/blog</uri>
    </author>
    <title type="html"><![CDATA[SQLite.jsm - SQLite Done Betterer]]></title>
    <link rel="alternate" type="text/html" href="http://gregoryszorc.com/blog/2013/04/14/sqlite.jsm---sqlite-done-betterer" />
    <id>http://gregoryszorc.com/blog/2013/04/14/sqlite.jsm---sqlite-done-betterer</id>
    <updated>2013-04-14T23:55:00Z</updated>
    <published>2013-04-14T23:55:00Z</published>
    <category scheme="http://gregoryszorc.com/blog" term="Mozilla" />
    <summary type="html"><![CDATA[SQLite.jsm - SQLite Done Betterer]]></summary>
    <content type="html" xml:base="http://gregoryszorc.com/blog/2013/04/14/sqlite.jsm---sqlite-done-betterer"><![CDATA[<p>Did you know there is now a better way to interact with SQLite from
JavaScript in Firefox? It's called
<a href="https://developer.mozilla.org/en-US/docs/Mozilla/JavaScript_code_modules/Sqlite.jsm">SQLite.jsm</a>
and it's available in Firefox 20 and later.</p>
<p>SQLite.jsm is an abstraction around the low-level Storage APIs that you
would have used before. However, it eliminates most of the footguns and
makes it easier to write code that doesn't jank the browser and is less
prone to memory leaks. It even has an API to free as much memory as
possible from the current connection!</p>
<p>If you currently use SQLite via Storage, I highly recommend taking a
look at SQLite.jsm - especially if you are using synchronous APIs.</p>
<p>If you are investigating SQLite for the storage needs of your add-on or
browser feature, please keep in mind that SQLite can incur lots of
filesystem I/O and may run slowly on old machines (especially with
magnetic hard drives) and especially with its default configuration.
You may be interested in low-level file I/O using
<a href="https://developer.mozilla.org/en-US/docs/JavaScript_OS.File">OS.File</a>
instead.</p>
<p>If you insist on using SQLite, please educate yourself on and then
seriously consider using
<a href="https://www.sqlite.org/wal.html">Write-Ahead Logging</a> mode on your
database. Some detailed discussion on SQLite behavior as it pertains to
Firefox can be found in <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=830492">bug 830492</a>.
I hope to eventually incorporate more <em>sane by default</em> connection
options into SQLite.jsm to make it easier for add-ons and browser
features to have the least-impactful behavior by default (e.g. enable
WAL by default). Until then, please, please, please research
<a href="https://www.sqlite.org/pragma.html">PRAGMA</a> statements to optimize
how your SQLite database runs so it has as little performance overhead
as possible. Also consider dropping into an IRC channel on
irc.mozilla.org and asking for advice from one of the many who have
fallen into SQLite's many performance pitfalls (including me).</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://gregoryszorc.com/blog</uri>
    </author>
    <title type="html"><![CDATA[Making hg-git Faster]]></title>
    <link rel="alternate" type="text/html" href="http://gregoryszorc.com/blog/2013/04/14/making-hg-git-faster" />
    <id>http://gregoryszorc.com/blog/2013/04/14/making-hg-git-faster</id>
    <updated>2013-04-14T21:45:00Z</updated>
    <published>2013-04-14T21:45:00Z</published>
    <category scheme="http://gregoryszorc.com/blog" term="Mozilla" />
    <category scheme="http://gregoryszorc.com/blog" term="Mercurial" />
    <category scheme="http://gregoryszorc.com/blog" term="Git" />
    <summary type="html"><![CDATA[Making hg-git Faster]]></summary>
    <content type="html" xml:base="http://gregoryszorc.com/blog/2013/04/14/making-hg-git-faster"><![CDATA[<p>When enterprising individuals at Mozilla
<a href="http://bluishcoder.co.nz/2011/02/10/git-conversion-of-mozilla-central.html">started maintaining</a>
a Git mirror of Firefox's main source repository (hosted in Mercurial),
they ran into a significant problem: conversion was slow. The initial
conversion apparently took over 6 days and used a lot of memory.
Furthermore, each subsequent commit took many seconds, even on modern
hardware. This meant that the they could only maintain a Git mirror of
a few project branches and that updates would be slow. Furthermore,
the slowness of the conversion significantly discouraged people
from using the tool locally as part of regular development.</p>
<p>I thought this was unacceptable. I wanted to enable people to use
their tool of choice (Git) to develop Firefox. So, I did what annoyed
engineers do when confronted with an itch: I scratched it.</p>
<h2>Diagnosing the Problem</h2>
<p>When I started tackling this problem, I had little knowledge of the
problem space other than the problem statement: <em>converting from
Mercurial to Git is prohibitively slow</em> and that the slow tool was
<a href="http://hg-git.github.io/">hg-git</a>. My challenge was thus to make
hg-git faster.</p>
<p>When confronted with a performance problem, one of the first things you
do is identify the source of the bad performance. Then, you need to
acertain whether that is something you have the ability to change.</p>
<p>This often starts by answering some high-level questions then drilling
down into more detail as necessary. For a long-running system tool like
hg-git, I start with the <em>top test</em>: how much CPU, memory, and I/O is
the process utilizing?</p>
<p>In the case of hg-git, we were CPU bound. The Python process was
consistently pegging a single CPU core while periodically incurring I/O
(but not nearly enough to saturate a magnetic disk). This told me a few
things. First, I should look for bottlenecks inside Python. Second, I
should investigate whether parallel execution would be possible. The
latter is especially important these days because the trend in
processors is towards more cores rather than higher clock speeds. It's
no longer acceptable to let increases in clock speed or cycle efficiency
bail you out: if you want a CPU bound process to run as fast as
possible, it's often necessary to involve all available CPU cores.</p>
<p>Once I diagnosed CPU as the limiting factor, I pulled out the next tool
in the arsenal: a code profiler. I quickly discovered exactly where the
conversion was taking the most CPU time. As feared, it was in the
<em>export Mercurial changeset to Git commit</em> function.
Specifically, profiling had flagged the conversion of Mercurial
manifests to Git trees and blobs. Furthermore, most of the time was
spent in functions in Mercurial itself (Mercurial is implemented in
Python and hg-git calls into it natively) and Dulwich (a pure Python
implementation of Git). So, I was either looking at deficiencies or
Mercurial and/or Dulwich, a bad conversion algorithm in hg-git, or both.
To know which, I would need a better grasp on the internal storage
models of Mercurial and Git.</p>
<h2>Learning about Mercurial's and Git's internal storage models</h2>
<p>To understand why conversion from Mercurial to Git was slow, I needed to
understand how each stored data internally. My hope was that if attained
better understanding I could apply the knowledge to assess the algorithm
hg-git was using and optimize it, hopefully introducing parallel
execution along the way.</p>
<h3>Git's internals</h3>
<p>I already had a fairly good understanding of how Git works internally.
And, it's quite simple really. The <a href="http://git-scm.com/book/en/Git-Internals">Git Internals</a>
chapter of the <em>Pro Git</em> is extremely useful. While I encourage readers
to read all of the <a href="http://git-scm.com/book/en/Git-Internals-Git-Objects">Git Objects</a>
section, the gist is:</p>
<ul>
<li>Git's core storage is a key-value data store. Keys are SHA-1 checksums
  of content. Each entity is storage in a <em>Git object</em>.</li>
<li>A <em>blob</em> is an object holding the raw content of a file.</li>
<li>A <em>tree</em> is an object holding a list of <em>tree entries</em>. Each tree entry
  defines a blob, another tree object, etc. A tree is essentially a
  directory listing.</li>
<li>A <em>commit</em> object holds metadata about an individual Git commit. Each
  commit object refers to a specific <em>tree</em> object.</li>
</ul>
<p>When you introduce a new file that hasn't been seen before, a new <em>blob</em>
is added to storage. That blob is referenced by a <em>tree</em>. When you
update a file, a new <em>tree</em> is created referring to the new <em>blob</em> that
was created.</p>
<p>Things get a little complicated when you consider directories. If you
update the file <em>foo/bar/baz.c</em>, the tree for <em>foo/bar</em> changes (because
the SHA-1 of <em>baz.c</em> changed). And, the SHA-1 for the <em>foo/bar</em> tree
changes, so the <em>bar</em> entry in <em>foo</em>'s tree changes, changing the SHA-1
for the root tree.</p>
<p>That's essentially how Git addresses commits, directories, and files. If
you don't grok this, please, please read the aforementioned page on it -
it may even help you better grok Git!</p>
<h3>Mercurial's internals</h3>
<p>Unlike Git, I didn't really have a clue how Mercurial worked internally.
So, I needed to do some self-education here.</p>
<p>The best resource for Mercurial's storage model I've found is the
<a href="http://hgbook.red-bean.com/read/behind-the-scenes.html">Behind the Scenes</a>
chapter from <em>Mercurial: The Definitive Guide</em>. The gist is:</p>
<ul>
<li>History for an individual file is stored in a <em>filelog</em>. Each
  <em>filelog</em> contains the history of a single file. Each file revision
  has a hash based on the file contents.</li>
<li>The <em>manifest</em> lists every file, its permissions, and its file
  revision for each changeset in the repository.</li>
<li>The <em>changelog</em> contains information about each changeset, including
  the revision of the <em>manifest</em> to use.</li>
<li>Each of these logs contain <em>revisions</em> and you can address an
  individual revision within the log.</li>
</ul>
<p>From a high level, Mercurial's storage model is very similar to Git's.
They both address files by hashing their content. Where Git uses
multiple tree objects to define every file in a commit, Mercurial has a
single manifest containing a flat list. Aside from that, the
differences are mostly in implementation details. These are important,
as we'll soon see.</p>
<h2>Analyzing hg-git's conversion algorithm</h2>
<p>Armed with knowledge of how Git and Mercurial internally store data, I
was ready to analyze how hg-git was performing conversion from Mercurial
to Git. Since profiling revealed it was the <em>convert a single changeset
into Git commit</em> function that was taking all the time, I started there.</p>
<p>In Python (but not the actual Python), the algorithm was essentially:</p>
<div class="pygments_murphy"><pre><span class="k">def</span> <span class="nf">export_changeset_to_git</span><span class="p">(</span><span class="n">changeset</span><span class="p">,</span> <span class="n">git</span><span class="p">,</span> <span class="n">already_converted</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;Receives the Mercurial changeset and a handle on a Git object storre.&quot;&quot;&quot;</span>
    <span class="c"># This is an entity that helps us build Git tree objects from</span>
    <span class="c"># paths and blobs. The logic is at</span>
    <span class="c"># https://github.com/jelmer/dulwich/blob/2a8548be3b1fd4a1ae7d0436dce91611112c47c2/dulwich/index.py#L298</span>
    <span class="n">tree_builder</span> <span class="o">=</span> <span class="n">TreeBuilder</span><span class="p">()</span>

    <span class="k">for</span> <span class="nb">file</span> <span class="ow">in</span> <span class="n">changeset</span><span class="o">.</span><span class="n">manifest</span><span class="p">:</span>
        <span class="n">blob_id</span> <span class="o">=</span> <span class="n">already_converted</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="nb">file</span><span class="o">.</span><span class="n">id</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">blob_id</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="n">blob</span> <span class="o">=</span> <span class="n">Blob</span><span class="p">(</span><span class="nb">file</span><span class="o">.</span><span class="n">data</span><span class="p">())</span>
            <span class="n">git</span><span class="o">.</span><span class="n">store</span><span class="p">(</span><span class="n">blob</span><span class="o">.</span><span class="n">id</span><span class="p">,</span> <span class="n">blob</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
            <span class="n">already_converted</span><span class="p">[</span><span class="nb">file</span><span class="o">.</span><span class="n">id</span><span class="p">]</span> <span class="o">=</span> <span class="n">blob</span><span class="o">.</span><span class="n">id</span>
            <span class="n">blob_id</span> <span class="o">=</span> <span class="n">blob</span><span class="o">.</span><span class="n">id</span>

        <span class="n">tree_builder</span><span class="o">.</span><span class="n">add_file</span><span class="p">(</span><span class="nb">file</span><span class="o">.</span><span class="n">path</span><span class="p">,</span> <span class="n">blob_id</span><span class="p">,</span> <span class="nb">file</span><span class="o">.</span><span class="n">mode</span><span class="p">)</span>

    <span class="k">for</span> <span class="n">tree</span> <span class="ow">in</span> <span class="n">tree_builder</span><span class="o">.</span><span class="n">all_trees</span><span class="p">():</span>
        <span class="n">git</span><span class="o">.</span><span class="n">store</span><span class="p">(</span><span class="n">tree</span><span class="o">.</span><span class="n">id</span><span class="p">,</span> <span class="n">tree</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>

    <span class="n">root_tree</span> <span class="o">=</span> <span class="n">tree_builder</span><span class="o">.</span><span class="n">root_tree</span>

    <span class="c"># And proceed to build the Git commit and insert it.</span>
</pre></div>

<p>On the face of it, this code doesn't seem too bad. If I were writing the
functionality from scratch, I'd likely do something very similar. So,
why is it so slow?</p>
<p>As I mentioned earlier, profiling results had identified Mercurial and
Dulwich as the hot spots. The Mercurial hotspot was in iteration over
the files in the manifest. And the Dulwich offender with Git <em>tree</em>
object construction. By why?</p>
<p>First, it turns out that iterating a manifest the way hg-git was isn't
exactly performant. I never traced all the gory details, but I'm pretty
sure every time it accessed the file context through the change context
there was I/O involved. Not good, especially if you may not need the
information contained if it was already cached!</p>
<p>Second, it turns out that creating Git <em>tree</em> objects in Dulwich is
rather slow. And, the problem is magnified when converting large
repositories - like mozilla-central (Firefox's canonical repository).</p>
<p>So, I was faced with a decision: make Mercurial and/or Dulwich faster or
change hg-git. Since improving these would have benefits outside of
hg-git, I initially went down those roads. However, I eventually
abandonded the effort because of effort involved. And, in the case of
Dulwich, improving things would likely require rewriting some pieces in
C - not something I cared to do nor something that the Dulwich people
would likely accept since Dulwich is all about being a pure Python
implementation of Git! And in hindsight, this was the right call.
Mercurial and Dulwich are fast enough - it's hg-git that was being
suboptimal.</p>
<p>I was faced with two problems: don't mass iterate over manifests and
don't mass generate Git trees. Both were seemingly impossible to avoid
because both are critical to converting a Mercurial changeset to Git.</p>
<p>I thought about this problem for a while. I experimented with numerous
micro benchmarks. I engaged the very helpful Mercurial developers on IRC
(thanks everyone!). And, I eventually arrived at what I think is an
elegant solution.</p>
<p>When I took a step back and looked at the larger problem of exporting
Mercurial changesets to Git, I realized it would be beneficial in terms
of efficiency for the conversion to be more aware of what had occurred
before. Before I came along, hg-git was asking Mercurial for the full
state of each changeset for each changeset conversion. When you think
about it in low-level operations, this is extremely inefficient. Let's
take Git trees as an example.</p>
<p>When you perform a commit, only the trees - and
their parents - that had modified files will change. All the
other trees will be identical across commits. For large repositories (in
terms of files and directories) like mozilla-central, the number of
<em>static</em> trees across small commits is quite significant compared to
changed trees. The overhead of computing all these trees is not
insignificant!</p>
<p>Instead of throwing away all the trees and file context information
between changeset exports, what if I preserved it and reused it for the
next changeset? I think I'm on to something...</p>
<h2>Implementing incremental changeset export</h2>
<p>To minimize the work performed when exporting Mercurial changesets to
Git, I <a href="https://github.com/indygreg/hg-git/commit/aef6eacf86fb08101ea98a7787f3b20dd67287c2">implemented</a>
a standalone class that can emit Mercurial changeset deltas in terms of
Git objects. Essentially, it caches a Git tree representation of a
Mercurial manifest. When you feed a new Mercurial changeset into it, it
asks Mercurial to compare those changesets using the same API used by
<em>hg status</em>. This API is efficient and returns the information I care
about: the paths that changed. Once we have the changed files, we
<em>simply</em> reflect those changes in terms of updating Git trees.</p>
<p>If a file changes or is added, we emit a <em>blob</em>. If a <em>tree</em> changes, we
emit the new <em>tree</em> object. When the consumer has finished writing the
set of new objects to Git, it asks for the SHA-1 of the root tree. (Up
until this point the consumer is not aware of what any of the emitted
objects actually are - just that they likely need to be added to
storage.) It then uses the SHA-1 of the root tree to construct the
commit. Then it moves on to the next changeset.</p>
<p>The impact of this change is significant. On my computer, converting
Mercurial's own Mercurial repository Git went from <strong>21:07</strong> to <strong>8:14</strong>
on my i7-2600k. mozilla-central is even more drastic. The first 200
commits (the first commit was a large dump from CVS) took <strong>8:17</strong>
before and now take <strong>2:32</strong>. I don't have exact numbers from newer
commits, but I do know they were at least twice as slow as the initial
commits and showed an even more drastic speedup.</p>
<p>But I was just getting started.</p>
<p>The initial implementation wasn't very efficient in terms of reducing
tree object calculations. I changed that earlier today when I
<a href="https://groups.google.com/d/msg/hg-git/I5w_FscF6lw/LAc0pw1iilQJ">submitted a patch for consideration</a>
that only calculates tree changes for trees that actually changed. I
also removed some needless sorting on the order of export operations.
This second patch reduced conversion of Mercurial's repository down to
<strong>5:33</strong>. Even more impressive is that <strong>mozilla-central's changesets are
now exporting almost 4x faster</strong> with this patch alone. The first 200
changesets now export in <strong>42s</strong> (down from <strong>2:32</strong> which is down from
<strong>8:17</strong>). This is mostly due to the overhead of reprocessing
non-dirty trees on every export.</p>
<p>And I'm not through.</p>
<p>As part of building the standalone incremental changeset exporter, one
of the goals in the back of my mind was to eventually have things
execute in parallel.</p>
<p>In my <a href="https://github.com/indygreg/hg-git/tree/performance-next">personal development branch</a>
I have a <a href="https://github.com/indygreg/hg-git/commit/e74641284fecc928b0b8f8dcc01ef9b99e09c3cc">patch</a>
to perform Mercurial changeset export on multiple cores. Essentially
hg-git fires up a bunch of worker processes and asks each to export a
consecutive range of changesets. The workers writes new Git objects into
Git and then tells the coordinator process the root tree SHA-1
corresponding to each Mercurial changeset. The coordinator process then
uses these root tree SHA-1's to derive Git commit objects (you can't
create the commit object until you know the SHA-1 of the commit's
parents).</p>
<p>The blob and tree exporting on separate processes makes Mercurial to Git
export scale out to however many cores you feel like throwing at it.
When 32 core machines come around, you can convert using all available
cores and the speedup should roughly be linear with the number of cores.</p>
<p>I'm still working out some kinks in the multiple processes patch
(the <em>multiprocessing</em> module is very difficult to get working on all
platforms and I don't want to break hg-git when it lands). But,
<a href="http://ehsanakhgari.org/">Ehsan Akhgari</a> has been using it to power the
<a href="https://github.com/mozilla/mozilla-central/">GitHub mirror</a> of
mozilla-central for months without issue. (His use of these patches
freed up the CPU required to support conversion of more project branches
on the Git mirror. And, he's still not using the 4x improvement patch I
wrote today - he will shortly - so who knows what improvements will stem
from that.)</p>
<p>With all the patches applied, hg-git now feels like a Ferrari when
exporting Mercurial changesets to Git. Conversion of Mercurial's
repository now takes <strong>1:25</strong> (down from <strong>21:07</strong>). <strong>Conversion of
mozilla-central has gone from 6+ days to about 3 hours!</strong> More
importantly, ongoing conversions feel somewhat snappy now.</p>
<h2>Making Git export even faster</h2>
<p>With the patch today, I'd say optimization of exporting Mercurial
changesets is nearing its limits. There are a few things I could try
that may net another 2 or 3x improvement. But, I think the ~50x
improvement I've already attained (at least for mozilla-central) is
pretty damn good and good enough for most users. (Part of performance
optimization is knowing when is good enough and stopping before you
invest excessive time in the long tail.)</p>
<p>There is one giant refactor that could likely net a significant win for
Git export. However, it requires optimizing for initial export over
recurring incremental export (which is why I have little interest in
it). Incremental export incurs a lot of <em>random</em> I/O accessing Mercurial
filelogs and extracting specific file revisions as they are needed. An
optimal export would iterate over the filelogs and export Git blobs from
each filelog in the sequence they occur in within the filelogs. It would
cache the file node to blob SHA-1. After all blobs are exported, the
mappings would be combined and distributed to all workers. Then, tree
export would occur in parallel largely under the existing model modulo
blob writing. This would minimize overall I/O and work in Mercurial and
would likely be significantly faster. However, it's mostly useful for
initial export and IMO not worth implementing. (It's possible to employ
a variation for incremental export that iterates over filelogs and
exports not-yet-seen revisions. Perhaps I will investigate this some
day.)</p>
<h2>What about converting Git to Mercurial?</h2>
<p>Now that I've tackled Mercurial to Git conversion, it's very tempting to
work magic on the inverse: converting Git commits to Mercurial
changesets. While I haven't looked at this problem in detail, I already
know it will be at least slightly more challenging.</p>
<p>The reason is parallelization. With Mercurial export, I have each child
process reading directly from Mercurial and writing directly to Git.
There are no locks involved. There is just a coordinator that ensures
minimum redundant work among workers. There is some redundant
work, sure. But, the alternative would be lots of locking and/or
exchange of state across processes - not cheap operations! Furthermore,
the writes into Git can occur in any order (since Git is just a
key-value store). The only hard requirement is a child commit must
come after its parent (because you need the parent commit's SHA-1).
And, single-threaded insert of commit objects isn't a big deal because
you can crank through hundreds of them per second (it might even be over
1000/s on my machine).</p>
<p>Mercurial's storage implementation does not afford me the same
<em>carelessness</em> with regards to writing into storage. Since Mercurial
uses shared files for individual file and manifest history, we have
a contention problem. We <em>could</em> lock files when writing to them.
However, these files (<em>revlogs</em> in Mercurial speak) also use
transparent delta compression. You get the best performance/compression
when changes are written in the order they actually occured in (at
least in the typical case).</p>
<p>To optimally write to Mercurial you need to order inserts. This means
parallel reads from Git (in separate worker processes) would be very
difficult to implement. Doable, sure, but you're looking at a lot of
transferred state and ordering. This likely involves a lot more memory
and CPU usage.</p>
<p>The best idea I've come up with so far is a single process
that reads off Git commits and iterates trees. It hashes the paths of
seen files to a consistent worker process which then pulls the blob from
Git's storage and inserts it into the filelog. You don't need to lock
filelogs because only one worker owns a specific path. Workers report
the blob's corresponding file node to another process which then
assembles manifests, writes manifests in order, and finally creates
and writes changesets. Unfortunately, the worker processes are just
doing blob I/O. There is no parallel processing of Git tree calculation
or Mercurial manifests. Given this was a significant source of slowness
exporting <em>to</em> Git, I worry the inverse will be true. Although, the
problem with Git was tree <em>creation</em> and it was due to the volume. Since
there is only 1 manifest per changeset, perhaps it won't be as bad.</p>
<p>While I've brainstormed a solution, I have no concrete plans to work on
Git to Mercurial conversion. The impetus for me working on Mercurial to
Git speedups was that I and a number of other Mozilla people were
personally impacted. If the same is true for Git to Mercurial slowness,
I could invest a few hours the next time I'm sick and bored over the
weekend.</p>
<h2>Conclusion</h2>
<p>Converting Mercurial repositories to Git with hg-git is now
significantly faster. If you thought it was too slow before, grab the
latest code (from either the official repository or my personal branch)
and enjoy.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://gregoryszorc.com/blog</uri>
    </author>
    <title type="html"><![CDATA[Bulk Analysis of Mozilla's Build and Test Data]]></title>
    <link rel="alternate" type="text/html" href="http://gregoryszorc.com/blog/2013/04/01/bulk-analysis-of-mozilla's-build-and-test-data" />
    <id>http://gregoryszorc.com/blog/2013/04/01/bulk-analysis-of-mozilla's-build-and-test-data</id>
    <updated>2013-04-01T13:12:00Z</updated>
    <published>2013-04-01T13:12:00Z</published>
    <category scheme="http://gregoryszorc.com/blog" term="Mozilla" />
    <category scheme="http://gregoryszorc.com/blog" term="Firefox" />
    <summary type="html"><![CDATA[Bulk Analysis of Mozilla's Build and Test Data]]></summary>
    <content type="html" xml:base="http://gregoryszorc.com/blog/2013/04/01/bulk-analysis-of-mozilla's-build-and-test-data"><![CDATA[<p>When you push code changes to Firefox and other similar Mozilla
projects, a flood of automated jobs is triggered on Mozilla's
infrastructure. It works like any other continuous integration
system. First you build, then you run tests, etc. What sets it
apart from other continuous integration systems is the size:
Mozilla runs thousands of jobs per week and the combined output
sums into the tens of gigabytes.</p>
<p>Most of the data from Mozilla's continuous integration is
available on public servers, notably ftp.mozilla.org. This includes
compiled binaries, logs, etc.</p>
<p>While there are tools that can sift through this mountain of data
(like <a href="https://tbpl.mozilla.org">TBPL</a>), they don't allow ad-hoc
queries over the raw data. Furthermore, these tools are very
function-specific and there are many data views they don't expose.
This <em>missing</em> data has always bothered me because, well, there
are cool and useful things I'd like to do with this data.</p>
<p>This itch has been bothering me for well over a year. The
persistent burning sensation coupled with rain over the weekend
caused me to scratch it.</p>
<p>The <a href="https://github.com/indygreg/mozilla-build-analyzer">product of my weekend labor</a>
is a system facilitating bulk storage and analysis of Mozilla's
build data. While it's currently very alpha, it's already showing
promise for more throrough data analysis.</p>
<p>Essentially, the tool works by collecting the dumps of all the jobs
executed on Mozilla's infrastructure. It can optionally supplement
this with the raw logs from those jobs. Then, it combs through this
data, extracts useful bits, and stores them. Once the initial
fetching has completed, you simply need to re-"parse" the data set
into useful data. And, since all data is stored locally, the
performance of this is not bound by Internet bandwidth. In practice,
this means that you can obtain a new metric faster than would have
been required before. The downside is you will likely be storing
gigabytes of raw data locally. But, disks are cheap. And, you have
control over what gets pulled in, so you can limit it to what you
need.</p>
<p>Please note the project is very alpha and is only currently serving
my personal interests. However, I know there is talk about <em>TBPL2</em>
and what I have built could evolve into the data store for the next
generation TBPL tool. Also, most of the work so far has centered on
data import. There is tons of analysis code waiting to be written.</p>
<p>If you are interested in improving the tool, please file a GitHub
pull request.</p>
<p>I hope to soon blog about useful information I've obtained through this
tool.</p>]]></content>
  </entry>
  <entry>
    <author>
      <name></name>
      <uri>http://gregoryszorc.com/blog</uri>
    </author>
    <title type="html"><![CDATA[Omnipresent mach]]></title>
    <link rel="alternate" type="text/html" href="http://gregoryszorc.com/blog/2013/03/03/omnipresent-mach" />
    <id>http://gregoryszorc.com/blog/2013/03/03/omnipresent-mach</id>
    <updated>2013-03-03T12:30:00Z</updated>
    <published>2013-03-03T12:30:00Z</published>
    <category scheme="http://gregoryszorc.com/blog" term="Mozilla" />
    <category scheme="http://gregoryszorc.com/blog" term="Firefox" />
    <category scheme="http://gregoryszorc.com/blog" term="mach" />
    <summary type="html"><![CDATA[Omnipresent mach]]></summary>
    <content type="html" xml:base="http://gregoryszorc.com/blog/2013/03/03/omnipresent-mach"><![CDATA[<p>Matt Brubeck recently landed an awesome patch for mach in
<a href="https://bugzilla.mozilla.org/show_bug.cgi?id=840588">bug 840588</a>:
it allows mach to be used by any directory. I'm calling it
<em>omnipresent mach</em>.</p>
<p>Essentially, Matt changed the <em>mach</em> driver (the script in the root
directory of mozilla-central) so instead of having it look in hard-coded
relative paths for all its code, it walks up the directory tree and
looks for signs of the source tree or the object directory.</p>
<p>What this all means is that if you have the <em>mach</em> script installed in
your $PATH and you just type <em>mach</em> in your shell from within any source
directory or object directory, <em>mach</em> should just work. So, no more
typing <em>./mach</em>: just copy <em>mach</em> to <em>~/bin</em>, <em>/usr/local/bin</em> or some
other directory on your $PATH and you should just be able to type
<em>mach</em>.</p>
<p>Unfortunately, there are bound to be bugs here. Since <em>mach</em>
traditionally was only executed with the current working directory as
the top source directory, some commands are not prepared to handle a
variable current working directory. Some commands will likely get
confused when it comes resolving relative paths, etc. If you find
an issue, please report it! A temporary workaround is to just invoke
mach from the top source directory like you've always been doing.</p>
<p>If you enjoy the feature, thank Matt: this was completely his idea and
he saw it through from conception to implementation.</p>]]></content>
  </entry>
</feed>
