Gregory Szorc's Digital Home

Code First and the Rise of the DVCS and GitHub

January 10, 2015 at 12:35 PM | categories: Git, Mozilla

The ascendancy of GitHub has very little to do with its namesake tool, Git.

What GitHub did that was so radical for its time and the strategy that GitHub continues to execute so well on today is the approach of putting code first and enabling change to be a frictionless process.

In case you weren't around for the pre-GitHub days or don't remember, they were not pleasant. Tools around code management were a far cry from where they are today (I still argue the tools are pretty bad, but that's for another post). Centralized version control systems were prevalent (CVS and Subversion in open source, Perforce, ClearCase, Team Foundation Server, and others in the corporate world). Tools for looking at and querying code had horrible, ugly interfaces and came out of a previous era of web design and browser capabilities. It felt like a chore to do anything, including committing code. Yes, the world had awesome services like SourceForge, but they weren't the same as GitHub is today.

Before I get to my central thesis, I want to highlight some supporting reasons for GitHub's success. There were two developments in the second half of the 2000s the contributed to the success of GitHub: the rises of the distributed version control system (DVCS) and the modern web.

While distributed version control systems like Sun WorkShop TeamWare and BitKeeper existed earlier, it wasn't until the second half of the 2000s that DVCS systems took off. You can argue part of the reason for this was open source: my recollection is there wasn't a well-known DVCS available as free software before 2005. Speaking of 2005, it was a big year for DVCS projects: Git, Mercurial, and Bazaar all had initial releases. Suddenly, there were old-but-new ideas on how to do source control being exposed to new and willing-to-experiment audiences. DVCS were a critical leap from traditional version control because they (theoretically) impose less process and workflow limitations on users. With traditional version control, you needed to be online to commit, meaning you were managing patches, not commits, in your local development workflow. There were some forms of branching and merging, but they were a far cry from what is available today and were often too complex for mere mortals to use. As more and more people were exposed to distributed version control, they welcomed its less-restrictive and more powerful workflows. They realized that source control tools don't have to be so limiting. Distributed version control also promised all kinds of revamped workflows that could be harnessed. There were potential wins all around.

Around the same time that open source DVCS systems were emerging, web browsers were evolving from an application to render static pages to a platform for running web applications. Web sites using JavaScript to dynamically manipulate web page content (DHTML as it was known back then) were starting to hit their stride. I believe it was GMail that turned the most heads as to the full power of the modern web experience, with its novel-for-its-time extreme reliance on XMLHttpRequest for dynamically changing page content. People were realizing that powerful, desktop-like applications could be built for the web and could run everywhere.

GitHub launched in April 2008 standing on the shoulders of both the emerging interest in the Git content tracking tool and the capabilities of modern browsers.

I wasn't an early user of GitHub. My recollection is that GitHub was mostly a Rubyist's playground back then. I wasn't a Ruby programmer, so I had little reason to use GitHub in the early stages. But people did start using GitHub. And in the spirit of Ruby (on Rails), GitHub moved fast, or at least was projecting the notion that they were. While other services built on top of DVCS tools - like Bitbucket - did exist back then, GitHub seemed to have momentum associated with it. (Look at the archives for GitHub's and Bitbucket's respective blogs. GitHub has hundreds of blog entries; Bitbucket numbers in the dozens.) Developers everywhere up until this point had all been dealing with sub-optimal tools and workflows. Some of us realized it. Others hadn't. Many of those who did saw GitHub as a beacon of hope: we have all these new ideas and new potentials with distributed version control and here is a service under active development trying to figure out how to exploit that. Oh, and it's free for open source. Sign me up!

GitHub did capitalize on a market opportunity. They also capitalized on the value of marketing and the perception that they were moving fast and providing features that people - especially in open source - wanted. This captured the early adopters market. But I think what really set GitHub apart and led to the success they are enjoying today is their code first approach and their desire to make contribution easy, and even fun and sociable.

As developers, our job is to solve problems. We often do that by writing and changing code. And this often involves working as part of a team, or collaborating. To collaborate, we need tools. You eventually need some processes. And as I recently blogged, this can lead to process debt and inefficiencies associated with them.

Before GitHub, the process debt for contributing to other projects was high. You often had to subscribe to mailing lists in order to submit patches as emails. Or, you had to create an account on someone's bug tracker or code review tool before you could send patches. Then you had to figure out how to use these tools and any organization or project-specific extensions and workflows attached to them. It was quite involved and a lot could go wrong. Many projects and organizations (like Mozilla) still practice this traditional methology. Furthermore (and as I've written before), these traditional, single patch/commit-based tools often aren't effective at ensuring the desired output of high quality software.

Before GitHub solved process debt via commoditization of knowledge via market dominance, they took another approach: emphasizing code first development.

GitHub is all about the code. You load a project page and you see code. You may think a README with basic project information would be the first thing on a project page. But it isn't. Code, like data, is king.

Collaboration and contribution on GitHub revolve around the pull request. It's a way of saying, hey, I made a change, will you take it? There's nothing too novel in the concept of the pull request: it's fundamentally no different than sending emails with patches to a mailing list. But what is so special is GitHub's execution. Gone are the days of configuring and using one-off tools and processes. Instead, we have the friendly confines of a clean, friendly, and modern web experience. While GitHub is built upon the Git tool, you don't even need to use Git (a tool lampooned for its horrible usability and approachability) to contribute on GitHub! Instead, you can do everything from your browser. That warrants repeating: you don't need to leave your browser to contribute on GitHub. GitHub has essentially reduced process debt to edit a text document territory, and pretty much anybody who has used a computer can do that. This has enabled GitHub to dabble into non-code territory, such as its GitHub and Government initiative to foster community involvement in government. (GitHub is really a platform for easily seeing and changing any content or data. But, please, let me continue using code as a stand-in, since I want to focus on the developer audience.)

GitHub took an overly-complicated and fragmented world of varying contribution processes and made the new world revolve around code and a unified and simple process for change - the pull request.

Yes, there are other reasons for GitHub's success. You can make strong arguments that GitHub has capitalized on the social and psychological aspects of coding and human desire for success and happiness. I agree.

You can also argue GitHub succeeded because of Git. That statement is more or less technically accurate, but I don't think it is a sound argument. Git may have been the most feature complete open source DVCS at the time GitHub came into existence. But that doesn't mean there is something special about Git that no other DVCS has that makes GitHub popular. Had another tool been more feature complete or had the backing of a project as large as Linux at the time of GitHub's launch, we could very well be looking at a successful service built on something that isn't Git. Git had early market advantage and I argue its popularity today - a lot of it via GitHub - is largely a result of its early advantages over competing tools. And, I would go so far to say that when you consider the poor usability of Git and the pain that its users go through when first learning it, more accurate statements would be that GitHub succeeded in spite of Git and Git owes much of its success to GitHub.

When I look back at the rise of GitHub, I see a service that has succeeded by putting people first by allowing them to capitalize on more productive workflows and processes. They've done this by emphasizing code, not process, as the means for change. Organizations and projects should take note.

Firefox Contribution Process Debt

January 09, 2015 at 04:45 PM | categories: Mozilla

As I was playing around with source-derived documentation, I grasped the reality that any attempt to move documentation out of MDN's wiki and into something derived from source code (despite the argued technical and quality advantages) would likely be met with fierce opposition because the change process for Firefox is much more involved than edit a wiki.

This observation casts light on something important: the very act of contributing any change to Firefox is too damn hard.

I've always believed this statement to be true. I even like to think I'm one of the few people that has consistently tried to do something about it (inventing mach, overhauling the build system, bootstrap scripting, MozReview, etc). But I, like many of the seasoned Firefox developers, often lose sight of this grim reality. (I think it's fair to say that new contributors often complain about the development experience and as we grow accustomed to it, the complaint volume and intensity wanes).

But when you have the simplicity of editing a wiki page on MDN juxtaposed against the Firefox change contribution process of today, the inefficiency of the Firefox process is clearly seen.

The amount of knowledge required to change Firefox is obscenely absurd. Sure, some complex components will always be difficult to change. But I'm talking about any component: the base set of knowledge required to contribute any change to Firefox is vast. This is before we get into any domain-specific knowledge inside Firefox. I have always believed and will continue to believe that this is a grave institutional issue for Mozilla. It should go without saying that I believe this is an issue worth addressing. After all, any piece of knowledge required for contribution is essentially an obstacle to completion. Elimination of required knowledge lowers the barrier to contribution. This, in turn, allows increased contribution via more and faster change. This advances the quality and relevance of Firefox, which enables Mozilla to advance its Mission.

Seasoned contributors have probably internalized most of the knowledge required to contribute to Firefox. Here is a partial list to remind everyone of the sheer scope of it:

Before you do anything
- Am I able to contribute?
- Do I meet the minimum requirements (hardware, internet access, etc)?
- Do I need any accounts?
Source control
- What is source control?
- How do I install Mercurial/Git?
- How do I use Mercurial/Git?
- Where can I get the Firefox source code?
- How do I *optimally* acquire the Firefox source code?
- Are there any recommended configuration settings or extensions?
Building Firefox
- Do I even need to build Firefox?
- How do I build Firefox?
- What prerequisites need to be installed?
- How do I install prerequisites?
- How do I configure the Firefox build to be appropriate for my needs?
- What are mozconfigs?
- How do I get Firefox to build faster?
- What do I do after a build?
Changing code
- Is there IDE support?
- Where can I find macros and aliases to make things easier?
Testing
- How do I run tests?
- Which tests are relevant for a given change?
- What are all these different test types?
- How do I debug tests?
Try and Automation
- What is Try?
- How do I get an account?
- What is vouching and different levels of access?
- What is SSH?
- How do I configure SSH?
- When will my tests run?
- What is Tree Herder?
- What do all these letters and numbers mean?
- What are all these colors?
- What's an *intermittent failure*?
- How do I know if something is an *intermittent failure*?
- What amount of *intermittent failure* is acceptable?
- What do these logs mean?
- What's buildbot?
- What's mozharness?
Sending patch to Mozilla
- Do I need to sign a CLA?
- Where do I send patches?
- Do I need to get an account on Bugzilla?
- Do I need to file a bug?
- What component should I file a bug in?
- What format should patches be sent in?
- How should I format commit messages?
- How do I upload patches to Bugzilla?
- How does code review work?
  - What's the modules system?
  - What modules does my change map to?
  - Who are the possible reviewers?
  - How do I ask someone for review?
  - When can I expect review?
  - What does r+ vs r- vs f+ vs f- vs cancelling review all mean?
  - How do I submit changes to my initial patch?
  - What do I do after review?
Landing patches
- What repository should a patch land on?
- How do you rebase?
- What's a tree closure?
- What do I do after pushing?
- How do I know the result of the landing?

Holy #$%@, that's a lot of knowledge. Not only is this list incomplete, it's also not encompassing a lot of the domain-specific knowledge around the content being changed.

Every item on this list represents a point where a potential contributor could throw up their arms out of despair and walk away, giving their time and talents to another project. Every item on this list that takes 10 minutes instead of 5 could be the tipping point. For common actions, things that take 5 seconds instead of 1 could be the difference maker. This list thus represents reasons that people do not contribute to Firefox or contribute ineffectively (in the case of common contributors, like paid Mozilla staff).

I view items on this list as process debt. Process debt is a term I'm coining (perhaps someone has beat me to it - I'm writing this on a plane without Internet access) that is a sibling of technical debt. Process debt is overhead and inefficiency associated with processes. The border between process debt and technical debt in computers is the code itself (although that border may sometimes not be very well-defined, as code and process are oftentimes similar, such as most interactions with version control or code review tools).

When I see this list of process debt, I'm inspired by the opportunity to streamline the processes and bask in the efficiency gains. But I am also simultaneously overwhelmed by the vast scope of things that need improved. When I think about the amount of energy that will need to be exerted to fight the OMG change crowd, the list becomes depressing. But discussing institutional resistance to change, the stop energy associated with it, and Mozilla's historical record of failing to invest in fixing process (and technical) debt is for another post.

When looking at the above list, I can think of the following general ways to make it more approachable:

Remove an item completely. If it isn't on the list, there is nothing to know and no overhead. The best way to solve a problem is to make it not exist.
Automate an item and makes its existence largely transparent. If an item is invisible, does it exist in the mind of a contributor? (This is also known as solving the problem by adding a layer of indirection.)
Change an item so that it is identical to another, more familiar process. If you use a well-defined process, there is no new knowledge that must be learned and the cost of on-boarding someone already familiar with that knowledge is practically zero.

When you start staring at this list of Firefox contribution process debt, you start thinking about priorities, groupings, and strategies. You look around at what others are doing to see if you can borrow good ideas.

I've done a lot of thinking on the subject and have some ideas and recommendations. Stay tuned for some additional posts on the topic.

Style Changes on hg.mozilla.org

January 09, 2015 at 03:25 PM | categories: Mercurial, Mozilla

Starting today and continuing through next week, there will be a number of styling changes made to hg.mozilla.org.

The main goal of the work is to bring the style up-to-date with upstream Mercurial. This will result in more features being available to the web interface, hopefully making it more useful. This includes display of bookmarks and the Mercurial help documentation. As part of this work, we're also removing some files on the server that shouldn't be used. If you start getting 404s or notice an unexpected theme change, this is probably the reason why.

If you'd like to look over the changes before they are made or would like to file a bug against a regression (we suspect there will be minor regressions due to the large nature of the changes), head on over to bug 1117021 or ping people in #vcs on IRC.

Mercurial Pushlog Is Now Robust Against Interrupts

December 30, 2014 at 12:25 PM | categories: Mozilla, Firefox

hg.mozilla.org - Mozilla's Mercurial server - has functionality called the pushlog which records who pushed what when. Essentially, it's a log of when a repository was changed. This is separate from the commit log because the commit log can be spoofed and the commit log doesn't record when commits were actually pushed.

Since its inception, the pushlog has suffered from data consistency issues. If you aborted the push at a certain time, data was not inserted in the pushlog. If you aborted the push at another time, data existed in the pushlog but not in the repository (the repository would get rolled back but the pushlog data wouldn't).

I'm pleased to announce that the pushlog is now robust against interruptions and its updates are consistent with what is recorded by Mercurial. The pushlog database commit/rollback is tied to Mercurial's own transaction API. What Mercurial does to the push transaction, the pushlog follows.

This former inconsistency has caused numerous problems over the years. When data was inconsistent, we often had to close trees until someone could SSH into the machines and manually run SQL to fix the problems. This also contributed to a culture of don't press ctrl+c during push: it could corrupt Mercurial. (Ctrl+c should be safe to press any time: if it isn't, there is a bug to be filed.)

Any time you remove a source of tree closures is a cause for celebration. Please join me in celebrating your new freedom to abort pushes without concern for data inconsistency.

In case you want to test things out, aborting pushes and (and rolling back the pushlog) should now result in something like:

pushing to ssh://hg.mozilla.org/mozilla-central
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 1 changes to 1 files
Trying to insert into pushlog.
Inserted into the pushlog db successfully.
^C
rolling back pushlog
transaction abort!
rollback completed

Firefox Source Documentation Versus MDN

December 30, 2014 at 12:00 PM | categories: Mozilla

The Firefox source tree has had in-tree documentation powered by Sphinx for a while now. However, its canonical home has been a hard-to-find URL on ci.mozilla.org. I finally scratched an itch and wrote patches to enable the docs to be built easier. So, starting today, the docs are now available on Read the Docs at https://gecko.readthedocs.org/en/latest/!

While I was scratching itches, I decided to play around with another documentation-related task: automatic API documentation. I have a limited proof-of-concept for automatically generating XPIDL interface documentation. Essentially, we use the in-tree XPIDL parser to parse .idl files and turn the Python object representation into reStructured Text, which Sphinx parses and renders into pretty HTML for us. The concept can be applied to any source input, such as WebIDL and JavaScript code. I chose XPIDL because a parser is readily available and I know Joshua Cranmer has expressed interest in automatic XPIDL documentation generation. (As an aside, JavaScript tooling that supports the flavor of JavaScript used internally by Firefox is very limited. We need to prioritize removing Mozilla extensions to JavaScript if we ever want to start using awesome tooling that exists in the wild.)

As I was implementing this proof-of-concept, I was looking at XPIDL interface documentation on MDN to see how things are presented today. After perusing MDN for a bit and comparing its content against what I was able to derive from the source, something became extremely clear: MDN has significantly more content than the canonical source code. Obviously the .idl files document the interfaces, their attributes, their methods, and all the types and names in between: that's the very definition of an IDL. But what was generally missing from the source code is comments. What does this method do? What is each argument used for? Things like example usage are almost non-existent in the source code. MDN, by contrast, typically has no shortage of all these things.

As I was grasping the reality that MDN has a lot of out-of-tree supplemental content, I started asking myself what's the point in automatic API docs? Is manual document curation on MDN good enough? This question has sort of been tearing me apart. Let me try to explain.

MDN is an amazing site. You can tell a lot of love has gone into making the experience and much of its content excellent. However, the content around the technical implementation / internals of Gecko/Firefox generally sucks. There are some exceptions to the rule. But I find that things like internal API documentation to be lackluster on average. It is rare for me to find documentation that is up-to-date and useful. It is common to find documentation that is partial and incomplete. It is very common to find things like JSMs not documented at all. I think this is a problem. I argue the lack of good documentation raises the barrier to contributing. Furthermore, writing and maintaining excellent low-level documentation is too much effort.

My current thoughts on API and low-level documentation are that I question the value of this documentation existing on MDN. Specifically, I think things like JSM API docs (like Sqlite.jsm) and XPIDL interface documentation (like nsIFile) don't belong on MDN - at least not in wiki form. Instead, I believe that documentation like this should live in and be derived from source code. Now, if the MDN site wants to expose this as read-only content or if MDN wants to enable the content to be annotated in a wiki-like manner (like how MSDN and PHP documentation allow user comments), that's perfectly fine by me. Here's why.

First, if I must write separate-from-source-code API documentation on MDN (or any other platform for that matter), I must now perform extra work or forgo either the source code or external documentation. In other words, if I write in-line documentation in the source code, I must spend extra effort to essentially copy large parts of that to MDN. And I must continue to spend extra effort to keep updates in sync. If I don't want to spend that extra effort (I'm as lazy as you), I have to choose between documenting the source code or documenting MDN. If I choose the source code, people either have to read the source to read the docs (because we don't generate documentation from source today) or someone else has to duplicate the docs (overall more work). If I choose to document on MDN, then people reading the source code (probably because they want to change it) are deprived of additional context useful to make that process easier. This is a lose-lose scenario and it is a general waste of expensive people time.

Second, I prefer having API documentation derived from source code because I feel it results in more accurate documentation that has the higher liklihood of remaining accurate and in sync with reality. Think about it: when was the last time you reviewed changes to a JSM and searched MDN for content that needed updated? I'm sure there are some pockets of people that do this right. But I've written dozens of JavaScript patches for Firefox and I'm pretty sure I've been asked to update external documentation less than 5% of the time. Inline source documentation, however, is another matter entirely. Because the documentation is often proximal to code that changed, I frequently a) go ahead and make the documentation changes because everything is right there and it's low overhead to change as I adjust the source b) am asked to update in-line docs when a reviewer sees I forgot to. Generally speaking, things tend to stay in sync and fewer bugs form when everything is proximally located. By fragmenting documentation between source code and external services like MDN, we increase the liklihood that things become out of sync. This results in misleading information and increases the barriers to contribution and change. In other words, developer inefficiency.

Third, having API documentation derived from source code opens up numerous possibilities to further aid developer productivity and improve the usefullness of documentation. For example:

We can parse @param references out of documentation and issue warnings/errors when documentation doesn't match the AST.
We can issue warnings when code isn't documented.
We can include in-line examples and execute and verify these as part of builds/tests.
We can more easily cross-reference APIs because everything is documented consistently. We can also issue warnings when cross-references no longer work.
We can derive files so editors and IDEs can display in-line API docs as you type or can complete functions as you type, allowing people to code faster.

While we don't generally do these things today, they are all within the realm of possibility. Sphinx supports doing many of these things. Stop reading and run mach build-docs right now and look at the warnings from malformed documentation. I don't know about you, but I love when my tools tell me when I'm creating a burden for others.

There really is so much more we could be doing with source-derived documentation. And I argue managing it would take less overall work and would result in higher quality documentation.

But the world of source-derived documentation isn't all roses. MDN has a very important advantage: it's a wiki. Just log in, edit in a WYSIWYG, and save. It's so easy. The moment we move to source-derived documentation, we introduce the massive Firefox source repository, the Firefox code review process, bugs/Bugzilla, version control overhead (although versioning documentation is another plus for source-derived documentation), landing changes, extra cost to Mozilla for building and running those checkins (even if they contain docs-only changes, sadly), and the time and cognitive burden associated with each one. That's a lot of extra work compared to clicking a few buttons on MDN! Moving documentation editing out of MDN and into the Firefox patch submission world would be a step in the wrong direction in terms of fostering contributions. Should someone really have to go through all that just to correct a typo? I have no doubt we'd lose contributors if we switched the change contribution process. And considering our lackluster track record of writing inline documentation in source, I don't feel great about losing any person who contributes documentation, no matter how small the contribution.

And this is my dilemma: the existing source-or-MDN solution is sub-par for pretty much everything except ease of contribution on MDN and deploying nice tools (like Sphinx) to address the suckitude will result in more difficulty contributing. Both alternatives suck.

I intend to continue this train of thought in a subsequent post. Stay tuned.

« Previous Page -- Next Page »

Code First and the Rise of the DVCS and GitHub

Firefox Contribution Process Debt

Style Changes on hg.mozilla.org

Mercurial Pushlog Is Now Robust Against Interrupts

Firefox Source Documentation Versus MDN

Categories