Finding Oldest Firefox Code

June 18, 2012 at 10:10 AM | categories: Mozilla

On Twitter the other night, Justin Dolske posed a question:

Weekend challenge: what is the oldest line of code still shipping
in Firefox (tiebreaker: largest contiguous chunk)?

Good question and good challenge!

Technical Approach

To solve this problem, I decided my first task would be to produce a database holding line-by-line metadata for files currently in the Firefox repository. This sounded like a difficult problem at first, especially considering the Mercurial repository doesn't contain CVS history and this would be needed to identify code older than Mercurial that is still active.

Fortunately, there exists a Git repository with full history of mozilla-central, including CVS and Mercurial! Armed with a clone of this repository, I wrote a quick shell one-liner to ascertain the history of every line in the repository:

for f in `git ls-tree -r --name-only HEAD`; do \
  echo "BEGIN_RECORD $f"; \
  git blame -l -t -M -C -n -w -p $f; \
  echo "END_RECORD $f"; \
done

The git ls-tree command prints the names of every file in the current tree. This is basically doing find . -type f except for files under version control by Git. git blame attempts to ascertain the history of each line in a file. It is worth pointing out arguments -M and -C. These attempt to find moves/copies of the line from within the same commit. If these are omitted, simple refactoring such as renaming a file or reordering code within a file would result in a improper commit attribution. Basically, Git would associate the line with the commit that changed it. With these flags, Git attempts to complete the chain and find the true origin of the line (to some degree).

Now, something I thought was really cool is git blame's porcelain output format (-p). Not only does it allow for relatively simple machine readability of the output (yay), but it also compacts the output so adjacent lines sharing the same history metadata share the same metadata/header block. In other words, it solves Dolske's largest contiguous chunk tiebreaker for free! Thanks, Git!

I should also say that git blame isn't perfect for attributing code. But I think it is good enough to solve a Twitter challenge.

I piped the output of the above command into a file so I could have the original data available to process. After all, this data is idempotent, so it makes no sense to not save it. After running for a while, I noticed things were running slower than I'd like. I think it took about 2 hours to obtain info for ~5000 files. No good. I played around a bit and realized the -M and -C flags were slowing things down. This is expected. But, I really wanted this data for a comprehensive data analysis.

I re-discovered GNU Parallel and modified my one-liner to use all the cores:

git ls-tree -r --name-only HEAD | \
parallel 'echo "BEGIN_RECORD {}"; git blame -l -t -M -C -n -w -p {}; echo "END_RECORD {}"'

This made things run substantially faster since I was now running on all 8 cores, not just 1. With GNU Parallel, this simple kind of parallelism is almost too easy. Now, I say substantially faster, but overall execution is still slow. How slow? Well, on my Sandy Bridge Macbook Pro:

real  525m49.149s
user  3592m15.862s
sys   201m4.482s

8:45 wall time and nearly 60 hours of CPU time. Yeah, I'm surprised my laptop didn't explode too! The output file was 1,099,071,448 bytes uncompressed and 155,354,423 bzipped.

While Git was obtaining data, I set about writing a consumer and data processor. I'm sure the wheel of parsing this porcelain output format has been invented before. But, I hadn't coded any Perl in a while and figured this was as good of an excuse as any!

The Perl script to do the parsing and data analysis is available at https://gist.github.com/2945604. The core parsing function simply calls a supplied callback whenever a new block of code from the same commit/file is encountered.

I implemented a function that records a mapping of commit times to blocks. Finally, I wrote a simple function to write the results.

Results

What did 60 hours of CPU time tell us? Well, the oldest recorded line dates from 1998-03-28. This is actually the Free the Lizard commit - the first commit of open source Gecko to CVS. From this commit (Git commit 781c48087175615674 for those playing at home), a number of lines linger, including code for mkdepend and nsinstall.

But, Dolske's question was about shipping code. Well, as far as I can tell, the oldest shipping code in the tree honor is shared by the following:

  • js/jsd/jsd1640.rc (pretty much the whole file)
  • js/jsd/jsd3240.rc (ditto)
  • js/jsd/jsd_atom.c:47-73
  • js/src/jsfriendapi.cpp:388-400
  • js/src/yarr/YarrParser.h:295-523
  • media/libjpeg/jdarith.c:21-336
  • media/libjpeg/jctrans.c:20-178
  • media/libjpeg (misc files all have large blocks)
  • xpcom/components/nsComponentManager.cpp:897-901
  • gfx/src/nsTransform2D.h (a couple 3-4 line chunks)
  • toolkit/library/nsDllMain.cpp:22-31
  • widget/windows/nsUXThemeData.cpp:213-257
  • widget/windows/nsWindowGfx.cpp:756-815
  • xpcom/ds/nsUnicharBuffer.cpp:14-33
  • xpcom/glue/nsCRTGlue.cpp:128-174

There are also a few small chunks of 1 or 2 lines in a couple dozen other files from that initial commit.

Further Analysis

If anyone is interested in performing additional analysis, you can just take my Gist and install your own onBlock and output formatting functions! Of course, you'll need a data set. My code is written so it will work with any Git repository. If you want to analyze Firefox, it will take hours of CPU time to extract the data from Git. To save some time, you can obtain a copy of the raw data from commit 0b961fb702a9576cb456809410209adbbb956bc8.

There is certainly no shortage of interesting analysis that can be performed over this data set. Some that come to mind are a scoreboard for most lines of code in the current tree (friendly competition, anyone?) and a breakdown of active lines by the period they were introduced.

I'll leave more advanced analysis as an exercise for the reader.