This page contains information on my technical history. You can think of it as a very descriptive resume.
In my 4 years at Case Western Reserve University, I dabbled into a number of small projects and contributed to a few larger ones. Most of my efforts were primarily focused on projects related to the university, but some impacted people outside.
As a student employee of the university's IT team, I was responsible for rolling at the Case Wiki, a central and unified wiki for the university. A designer prototyped the layout and I created a MediaWiki skin for it.
While maintaining the wiki, I developed numerous MediaWiki extensions. These were all released as open source. These extensions include:
While working for the university, I deployed CAS as a single sign-on service for the entire university. Previous to its rollout, individual sites were taking login credentials from HTML forms, HTTP basic auth, etc and trying to log in to the university's LDAP server.
Initially, very few services used CAS. Over time, more and more people got on board and I'm pretty sure that now almost every site at the university uses it.
A few friends and I thought it would be a great idea for people at the university to have a central service that hosted projects. We deployed an instance of Trac at opensource.case.edu (which was initially hosted on my personal Linux desktop machine). Students, staff, and faculty slowly but surely added projects there. Keep in mind this was before Github, when SourceForge was king. But, with SourceForge, you couldn't integrate with the university's authentication system, easily show community involvement, etc.
Unfortunately, after I left the university, the server opensource.case.edu was hosted on had a catastrophic hardware failure and data was lost and the service is no longer available. I had transitioned ownership of the site before that happened, so I like to think that the data loss was not my fault.
I was part of a small team hired by Undergraduate Student Government to implement web site features for that group. This included an election voting system and a custom management system used by members of USG to help them organize bills, events, etc.
At the time I was at the university, there was a very crude system in place for performing course and instructor evaluations. During class, students would fill out cards with a #2 pencil with rankings of various aspects of the course and instructor. These were compiled by the university and made available as text files on a central server. These were exposed through a web site with only a crude search interface.
I wrote a tool that scraped all data (I believe it went back 10+ years), imported it into a database, then exposed the data, complete with pretty graphs, via a web site. This alone was a major improvement over the existing system. From there, I built out extra features, such as aggregation of an individual instructor's ratings over all courses, including all of time. This made it easier for students to determine which instructors were likely to teach better and thus what course offering to sign up for.
Feedback was generally positive, especially from the students. However, some didn't care for the site because it exposed information they didn't want to see (the graph of average ratings per department apparently drew the attention of the administration). Some wanted me to take down the site. But, I argued that I was just using the data already available by the university in new and creative ways. With backing from USG, I won out.
During my summer break senior year in college, I was an intern at Tellme Networks, which at the time was a successful startup in the speech recognition and telephone space. I loved every minute of it.
My summer internship project was to design and implement a provisioning system for VoIP telephones. The system consisted of a dynamic TFTP server (I rolled my own TFTP server because no TFTP servers had dynamic hooks and the protocol is very simple to implement), dynamic HTTP server to process requests from telephones, a backend store, and a web administration interface.
When completed, one could simply plug in any supported Cisco or Polycom VoIP telephone into the ethernet network and the phone would download the latest firmware and bootstrap with a suitable configuration. For phones that weren't provisioned, their initial configuration was configured to dial a VXML application hosted on Tellme's platform when the receiver was lifted. This application talked to the company LDAP server, prompted the person for his or her name, looked up information in the directory, and updated the phone provisioning system with the information. The phone would then reboot, picking up the user's configuration, complete with saved dialing preferences, etc.
By far the coolest feature of the system was the ability via the admin interface to remotely reboot all connected phones. Combined with a custom bootup chime and an open office space, it was quite amusing listening to 20+ phones play simple jingles simultaneously!
I must have done a good job with the project and generally impressed people, because I was given a full-time offer at the end of my internship. I accepted, finished my senior year at college, and started full-time at Tellme in January 2007.
At one time, I was maintaining www.tellme.com and m.tellme.com. These were both written in PHP and were utilizing the Zend Framework. Although, since 90%+ of the content on www.tellme.com was static, there was aggressive caching in place such that most page views did not involve much PHP, keeping server load down. As part of this role, I worked closely with the creative team for designs and graphics and with marketing to manage content.
At one time m.tellme.com hosted a download component for Tellme's mobile application, which ran on most BlackBerry devices and other misc phones. One of the most frustrating experiences I have had to date was verifying that the site worked on all these devices. Each device had different support for rendering web sites. On top of that, you were dealing with different resolutions. Keep in mind this was before modern browsers came to mobile devices. Supporting these all concurrently was a real chore. Even the identification part was a nightmare, as phones could only usually be identified by the User-Agent HTTP request header or by the presence of some other non-standard header. And, this wasn't always consistent across carriers or guaranteed. What a nightmare! I'm glad I don't have to worry about this any more.
Since the assimilation of Tellme by Microsoft, www.tellme.com and m.tellme.com are no longer available. Fortunately, the Internet Archive contains a semi-working snapshot.
When I started at Tellme, CVS was the lone version control system for engineering. (I even think one group was still on RCS!) I asked around about why a more modern version control system wasn't in use, and nobody seemed to have a good answer (I think the common answer was "because nobody has set one up").
After consulting a lot of peers, it was decided that Subversion was the best fit for a new, supported VCS. It wasn't that Subversion was the best of breed (I believe Perforce arguably was at the time). It was chosen because it was very similar to CVS (their motto at the time was a compelling replacement for CVS), it was free and open source, and had a great hooks system.
I worked with the operations team to deploy Subversion and announced it to the company. Slowly but surely, people started using it. Usage increased dramtically after I did a brownbag presentation on Subversion and its merits over CVS and I started to roll out custom hooks to supplement functionality. Teams loved the ability to receive emails on commits, close bugs (via a special syntax in the commit messages), require buddies on commit (also via special syntax in commit messages). This all required special commit hooks, of course. Unfortunately, those were developed after Tellme became Microsoft, so they will likely never see light outside company walls.
At some point, I lobbied someone in the engineering org to splurge for a new server for hosting the source code services. I worked with the operations team to get that installed. And, I received a lot of hacker cred for making the version control systems 5x faster (which was a big deal for some of the teams, with their large trees that took a couple of minutes to switch branches with CVS).
At some point in my time at Tellme/Microsoft, I organized a weekly brown bag series. Every Wednesday, individuals or small groups would present topics to whoever was in attendence. These were often engineer-centric, but anyone could and did present.
I seeded the sessions with presentations by myself and a few select people. After that, I either had people contact me to be added to the calendar or I sought out and encouraged people to present. Every week, I also recorded the sessions and made them available on the intranet.
I believe sometime in 2009, I grew frustrated with the way we were doing code reviews. The act of looking at plain text diffs and recording comments in email or similar just felt out-dated. When others felt the same, I started researching solutions. I heard Google had a pretty nifty one, but it was only partially available to the public. I decided on Review Board and deployed it. It became an overnight success with almost immediate adoption by every team. Great tool.
In my time at Tellme/Microsoft, I became a go-to person in the org for questions on Perl, mod_perl, Apache HTTP Server, compiling packages, and other misc topics. The first two sort of scared me at the time because when I started at Tellme, I knew almost nothing about Perl. Somehow, I became good enough that I could dole out advice and answers and know what I was doing in code reviews.
For the Apache HTTP Server, I was very familiar with the internals, since I wrote a couple of C modules while at Tellme. As such, all the weird bugs or questions about server behavior seemed to always come my way. When a team was deploying a new site or service, I would often be the one reviewing the configs or lending advice on how to configure it.
For compiling packages, I always took satisfaction out of wrangling things to compile on Solaris x86 (although, we ran a typical GNU toolchain as opposed to the Sun stack, so it wasn't as bad as it sounds). I was also pretty adamant about packages being built properly (e.g. considering which libraries were statically and dynamically linked, defining proper linker flags to foster easier linking in the future, getting an optimized binary, etc). So, people would often come to me and ask me to review the compilation procedure.
For most of my time at Tellme, I was a member of an about 15 person team which wrote and maintained many of the Tellme/Microsoft-branded VXML applications and the services they directly required. (VXML applications often consist of static VXML documents and dynamic web services with which they talk at run-time.) One of my major roles on the team was managing the servers these resided on.
This role involved numerous responsibilities. First, I needed to be familiar with all aspects of the dynamic services so I would know how to triage them, make informed decisions about scalability, etc. I was often developing these services myself. When I wasn't, I typically became informed by being part of the code reviews. I would often address concerns around scalability and failure, such as timeouts to remote services, expected latency, ensuring adequate monitors were present and tested, etc.
Once the services were in my court, I needed to figure out where to host them. We had a number of different server pools from which to choose. Or, there was always the option of buying new hardware, but you want to keep costs down whenever possible, of course. I had to take into consideration the expected load, security requirements, network ACL connectivity requirements, technology requirements, etc. This was always a complicated juggling act. But, people knew that I would get it done and things would just work.
As part of this role, I interacted closely with the operations team and the NOC. The operations team would be involved whenever new servers were needed, new ACLs were to be deployed, etc. At a large company, this was often a formal process, which required working with various project and product managers, navigating political waters, etc. I could talk the lingo with the operations people, so having me always act as the go-between was effective at gettings things done. As for the NOC, they were involved any time we changed anything. The operations mindset is to achieve 99.999% uptime on everything. As part of that, any change is communicated and planned well in advance along with the exact procedure to be performed. I knew most of the individuals in the NOC and they knew me.
Sometime in 2009, I set out to make improvements to Tellme's proprietary event logging and transport infrastructure. The first part of this involved writing a new C library that performed writing. When I started, there existed the initial C++ implementation and feature-minimal implementations in Perl and a few other languages. However, the C++ implementation required a sizeable number of external dependencies (on the order of 40MB) and interop with C was difficult. People in C land found themselves in dependency hell and were cursing the computer gods.
I implemented a C library from scratch, using the C++ code as a guide, but only for the low-level details of the protocol, which weren't documented well outside of the code. The initial version only supported writing. It was free of external dependencies, compiled on both Solaris and Windows, and weighed in at a svelt 50kB.
A few months later, an individual came to me and said something along the lines of, "I really love the simplicity of your library. But, I need to perform reading. Can you help me?" So, I implemented reading support in the library. It turned out that the reading library was much more efficient than the C++ one. And, for what my colleague was using it for (iterating over hundreds of gigabytes of data), this made a huge difference! What followed was a lot of low-level performance optimization to maximize the reading throughput of the library. A stack instead of dynamic allocation here, a register variable there, all contributed to significant performance improvements to the already insanely fast library. Towards the end, we were in the territory of the x86 calling convention hurting us more than any other part.
I achieved great satisfaction on this project. And, a lot of it was after it was complete (and I had even left the company). People were using the library in ways I had not foreseen (like in C# or on non-server devices). There was even talk of shipping the library as part of an external release, which never would have been possible with the C++ version because it linked to open source software and Microsoft is very sensitive to that (sadly). A few months after I left the company, somebody emailed me and said something along the lines of, "your XXX library is the best C library I have ever seen. I wish all C libraries were this easy to use and would 'just work.'" Sadly, the library is Microsoft IP, so I can't share it with you. But, I can refer you to people who have used it!
It is my understanding that Xobni hired me to bring experience to their at-the-time frail cloud system, Xobni One (now known as Xobni Cloud). When I joined, their operations procedures seemed like they were out of the wild west (at least this was my perspective come from Tellme/Microsoft). System monitoring for the product I was to beef up consisted of someone looking at some Graphite graphs, noticing a change in the pattern, and investigating. We could do better.
One of my first acts at Xobni was to deploy a real monitoring system. We decided on Opsview because it is built on top of Nagios, a popular and well-known monitoring system (although not one without its flaws) and offers a compelling front-end for managing Nagios, which is typically a real chore.
I set up Opsview on a central host, deployed NRPE on all the hosts, and started monitoring. Email alerts were configured and all were happy we now knew in near real time when stuff was breaking. At one point, I even had a custom Nagios notification script that used the Twilio API to call people with alert info. But, we didn't pursue that further. Cool idea, though!
When I arrived, only minimal server metrics were being collected. While they were being fed to Graphite, a compelling replacement for RRDTool, they were being collected via an in-house Python daemon and any new collection required custom plugins. There exist many open source tools for metrics collections, so I swapped out the custom code for Collectd, which I think is an excellently-designed collection system. (I like Collectd because of its plugin system and ability to write plugins in C, Java, Perl, and Python.)
The only drawback was Collectd used RRD out of the box. We used that for a little bit. But, we quickly missed Graphite, so I wrote a Collectd plugin that writes data to Graphite instead of RRD.
At the end of the day, we were recording more metrics with less effort and had much more data to back up our decisions.
When I joined Xobni, Xobni Cloud was still considered beta and was wrought with stability issues. The overall architecture of the service was fine, but things were rough around the edges. When I first started, I believe the service had around 500 users and was crashing daily. We could do better.
One of my first contributions was to learn how Cassandra worked and how to make it run faster. A well-tuned Cassandra running on top of a properly configured JVM makes a world of difference. We saw tons of improvement by experimenting here.
One of my first major code contributions to the product was to transition to Protocol Buffers for data encoding. Previously, the system was storing JSON anywhere there was structed data. Switching to Protocol Buffers cut down on CPU (most of the CPU savings were because the JSON library was utilizing reflection). More importantly, it reduced our storage size dramatically. And, the less data stored means less work for the hard drives.
On the failure resiliency side of the scaling problem, I replaced the Beanstalk-based queue system with jobs stored in Cassandra. I have nothing against Beanstalk. But, in our architecture, Beanstalk was a central point of failure. If the server died and we couldn't recover Beanstalk's binary log on disk, we would have lost user data. We did not want to lose user data, so we stored the queue in a highly-available data store, Cassandra. I also made a number of application changes that allowed us to upgrade the product without incurring any downtime. (When I joined, you had to turn the service off when upgrading the software.)
Another change that saw significant performance gains was moving the Cassandra sstable store to btrfs, a modern Linux filesystem. The big performance win came not from the filesystem switch itself, but from enabling filesystem compression. At the expense of CPU (which we had plenty of on the Cassandra nodes), we cut down significantly on the number of sectors being accessed, which gave Cassandra more head room. I would have preferred using ZFS instead of btrfs because it is stable and has deduplication (which I theorize would help Cassandra because of its immutable sstables which carry data forward during compactions). But, we weren't willing to switch to OpenBSD or OpenSolaris to obtain decent ZFS performance.
There were a number of smaller changes that all amounted to significant performance gains. But, they are difficult to explain without intimate knowledge of the system. I will say that at the end of the day, we went from crashing every day on 500 users to being stable for weeks on end and serving over 10,000 users on roughly the same hardware configuration.
I started working for Mozilla on July 18, 2011 as part of the Services team. The Services team is responsible for writing and maintaining a number of Mozilla's hosted services. When I started, the main service was Firefox Sync, but a number of other services were in the pipeline.
My first major contribution to Sync/Firefox was add-on sync, which keeps your add-ons in sync across devices. It shipped as part of Firefox 11. Add-on sync was a heavily requested feature, so I felt quite good about shipping it.
As part of working on Firefox Sync, I learned a lot about various Firefox components and how they work. I also expanded my syncing knowledge (the product I worked on at Xobni was essentially contact sync) to cover scenarios where synchronization must be performed in a distributed manner on clients (Firefox Sync data was encrypted and the server only sees an opaque blob). I also learned a bit about cryptography. Some of my understanding of this space is demonstrated in this blog post summarizing the security of various browser syncing implementations.
I was the lead implementor on the Firefox desktop implementation of Firefox Health Report, a feature that collects data from every Firefox install and sends it to Mozilla (in a privacy-conscious manner, of course).
Firefox Health Report (or FHR) was a huge project. It's not every day that you have the opportunity to write a feature that will be used by over 100 million people! FHR was also one of those projects that had a lot of interest from management and leadership. FHR was going to be the first time Mozilla collected so much data from every Firefox user by default. Up to that point, Mozilla collected very little data from its entire user base. What collection existed was opt in (Telemetry) and thus had low activation rate or collected very little data at all (before FHR, Mozilla measured active daily users by counting the number of update ping requests (HTTP requests sent by Firefox to see whether a new Firefox release and/or add-on versions are available). FHR was a huge change of direction and a lot of people from metrics/statisticians, security, privacy, performance, etc all wanted a seat at the table. There were a lot of cooks in the kitchen and I got to feel what it was like to be in that position at Mozilla. It was a learning experience to say the least.
In my early days at Mozilla, I grew frustrated with the build system (as most Mozilla people do). So, I did what curious engineers often do and started digging deeper into the rabbit hole. One thing led to another and I eventually became the module owner (Mozilla speak for the person with governance responsibility over something).
When I first got involved with the build system, Firefox was built from over 1000 Makefiles. We employed what is called recursive make. Essentially, you have a tree of Makefiles which is iterated upon. This technique is far from efficient. There's even a Recursive Make Considered Harmful paper explaining it.
Everyone knew the build system sucked and needed to be improved. But nobody knew what to do or had the will to tackle a major change. I spent a lot of time looking at the problem space, experimenting with various solutions. I wrote up a very detailed blog post detailing a transition plan. After much technical deliberation, we adopted a plan to use sandboxed Python files to define our build configuration. At the time, I thought this was a new and novel idea. I thought it was somewhat risky because it had never been tried before. I later learned that the solution was effectively invented at Google years before. The Google project is called Blaze and the approach has been copied by Twitter's Pants build tool, Facebook's Buck build tool, and eventually Chromium's CN tool. I essentially independently arrived at the same solution that Google did and that felt pretty reassuring!
Over time, the Firefox build config data was slowly transitioned to moz.build files. We couldn't do this atomically in a flag day because it would be too much work. Instead, we moved things over and continued to emit Makefiles behind the scenes. Where we could, we would consolidate data from various parts of the source tree together and emit optimized build rules. moz.build files enabled us to do things we couldn't do with recursive make and enabled us to build Firefox more efficiently.
From my first days contributing to Firefox, I was frustrated at how difficult everything was. There were so many hurdles preventing people from getting started and once you got up and running, there were so many tasks that were non-intuitive. At Mozilla, I was on a never-ending quest to improve the developer experimence and to make developers more productive.
One of my major contributions to Firefox development is a tool called mach. Mach is effectively a command-line command dispatcher. You register commands and it runs them. Simple, right? Before mach came along, Firefox developers had to run over a dozen different commands to perform common tasks. The command locations and their options were non-intuitive. Mach fixed all of that.
I first blogged about Mach in May 2012. I think that post details some of the empathy I feel towards new contributors and onboarding new members. After an uphill battle where I couldn't find someone willing to buy-in to my vision and allow mach to be checked into the tree, mach finally landed in September 2012. In the time since, mach has increased in popularity and gained dozens of commands. It's now used by most developers and I commonly hear things like "I can't believe we lived in a world without mach for so long!".
Onboarding new contributors has always been important to me. When I started at Mozilla, if you want wanted to build Firefox, you needed to install all the build dependencies manually. You did this by following instructions on a wiki that were frequently out of data. I think it's inefficient to perform actions that can be automated, so I wrote a tool to automate it. People can now type a one-liner into the shell to configure their system to build Firefox.
There was a growing movement at Mozilla in 2012 and 2013 to use Git for developing Firefox. (The canonical repository is Mercurial.) A tool called hg-git was being used to allow developers to convert Mercurial commits to and from Git commits. A major complaint was it was too slow. So, I started learning a lot about Mercurial and Git's internals and set about to improve it. The results speak for themselves.
When I started at Mozilla, it was clear that Mozilla's build and testing automation had a lot of potential to grow. I've been casually involved in making it better.
One of the things about Mozilla's automation that troubled me a lot was the lack of machine readable output. For example, to determine whether a test job was successful, we would parse the log output and look for certain strings using regular expressions. This is a very fragile process and it was prone to breaking and made it difficult to change output without breaking the parser. I wrote a blog post on structured logging and later worked with the automation team to integrate that approach into our testing automation. I even mentored a summer intern in 2013 who had this as his chief project. As of April 2014, things are still moving forward and Mozilla is on the trajectory of emitting machine-readable data from automation. I can't wait for that day to come.
Similar in vein to lack of machine readable output from automation was the lack of data being captured at all. For example, Mozilla was not recording system resource usage (CPU, I/O, memory, etc) and thus was not measuring how efficient our automation was. The optimization engineer with server-side experience in me tells me that you should try to get 100% out of your servers or you are wasting money. So, I patched our automation code to record system resource usage.
I also built some tools for analyzing Mozilla's automation data. One tool visualized the efficiency of every machine in automation. Although the tool no longer is live, it was used to show people that a lot of the money we were spending on machines was being wasted. It turned some heads. I also wrote a tool that aggregated and allowed analysis of bulk automation data.
Somehow I became a version control geek during my time at Mozilla. It likely started with my hg-git optimization work. I think what captivated me was the scaling problems within both Mercurial and Git. I was also writing a lot of Python at the time and was also captivated by Mercurial's extensibility and hackability. I wrote about the topic.
It was the summer of 2013 that I became a Mercurial convert. I used to loathe working with Mercurial (preferring Git instead). With what I know now, pretty much the only reasons I'd use Git are for GitHub and because most everyone seems to know Git these days. Those are big reasons. But when it comes down to your version control system as a tool, Mercurial wins hands down.
It was at Mozilla that my Python knowledge developed from intermediate to I'd say pretty advanced. A lot of my projects for Firefox's build system and automation are written in Python. Mercurial is Python.
I even got Slashdotted writing about Python.
I was a very early fan of Docker. When it came out, I immediately saw the potential for use in Mozilla's automation infrastructure. I went over to dotCloud (now Docker Inc) to discuss Mozilla's use cases very early in Docker's lifetime. For a while, a quote of mine was in the main Learn about Docker slide deck that was prominently featured on Docker's website!
My One Year at Mozilla is worth reading.
Pretty much all my blog posts from 2012 and on are related to Mozilla in one way or another.
I have contributed significant patches to Clang's Python bindings. The Clang Python bindings allow you to examine the token stream and AST that Clang generates. It is my understanding that the Clang Python bindings are heavily used in the science and research arenas, where people are using higher-level tools for examining source code.
Zippylog is a high performance stream processing platform. I started the project in late July 2010 and hack on it when I have time.
I started the project to solve what I thought was a gap in the market. I also wanted to start a personal project to assess what my skills were as an individual developer. And, since I was working behind Microsoft's walls, where open source contributions were difficult to swing, I wanted to do something in the open for all to see.
My Github project, lua-protobuf integrates the programming language Lua with Google's Protocol Buffer serialization format. Both are extremely fast, so it is a marriage that needed to happen.
I started the project because I wanted to consume protocol buffers in Lua from within zippylog.
Interestingly, lua-protobuf is a Python program that generates C/C++ code that provides Lua scripts access to protocol buffers. Yeah, that makes no sense to me either, but it works.
Collectd is an excellent metrics collection and dispatching daemon. Graphite is a great data recording and visualization tool. I decided to marry them by writing a Collectd plugin, collectd-carbon, which writes values to Graphite/Carbon. (Carbon is the name of the network service that receives values.)
I've made contributions to Clang's Python bindings. These bindings allow you to consume the Clang C API (libclang) using pure Python (via ctypes).