This page contains information on my technical history. You can think of it as a very descriptive resume.
In my 4 years at Case Western Reserve University, I dabbled into a number of small projects and contributed to a few larger ones. Most of my efforts were primarily focused on projects related to the university, but some impacted people outside.
As a student employee of the university's IT team, I was responsible for rolling at the Case Wiki, a central and unified wiki for the university. A designer prototyped the layout and I created a MediaWiki skin for it.
While maintaining the wiki, I developed numerous MediaWiki extensions. These were all released as open source. These extensions include:
While working for the university, I deployed CAS as a single sign-on service for the entire university. Previous to its rollout, individual sites were taking login credentials from HTML forms, HTTP basic auth, etc and trying to log in to the university's LDAP server.
Initially, very few services used CAS. Over time, more and more people got on board and I'm pretty sure that now almost every site at the university uses it.
A few friends and I thought it would be a great idea for people at the university to have a central service that hosted projects. We deployed an instance of Trac at opensource.case.edu (which was initially hosted on my personal Linux desktop machine). Students, staff, and faculty slowly but surely added projects there. Keep in mind this was before Github, when SourceForge was king. But, with SourceForge, you couldn't integrate with the university's authentication system, easily show community involvement, etc.
Unfortunately, after I left the university, the server opensource.case.edu was hosted on had a catastrophic hardware failure and data was lost and the service is no longer available. I had transitioned ownership of the site before that happened, so I like to think that the data loss was not my fault.
I was part of a small team hired by Undergraduate Student Government to implement web site features for that group. This included an election voting system and a custom management system used by members of USG to help them organize bills, events, etc.
At the time I was at the university, there was a very crude system in place for performing course and instructor evaluations. During class, students would fill out cards with a #2 pencil with rankings of various aspects of the course and instructor. These were compiled by the university and made available as text files on a central server. These were exposed through a web site with only a crude search interface.
I wrote a tool that scraped all data (I believe it went back 10+ years), imported it into a database, then exposed the data, complete with pretty graphs, via a web site. This alone was a major improvement over the existing system. From there, I built out extra features, such as aggregation of an individual instructor's ratings over all courses, including all of time. This made it easier for students to determine which instructors were likely to teach better and thus what course offering to sign up for.
Feedback was generally positive, especially from the students. However, some didn't care for the site because it exposed information they didn't want to see (the graph of average ratings per department apparently drew the attention of the administration). Some wanted me to take down the site. But, I argued that I was just using the data already available by the university in new and creative ways. With backing from USG, I won out.
During my summer break senior year in college, I was an intern at Tellme Networks, which at the time was a successful startup in the speech recognition and telephone space. I loved every minute of it.
My summer internship project was to design and implement a provisioning system for VoIP telephones. The system consisted of a dynamic TFTP server (I rolled my own TFTP server because no TFTP servers had dynamic hooks and the protocol is very simple to implement), dynamic HTTP server to process requests from telephones, a backend store, and a web administration interface.
When completed, one could simply plug in any supported Cisco or Polycom VoIP telephone into the ethernet network and the phone would download the latest firmware and bootstrap with a suitable configuration. For phones that weren't provisioned, their initial configuration was configured to dial a VXML application hosted on Tellme's platform when the receiver was lifted. This application talked to the company LDAP server, prompted the person for his or her name, looked up information in the directory, and updated the phone provisioning system with the information. The phone would then reboot, picking up the user's configuration, complete with saved dialing preferences, etc.
By far the coolest feature of the system was the ability via the admin interface to remotely reboot all connected phones. Combined with a custom bootup chime and an open office space, it was quite amusing listening to 20+ phones play simple jingles simultaneously!
I must have done a good job with the project and generally impressed people, because I was given a full-time offer at the end of my internship. I accepted, finished my senior year at college, and started full-time at Tellme in January 2007.
At one time, I was maintaining www.tellme.com and m.tellme.com. These were both written in PHP and were utilizing the Zend Framework. Although, since 90%+ of the content on www.tellme.com was static, there was aggressive caching in place such that most page views did not involve much PHP, keeping server load down. As part of this role, I worked closely with the creative team for designs and graphics and with marketing to manage content.
At one time m.tellme.com hosted a download component for Tellme's mobile application, which ran on most BlackBerry devices and other misc phones. One of the most frustrating experiences I have had to date was verifying that the site worked on all these devices. Each device had different support for rendering web sites. On top of that, you were dealing with different resolutions. Keep in mind this was before modern browsers came to mobile devices. Supporting these all concurrently was a real chore. Even the identification part was a nightmare, as phones could only usually be identified by the User-Agent HTTP request header or by the presence of some other non-standard header. And, this wasn't always consistent across carriers or guaranteed. What a nightmare! I'm glad I don't have to worry about this any more.
Since the assimilation of Tellme by Microsoft, www.tellme.com and m.tellme.com are no longer available. Fortunately, the Internet Archive contains a semi-working snapshot.
When I started at Tellme, CVS was the lone version control system for engineering. (I even think one group was still on RCS!) I asked around about why a more modern version control system wasn't in use, and nobody seemed to have a good answer (I think the common answer was "because nobody has set one up").
After consulting a lot of peers, it was decided that Subversion was the best fit for a new, supported VCS. It wasn't that Subversion was the best of breed (I believe Perforce arguably was at the time). It was chosen because it was very similar to CVS (their motto at the time was a compelling replacement for CVS), it was free and open source, and had a great hooks system.
I worked with the operations team to deploy Subversion and announced it to the company. Slowly but surely, people started using it. Usage increased dramtically after I did a brownbag presentation on Subversion and its merits over CVS and I started to roll out custom hooks to supplement functionality. Teams loved the ability to receive emails on commits, close bugs (via a special syntax in the commit messages), require buddies on commit (also via special syntax in commit messages). This all required special commit hooks, of course. Unfortunately, those were developed after Tellme became Microsoft, so they will likely never see light outside company walls.
At some point, I lobbied someone in the engineering org to splurge for a new server for hosting the source code services. I worked with the operations team to get that installed. And, I received a lot of hacker cred for making the version control systems 5x faster (which was a big deal for some of the teams, with their large trees that took a couple of minutes to switch branches with CVS).
At some point in my time at Tellme/Microsoft, I organized a weekly brown bag series. Every Wednesday, individuals or small groups would present topics to whoever was in attendence. These were often engineer-centric, but anyone could and did present.
I seeded the sessions with presentations by myself and a few select people. After that, I either had people contact me to be added to the calendar or I sought out and encouraged people to present. Every week, I also recorded the sessions and made them available on the intranet.
I believe sometime in 2009, I grew frustrated with the way we were doing code reviews. The act of looking at plain text diffs and recording comments in email or similar just felt out-dated. When others felt the same, I started researching solutions. I heard Google had a pretty nifty one, but it was only partially available to the public. I decided on Review Board and deployed it. It became an overnight success with almost immediate adoption by every team. Great tool.
In my time at Tellme/Microsoft, I became a go-to person in the org for questions on Perl, mod_perl, Apache HTTP Server, compiling packages, and other misc topics. The first two sort of scared me at the time because when I started at Tellme, I knew almost nothing about Perl. Somehow, I became good enough that I could dole out advice and answers and know what I was doing in code reviews.
For the Apache HTTP Server, I was very familiar with the internals, since I wrote a couple of C modules while at Tellme. As such, all the weird bugs or questions about server behavior seemed to always come my way. When a team was deploying a new site or service, I would often be the one reviewing the configs or lending advice on how to configure it.
For compiling packages, I always took satisfaction out of wrangling things to compile on Solaris x86 (although, we ran a typical GNU toolchain as opposed to the Sun stack, so it wasn't as bad as it sounds). I was also pretty adamant about packages being built properly (e.g. considering which libraries were statically and dynamically linked, defining proper linker flags to foster easier linking in the future, getting an optimized binary, etc). So, people would often come to me and ask me to review the compilation procedure.
For most of my time at Tellme, I was a member of an about 15 person team which wrote and maintained many of the Tellme/Microsoft-branded VXML applications and the services they directly required. (VXML applications often consist of static VXML documents and dynamic web services with which they talk at run-time.) One of my major roles on the team was managing the servers these resided on.
This role involved numerous responsibilities. First, I needed to be familiar with all aspects of the dynamic services so I would know how to triage them, make informed decisions about scalability, etc. I was often developing these services myself. When I wasn't, I typically became informed by being part of the code reviews. I would often address concerns around scalability and failure, such as timeouts to remote services, expected latency, ensuring adequate monitors were present and tested, etc.
Once the services were in my court, I needed to figure out where to host them. We had a number of different server pools from which to choose. Or, there was always the option of buying new hardware, but you want to keep costs down whenever possible, of course. I had to take into consideration the expected load, security requirements, network ACL connectivity requirements, technology requirements, etc. This was always a complicated juggling act. But, people knew that I would get it done and things would just work.
As part of this role, I interacted closely with the operations team and the NOC. The operations team would be involved whenever new servers were needed, new ACLs were to be deployed, etc. At a large company, this was often a formal process, which required working with various project and product managers, navigating political waters, etc. I could talk the lingo with the operations people, so having me always act as the go-between was effective at gettings things done. As for the NOC, they were involved any time we changed anything. The operations mindset is to achieve 99.999% uptime on everything. As part of that, any change is communicated and planned well in advance along with the exact procedure to be performed. I knew most of the individuals in the NOC and they knew me.
Sometime in 2009, I set out to make improvements to Tellme's proprietary event logging and transport infrastructure. The first part of this involved writing a new C library that performed writing. When I started, there existed the initial C++ implementation and feature-minimal implementations in Perl and a few other languages. However, the C++ implementation required a sizeable number of external dependencies (on the order of 40MB) and interop with C was difficult. People in C land found themselves in dependency hell and were cursing the computer gods.
I implemented a C library from scratch, using the C++ code as a guide, but only for the low-level details of the protocol, which weren't documented well outside of the code. The initial version only supported writing. It was free of external dependencies, compiled on both Solaris and Windows, and weighed in at a svelt 50kB.
A few months later, an individual came to me and said something along the lines of, "I really love the simplicity of your library. But, I need to perform reading. Can you help me?" So, I implemented reading support in the library. It turned out that the reading library was much more efficient than the C++ one. And, for what my colleague was using it for (iterating over hundreds of gigabytes of data), this made a huge difference! What followed was a lot of low-level performance optimization to maximize the reading throughput of the library. A stack instead of dynamic allocation here, a register variable there, all contributed to significant performance improvements to the already insanely fast library. Towards the end, we were in the territory of the x86 calling convention hurting us more than any other part.
I achieved great satisfaction on this project. And, a lot of it was after it was complete (and I had even left the company). People were using the library in ways I had not foreseen (like in C# or on non-server devices). There was even talk of shipping the library as part of an external release, which never would have been possible with the C++ version because it linked to open source software and Microsoft is very sensitive to that (sadly). A few months after I left the company, somebody emailed me and said something along the lines of, "your XXX library is the best C library I have ever seen. I wish all C libraries were this easy to use and would 'just work.'" Sadly, the library is Microsoft IP, so I can't share it with you. But, I can refer you to people who have used it!
It is my understanding that Xobni hired me to bring experience to their at-the-time frail cloud system, Xobni One (now known as Xobni Cloud). When I joined, their operations procedures seemed like they were out of the wild west (at least this was my perspective come from Tellme/Microsoft). System monitoring for the product I was to beef up consisted of someone looking at some Graphite graphs, noticing a change in the pattern, and investigating. We could do better.
One of my first acts at Xobni was to deploy a real monitoring system. We decided on Opsview because it is built on top of Nagios, a popular and well-known monitoring system (although not one without its flaws) and offers a compelling front-end for managing Nagios, which is typically a real chore.
I set up Opsview on a central host, deployed NRPE on all the hosts, and started monitoring. Email alerts were configured and all were happy we now knew in near real time when stuff was breaking. At one point, I even had a custom Nagios notification script that used the Twilio API to call people with alert info. But, we didn't pursue that further. Cool idea, though!
When I arrived, only minimal server metrics were being collected. While they were being fed to Graphite, a compelling replacement for RRDTool, they were being collected via an in-house Python daemon and any new collection required custom plugins. There exist many open source tools for metrics collections, so I swapped out the custom code for Collectd, which I think is an excellently-designed collection system. (I like Collectd because of its plugin system and ability to write plugins in C, Java, Perl, and Python.)
The only drawback was Collectd used RRD out of the box. We used that for a little bit. But, we quickly missed Graphite, so I wrote a Collectd plugin that writes data to Graphite instead of RRD.
At the end of the day, we were recording more metrics with less effort and had much more data to back up our decisions.
When I joined Xobni, Xobni Cloud was still considered beta and was wrought with stability issues. The overall architecture of the service was fine, but things were rough around the edges. When I first started, I believe the service had around 500 users and was crashing daily. We could do better.
One of my first contributions was to learn how Cassandra worked and how to make it run faster. A well-tuned Cassandra running on top of a properly configured JVM makes a world of difference. We saw tons of improvement by experimenting here.
One of my first major code contributions to the product was to transition to Protocol Buffers for data encoding. Previously, the system was storing JSON anywhere there was structed data. Switching to Protocol Buffers cut down on CPU (most of the CPU savings were because the JSON library was utilizing reflection). More importantly, it reduced our storage size dramatically. And, the less data stored means less work for the hard drives.
On the failure resiliency side of the scaling problem, I replaced the Beanstalk-based queue system with jobs stored in Cassandra. I have nothing against Beanstalk. But, in our architecture, Beanstalk was a central point of failure. If the server died and we couldn't recover Beanstalk's binary log on disk, we would have lost user data. We did not want to lose user data, so we stored the queue in a highly-available data store, Cassandra. I also made a number of application changes that allowed us to upgrade the product without incurring any downtime. (When I joined, you had to turn the service off when upgrading the software.)
Another change that saw significant performance gains was moving the Cassandra sstable store to btrfs, a modern Linux filesystem. The big performance win came not from the filesystem switch itself, but from enabling filesystem compression. At the expense of CPU (which we had plenty of on the Cassandra nodes), we cut down significantly on the number of sectors being accessed, which gave Cassandra more head room. I would have preferred using ZFS instead of btrfs because it is stable and has deduplication (which I theorize would help Cassandra because of its immutable sstables which carry data forward during compactions). But, we weren't willing to switch to OpenBSD or OpenSolaris to obtain decent ZFS performance.
There were a number of smaller changes that all amounted to significant performance gains. But, they are difficult to explain without intimate knowledge of the system. I will say that at the end of the day, we went from crashing every day on 500 users to being stable for weeks on end and serving over 10,000 users on roughly the same hardware configuration.
Zippylog is a high performance stream processing platform. I started the project in late July 2010 and hack on it when I have time.
I started the project to solve what I thought was a gap in the market. I also wanted to start a personal project to assess what my skills were as an individual developer. And, since I was working behind Microsoft's walls, where open source contributions were difficult to swing, I wanted to do something in the open for all to see.
My Github project, lua-protobuf integrates the programming language Lua with Google's Protocol Buffer serialization format. Both are extremely fast, so it is a marriage that needed to happen.
I started the project because I wanted to consume protocol buffers in Lua from within zippylog.
Interestingly, lua-protobuf is a Python program that generates C/C++ code that provides Lua scripts access to protocol buffers. Yeah, that makes no sense to me either, but it works.
Collectd is an excellent metrics collection and dispatching daemon. Graphite is a great data recording and visualization tool. I decided to marry them by writing a Collectd plugin, collectd-carbon, which writes values to Graphite/Carbon. (Carbon is the name of the network service that receives values.)
I've made contributions to Clang's Python bindings. These bindings allow you to consume the Clang C API (libclang) using pure Python (via ctypes).