Monthly Archives: May 2012

Open Source, Forking, and Tech Bankruptcy

Open source software is a part of most of the things I do day-to-day. I use a ton of things made by others: Hadoop, Cascading, Apache, Jetty, Ivy, Ant – the list could literally go on for pages. But I also use and develop some things that I’ve built that have been released into the public. I contribute to Thrift frequently, and have released Hank, Jack, and other projects as part of my work at Rapleaf.

Working with so much open source software has given me lots of opportunity to develop perspective about how companies should engage with open source projects. In this day and age, nobody is going to counsel against using open source, since it’s an enormous productivity booster and it’s everywhere. However, there are some different schools of thought about how you should use and contribute to open source.

One way of using open source is to just use what’s released, never make any modifications, and never make any contributions. For some projects, this is perfectly fine. For instance, I find it hard to imagine making a contribution to Apache Commons. Everyone will take this approach on some projects, particularly the ones that are mature and useful but not mission critical: they’ll never produce enough pain to merit fixes nor produce enough value to merit enhancements.

However, the above model only works well on projects that are very stable. Other projects you’ll want to use while they are still immature, unstable, and actively developed. To reap the benefits, you might have to roll up your sleeves and fix bugs or add features, as well as dealing with the “features” introduced by other developers. This is where things get tricky.

There are two basic ways to deal with this scenario, which I think of as the “external” and “internal” approaches. The external approach involves your team becoming a part of the community of the project, contributing actively (or at least actively reporting bugs and requesting features), and doing your best to hang onto the bleeding edge or commit to using only public releases. The “internal” approach involves you picking an existing revision of the project, forking it into some internal repository, and then carefully selecting which upstream patches to accept into your private fork while mixing in your own custom patches.

Both of these options are imperfect, since either way you’re going to do a lot of work. A lot of companies see this as a simple pain now/pain later tradeoff and then choose accordingly. But I don’t think this is actually the case. What’s not easy to appreciate is that the pain later is usually much, much worse than the pain now.

Why is this the case? It comes down to tech debt. Choosing to create an internal fork of an open-source project is like taking out a massive loan: you get some time right now, but with every upstream patch you let go unmerged, you are multiplying the amount of effort you will ultimately need to get back in sync. And to make matters worse, people have a tendency to apply custom patches to their internal forks to get the features they need up and running quickly. This probably seems like a great idea at the time, but when it’s done carelessly, you can quickly get into a state where your systems depend on a feature that’s never going to make it into the upstream and might actually conflict with what the community decided to do.

When you get into the situation where your fork has diverged so much that you find yourself thinking, “I’ll never be able to switch to upstream,” then you’ve reached a state of tech bankruptcy – literally the only thing you can do is give up and stick with what you have or commit to an unbelievably expensive restructuring. At this point you cease to have a piece of open-source software: you have no external community, nobody outside to add features, fix bugs, and review your code, and you can lose compatibility with externals systems and tools.

Needless to say, the decision to make an internal fork should not be undertaken lightly. Weigh the perceived stability and flexibility benefits very carefully before starting down that road. If you must fork, make sure you understand the costs up front so that you can budget time to keep your fork in sync.

There’s a flip side to this. How often does a piece of internal code that “could be” an open source project go from closed to open? I know from my experience that it’s not easy to make the transition – you end up building in a feature that’s too domain-specific, or you tie it to your internal deploy system. I think that writing a decent piece of software that could be spun out as an open-source project and yet failing to do so is another case of accumulating tech debt. In this case, the bankruptcy state is a project that could have been open but never will be because of the time investment required.

The prescription in this case is easy: open source your project early, perhaps even before it’s “done,” continue to develop it in the open, and whatever you do, use the version you open sourced, not an internal fork.

Custom languages considered harmful

I use and contribute to a lot of open-source software projects, and one thing I see all over the place that drives me absolutely nuts is the prevalence of custom languages. In serialization frameworks, there are interface definition languages. In CAD tools, there are modeling languages. In the NoSQL and big data processing world, you’ll find my personal pet peeve, query languages.

Why are people making these custom languages? Abstractly, I get what they’re thinking: they have an application that would benefit from an expressive, terse, awesome set keywords and operators that will allow their users to be more productive. And this thinking isn’t fundamentally wrong. The productivity gain is real and definitely worth pursuing, and an intuitive language can go a long way towards making your product extremely accessible.

But here’s where I think these people go off the rails: aside from the very real NIH danger (Making your own language is cool, right? Guys?), there is a huge difference between a custom language – one in which you have to do the whole compiler contortion of lexing and grammars and whatnot – and a domain-specific language, which is more typically a set of functions and syntax that work from within another existing language to provide enhanced functionality. These two things seem a lot alike at first glance, but in my opinion, they couldn’t be more different. Why? Because when you decide to write a custom language – however simple it might seem to you right now – you are making the incredibly short-sighted statement that you know how to design a language better than all the people who have tackled this gargantuan problem before you. I don’t doubt that you know your domain better than anyone else, and that’s definitely what you need to get started, but what you probably don’t know is what you’re going to need later that will totally change your language. Or how best to implement recursion, scoped variables, memory management, or exception handling. The list could go on forever.

Let’s look at a concrete example. Lots of people in the 3D printing/maker community use OpenSCAD, an open-source 3D solid modeler with a decidedly programmer-tailored interface. Makers love it, because it’s powerful, fairly simple, and gives you really repeatable results. It uses a totally custom scripting language that is basically C-like: nesting, scope, function calls. But in combing through some of the syntax and reading the mailing list, I get the impression that a nontrivial amount of the developers’ time is spent reinventing features like variable assignment and for loops that could instead be spent actually developing the application itself.

To be totally clear, I love OpenSCAD, and I’m going to keep using it. That doesn’t mean that I don’t wish the developers had decided to write a DSL instead. If they had decided to embed their functions in a nice Ruby DSL, instead of having to figure out how to deal with the quixotic features of the native dynamic length lists, I could leverage all of the existing resources – community-vetted documentation, libraries, debugging and editing tools – and it would just work. As a user, I could just focus on learning the portion of the new syntax that applies to the specific domain, rather than also having to scramble to try and figure out how to do the things I already know how to do in other languages.

It’s clear that some folks out there get this right and some get it wrong. Thrift and Protobuf both have their own IDLs, which work well enough, though the code involved in analyzing IDL files is decidedly non-trivial. OpenSCAD and others like RapCAD have plowed tons of time on their own custom languages. Map/Reduce productivity tool Pig is a nightmare: since it’s not Turing complete, whenever you want to do something not imagined by the developers you have to make the significant development context switch into regular Java to make user-defined functions. Conversely, Cascading is a wonderfully usable and extensible pure-Java DSL for composing complex Map/Reduce workflows with arbitrary UDFs that don’t require you to context switch. (Even the splendid Clojure wrapper for Cascading, Cascalog, has the sense to use the existing Clojure language to give you an honest-to-god Turing complete language to work in.) Cassandra has CQL (which I admittedly haven’t used) while HBase has stuck with a solid functional API and has a JRuby REPL in which you can use it.

So next time you’re thinking about dusting off your context free grammar skills and busting out your favorite compiler compiler, just don’t do it. There’s an easier way.