Open source software is a part of most of the things I do day-to-day. I use a ton of things made by others: Hadoop, Cascading, Apache, Jetty, Ivy, Ant – the list could literally go on for pages. But I also use and develop some things that I’ve built that have been released into the public. I contribute to Thrift frequently, and have released Hank, Jack, and other projects as part of my work at Rapleaf.
Working with so much open source software has given me lots of opportunity to develop perspective about how companies should engage with open source projects. In this day and age, nobody is going to counsel against using open source, since it’s an enormous productivity booster and it’s everywhere. However, there are some different schools of thought about how you should use and contribute to open source.
One way of using open source is to just use what’s released, never make any modifications, and never make any contributions. For some projects, this is perfectly fine. For instance, I find it hard to imagine making a contribution to Apache Commons. Everyone will take this approach on some projects, particularly the ones that are mature and useful but not mission critical: they’ll never produce enough pain to merit fixes nor produce enough value to merit enhancements.
However, the above model only works well on projects that are very stable. Other projects you’ll want to use while they are still immature, unstable, and actively developed. To reap the benefits, you might have to roll up your sleeves and fix bugs or add features, as well as dealing with the “features” introduced by other developers. This is where things get tricky.
There are two basic ways to deal with this scenario, which I think of as the “external” and “internal” approaches. The external approach involves your team becoming a part of the community of the project, contributing actively (or at least actively reporting bugs and requesting features), and doing your best to hang onto the bleeding edge or commit to using only public releases. The “internal” approach involves you picking an existing revision of the project, forking it into some internal repository, and then carefully selecting which upstream patches to accept into your private fork while mixing in your own custom patches.
Both of these options are imperfect, since either way you’re going to do a lot of work. A lot of companies see this as a simple pain now/pain later tradeoff and then choose accordingly. But I don’t think this is actually the case. What’s not easy to appreciate is that the pain later is usually much, much worse than the pain now.
Why is this the case? It comes down to tech debt. Choosing to create an internal fork of an open-source project is like taking out a massive loan: you get some time right now, but with every upstream patch you let go unmerged, you are multiplying the amount of effort you will ultimately need to get back in sync. And to make matters worse, people have a tendency to apply custom patches to their internal forks to get the features they need up and running quickly. This probably seems like a great idea at the time, but when it’s done carelessly, you can quickly get into a state where your systems depend on a feature that’s never going to make it into the upstream and might actually conflict with what the community decided to do.
When you get into the situation where your fork has diverged so much that you find yourself thinking, “I’ll never be able to switch to upstream,” then you’ve reached a state of tech bankruptcy – literally the only thing you can do is give up and stick with what you have or commit to an unbelievably expensive restructuring. At this point you cease to have a piece of open-source software: you have no external community, nobody outside to add features, fix bugs, and review your code, and you can lose compatibility with externals systems and tools.
Needless to say, the decision to make an internal fork should not be undertaken lightly. Weigh the perceived stability and flexibility benefits very carefully before starting down that road. If you must fork, make sure you understand the costs up front so that you can budget time to keep your fork in sync.
There’s a flip side to this. How often does a piece of internal code that “could be” an open source project go from closed to open? I know from my experience that it’s not easy to make the transition – you end up building in a feature that’s too domain-specific, or you tie it to your internal deploy system. I think that writing a decent piece of software that could be spun out as an open-source project and yet failing to do so is another case of accumulating tech debt. In this case, the bankruptcy state is a project that could have been open but never will be because of the time investment required.
The prescription in this case is easy: open source your project early, perhaps even before it’s “done,” continue to develop it in the open, and whatever you do, use the version you open sourced, not an internal fork.