Operational overhead is all the costs you incur just to keep your app running, totally separate from initial and ongoing development. This is all the behind-the-scenes work necessary to support ongoing operation: updating special models (“categories”, “countries”, etc.), auditing capacity consumption, re-running failed scheduled tasks, granting access or privileges, etc. It’s uninspired busywork that we can’t seem to escape. It ends up relegated to your on-call rotation and is generally a drag and a distraction from future development.
Nobody likes operational overhead, yet rarely do engineers budget time into their project plans to build features that help reduce it from the beginning. Why not? Most often the refrain is “Sounds great, but we don’t have the time to build that.” But make no mistake, these same engineers will spend vastly more time maintaining the systems they build than building them in the first place! When the time comes for you to take some operational action, the work you’ve done in advance to minimize overhead determines whether it’s a just quick series of clicks or a nightmarish slog through the application database console. If it’s the latter, then you are setting the team up for lengthy interruptions and costly outages.
Operational overhead is another kind of tech debt, where you are paying for “quick” development now with operational work later. I’d argue that allowing operational overhead is right up there with avoiding unit tests: it seems fast, but is actually slow. Worse, operational overhead is an enormous morale drain.
Minimum Viable Operational Overhead
Minimum Viable Operational Overhead, then, is the philosophy of building systems that require the lowest amount of operational overhead possible while still working. The “minimum viable” part is in there to remind us that customers must come first, and in practice any useful system will have operational overhead. But when you prioritize reducing operational overhead, your customer is your future self. Making sure to budget for this kind of work will help you produce higher-quality systems and allow you to move more quickly in the future.
The actual techniques you should use to reduce operational overhead are too various and situational to list, but I have two favorites that always seem to be applicable. The first is to automate everything. You should try to minimize the number of operational tasks that require (or encourage) the least bit of creativity or analysis during execution. If it requires you to think about how the pieces fit together, then the context switch will be worse and you run a greater risk of making a mistake along the way. Even if it seems simple, write a simple script that does the task and document the arguments extremely well. Make sure the usage is described in your runbook and it’s copy-pasteable. Better, just build the copy-pasteable part into the app — it can be easier to just handle the special case than it is to document it. The more brainless you can make your operational interruptions, the better.
A common trap to fall into here is to think “eh, this is a one-off, I’ll automate this the next time it comes up.” Don’t do this. It’s hard to tell at the outset how many times you might have to redo an operational task. The best case is that the next time it only takes you 5 seconds, but the worst case is that you’ve rigorously done a one-off. Script it this time.
The second key technique is to make a UI. I can hear the grumbling already, but honestly, it’s not that difficult to attach a trivial web UI to the service or to add a few views to an existing application. When you do this well, it pays off in spades. It is incredibly powerful to be able to click a link or fill out a short web form to complete an operational task, especially the kind that involve hairy curl requests or manipulating complex database records. This also tends to make operational tasks almost self-documenting, because you have a place to write a document that lives right alongside the code itself. And since it’s an actual code artifact integrated right in with the existing internal API, it’s much easier to keep it up to date and functioning. (Write tests for your admin UIs!)
But the best part is that you will find that these internal-facing admin UIs grow like weeds, accumulating more and more command / control / monitoring features over time. It’s often not the first button that’s the big win — it’s that by taking the time to create a trivial web UI now, you are dramatically lowering the activation energy for future efforts that just need a place to put one more button. Eventually you’ll have a whole control panel full of them.
Thanks to Dmitri, Jack, and Sarkhan for reading and editing this post!