I use and contribute to a lot of open-source software projects, and one thing I see all over the place that drives me absolutely nuts is the prevalence of custom languages. In serialization frameworks, there are interface definition languages. In CAD tools, there are modeling languages. In the NoSQL and big data processing world, you’ll find my personal pet peeve, query languages.
Why are people making these custom languages? Abstractly, I get what they’re thinking: they have an application that would benefit from an expressive, terse, awesome set keywords and operators that will allow their users to be more productive. And this thinking isn’t fundamentally wrong. The productivity gain is real and definitely worth pursuing, and an intuitive language can go a long way towards making your product extremely accessible.
But here’s where I think these people go off the rails: aside from the very real NIH danger (Making your own language is cool, right? Guys?), there is a huge difference between a custom language – one in which you have to do the whole compiler contortion of lexing and grammars and whatnot – and a domain-specific language, which is more typically a set of functions and syntax that work from within another existing language to provide enhanced functionality. These two things seem a lot alike at first glance, but in my opinion, they couldn’t be more different. Why? Because when you decide to write a custom language – however simple it might seem to you right now – you are making the incredibly short-sighted statement that you know how to design a language better than all the people who have tackled this gargantuan problem before you. I don’t doubt that you know your domain better than anyone else, and that’s definitely what you need to get started, but what you probably don’t know is what you’re going to need later that will totally change your language. Or how best to implement recursion, scoped variables, memory management, or exception handling. The list could go on forever.
Let’s look at a concrete example. Lots of people in the 3D printing/maker community use OpenSCAD, an open-source 3D solid modeler with a decidedly programmer-tailored interface. Makers love it, because it’s powerful, fairly simple, and gives you really repeatable results. It uses a totally custom scripting language that is basically C-like: nesting, scope, function calls. But in combing through some of the syntax and reading the mailing list, I get the impression that a nontrivial amount of the developers’ time is spent reinventing features like variable assignment and for loops that could instead be spent actually developing the application itself.
To be totally clear, I love OpenSCAD, and I’m going to keep using it. That doesn’t mean that I don’t wish the developers had decided to write a DSL instead. If they had decided to embed their functions in a nice Ruby DSL, instead of having to figure out how to deal with the quixotic features of the native dynamic length lists, I could leverage all of the existing resources – community-vetted documentation, libraries, debugging and editing tools – and it would just work. As a user, I could just focus on learning the portion of the new syntax that applies to the specific domain, rather than also having to scramble to try and figure out how to do the things I already know how to do in other languages.
It’s clear that some folks out there get this right and some get it wrong. Thrift and Protobuf both have their own IDLs, which work well enough, though the code involved in analyzing IDL files is decidedly non-trivial. OpenSCAD and others like RapCAD have plowed tons of time on their own custom languages. Map/Reduce productivity tool Pig is a nightmare: since it’s not Turing complete, whenever you want to do something not imagined by the developers you have to make the significant development context switch into regular Java to make user-defined functions. Conversely, Cascading is a wonderfully usable and extensible pure-Java DSL for composing complex Map/Reduce workflows with arbitrary UDFs that don’t require you to context switch. (Even the splendid Clojure wrapper for Cascading, Cascalog, has the sense to use the existing Clojure language to give you an honest-to-god Turing complete language to work in.) Cassandra has CQL (which I admittedly haven’t used) while HBase has stuck with a solid functional API and has a JRuby REPL in which you can use it.
So next time you’re thinking about dusting off your context free grammar skills and busting out your favorite compiler compiler, just don’t do it. There’s an easier way.