Your Software Needs to be Operated Decades Longer Than you Write it
I was inspired by a recent tweet from Werner Vogels to consider what it means to do as he advises — to think about what it takes to have our software operated for decades after we write it. After all most people building a system or an application intend for it be useful and used forever. And those of us who've run cloud services have learned over the years that operational costs can overwhelm and out-pace development costs.
So how do we think early about how we will operate something long after we've built it?
Build vs. Buy or Build-and-Buy?
To build or to buy is an oft asked software question. What if we were to think about it differently? What if, rather than think about it as a software decision, it were primarily an operational decision?
Code expresses that which is uniquely different about our applications, and we often choose to build because we can't buy that uniqueness. But if building something were a banana, operating it is the gorilla holding on to the banana. The question is how can we get the banana without bringing the gorilla and the entire jungle home as well? One way is to make every technology choice with an opt-out point of view on building what's needed to operate it. In other words, whenever possible, you choose to use a service even if you have to write more code to use the service than building your own. Luckily however, in many if not most cases, and as SaaS becomes a predominant model, using a service is less work.
Choosing this path means that you will adopt platforms into which you can pour your code or data, but which run the code or host the data for you. Pouring this code into such a platform is work — you must create the code, and also the automation to deploy the code into the platform. But if you choose the right platform — one that helps you operate your application reliably, at low cost, and at scale, this work will make your operations easier.
It will shift operational burden and you can move on to building the next application. Someone else will keep things running. Someone else will manage the servers. Someone else will backup the database. Someone else will worry about deployment security. Someone else will monitor for availability. Someone else will patch for bugs and security vulnerabilities. You have just opted out of all of this work!
GraphQL as a Service
GraphQL (and APIs in general) is an example of something like this. Writing a GraphQL engine is relatively easy. One can download one of a number engines, write a few lines of code, and have a working endpoint. Designing an API that developers will love is the first part. The work of delivering performant and reliable GraphQL does not start until you have it working and deployed.
Here are just two examples of run-time considerations for a GraphQL API:
-
Your API is not performing well. You need to add a cache? What is a good invalidation algorithm? What is a good replacement strategy? Which queries need to ignore the cache?
-
Your API is popular - developers love it! That's great! Now can you scale it automatically? When do you scale up, and when do you trigger a scale down? How can you ensure that scale up or scale down do not need a downtime? What to do if your traffic is bursty? Can your auto-scaling react quickly? Do you need to maintain a warm server pool? When do you rate limit and when do you scale up?
The key point is that these are all operational concerns and exist (hopefully) decades longer and after you build the API. Through it all, your GraphQL API may remain essentially the same.
Declarative systems
A well known truth in computing systems is that you can't optimize that which you can't understand. Building a good GraphQL cache requires one to deeply understand what the GraphQL is doing. Building an auto-scaling strategy for a GraphQL service requires one to deeply understand what the GraphQL is doing. Building an access control mechanism for GraphQL requires the same. Solving the 1 + N problem requires the same.
In the end, each of these hard problems representing operational concerns are problems that require understanding what the GraphQL is doing. They can't be reliably and efficiently solved otherwise.
In a declarative system, you express what you want the system to do. You don't tell the system how it should do those things. A programmatic system, on the other hand, provides more flexibility because you tell it how (by writing code). It's not possible to understand what a programmatic system is doing. A declarative system understands the what deeply by design.
At StepZen. we believe that a declarative system for building GraphQL leads to a better execution engine. For more on this topic, check out Anant's article in The New Stack: What GraphQL Can Learn from Databases
Database systems use declarations (tell me what to do, do not tell me how to do it) to make the overall experience of both the setup and interactive side amazing. In contrast, most GraphQL systems only make the interactive side declarative. The setup side is complex, and worse, the execution of interactions is suboptimal. It is entirely possible to build a GraphQL system where both sides are declarative, and as a result, the setup is both simple and leads to a much better execution engine.
Using a declarative syntax for expressing your GraphQL model enables StepZen to run your GraphQL and solve the hard operational problems. And most importantly, it enables the developers who implement their GraphQL on StepZen to opt out of having to operate a complex system and focus instead on what their next GraphQL API should look like.