Availability: Keeping Systems Up & Running
avail·abil·i·ty | \ ə-ˌvā-lə-ˈbi-lə-tē
The percentage of time that the infrastructure, system, or solution remains operational under normal circumstances in order to serve its intended purpose.
Availability is an important metric for any cloud-delivered service. Regardless of how useful the features of a service are, without reasonable availability the service is unusable. And this applies to small startup companies as much as to large popular services.
This blog, a second in a series on software "-ilities", is about how we, at StepZen, think about availability. (For a discussion on reliability, see How StepZen Balances Shipping Quickly and Running a Reliable Service.
How many 9s?
The most common reason services break is change: This can be new code, or deployment and configuration changes. Let's with the understanding that there are approximately 700 hours in a month. Two nines, or 99% availability, allows you about 7 hours of downtime in a month. Three nines, or 99.9% availability allows you 40 minutes of downtime a month. Four nines, 99.99% is 4 minutes of downtime each month and five nines, or 99.999% is 5 minutes of downtime each year.
If either your deployment or recovery processes need hands on a keyboard, three nines of availability, or more, is out of reach.
This is why you should always consider a high quality cloud service provider for your infrastructure components. If you choose to do things on your own, you will need to build a lot of automation in addition to core functional software.
Context for availability or reliability
The service breaks, and when it does, it causes an availability miss until the service is no longer broken. This means that there are two independent concerns that are important.
- How frequently does your service break?
- How rapidly do you unbreak your service when it does.
In a previous blog, we described our approach to attacking the first problem. Here we'll look at the second problem. Naively, it might appear that shipping often will cause more breaks. If you never ship new code, outages are less likely. And precisely for this, companies employ code freezes and error budgets as mechanisms to limit service breakages.
However, shipping small increments can also help decrease the time to unbreak the service dramatically. We believe that a combination of shipping small increments often, and thoughtful automation for monitoring and recovery, deliver the best overall availability.
An availability miss is both a problem visible to customers and a tax on the team. A team that is doing a lot of firefighting isn't shipping new features. If you use third party services, like we do, be thoughtful about which ones you use, and how you use them. If they break, so do you.
Build your processes so that availability gets early attention. It is one of the best team productivity investments you can make. Be careful when you use external services, picking the right service can deliver huge value, and picking the wrong one can cause a lot of pain.
Starting the server
We like our servers to start and be ready to receive traffic quickly. StepZen's core GraphQL server starts up in less than a second. There are a couple of benefits:
Things that work fast tend to do less, and this is strongly correlated with reliability: fewer things that can go wrong, and when they do, fewer things which could have been the cause of the error.
We use Kubernetes to deploy our services. Kubernetes can be configured to automatically startup a new server when one fails. This means that we can tolerate a low rate of server crashes with near zero impact on service availability.
If we are nimble, software does not need to be perfect. It's good enough to be good enough.
As soon as we push a change to production, we automatically run some of our most complicated usecases against it. Any unexpected behavior, or failure, results in a rollback. Indeed, we run these usecases automatically before we commit any change to our internal development branch.
In addition to ensuring high quality within our codebase, this ensures that we detect possible outages very quickly, sometimes even before they happen, and avoid any potential availability miss.
Be your own strongest critic.
In a hospital emergency room, a patient is stabilized first, diagnosis and treatment comes later. Unbreaking is the first step.
Unbreaking quickly and effectively relies on ensuring that our releases are hermetic. Specifically, configuration values are treated as code, and are linked to the release. Therefore, rolling back also rolls back configuration values. The goal is to reset the service to a state which we know has worked.
- If there is a recent release, roll it back.
- If there is no recent change, or a rollback does not fix the issue, it is very likely that only a few customers are affected. Specifically, someone has triggered an existing latent bug.
In the second case, we focus on a fix and roll forward. Since we release frequently, we do not expect any serious delays. The key thought here is that things are stable, and while it's important to ensure that every customer is whole, this does not represent a service level breakage or an availability miss.
In this regard, you might be familiar with a related concept: error budgets. It's unrealistic to expect that engineers can write perfect software. Error budgets quantify the acceptable gap between perfect services and real services which are good enough. Error budgets allow you to turn availability misses into routine maintenance and fix tasks. At StepZen, we try and use our error budgets wisely.
Do not patch things. Either roll back and fix, or fix and roll forward. A hastily prepared patch fix can turn a small problem into a large problem.
Control is not relevant
There are two situations that arise when things break.
- Under your control: We can fix it. It's your service, and your code.
- Not under your control: We can't fix things. The breakage is in someone else's code and service.
In either case, the important question from an availability perspective is not control, but how long will a fix take? If you are using a well engineered service vendor service, they will usually fix outages far faster than you can.
From an availability perspective, we believe that you should run your own infrastructure service only if you are willing to match or outdo the availability of the best one. And if you can do this, consider offering it to others outside your company!
Stuff happens. Systems go down. But rather than hoping that it doesn’t occur, we can engineer processes and systems for optimal operational performance. Summarizing the practices that guide our approach at StepZen:
- Automate: If either deployment or recovery processes need hands on a keyboard, three nines or more of availability is out of reach.
- Build engineering processes so that availability gets early attention.
- Be nimble. If you are, software does not need to be perfect. It's good enough to be good enough.
- When you push a change to production, automatically run some of the most complicated usecases against it.
- Do not patch things. Rather, roll back and fix, or fix and roll forward. A hastily prepared patch fix can turn a small problem into a large one.
- The important consideration from an availability perspective is not who controls the service, but how long will a fix take.
Check us out on stepzen.com, and we're always happy for your feedback on our Community Discord.