Minh Triet Tran's Home

Chapter 3 Embracing Risk Part 1

23 Dec 2019

100% Reliable Service Comes At A Price

And the price is devided on two things:

Cost of redundant machine/compute resources
Opportunity cost: engineers working to achieve 100% reliability doesn’t work on features for end users.

In a sense, we view the availability target as both a minimum and a maximum

Measuring Service Risk

As standard practice at Google, we are often best served by identifying an objective metric to represent the property of a system we want to optimize. By setting a target, we can assess our current performance and track improvements or degradations over time.

So everything starts with a very clear vision of what the outcome would be.

And the metric Google came up with to measure service risk (or more accurately, service risk tolerance) is unplanned downtime. Unplanned downtime is deduced from the desired level of service availability, which is measured in percentage.

That way, Google teams can agree upon a very specific target for how risk-tolerant a service must be.

There are two methods to calculate service availability:

Time-based availability

availability = uptime/(uptime + downtime)

For example, targeting 99.99% availability means allowing 52.56 minutes of downtime per year or 12.96 minutes per quarter or 4.32 minutes per month and so on. They even have an avaiability table to chart out the service availability and its corresponding total unplanned downtime allowable on a yearly all the way down to daily basis.

This method is met with a problem: Google has a fault isolation which prevents issues in an area from spreading to others. In there own words, it means they’re “at least partially up at all times”. This leads us to the second method to express service availability: in terms of aggregated request success rate.

Aggregate avaiability

availability = successful requests/total requests

For example, a system that serves 2.5M requests in a day with a daily availability target of 99.99% can serve up to 250 errors and still hit its target for that given day.

Aggragate avaiability is more useful than time-based metric because:

Using an aggregate unavailability metric (i.e., “X% of all operations failed”) is more useful than focusing on outage lengths for services that may be partially available—for instance, due to having multiple replicas, only some of which are unavailable—and for services whose load varies over the course of a day or week rather than remaining constant.

Google set quarterly availability targets and track them on a weekly or daily basis.

My thoughts

As usual, I learned something new.

It is amazing how Google expresses something so vague with such concrete and simple mathematics.

However, I wonder why their availability targets only comprises of nines. What’s wrong with an availability target of 99.80%? As far as I’m concerned, maybe they just want to have a quick and easy way to agree on the target service availability for the whole company. If people could choose to set an arbitrary level of avaiability target, it would be hard to imagine the coressponding allowable unplanned downtime.

Apart from the method to calculate service’s risk tolerance (or service availability), I learn that it is important to:

Define metrics
Set goals before you start optimize anything.

This helps one to:

assess our current performance and track improvements or degradations over time

As for our company, planned downtime is common. They happen for releases at least once every 6 weeks. However, hotfixes are commonplace and increase the rate two fold.

For us, we’re still having trouble with planned downtime: our business requires that 0 downtime.

Recently, I have learned that the notion of 0 downtime is not clear. It’s just what sales use for maketting purposes. This is just another requirement and should be discussed and defined concretely in terms of technical details. It’s easy for different people to interprete “0 downtime” differently.

This is something I should clearly discuss with my superiors.

As for the current stage of our DevOps/SRE team, I realized that we do not measure the performance of our releases. We do measure the time for each step, but in a somewhat unformal way and it only serves the purpose of forecasting the downtime for the next release. That is not to mention the time reported is often incorrect. There’s a lot of reasons for this:

Tasks and timing are not automated. We manually click a button and look at the clock.
We lack the incentives to perform this seriously. If it’s just for the purpose of release time estimation, we don’t need to meticulously measure time spent on each step.

In short, since we’re not trying to optimize the planned downtime for release we don’t think setting out a metric for the performance of release is important.

And why is that so? Even though our service wants “0 downtime”? Because we plan on a long and unclear plan for “0 downtime” release and not for a “reduced downtime” one.

It’s fine and all that we decide to focus on a more suitable goal and accept current situation. But the problem is the other goal is not followed through thoroughly.

Anymore of this is irrelevant for the topic SRE. I want to make the blog useful to me and more practical with my real work experience but at times, it tends to become a personal conscience mirror.

Come back to what I learned from the book, another useful advice is to think of the service availability metrics in terms of aggregate sucessful rate, rather than a time-based metric.

Minh Triet Tran's Home

Linux - Devops - Networking

Chapter 3 Embracing Risk Part 1

100% Reliable Service Comes At A Price

Measuring Service Risk

My thoughts