SRE #3 – Service Level Objective (SLO)

Service-Level Objective (SLO)

SLO is precise numerical target for system availability.

SLO is binding target for a collection of SLIs.

SLO = SLI + goal

SLO is defined by SLI with a goal. For example if we use REST API latency as a SLI, we can define SLO like this “ Every week, 99% of REST API calls will complete in less than 100 ms”. REST API latency is SLI and 100 ms is goal for the SLI.

SLO is consists of one more SLI like latency, availability etc. I will explain in more detail with examples.

What user cares?

For SLO, it is important that thinking about what your user care about, not what you can measure. If we set goals very high, it can be inefficient. For example, 10% of server failure is very high number. But if the system has high availability architecture, server failure can be handled by other instance and end user cannot know the server failure. In that case 90% (100%-10%) availability is not a problem. To meet high goal, it takes a lot of effort and it result reducing feature development velocity. To select proper goals in SLO, it is important to understand what user cares.

Sometimes it is hard to know what user care and sometimes it is impossible, so you will end up approximating user’s needs in some way.

Or, if you simply start with what’s easy to measure, you’ll end up with less useful SLOs.

Choosing goals

So how can we choose the proper goal for SLO? Goal setting is not  a purely technical activity because of the product and business implications, which should be reflected in both the SLIs and SLO

Keep it simple

SLO is common goals across the organization. It means that all stakeholders needs to understand the meaning of the SLO easily.  

In technical aspect, complicated aggregation in SLIs can obscure change to system performance and hard to understand.

Avoid perfection (absolutes)

Ultra fast response time with “always” available system is unrealistic.

Have as few SLOs as possible

Choose just enough SLOs provide good coverage of your system’s attributes.

If you can not win a conversation about priority by using particular SLO, the SLO is probably not worth having.

Gradually change SLO

You can refine SLO definition and goal over time as you learn about a system behavior. It is better to start with loose(lower) goal that you tighten it later. If you start from tighten(higher) goal from beginning and relax it later, it will make bad habit. Objective needs to be more higher with time being.

Defining SLO and expectation management

SLI has upper and lower bound

A nature of SLO is that “SLI <= target” or “ lower bound <= SLI <= upper bound”.

Let’s say that we have SLO like “100ms < API latency < 300ms”.

“<300ms” is acceptable. But why do we have lower bound “100ms”?

System performance and development speed are inversely related. If you provide lower bound, development team will not do more effort for performance enhancement but will do the effort for other feature development.

Its expectation that you are sending for your users that if you suddenly start breaking a lot more often than they are used to because you start running exactly at your SLO rather than doing much better than your SLO, the users will be unhappily suprised if they’re tryung to build other service on top of yours

Keep a safety margin

If you have tighten internal SLO than the SLO advertised to users gives you room to respond to problems before they become visible externally.

Don’t overachieve

If your service’s actual performance is better than SLO, user will come to reply on its current performance. It is one of the reasons why SLO has lower and upper bound.

SLOs can and should be a major driver in prioritizing work for SREs and product developers, because they reflect what users care about. A good SLO is a helpful, legitimate forcing function for a development team. But a poorly thought-out SLO can result in wasted work if a team uses heroic efforts to meet an overly aggressive SLO, or a bad product if the SLO is too lax. SLOs are a massive lever: use them wisely

Reference

SRE #2 – Service Level Indicator (SLI)

Service-Level Indicator (SLI)

An SLI is a service level indicator, which defined quantitative measure or level of service.

It is generally recommended treating the SLI as the ratio of two numbers : the number of good events divided by the total number of events.

For example,

  • Request latency : How long it takes to return response to a request.
  • Error rate : successful request
  • Throughput : It typically measure per second. Like TPS (Throughput per second), QPS (Query per second)
  • Availability : System uptime like available serving time in production
  • Durability (Storage system only) : the likelihood that data will be retained over a long period of time

What do you and what your users care about?

You shouldn’t use every metric you can track in your monitoring system.

An understanding of what your users want from the system will inform the judicious selection of a few indicator. Choosing too many indicator makes it hard to pay the right level of attention to the indicators.

SLI can be defined based on service with proper user expectations. For example

  • User facing serving system : It generally care about availabilities, latency and throughput
  • Storage system : latency, availability and durability
  • Big data analytics system : throughput, end-to-end latency
  • Machine Learning system : latency, availability, throughput, accuracy (prediction) and training time (training)

Collecting indicator

The indicator metric can be collected by using monitoring system such as prometheus or with periodical log analytics. Google stack driver is good solution for this. Stack driver can support monitoring as well as metrics from logs with stack driver logging.

For the indicator metrics, it can be simplified by using aggregation like sum, average etc. But most of metrics are better thought of as distribution rather than average. For example, for latency SLI, some requests will be services quickly, while others will invariably take longer

Using percentiles for indicators allows you to consider the shape of the distribution.

A high-order percentile, such as the 99th or 99.9th, shows you a worst-case value.

50th percentile (aka median) emphasizes the typical case.

The higher the variance in response times, the more the typical user experience is affected by long-tail behavior, an effect exacerbated at high load by queuing effects. User studies have shown that people typically prefer a slightly slower system to one with high variance in response time, so some SRE teams focus only on high percentile values, on the grounds that if the 99.9th percentile behavior is good, then the typical experience is certainly going to be.

Standardized indicators

It is recommended that you standardize on the common definitions for SLIs

For example

  • Aggregation interval : average over 1 min
  • Aggregation region : All the task in a cluster
  • How frequent measurement are made : Every 10 sec
  • Which requests are included : HTTP GET from black-box monitoring
  • How the data is acquired : through monitoring systems
  • Data-access latency : Time to last byte

To save effort, build a set of reusable SLI template for each common metric. These also make it simple for everyone to understand what a specific SLI means.

Reference – Google SRE book / Chapter 4. Service Level Objectives