Service-Level Objective (SLO)
SLO is precise numerical target for system availability.
SLO is binding target for a collection of SLIs.
SLO = SLI + goal
SLO is defined by SLI with a goal. For example if we use REST API latency as a SLI, we can define SLO like this “ Every week, 99% of REST API calls will complete in less than 100 ms”. REST API latency is SLI and 100 ms is goal for the SLI.
SLO is consists of one more SLI like latency, availability etc. I will explain in more detail with examples.
What user cares?
For SLO, it is important that thinking about what your user care about, not what you can measure. If we set goals very high, it can be inefficient. For example, 10% of server failure is very high number. But if the system has high availability architecture, server failure can be handled by other instance and end user cannot know the server failure. In that case 90% (100%-10%) availability is not a problem. To meet high goal, it takes a lot of effort and it result reducing feature development velocity. To select proper goals in SLO, it is important to understand what user cares.
Sometimes it is hard to know what user care and sometimes it is impossible, so you will end up approximating user’s needs in some way.
Or, if you simply start with what’s easy to measure, you’ll end up with less useful SLOs.
Choosing goals
So how can we choose the proper goal for SLO? Goal setting is not a purely technical activity because of the product and business implications, which should be reflected in both the SLIs and SLO
Keep it simple
SLO is common goals across the organization. It means that all stakeholders needs to understand the meaning of the SLO easily.
In technical aspect, complicated aggregation in SLIs can obscure change to system performance and hard to understand.
Avoid perfection (absolutes)
Ultra fast response time with “always” available system is unrealistic.
Have as few SLOs as possible
Choose just enough SLOs provide good coverage of your system’s attributes.
If you can not win a conversation about priority by using particular SLO, the SLO is probably not worth having.
Gradually change SLO
You can refine SLO definition and goal over time as you learn about a system behavior. It is better to start with loose(lower) goal that you tighten it later. If you start from tighten(higher) goal from beginning and relax it later, it will make bad habit. Objective needs to be more higher with time being.
Defining SLO and expectation management
SLI has upper and lower bound
A nature of SLO is that “SLI <= target” or “ lower bound <= SLI <= upper bound”.
Let’s say that we have SLO like “100ms < API latency < 300ms”.
“<300ms” is acceptable. But why do we have lower bound “100ms”?
System performance and development speed are inversely related. If you provide lower bound, development team will not do more effort for performance enhancement but will do the effort for other feature development.
Its expectation that you are sending for your users that if you suddenly start breaking a lot more often than they are used to because you start running exactly at your SLO rather than doing much better than your SLO, the users will be unhappily suprised if they’re tryung to build other service on top of yours
Keep a safety margin
If you have tighten internal SLO than the SLO advertised to users gives you room to respond to problems before they become visible externally.
Don’t overachieve
If your service’s actual performance is better than SLO, user will come to reply on its current performance. It is one of the reasons why SLO has lower and upper bound.
SLOs can and should be a major driver in prioritizing work for SREs and product developers, because they reflect what users care about. A good SLO is a helpful, legitimate forcing function for a development team. But a poorly thought-out SLO can result in wasted work if a team uses heroic efforts to meet an overly aggressive SLO, or a bad product if the SLO is too lax. SLOs are a massive lever: use them wisely
Reference
- Example of SLO document from SRE book
- Practical example about SLO management (SLO, Error budget visualization)
- SRE fundamentals SLIs,SLAs and SLOs