SRE #3 – Service Level Objective (SLO)

Service-Level Objective (SLO)

SLO is precise numerical target for system availability.

SLO is binding target for a collection of SLIs.

SLO = SLI + goal

SLO is defined by SLI with a goal. For example if we use REST API latency as a SLI, we can define SLO like this “ Every week, 99% of REST API calls will complete in less than 100 ms”. REST API latency is SLI and 100 ms is goal for the SLI.

SLO is consists of one more SLI like latency, availability etc. I will explain in more detail with examples.

What user cares?

For SLO, it is important that thinking about what your user care about, not what you can measure. If we set goals very high, it can be inefficient. For example, 10% of server failure is very high number. But if the system has high availability architecture, server failure can be handled by other instance and end user cannot know the server failure. In that case 90% (100%-10%) availability is not a problem. To meet high goal, it takes a lot of effort and it result reducing feature development velocity. To select proper goals in SLO, it is important to understand what user cares.

Sometimes it is hard to know what user care and sometimes it is impossible, so you will end up approximating user’s needs in some way.

Or, if you simply start with what’s easy to measure, you’ll end up with less useful SLOs.

Choosing goals

So how can we choose the proper goal for SLO? Goal setting is not  a purely technical activity because of the product and business implications, which should be reflected in both the SLIs and SLO

Keep it simple

SLO is common goals across the organization. It means that all stakeholders needs to understand the meaning of the SLO easily.  

In technical aspect, complicated aggregation in SLIs can obscure change to system performance and hard to understand.

Avoid perfection (absolutes)

Ultra fast response time with “always” available system is unrealistic.

Have as few SLOs as possible

Choose just enough SLOs provide good coverage of your system’s attributes.

If you can not win a conversation about priority by using particular SLO, the SLO is probably not worth having.

Gradually change SLO

You can refine SLO definition and goal over time as you learn about a system behavior. It is better to start with loose(lower) goal that you tighten it later. If you start from tighten(higher) goal from beginning and relax it later, it will make bad habit. Objective needs to be more higher with time being.

Defining SLO and expectation management

SLI has upper and lower bound

A nature of SLO is that “SLI <= target” or “ lower bound <= SLI <= upper bound”.

Let’s say that we have SLO like “100ms < API latency < 300ms”.

“<300ms” is acceptable. But why do we have lower bound “100ms”?

System performance and development speed are inversely related. If you provide lower bound, development team will not do more effort for performance enhancement but will do the effort for other feature development.

Its expectation that you are sending for your users that if you suddenly start breaking a lot more often than they are used to because you start running exactly at your SLO rather than doing much better than your SLO, the users will be unhappily suprised if they’re tryung to build other service on top of yours

Keep a safety margin

If you have tighten internal SLO than the SLO advertised to users gives you room to respond to problems before they become visible externally.

Don’t overachieve

If your service’s actual performance is better than SLO, user will come to reply on its current performance. It is one of the reasons why SLO has lower and upper bound.

SLOs can and should be a major driver in prioritizing work for SREs and product developers, because they reflect what users care about. A good SLO is a helpful, legitimate forcing function for a development team. But a poorly thought-out SLO can result in wasted work if a team uses heroic efforts to meet an overly aggressive SLO, or a bad product if the SLO is too lax. SLOs are a massive lever: use them wisely

Reference

SRE #2 – Service Level Indicator (SLI)

Service-Level Indicator (SLI)

An SLI is a service level indicator, which defined quantitative measure or level of service.

It is generally recommended treating the SLI as the ratio of two numbers : the number of good events divided by the total number of events.

For example,

  • Request latency : How long it takes to return response to a request.
  • Error rate : successful request
  • Throughput : It typically measure per second. Like TPS (Throughput per second), QPS (Query per second)
  • Availability : System uptime like available serving time in production
  • Durability (Storage system only) : the likelihood that data will be retained over a long period of time

What do you and what your users care about?

You shouldn’t use every metric you can track in your monitoring system.

An understanding of what your users want from the system will inform the judicious selection of a few indicator. Choosing too many indicator makes it hard to pay the right level of attention to the indicators.

SLI can be defined based on service with proper user expectations. For example

  • User facing serving system : It generally care about availabilities, latency and throughput
  • Storage system : latency, availability and durability
  • Big data analytics system : throughput, end-to-end latency
  • Machine Learning system : latency, availability, throughput, accuracy (prediction) and training time (training)

Collecting indicator

The indicator metric can be collected by using monitoring system such as prometheus or with periodical log analytics. Google stack driver is good solution for this. Stack driver can support monitoring as well as metrics from logs with stack driver logging.

For the indicator metrics, it can be simplified by using aggregation like sum, average etc. But most of metrics are better thought of as distribution rather than average. For example, for latency SLI, some requests will be services quickly, while others will invariably take longer

Using percentiles for indicators allows you to consider the shape of the distribution.

A high-order percentile, such as the 99th or 99.9th, shows you a worst-case value.

50th percentile (aka median) emphasizes the typical case.

The higher the variance in response times, the more the typical user experience is affected by long-tail behavior, an effect exacerbated at high load by queuing effects. User studies have shown that people typically prefer a slightly slower system to one with high variance in response time, so some SRE teams focus only on high percentile values, on the grounds that if the 99.9th percentile behavior is good, then the typical experience is certainly going to be.

Standardized indicators

It is recommended that you standardize on the common definitions for SLIs

For example

  • Aggregation interval : average over 1 min
  • Aggregation region : All the task in a cluster
  • How frequent measurement are made : Every 10 sec
  • Which requests are included : HTTP GET from black-box monitoring
  • How the data is acquired : through monitoring systems
  • Data-access latency : Time to last byte

To save effort, build a set of reusable SLI template for each common metric. These also make it simple for everyone to understand what a specific SLI means.

Reference – Google SRE book / Chapter 4. Service Level Objectives

SRE #1 – What is Devops and SRE?

Devops

DevOps is a set of software development practices that combines software development (Dev) and information technology operations (Ops) to shorten the systems development life cycle while delivering features, fixes, and updates frequentlyin close alignment with business objectives
(from wikipedia)

Developer concerns feature and velocity. Operation concerns stability.

Devops is culture and disciplines to break down silos between developers and operators.

Devops can be broken down into five key areas. Fore more detail about 5 areas

  • Reduce organization silos.
    Share ownership with developers. It uses same toolings between developers and operations. It makes sure everyone has the same view and same approach
  • Accept failure as normal
    Blameless postmodern.
    It makes sure that the same problem will not happen again.
    A concept of error budget of how much the system is allowed.
  • Implement gradual change
    Canary deployment
  • Leverage tooling & automation
    How much toil we have?  It tries to remove manual work by automation.
    Provide same tooling to everyone
  • Measure everything
    Measuring reliability. Measuring toils for more efficiency.

Difference between SRE and Devops

Devops is more like culture and disciplines. SRE is realized practice of Devops.

“ class SRE implements Devops

Devops is a set of practices and culture designed to break down silos in IT, ops, networks, security etc.

SRE is a set of practice google has found to work,some beliefs that animate those practices and a job role.

SRE Concept

In Devops, developer continuously push new features to production for velocity and operation team continuously denies deployment for stability. It brings chaos in production. Google also had a similar problem in the year 2000 when they developed google search.  Google tried to resolve this problem in different ways, which is SRE. It defines 3 ways to resolve the problem

  • Define availability
  • Level of availability
  • Plan in case of failure

Google team shared and agreed this way from individual contributor all the way up to vice presidents level. After that they shared responsibility of the availability.

They defined service level objective in collaboration of product owner and by agreeing with the matrix in advance, it can reduce confusion

First speech in English

In this week, I did first speech in public technical conference.
Every year, Google Cloud has public conference, which is google cloud next and i attended as a breakout session speaker. I spoke about logging with stack driver logging product.

Before this speech, i had number of speech for customer and internal team in English. But it was not public speech. It was my first speeches in public tech conference. I did many speeches in such a big tech conference in Korean. Biggest speech size has over 1000+ attendees. This my first speech just had around 200~300 attendees. But I was very nervous. I spent a lot of time to prepare, writing script, self rehearsal. Eventually I did 50% of success only. Contents was well prepared. Problem was only my English. Accent and pronunciation was very bad. My daughter told me that she can not understand. 😦

Before speech I prepared script but it was useless. Amount of script was so huge and hard to remember. So I gave up to memorize the contents and I just did number of practice.

Positive thing was that i got number of appreciations from attendees after my speech. It means that contents itself was not bad. I realized that my contents may work in outside of Korea.

However it was very good experience for me. After this speech i have more self-confidence and I will apply more in internal & external events with English. It will help me to grow up more.

Actually I haven’t had a plan to speech in this conference. But our ex-manager(Jenny) encourage me to be there and one of my colleague (Harry) volunteered as a Speaker. It makes me to be motivated.

When I watch again this video, it makes me shame but it also makes me to be proud of my challenge.

Short memory in April.2019 after google next

Disappointed in word press

Disappointed in wordpress

I just open this blog to exercise my english and today i want to note my learning before i post articles into my korean blog.

I had an interest in wordpress. Because it seems to be easy and it has a lot of plug-in.

But today i disappointed about wordpress.com. Today posting was not about usual-life. It was technical article. The technical article always needs to post source code.

Every time, when i use my korean blog, it was painful to decorate my code. So i expected wordpress.com has plug ins for source code. I found it!!. But it can be installed to commercial version.

Because this blog is not my main blog, i don’t want to invest my time.

Maybe,i need to consider rehost my english blog into github site or some where.

Structured logging (SLF4J + JSON)

Recently i have an interest in logging system. In modern system architecture, log is not just simple text message. It has more meaningful and can be used for data analytics. For this reason the log data is exported to data analytics system such as ELK , BigQuery etc.

In that case, we will have a problem. In a line of log, it can combine multiple information, which can be mapped into database. If we write log with simple text, it will be very hard to parse.

With these reasons, structured logging can be one of the solutions and it is also recommended as a logging best practices. I already had an experience in structure logging with JSON format by using google cloud stack driver logging. But it has dependency to google stack driver. So i tried to find other alternative solution from open source. Surprisingly there were few solutions guidance. However, in java there were number of options.

JSON logging itself is not difficult, if you simply add JSON layout appender, it will works.This is logback.xml. In spring boot project, this file should be in /src/main/resource

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<appender name="json" class="ch.qos.logback.core.ConsoleAppender">
   <layout class="ch.qos.logback.contrib.json.classic.JsonLayout">
       <jsonFormatter
           class="ch.qos.logback.contrib.jackson.JacksonJsonFormatter">
           <prettyPrint>true</prettyPrint>
       </jsonFormatter>
       <timestampFormat>yyyy-MM-dd' 'HH:mm:ss.SSS</timestampFormat>
   </layout>
</appender>

<logger name="jsonLogger" level="TRACE">
   <appender-ref ref="json" />
</logger>
</configuration>

This is part of pom.xml for maven build. The thing that we need to be aware, please check version of libraries. I spent almost 1 hour cause of lower version.

<dependencies>
:
   <dependency>
       <groupId>ch.qos.logback.contrib</groupId>
       <artifactId>logback-json-classic</artifactId>
       <version>0.1.5</version>
   </dependency>

   <dependency>
       <groupId>ch.qos.logback.contrib</groupId>
       <artifactId>logback-jackson</artifactId>
       <version>0.1.5</version>
   </dependency>

   <dependency>
       <groupId>com.fasterxml.jackson.core</groupId>
       <artifactId>jackson-databind</artifactId>
       <version>2.9.3</version>
   </dependency>
</dependencies

However, after this set up, it can use json structured log. Before call logging, it needs to import slf4j packages.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

And let’s log
static Logger logger = LoggerFactory.getLogger("jsonLogger");
public static void main(String[] args) {

SpringApplication.run(DemoApplication.class, args);
logger.debug("Debug message");

This is output

{
 "timestamp" : "2019-03-06 01:23:11.738",
 "level" : "DEBUG",
 "thread" : "main",
 "logger" : "jsonLogger",
 "message" : "Debug message",
 "context" : "default"
}

As you can see, the log message is printed as a “message” element in JSON document.

But it is not enough, if we use logging for data analytics, it should have multiple fields.

How can we add multiple field. Sl4j itself has very interesting class – MDC. It is basically K/V based map. And if you fill key and value into MDC. It will be printed as a part of log.

This is code

MDC.put("userid","terry");
logger.info("this is log");

And this is output

{
 "timestamp" : "2019-03-06 01:23:11.757",
 "level" : "INFO",
 "thread" : "main",
 "mdc" : {
   "userid" : "terry"
 },
 "logger" : "jsonLogger",
 "message" : "this is log",
 "context" : "default"
}

I put key “userid” with value “terry”. The record was printed as a part of MDC element in JSON log document.

Let’s add one more element like this

 MDC.put("userid","terry");
MDC.put("client","galaxy s8");

logger.info("this is second log")

And this is output

{
 "timestamp" : "2019-03-06 01:23:11.758",
 "level" : "INFO",
 "thread" : "main",
 "mdc" : {
   "userid" : "terry",
   "client" : "galaxy s8"
 },
 "logger" : "jsonLogger",
 "message" : "this is second log",
 "context" : "default"
}

MDC is permanent among same thread. So if u don’t delete the value, it will continuously print the value to log. To remove the data, it can use MDC.delete to delete one key/value pair or, you can use MDC.clear() to delete all.

However, this way is also not beautiful. Because we cannot change root element name “mdc”. And it cannot have multiple root.

I slightly checked number of document, it can be supported by Logstash appender which is part of ELK stack. I will invest more time to find more proper solution. As I mentioned in the beginning of the article, I don’t prefer to locked in by specific vendors

Starting new blog

Today i opened new blog. I have Korean blog and also have english blog before.  But i gave up to update my english blog. It takes lots of effort to manage two language blog.

However, the reason i started this new blog is my english. My english is not improved from long time ago. Today i attend on Toastmaster class followed by Jerry my friend, mentor and colleague.

Toastmaster class is kind of regular meet-up to improve speech skill and it also in English. I expected to improve my English skill by Toastmaster. But it is not all. I got a lot of motivation today and want to record my thought somewhere. During the class, i tried to prepare my feedback in English. But i couldn’t find exact term to express my thought.  One of good way is writing article. Most of my english writing is about business email or technical documentation. But it is not enough to communicate in english. It is reason why i start new blog in english.

Today, in Toastmaster class, i got deep impression from attendees. They are not professional. But i was surprised that their quality of the speech. I thought they are professional. While observing their speech, they were well prepared. I assume that they spent lots of time to prepare it. It makes me shame. As a sales engineer, i’m professional in speech at least my area. Today i had a customer workshop and did a speech there. In my work, technical speech is routine work. So i didn’t prepare a lot. But today, attendees in Toastmaster, they did a lot of effort for the preparation. It makes me to rethink about me.

I just attended on Toast master to broaden my network and increase my english. But it think i got more than i expected. Motivation and passion.

In April, i will have a very big speech in my life. One of my dream is technical speech in global. I will be a speaker of google next, which is one of most biggest google event. It will be first English speech in big stage. I did speech in small workshop or small stage. But this time will be very huge. . I feel that i’m lucky to meet Toast master before the speech.

Today Toast master was speech contest, which is not ordinary. I will attend on Toast master next week to observe how they are going.  In addition, jerry recommended me to attend on other Toast master class. Bundang class looks good. I will try to find my time slot. Ah. not finding. I will try to make my slot and will attend on there.