Introduction to Site Reliability Engineering

The idea behind: Work for the product

Site Reliability Engineering is the smarter version of devops engineering*1, coming with a set of practices to manage the infrastructure holding a product, where the word "product" only has meaning because there are people using it to complete user stories and as a result a business can succeed. By bringing focus towards the product and not only on technology, we maximize the value we provide and we reduce effort spent on technical tasks that don't matter to end users.
The fundamental idea isn't new, it's an already known important principle in many other software development areas to build and maintain a cost-effective product.

The moment it clicked

The thing that blew my mind of this approach is the inclination for embracing risk; making something be perfect 100% of the time is something almost nobody can afford. It's okay to have errors and downtime as long as we define what is acceptable and what is too expensive to improve. For that reason we surely need to be certain we are measuring everything (but smartly, even if it's hard to measure). 

Embracing risk, not alert noise

We all know how to pull logs from our systems into a log analytics tool and create a metric to trigger an alert, but when our business grows and the software system does more things, that kind of metric won't help us understand the severity of each one, we'll have much more alerts than we can handle and we're going to need to either prioritise or just infinitely hire more engineers to react to potentially meaningless alerts.

Buzzwords

The buzzwords I'm going to present below will only be meaningful if you do a pre-work to create a Site Reliability Strategy Plan I explain in the next section where you'll collect your user journeys and analise for each of them which are the services that keep them alive, the consequences of failure and how you can measure the success of the technical support towards the user story.
  • SLI - Service Level Indicators (metric): This is just the thing you're measuring or simply the metric. We use the term "SLI" referring to a metric we've defined by doing the pre-work of understanding their importance and how it connects to a user journey.
  • SLO - Service Level Objective (target): It's where you need your SLI to before it causes a major disruption of your service and negatively impacts your business.
    • Error budget: Once you've set your SLO, the remaining percentage is your error budget and it's up to each team to manage and ensure the SLIs stay between their target. 
  • SLA - Service Level Agreement (contract): What the business managers agree with end users. As an SRE you're not defining these but instead you're working towards avoiding agreed disruptions by meeting the SLOs you defined. If your team starts breaking the SLA, you'd usually stop all sprint work and work on recovering the SLA.

Pre-work: SRE Strategy

The SRE plan will lay out for each of your user journeys what are the indicators and goals and their priority.
  1. Collect all your user journeys
  2. For each user journey, identify which are the technical components that support the user journey.
  3. Determine the SLI for each user journey.
    1. Something the user cares about
    2. Should be a single data point that can represent the user journey

Do I stop measuring everything else?

No, of course not, our metrics for all our systems traffic, saturation, latency and error rate will still be necessary and valuable but we won't be alerting based on those signals.

You may or may not need these many screens.

Next post will be the continuation of this write-up where I'll summarise more low level techniques to apply this knowledge seasoned with my own experience.

Clarification notes:

  1. *I know some say devops is not a role but a philosophy but I'm still seeing devops in job offers so let's stay with the average meaning.

Sources:

  • I drafted this write-up from a training session I had at work by Mike Bryant and Pratiksha Vekaria and who decided to give all our teams an amazing training session. Since then, my team has started applying this approach, are more conscious about what we use as an alert and our on-call rota has improved considerably 🎉.
  • Other quick references I used to disambiguate some concepts:
    • https://sre.google/sre-book/service-level-objectives