The idea behind: Work for the product
Site Reliability Engineering is the smarter version of devops engineering*1, coming with a set of practices to manage the infrastructure holding a product, where the word "product" only has meaning because there are people using it to complete user stories and as a result a business can succeed. By bringing focus towards the product and not only on technology, we maximize the value we provide and we reduce effort spent on technical tasks that don't matter to end users.
The fundamental idea isn't new, it's an already known important principle in many other software development areas to build and maintain a cost-effective product.
The moment it clicked
The thing that blew my mind of this approach is the inclination for embracing risk; making something be perfect 100% of the time is something almost nobody can afford. It's okay to have errors and downtime as long as we define what is acceptable and what is too expensive to improve. For that reason we surely need to be certain we are measuring everything (but smartly, even if it's hard to measure).
Embracing risk, not alert noise
We all know how to pull logs from our systems into a log analytics tool and create a metric to trigger an alert, but when our business grows and the software system does more things, that kind of metric won't help us understand the severity of each one, we'll have much more alerts than we can handle and we're going to need to either prioritise or just infinitely hire more engineers to react to potentially meaningless alerts.
Buzzwords
- SLI - Service Level Indicators (metric): This is just the thing you're measuring or simply the metric. We use the term "SLI" referring to a metric we've defined by doing the pre-work of understanding their importance and how it connects to a user journey.
- SLO - Service Level Objective (target): It's where you need your SLI to before it causes a major disruption of your service and negatively impacts your business.
- Error budget: Once you've set your SLO, the remaining percentage is your error budget and it's up to each team to manage and ensure the SLIs stay between their target.
- SLA - Service Level Agreement (contract): What the business managers agree with end users. As an SRE you're not defining these but instead you're working towards avoiding agreed disruptions by meeting the SLOs you defined. If your team starts breaking the SLA, you'd usually stop all sprint work and work on recovering the SLA.
Pre-work: SRE Strategy
- Collect all your user journeys
- For each user journey, identify which are the technical components that support the user journey.
- Determine the SLI for each user journey.
- Something the user cares about
- Should be a single data point that can represent the user journey
Do I stop measuring everything else?
You may or may not need these many screens. |
Clarification notes:
- *I know some say devops is not a role but a philosophy but I'm still seeing devops in job offers so let's stay with the average meaning.
Sources:
- I drafted this write-up from a training session I had at work by Mike Bryant and Pratiksha Vekaria and who decided to give all our teams an amazing training session. Since then, my team has started applying this approach, are more conscious about what we use as an alert and our on-call rota has improved considerably 🎉.
- Other quick references I used to disambiguate some concepts:
- https://sre.google/sre-book/service-level-objectives