Reliability and SRE check

Part of the Multi-team Software Delivery Assessment (README)

Copyright © 2018-2021 Conflux Digital Ltd

Licenced under CC BY-SA 4.0

Permalink: SoftwareDeliveryAssessment.com

Based on selected criteria from the following books:

Site Reliability Engineering by Betsy Beyer, Chris Jones, Jennifer Petoff, & Niall Murphy
The Site Reliability Workbook edited by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, & Stephen Thorne
Seeking SRE edited by David N. Blank-Edelman
Team Guide to Software Operability by Matthew Skelton, Alex Moore, & Rob Thatcher

Purpose: Assess the approach to reliability and SRE practices within the software system.

Method: Use the Spotify Squad Health Check approach to assess the team's answers to the following questions, and also capture the answers:

Question	Tired (1)	Inspired (5)
1. Service Availability - How available (in "nines") does your service or application need to be and how do you know or decide?	We don't know how available our service needs to be --OR-- The availability "needs to be 100%".	The reliability is based on clear business priorities and is less than 100%.
2. User Goals and SLIs - What should your service/application do from the viewpoint of the user?	We do not have a clear definition of what our application or service does from the user perspective.	We have clear, user-centric definitions of the application/service capabilities and outcomes from a user perspective.
3. Understanding users and behavior - Who are the users of the software and how do they interact with the software? How do you know?	We don't really know how our users interact with our application/service --OR-- We don't really know who our users are.	We have user personas validated through user research and we measure and track usage of the applications/services using digital telemetry.
4. SLIs/SLOs - How do you know when users have experienced an outage or unexpected behaviour in the software?	We know there is an outage or problem when users complain via chat or the help desk.	We proactively monitor the user experience using synthetic transactions across the key user journeys.
5. Service Health - What is the single most important indicator or metric you use to determine the health and availability of your software in production/live?	We don't have a single key metric for the health and availability of the application/service.	We have a clear, agreed key metric for each application/service and we display this figure on a team-visible dashboard. The dashboard data is updated at least every 10 minutes.
6. SLIs - What combination of three or four indicators or metrics do you use (or could/would you use) to provide a comprehensive picture of the health and availability of your software in production/live?	We don't have a set of key metrics for the health and availability of the application/service.	We have a clear, agreed set of key metrics for each application/service and we display this figure on a team-visible dashboard. The dashboard data is updated at least every 10 minutes.
7. Error Budget and similar mechanisms - How does the team know when to spend time on operational aspects of the software (logging, metrics, performance, reliability, security, etc.)? Does that time actually get spent?	We spend time on operational aspects only when there is a problem that needs fixing.	We allocate between 20% and 30% of our time for working on operational aspects and we check this each week. We alert if we have not spent time on operational aspects --OR-- We use SRE Error Budgets to plan our time spent on operational aspects.
8. Alerting - What proportion (approximately) of your time and effort as a team do you spend on making alerts and operational messages more reliable and more relevant?	We spend as little time as possible on alerts and operational messages - we need to focus on user-visible features.	We regularly spend time reviewing and improving alerts and operational messages.
9. Toil and fixing problems - What proportion (approx) of your time gets taken up with incidents from live systems and how predictable is the time needed to fix problems?	We do not deal with issues from live systems at all - we focus on new features --OR-- live issues can really affect our delivery cadence and are very disruptive.	We allocate a consistent amount of time for dealing with live issues --OR-- one team member is responsible for triage of live issues each week OR we rarely have problems with live issues because the software works well.
10. Time to Diagnose - How long does it typically take to diagnose problems in the live/production environment? This is the time taken to understand and pinpoint what is wrong (not to fix or remediate the problem).	It can take hours or days to diagnose most problems in live/production.	It typically takes seconds or minutes to diagnose most problems in live/production.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reliability.md

reliability.md

Reliability and SRE check

Files

reliability.md

Latest commit

History

reliability.md

File metadata and controls

Reliability and SRE check