Part of the Multi-team Software Delivery Assessment (README)
Copyright © 2018-2021 Conflux Digital Ltd
Licenced under CC BY-SA 4.0
Permalink: SoftwareDeliveryAssessment.com
Based on selected criteria from the following books:
- Site Reliability Engineering by Betsy Beyer, Chris Jones, Jennifer Petoff, & Niall Murphy
- The Site Reliability Workbook edited by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, & Stephen Thorne
- Seeking SRE edited by David N. Blank-Edelman
- Team Guide to Software Operability by Matthew Skelton, Alex Moore, & Rob Thatcher
Purpose: Assess the approach to reliability and SRE practices within the software system.
Method: Use the Spotify Squad Health Check approach to assess the team's answers to the following questions, and also capture the answers:
Question | Tired (1) | Inspired (5) |
---|---|---|
1. Service Availability - How available (in "nines") does your service or application need to be and how do you know or decide? | We don't know how available our service needs to be --OR-- The availability "needs to be 100%". | The reliability is based on clear business priorities and is less than 100%. |
2. User Goals and SLIs - What should your service/application do from the viewpoint of the user? | We do not have a clear definition of what our application or service does from the user perspective. | We have clear, user-centric definitions of the application/service capabilities and outcomes from a user perspective. |
3. Understanding users and behavior - Who are the users of the software and how do they interact with the software? How do you know? | We don't really know how our users interact with our application/service --OR-- We don't really know who our users are. | We have user personas validated through user research and we measure and track usage of the applications/services using digital telemetry. |
4. SLIs/SLOs - How do you know when users have experienced an outage or unexpected behaviour in the software? | We know there is an outage or problem when users complain via chat or the help desk. | We proactively monitor the user experience using synthetic transactions across the key user journeys. |
5. Service Health - What is the single most important indicator or metric you use to determine the health and availability of your software in production/live? | We don't have a single key metric for the health and availability of the application/service. | We have a clear, agreed key metric for each application/service and we display this figure on a team-visible dashboard. The dashboard data is updated at least every 10 minutes. |
6. SLIs - What combination of three or four indicators or metrics do you use (or could/would you use) to provide a comprehensive picture of the health and availability of your software in production/live? | We don't have a set of key metrics for the health and availability of the application/service. | We have a clear, agreed set of key metrics for each application/service and we display this figure on a team-visible dashboard. The dashboard data is updated at least every 10 minutes. |
7. Error Budget and similar mechanisms - How does the team know when to spend time on operational aspects of the software (logging, metrics, performance, reliability, security, etc.)? Does that time actually get spent? | We spend time on operational aspects only when there is a problem that needs fixing. | We allocate between 20% and 30% of our time for working on operational aspects and we check this each week. We alert if we have not spent time on operational aspects --OR-- We use SRE Error Budgets to plan our time spent on operational aspects. |
8. Alerting - What proportion (approximately) of your time and effort as a team do you spend on making alerts and operational messages more reliable and more relevant? | We spend as little time as possible on alerts and operational messages - we need to focus on user-visible features. | We regularly spend time reviewing and improving alerts and operational messages. |
9. Toil and fixing problems - What proportion (approx) of your time gets taken up with incidents from live systems and how predictable is the time needed to fix problems? | We do not deal with issues from live systems at all - we focus on new features --OR-- live issues can really affect our delivery cadence and are very disruptive. | We allocate a consistent amount of time for dealing with live issues --OR-- one team member is responsible for triage of live issues each week OR we rarely have problems with live issues because the software works well. |
10. Time to Diagnose - How long does it typically take to diagnose problems in the live/production environment? This is the time taken to understand and pinpoint what is wrong (not to fix or remediate the problem). | It can take hours or days to diagnose most problems in live/production. | It typically takes seconds or minutes to diagnose most problems in live/production. |