Skip to content

Latest commit

 

History

History
32 lines (27 loc) · 10.1 KB

operability.md

File metadata and controls

32 lines (27 loc) · 10.1 KB

Operability check

Part of the Multi-team Software Delivery Assessment (README)

Copyright © 2018-2021 Conflux Digital Ltd

Licenced under CC BY-SA 4.0 CC BY-SA 4.0

Permalink: SoftwareDeliveryAssessment.com

Based on the operability assessment questions from Team Guide to Software Operability by Matthew Skelton, Alex Moore, and Rob Thatcher at OperabilityQuestions.com

Purpose: Assess the awareness and practices of the team in relation to software operability - readiness for Production

Method: Use the Spotify Squad Health Check approach to assess the team's answers to the following questions, and also capture the answers:

Question Tired (1) Inspired (5)
1. Collaboration - How often and in what ways do we collaborate with other teams on operational aspects of the system such as operational features (logging, monitoring, alerting, etc.) and NFRs? We respond to the need for operational aspects after go-live when tickets are raised by the live service teams We collaborate on operational aspects from the very first week of the engagement/project
2. Spend on operability - What proportion of product budget and team effort is spent addressing operational aspects? How do you track this? [Ignore infrastructure costs and focus on team effort] We try to spend as little time and effort as possible on operational aspects / We do not track the spend on operational aspects at all We spend around 30% of our time and budget addressing operational aspects and raise an alert if focus on operational aspects drops
3. Feature Toggles - How do we know which feature toggles (feature switches) are active for this subsystem? We need to run diffs against config files to determine which feature toggles are active We have a simple UI or API to report the active/inactive feature flags in an environment
4. Config deployment - How do we deploy a configuration change without redeploying the software? We cannot deploy a configuration change without deploying the software or causing an outage We simply run a config deployment separately from the software / We deploy config together with the software without an outage
5. System health - How do we know that the system is healthy (or unhealthy)? We wait for checks made manually by another team to tell us if our software is healthy We query the software using a standard HTTP health check URL, returning HTTP 200/500, etc. based on logic that we write in the code, and with synthetic transaction monitoring for key scenarios
6. Service KPIs - How do we track the main service/system Key Performance Indicators (KPIs)? What are the KPIs? We do not have service KPIs defined We use logging and/or time series metrics to emit service KPIs that are picked up by a dashboard
7. Logging working - How do we know that logging is working correctly? We do not test if logging is working We test that logging is working using BDD feature tests that search for specific log message strings after a particular application behaviour is executed and we can see logs appear correctly in the central log aggregation/search system
8. Testability - How do we show that the software system is easy to test? What do we provide and to whom? We do not explicitly aim to make our software easily testable We run clients and external test packs against all parts of our software within our deployment pipeline
9. TLS Certs - How do we know when an SSL/TLS certificate is close to expiry? We do not know when our certificates are going to expire We use auto-renewal of certificates combined with certificate monitoring/alerting tools to keep a live check on when certs will expire so we can take remedial action ahead of time
10. Sensitive data - How do we ensure that sensitive data in logs is masked or hidden? We do not test for sensitive data in logs We test that data masking is happening by using BDD feature tests that search for specific log message strings after a particular application behaviour is executed
11. Performance - How do we know that the system/service performs within acceptable ranges? We rely solely on the Performance team to validate the performance of our service or application We run a set of indicative performance tests within our deployment pipeline that are run on every check-in to version control
12. Failure modes - How can we see and share the different known failure modes (failure scenarios) for the system? We do not really know how the system might fail We use a set of error identifiers to define the failure modes in our software and we use these identifiers in our log messages
13. Call tracing - How do we trace a call/request end-to-end through the system? We do not trace calls through the system We use a standard tracing library such as OpenTracing to trace calls through the system. We collaborate with other teams to ensure that the correct tracing fields are maintained across component boundaries.
14. Service status - How do we display the current service/system status to operations-facing teams? Operations teams tend to discover the status indicators themselves We build a dashboard in collaboration with the Operations teams so they have all the details they need in a user-friendly way with UX a key consideration