Skip to content

Latest commit

 

History

History
81 lines (78 loc) · 3.4 KB

Mon07__Tackling.Alert.Fatigue__by_Caitie.McCaffrey.md

File metadata and controls

81 lines (78 loc) · 3.4 KB

Tackling Alert Fatigue

@caitie

Her slides are at: https://github.com/CaitieM20/Monitorama2016

  • "When alerts are more often false than true, the on-call's sense of urgency in responding to alerts is diminished.... the simple burden of alerts desensitizes the on-call to alerts..."
  • "When alarms are more often false than true, the nursing staff's sense of urgency in responding to alarms is diminished.... the simple burden of alerts desensitizes caregivers to alarms..."
  • Nurses were alerts too frequently to heart failure... so they ignored it.
  • and people died.
  • This is a people problem, not tech problem.
  • Ignored alerts -> unreliable systems -> unhappy customers
  • unplanned work -> inability to complete planned work -> less time to focus on core business.
  • Business measure this poorly.
  • People problem:
    • Fatigue + Fire-fighting = burnout.
  • Tackling Alert Fatigue in Hospitals
    • Increases thresholds for patients vitals
    • Only Crisis Alarms would emit audible alerts
    • Nursing staff required to tune false positive alerts.
    • Novel Approach to Cardiac Alarm Management on Telemetry Units
    • THIS DECREASED FATALITIES ###
  • Applied the above to Cockoo for Twitter. 50 pages per week previous.
  • Started with a full alert audit. Took 2 weeks.
  • "Nothing to surface hidden assumptions like writing runbooks"
  • Runbooks:
    • Table of Contents,
    • General description,
    • list of dashboards,
    • Then all the Alerts.
      • Title of alert,
      • impact to customer
        • (If no impact, candidate for alert deletion)
        • Don't page for something being weird, if there's no real impact
      • remediation steps
        • Includes customer communication
        • If you can't tell me what to do, delete the alert.
        • If you don't have control over the alert, don't alert.
  • Empower the on-call
    • Tune alert thresholds
    • Disable or delete in-actionable alerts
    • Business hours only alerts are a thing.
  • Weekly on-call retro:
    • Handoff ongoing issues
    • review alerts fired in the previous week
    • Schedule work to improve on-call or reliability
      • Were these actionable?
      • Did you tweak thresholds
      • do something else?
      • Prioritize fixing things
  • "The goal is not to never get paged, the goal is to never get paged for the same thing twice" - Astrid Atkinson
  • 50% reduction of alerts in 1 quarter.
    • Reductions continue after each quarter (just not as much)
  • On-call slept through the night
  • more time to do scheduled work while on-call
  • Faster to ramp up new teammates
    • Was 6 months before on-call happened,
    • now is on-call within 1 month
  • This also vastly improved the visibility into the system.
    • Quarterly pie chart of alerts by service.
  • Prevention of Alert Fatigue:
    • Critical alerts need to be actionable and impacting customers.
    • Do not alert on machine specific metrics.
      • GC Pauses on one machine are not pageable.
      • High CPU on one machine is not pageable.
    • The tech lead or engineering manager should be on-call. ###
    • Cultural change.
      • move away from rewarding fire-fighting.
      • Move to preventing fires
    • The goal is ti build systems that can scale linearly with machines and sub-linearly with people.
  • Benefits of tackling alert fatigue:
    • More reliable systems
    • Less unplanned work
    • happier developers.