Tackling Alert Fatigue

@caitie

Her slides are at: https://github.com/CaitieM20/Monitorama2016

"When alerts are more often false than true, the on-call's sense of urgency in responding to alerts is diminished.... the simple burden of alerts desensitizes the on-call to alerts..."
"When alarms are more often false than true, the nursing staff's sense of urgency in responding to alarms is diminished.... the simple burden of alerts desensitizes caregivers to alarms..."
Nurses were alerts too frequently to heart failure... so they ignored it.
and people died.
This is a people problem, not tech problem.
Ignored alerts -> unreliable systems -> unhappy customers
unplanned work -> inability to complete planned work -> less time to focus on core business.
Business measure this poorly.
People problem:
- Fatigue + Fire-fighting = burnout.
Tackling Alert Fatigue in Hospitals
- Increases thresholds for patients vitals
- Only Crisis Alarms would emit audible alerts
- Nursing staff required to tune false positive alerts.
- Novel Approach to Cardiac Alarm Management on Telemetry Units
- THIS DECREASED FATALITIES ###
Applied the above to Cockoo for Twitter. 50 pages per week previous.
Started with a full alert audit. Took 2 weeks.
"Nothing to surface hidden assumptions like writing runbooks"
Runbooks:
- Table of Contents,
- General description,
- list of dashboards,
- Then all the Alerts.
  - Title of alert,
  - impact to customer
    - (If no impact, candidate for alert deletion)
    - Don't page for something being weird, if there's no real impact
  - remediation steps
    - Includes customer communication
    - If you can't tell me what to do, delete the alert.
    - If you don't have control over the alert, don't alert.
Empower the on-call
- Tune alert thresholds
- Disable or delete in-actionable alerts
- Business hours only alerts are a thing.
Weekly on-call retro:
- Handoff ongoing issues
- review alerts fired in the previous week
- Schedule work to improve on-call or reliability
  - Were these actionable?
  - Did you tweak thresholds
  - do something else?
  - Prioritize fixing things
"The goal is not to never get paged, the goal is to never get paged for the same thing twice" - Astrid Atkinson
50% reduction of alerts in 1 quarter.
- Reductions continue after each quarter (just not as much)
On-call slept through the night
more time to do scheduled work while on-call
Faster to ramp up new teammates
- Was 6 months before on-call happened,
- now is on-call within 1 month
This also vastly improved the visibility into the system.
- Quarterly pie chart of alerts by service.
Prevention of Alert Fatigue:
- Critical alerts need to be actionable and impacting customers.
- Do not alert on machine specific metrics.
  - GC Pauses on one machine are not pageable.
  - High CPU on one machine is not pageable.
- The tech lead or engineering manager should be on-call. ###
- Cultural change.
  - move away from rewarding fire-fighting.
  - Move to preventing fires
- The goal is ti build systems that can scale linearly with machines and sub-linearly with people.
Benefits of tackling alert fatigue:
- More reliable systems
- Less unplanned work
- happier developers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mon07Tackling.Alert.Fatigueby_Caitie.McCaffrey.md

Mon07Tackling.Alert.Fatigueby_Caitie.McCaffrey.md

Tackling Alert Fatigue

Files

Mon07__Tackling.Alert.Fatigue__by_Caitie.McCaffrey.md

Latest commit

History

Mon07__Tackling.Alert.Fatigue__by_Caitie.McCaffrey.md

File metadata and controls

Tackling Alert Fatigue

Mon07Tackling.Alert.Fatigueby_Caitie.McCaffrey.md

Mon07Tackling.Alert.Fatigueby_Caitie.McCaffrey.md