Skip to content

Commit 2a1bc4c

Browse files
committed
til: philosophy on alerting
1 parent c091c80 commit 2a1bc4c

File tree

1 file changed

+32
-0
lines changed

1 file changed

+32
-0
lines changed
+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# My Philosophy on Alerting
2+
3+
Source:
4+
5+
- <https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit?usp=sharing>
6+
- <https://www.oreilly.com/radar/monitoring-distributed-systems/>
7+
8+
## Monitor for your users
9+
10+
- Symptom-based monitoring > cause-based monitoring.
11+
- Users, in general, care about a small number of things:
12+
- Basic availability and correctness
13+
- Latency
14+
- Completeness/freshness/durability
15+
- Features
16+
- Cause-based alerts are bad, but sometimes necessary. ere's (often) no symptoms to "almost" running out of quota
17+
or memory or disk I/O, etc., so you want rules to know you're walking towards a cliff. Use these
18+
sparingly; don't write cause-based paging rules for symptoms you can catch otherwise.
19+
20+
## Tickets, Reports and Email
21+
22+
- Bug or ticket-tracking systems can be usefuil.
23+
- A daily (or more frequent) report can work too.
24+
- Every alert should be tracked through a workflow system.
25+
26+
The underlying point is to create a system that still has accountability for responsiveness, but doesn't have the high cost of waking someone up, interrupting their dinner, or preventing snuggling with a significant other.
27+
28+
## Playbooks
29+
30+
Playbooks (or runbooks) are an important part of an alerting system; it's best to have an entry for each alert or family of alerts that catch a symptom, which can further explain what the alert means and how it might be addressed.
31+
32+
## Tracking & Accountability

0 commit comments

Comments
 (0)