diff --git a/enhancements/monitoring/alerting-consistency.md b/enhancements/monitoring/alerting-consistency.md new file mode 100644 index 0000000000..3278ea5a17 --- /dev/null +++ b/enhancements/monitoring/alerting-consistency.md @@ -0,0 +1,253 @@ +--- +title: alerting-consistency +authors: + - "@michaelgugino" +reviewers: + - TBD +approvers: + - TBD +creation-date: 2021-02-03 +last-updated: 2021-02-03 +status: implementable +--- + +# Alerting Consistency + +## Release Signoff Checklist + +- [ ] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Operational readiness criteria is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +Clear and actionable alerts are a key component of a smooth operational +experience. Ensuring we have clear and concise guidelines for our alerts +will allow developers to better inform users of problem situations and how to +resolve them. + +## Motivation + +Improve Signal to noise of alerts. Align alert levels "Critical, Warning, Info" +with operational expectations to allow easier triaging. Ensure current and +future alerts have documentation covering problem/investigation/solution steps. + +### Goals + +* Define clear criteria for designating an alert 'critical' +* Define clear criteria for designating an alert 'warning' +* Define minimum time thresholds before prior to triggering an alert +* Define ownership and responsibilities of existing and new alerts +* Establish clear documentation guidelines +* Outline operational triage of firing alerts + +### Non-Goals + +* Define specific alert needs +* Implement user-defined alerts + +## Proposal + +### User Stories + +#### Story 1 + +As an OpenShift developer, I need to ensure that my alerts are appropriately +tuned to allow end-user operational efficiency. + +#### Story 2 + +As an SRE, I need alerts to be informative, actionable and thoroughly +documented. + +### Implementation Details/Notes/Constraints [optional] + +This enhancement will require some developers and administrators to rethink +their concept of alerts and alert levels. Many people seem to have an +intuition about how alerts should work that doesn't necessarily match how +alerts actually work or how others believe alerts should work. + +Formalizing an alerting specification will allow developers and administrators +to speak a common language and have a common understanding, even if the concepts +seem unintuitive initially. + +### Risks and Mitigations + +People will make wrong assumptions about alerts and how to handle them. This +is not unique to alerts, people often believe certain components should behave +in a particular way in which they do not. This is an artifact of poor +documentation that leaves too much to the imagination. + +We can work around this problem with clear and concise documentation. + +## Design Details + +### Critical Alerts + +TL/DR: For alerting current and impending disaster situations. + +Timeline: ~5 minutes. + +Reserve critical level alerts only for reporting conditions that may lead to +loss of data or inability to deliver service for the cluster as a whole. +Failures of most individual components should not trigger critical level alerts, +unless they would result in either of those conditions. Configure critical level +alerts so they fire before the situation becomes irrecoverable. Expect users to +be notified of a critical alert within a short period of time after it fires so +they can respond with corrective action quickly. + +Some disaster situations are: +* loss or impending loss of etcd-quorum +* etcd corruption +* inability to route application traffic externally or internally +(data plane disruption) +* inability to start/restart any pod other than capacity. EG, if the API server +were to restart or need to be rescheduled, would it be able to start again? + +In other words, critical alerts are something that require someone to get out +of bed in the middle of the night and fix something **right now** or they will +be faced with a disaster. + +An example of something that is NOT a critical alert: +* MCO and/or related components are completely dead/crash looping. + +Even though the above is quite significant, there is no risk of immediate loss +of the cluster or running applications. + +You might be thinking to yourself: "But my component is **super** important, and +if it's not working, the user can't do X, which user definitely will want." All +that is probably true, but that doesn't make your alert "critical" it just makes +it worthy of an alert. + +The group of critical alerts should be small, very well defined, +highly documented, polished and with a high bar set for entry. This includes a +mandatory review of a proposed critical alert by the Red Hat SRE team. + +### Warning Alerts + +TL/DR: fix this soon, or some things won't work and upgrades will probably be +blocked. + +If your alert does not meet the criteria in "Critical Alerts" above, it belongs +to the warning level or lower. + +Use warning level alerts for reporting conditions that may lead to inability to +deliver individual features of the cluster, but not service for the cluster as a +whole. Most alerts are likely to be warnings. Configure warning level alerts so +that they do not fire until components have sufficient time to try to recover +from the interruption automatically. Expect users to be notified of a warning, +but for them not to respond with corrective action immediately. + +Timeline: ~60 minutes + +60 minutes? That seems high! That's because it is high. We want to reduce +the noise. We've done a lot to make clusters and operators auto-heal +themselves. The whole idea is that if a condition has persisted for more than +60 minutes, it's unlikely to be resolved without intervention any time later. + +#### Example 1: All machine-config-daemons are crashlooping + +That seems significant +and it probably is, but nobody has to get out of bed to fix this **right now**. +But, what if a machine dies, I won't be able to get a replacement? Yeah, that +is probably true, but how often does that happen? Also, that's an entirely +separate set of concerns. + +#### Example 2: I have an operator that needs at least 3 of something to work +* With 2/3 replicas, warning alert after 60M. +* With 1/3 replicas, warning alert after 10M. +* With 0/3 replicas, warning alert after 5M. + +Q: How long should we wait until the above conditions rises to a 'critical' +alert? + +A: It should never rise to the level of a critical alert, it's not critical. If +it was, this section would not apply. + +#### Example 3: I have an operator that only needs 1 replica to function + +If the cluster can upgrade with only 1 replica, and the the service is +available despite other replicas being unavailable, this can probably +be just an info-level alert. + +#### Q: What if a particular condition will block upgrades? + +A: It's a warning level alert. + + +### Alert Ownership + +Previously, the bulk of our alerting was handled directly by the monitoring +team. Initially, this gave use some indication into platform status without +much effort by each component team. As we mature our alerting system, we should +ensure all teams take ownership of alerts for their individual components. + +Going forward, teams will be expected to own alerting rules within their own +repositories. This will reduce the burden on the monitoring team and better +enable component owners to control their alerts. + +Additional responsibilities include: + +1. First in line to receive the bug for an alert firing +1. Responsible for describing, "what does it mean" and "what does it do" +1. Responsible for choosing the level of alert +1. Responsible for deciding if the alert even matters + + +### Documentation Required + +1. The name of the alerting rule should clearly identify the component impacted +by the issue (for example etcdInsufficientMembers instead of +InsufficientMembers, MachineConfigDaemonDrainError instead of MCDDrainError). +It should camel case, without whitespace, starting with a capital letter. The +first part of the alert name should be the same for all alerts originating from +the same component. +1. Alerting rules should have a "severity" label whose value is either info, +warning or critical (matching what we have today and staying aside from the +discussion whether we want minor or not). +1. Alerting rules should have a description annotation providing details about +what is happening and how to resolve the issue. +1. Alerting rules should have a summary annotation providing a high-level +description (similar to the first line of a commit message or email subject). +1. If there's a runbook in https://github.com/openshift/runbooks, it should be +linked in the runbook_url annotation. + + +### Open Questions [optional] + +* Explore additional alerting severity (eg, 'minor') for triage purposes + +### Test Plan + +## Implementation History + +## Drawbacks + +People might have rules around the existing broken alerts. They will have to +change some of these rules. + +## Alternatives + +Document policies, but not make any backwards-incompatible changes to the +existing alerts and only apply the policies to new alerts. + +### Graduation Criteria +None + +#### Dev Preview -> Tech Preview" +None + +#### Tech Preview -> GA" +None + +#### Removing a deprecated feature" +None + +### Upgrade / Downgrade Strategy" +None + +### Version Skew Strategy" +None