openshift · openshift-merge-robot · Sep 28, 2021 · Sep 13, 2021 · bison · Sep 14, 2021
diff --git a/enhancements/monitoring/alerting-consistency.md b/enhancements/monitoring/alerting-consistency.md
@@ -2,12 +2,13 @@
 title: alerting-consistency
 authors:
   - "@michaelgugino"
+  - "@bison"
 reviewers:
   - TBD
 approvers:
   - TBD
 creation-date: 2021-02-03
-last-updated: 2021-02-03
+last-updated: 2021-09-09
 status: implementable
 ---
 
@@ -25,29 +26,31 @@ status: implementable
 ## Summary
 
 Clear and actionable alerts are a key component of a smooth operational
-experience.  Ensuring we have clear and concise guidelines for our alerts
-will allow developers to better inform users of problem situations and how to
-resolve them.
+experience.  Ensuring we have clear and concise guidelines for developers and
+administrators creating new alerts for OpenShift will result in a better
+experience for end users.  This proposal aims to outline the collective wisdom
+of the OpenShift developer and wider monitoring communities in a way which can
+then be translated into official documentation.
 
 ## Motivation
 
-Improve Signal to noise of alerts.  Align alert levels "Critical, Warning, Info"
-with operational expectations to allow easier triaging.  Ensure current and
-future alerts have documentation covering problem/investigation/solution steps.
+There seems to be consensus that the signal-to-noise ratio of alerts in
+OpenShift could be improved, and alert fatigue for administrators thereby
+reduced.  There should be clear guidance aligning critical, warning, and info
+alert severities with expected outcomes across components.  There must be agreed
+upon acceptance criteria for new critical alerts introduced in OpenShift.
 
 ### Goals
 
-* Define clear criteria for designating an alert 'critical'
-* Define clear criteria for designating an alert 'warning'
-* Define minimum time thresholds before prior to triggering an alert
-* Define ownership and responsibilities of existing and new alerts
-* Establish clear documentation guidelines
-* Outline operational triage of firing alerts
+* Define acceptance criteria for new critical alerts.
+* Define practical guidelines for writing alerts of each severity.
+* Translate this into clear developer documentation with examples.
+* Enforce acceptance criteria for critical alerts in the OpenShift test suite.
 
 ### Non-Goals
 
-* Define specific alert needs
-* Implement user-defined alerts
+* Define needs for specific alerts.
+* Implement user-defined alerts.
 
 ## Proposal
 
@@ -63,31 +66,69 @@ tuned to allow end-user operational efficiency.
 As an SRE, I need alerts to be informative, actionable and thoroughly
 documented.
 
-### Implementation Details/Notes/Constraints [optional]
+### Risks and Mitigations
 
-This enhancement will require some developers and administrators to rethink
-their concept of alerts and alert levels.  Many people seem to have an
-intuition about how alerts should work that doesn't necessarily match how
-alerts actually work or how others believe alerts should work.
+As the primary goal of this proposal is to collect feedback to be translated
+into developer documentation, there isn't much risk.  The primary source of risk
+is friction introduced by enforcing acceptance criteria for critical alerts in
+shared OpenShift test suites.  This could be an issue for teams already shipping
+critical alerts, but can be mitigated by providing case by case exceptions until
+the monitoring team can reach out and provide guidance on correcting any issues.
+This also provides a concrete process by which we can audit existing critical
+alerts across components, and bring them into compliance one by one.
 
-Formalizing an alerting specification will allow developers and administrators
-to speak a common language and have a common understanding, even if the concepts
-seem unintuitive initially.
+## Design Details
 
-### Risks and Mitigations
+### Recommended Reading
 
-People will make wrong assumptions about alerts and how to handle them.  This
-is not unique to alerts, people often believe certain components should behave
-in a particular way in which they do not.  This is an artifact of poor
-documentation that leaves too much to the imagination.
+A list of references on good alerting practices:
 
-We can work around this problem with clear and concise documentation.
+* [Google SRE Book - Monitoring Distributed Systems][7]
+* [Prometheus Alerting Documentation][8]
+* [Alerting for Distributed Systems][9]
 
-## Design Details
+### Alert Ownership
+
+Individual teams are responsible for writing and maintaining alerting rules for
+their components, i.e. their operators and operands.  The monitoring team is
+available for consulting.  Frequently asked questions and broadly applicable
+patterns should be added to the developer documentation this proposal aims to
+result in.
+
+Teams should also take into consideration how their components interact with
+existing monitoring and alerting.  As an example, if your operator deploys a
+service which creates one or more `PersistentVolume` resources, and these
+volumes are expected to be mostly full as part of normal operation, it's likely
+that this will cause unnecessary `KubePersistentVolumeFillingUp` alerts to fire.
+You should work with the monitoring team to find a solution to avoid triggering
+these alerts if they are not actionable.
+
+### Style Guide
+
+* Alert names MUST be CamelCase, e.g.: `PrometheusRuleFailures`
+* Alert names SHOULD be prefixed with a component, e.g.: `AlertmanagerFailedReload`
+  * There may be exceptions for some broadly scoped alerts, e.g.: `TargetDown`
+* Alerts MUST include a `severity` label indicating the alert's urgency.
+  * Valid severities are: `critical`, `warning`, or `info` — see below for
+    guidelines on writing alerts of each severity.
+* Alerts MUST include `summary` and `description` annotations.
+  * Think of `summary` as the first line of a commit message, or an email
+    subject line.  It should be brief but informative.  The `description` is the
+    longer, more detailed explanation of the alert.
+* Alerts SHOULD include a `namespace` label indicating the source of the alert.
+  * Many alerts will include this by virtue of the fact that their PromQL
+    expressions result in a namespace label.  Others may require a static
+    namespace label — see for example, the [KubeCPUOvercommit][1] alert.
+* All critical alerts MUST include a `runbook_url` annotation.
+  * Runbook style documentation for resolving critical alerts is required.
+    These runbooks are reviewed by OpenShift SREs and currently live in the
+    [openshift/runbooks][2] repository.
 
 ### Critical Alerts
 
-TL/DR:  For alerting current and impending disaster situations.
+TL/DR: For alerting current and impending disaster situations.  These alerts
+page an SRE.  The situation should warrant waking someone in the middle of the
+night.
 
 Timeline:  ~5 minutes.
 
@@ -99,37 +140,39 @@ alerts so they fire before the situation becomes irrecoverable. Expect users to
 be notified of a critical alert within a short period of time after it fires so
 they can respond with corrective action quickly.
 
-Some disaster situations are:
-* loss or impending loss of etcd-quorum
-* etcd corruption
-* inability to route application traffic externally or internally
-(data plane disruption)
-* inability to start/restart any pod other than capacity.  EG, if the API server
-were to restart or need to be rescheduled, would it be able to start again?
-
-In other words, critical alerts are something that require someone to get out
-of bed in the middle of the night and fix something **right now** or they will
-be faced with a disaster.
-
-An example of something that is NOT a critical alert:
-* MCO and/or related components are completely dead/crash looping.
-
-Even though the above is quite significant, there is no risk of immediate loss
-of the cluster or running applications.
-
-You might be thinking to yourself: "But my component is **super** important, and
-if it's not working, the user can't do X, which user definitely will want."  All
-that is probably true, but that doesn't make your alert "critical" it just makes
-it worthy of an alert.
-
-The group of critical alerts should be small, very well defined,
-highly documented, polished and with a high bar set for entry. This includes a
+Example critical alert: [KubeAPIDown][3]
+
+```yaml
+- alert: KubeAPIDown
+  annotations:
+    summary: Target disappeared from Prometheus target discovery.
+    description: KubeAPI has disappeared from Prometheus target discovery.
+    runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/KubeAPIDown.md
+  expr: |
+    absent(up{job="apiserver"} == 1)
+  for: 15m
+  labels:
+    severity: critical
+```
+
+This alert fires if no Kubernetes API server instance has reported metrics
+successfully in the last 15 minutes.  This is a clear example of a critical
+control-plane issue that represents a threat to the operability of the cluster
+as a whole, and likely warrants paging someone.  The alert has clear summary and
+description annotations, and it links to a runbook with information on
+investigating and resolving the issue.
+
+The group of critical alerts should be small, very well defined, highly
+documented, polished and with a high bar set for entry.  This includes a
 mandatory review of a proposed critical alert by the Red Hat SRE team.
 
 ### Warning Alerts
 
-TL/DR: fix this soon, or some things won't work and upgrades will probably be
-blocked.
+TL/DR: The vast majority of alerts should use the severity.  Issues at the
+warning level should be addressed in a timely manner, but don't pose an
+immediate threat to the operation of the cluster as a whole.
+
+Timeline:  ~60 minutes
 
 If your alert does not meet the criteria in "Critical Alerts" above, it belongs
 to the warning level or lower.
@@ -141,88 +184,65 @@ that they do not fire until components have sufficient time to try to recover
 from the interruption automatically. Expect users to be notified of a warning,
 but for them not to respond with corrective action immediately.
 
-Timeline:  ~60 minutes
-
-60 minutes?  That seems high!  That's because it is high.  We want to reduce
-the noise.  We've done a lot to make clusters and operators auto-heal
-themselves.  The whole idea is that if a condition has persisted for more than
-60 minutes, it's unlikely to be resolved without intervention any time later.
-
-#### Example 1: All machine-config-daemons are crashlooping
-
-That seems significant
-and it probably is, but nobody has to get out of bed to fix this **right now**.
-But, what if a machine dies, I won't be able to get a replacement?  Yeah, that
-is probably true, but how often does that happen?  Also, that's an entirely
-separate set of concerns.
-
-#### Example 2:  I have an operator that needs at least 3 of something to work
-* With 2/3 replicas, warning alert after 60M.
-* With 1/3 replicas, warning alert after 10M.
-* With 0/3 replicas, warning alert after 5M.
-
-Q: How long should we wait until the above conditions rises to a 'critical'
-alert?
-
-A: It should never rise to the level of a critical alert, it's not critical.  If
-it was, this section would not apply.
-
-#### Example 3:  I have an operator that only needs 1 replica to function
-
-If the cluster can upgrade with only 1 replica, and the the service is
-available despite other replicas being unavailable, this can probably
-be just an info-level alert.
-
-#### Q: What if a particular condition will block upgrades?
-
-A: It's a warning level alert.
-
-
-### Alert Ownership
-
-Previously, the bulk of our alerting was handled directly by the monitoring
-team.  Initially, this gave use some indication into platform status without
-much effort by each component team.  As we mature our alerting system, we should
-ensure all teams take ownership of alerts for their individual components.
-
-Going forward, teams will be expected to own alerting rules within their own
-repositories.  This will reduce the burden on the monitoring team and better
-enable component owners to control their alerts.
-
-Additional responsibilities include:
-
-1. First in line to receive the bug for an alert firing
-1. Responsible for describing, "what does it mean" and "what does it do"
-1. Responsible for choosing the level of alert
-1. Responsible for deciding if the alert even matters
-
-
-### Documentation Required
-
-1. The name of the alerting rule should clearly identify the component impacted
-by the issue (for example etcdInsufficientMembers instead of
-InsufficientMembers, MachineConfigDaemonDrainError instead of MCDDrainError).
-It should camel case, without whitespace, starting with a capital letter. The
-first part of the alert name should be the same for all alerts originating from
-the same component.
-1. Alerting rules should have a "severity" label whose value is either info,
-warning or critical (matching what we have today and staying aside from the
-discussion whether we want minor or not).
-1. Alerting rules should have a description annotation providing details about
-what is happening and how to resolve the issue.
-1. Alerting rules should have a summary annotation providing a high-level
-description (similar to the first line of a commit message or email subject).
-1. If there's a runbook in https://github.com/openshift/runbooks, it should be
-linked in the runbook_url annotation.
-
-
-### Open Questions [optional]
-
-* Explore additional alerting severity (eg, 'minor') for triage purposes
+Example warning alert: [ClusterNotUpgradeable][4]
+
+```yaml
+- alert: ClusterNotUpgradeable
+  annotations:
+    summary: One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
+    description: In most cases, you will still be able to apply patch releases.
+      Reason {{ "{{ with $cluster_operator_conditions := \"cluster_operator_conditions\" | query}}{{range $value := .}}{{if and (eq (label \"name\" $value) \"version\") (eq (label \"condition\" $value) \"Upgradeable\") (eq (label \"endpoint\" $value) \"metrics\") (eq (value $value) 0.0) (ne (len (label \"reason\" $value)) 0) }}{{label \"reason\" $value}}.{{end}}{{end}}{{end}}"}}
+      For more information refer to 'oc adm upgrade'{{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} or {{ label \"url\" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}" }}.
+    expr: |
+      max by (name, condition, endpoint) (cluster_operator_conditions{name="version", condition="Upgradeable", endpoint="metrics"} == 0)
+    for: 60m
+    labels:
+      severity: warning
+```
+
+This alert fires if one or more operators have not reported their `Upgradeable`
+condition as true in more than an hour.  The alert has a clear name and
+informative summary and description annotations.  The timeline is appropriate
+for allowing the operator a chance to resolve the issue automatically, avoiding
+the need to alert an administrator.
+
+### Info Alerts
+
+TL/DR: Info level alerts represent situations an administrator should be aware
+of, but that don't necessarily require any action.  Use these sparingly, and
+consider instead reporting this information via Kubernetes events.
+
+Example info alert: [MultipleContainersOOMKilled][5]
+
+```yaml
+- alert: MultipleContainersOOMKilled
+  annotations:
+    description: Multiple containers were out of memory killed within the past
+      15 minutes. There are many potential causes of OOM errors, however issues
+      on a specific node or containers breaching their limits is common.
+      summary: Containers are being killed due to OOM
+  expr: sum(max by(namespace, container, pod) (increase(kube_pod_container_status_restarts_total[12m]))
+    and max by(namespace, container, pod) (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}) == 1) > 5
+  for: 15m
+  labels:
+    namespace: kube-system
+    severity: info
+```
+
+This alert fires if multiple containers have been terminated due to out of
+memory conditions in the last 15 minutes.  This is something the administrator
+should be aware of, but may not require immediate action.
 
 ### Test Plan
 
+Automated tests enforcing acceptance criteria for critical alerts, and basic
+style linting will be added to the [openshift/origin][6] end-to-end test suite.
+The monitoring team will work with anyone shipping existing critical alerts that
+don't meet these criteria in order to resolve the issue before enabling the
+tests.
+
 ## Implementation History
+None
 
 ## Drawbacks
 
@@ -251,3 +271,14 @@ None
 
 ### Version Skew Strategy"
 None
+
+
+[1]: https://github.com/openshift/cluster-monitoring-operator/blob/79cdf68/assets/control-plane/prometheus-rule.yaml#L235-L247
+[2]: https://github.com/openshift/runbooks
+[3]: https://github.com/openshift/cluster-monitoring-operator/blob/79cdf68/assets/control-plane/prometheus-rule.yaml#L412-L421
+[4]: https://github.com/openshift/cluster-version-operator/blob/513a2fc/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L68-L76
+[5]: https://github.com/openshift/cluster-monitoring-operator/blob/79cdf68/assets/cluster-monitoring-operator/prometheus-rule.yaml#L326-L338
+[6]: https://github.com/openshift/origin
+[7]: https://sre.google/sre-book/monitoring-distributed-systems/
+[8]: https://prometheus.io/docs/practices/alerting/
+[9]: https://www.usenix.org/sites/default/files/conference/protected-files/srecon16europe_slides_rabenstein.pdf