Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
303 changes: 167 additions & 136 deletions enhancements/monitoring/alerting-consistency.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,13 @@
title: alerting-consistency
authors:
- "@michaelgugino"
- "@bison"
reviewers:
- TBD
approvers:
- TBD
creation-date: 2021-02-03
last-updated: 2021-02-03
last-updated: 2021-09-09
status: implementable
---

Expand All @@ -25,29 +26,31 @@ status: implementable
## Summary

Clear and actionable alerts are a key component of a smooth operational
experience. Ensuring we have clear and concise guidelines for our alerts
will allow developers to better inform users of problem situations and how to
resolve them.
experience. Ensuring we have clear and concise guidelines for developers and
administrators creating new alerts for OpenShift will result in a better
experience for end users. This proposal aims to outline the collective wisdom
of the OpenShift developer and wider monitoring communities in a way which can
then be translated into official documentation.

## Motivation

Improve Signal to noise of alerts. Align alert levels "Critical, Warning, Info"
with operational expectations to allow easier triaging. Ensure current and
future alerts have documentation covering problem/investigation/solution steps.
There seems to be consensus that the signal-to-noise ratio of alerts in
OpenShift could be improved, and alert fatigue for administrators thereby
reduced. There should be clear guidance aligning critical, warning, and info
alert severities with expected outcomes across components. There must be agreed
upon acceptance criteria for new critical alerts introduced in OpenShift.

### Goals

* Define clear criteria for designating an alert 'critical'
* Define clear criteria for designating an alert 'warning'
* Define minimum time thresholds before prior to triggering an alert
* Define ownership and responsibilities of existing and new alerts
* Establish clear documentation guidelines
* Outline operational triage of firing alerts
* Define acceptance criteria for new critical alerts.
* Define practical guidelines for writing alerts of each severity.
* Translate this into clear developer documentation with examples.
* Enforce acceptance criteria for critical alerts in the OpenShift test suite.

### Non-Goals

* Define specific alert needs
* Implement user-defined alerts
* Define needs for specific alerts.
* Implement user-defined alerts.

## Proposal

Expand All @@ -63,31 +66,69 @@ tuned to allow end-user operational efficiency.
As an SRE, I need alerts to be informative, actionable and thoroughly
documented.

### Implementation Details/Notes/Constraints [optional]
### Risks and Mitigations

This enhancement will require some developers and administrators to rethink
their concept of alerts and alert levels. Many people seem to have an
intuition about how alerts should work that doesn't necessarily match how
alerts actually work or how others believe alerts should work.
As the primary goal of this proposal is to collect feedback to be translated
into developer documentation, there isn't much risk. The primary source of risk
is friction introduced by enforcing acceptance criteria for critical alerts in
shared OpenShift test suites. This could be an issue for teams already shipping
critical alerts, but can be mitigated by providing case by case exceptions until
the monitoring team can reach out and provide guidance on correcting any issues.
This also provides a concrete process by which we can audit existing critical
alerts across components, and bring them into compliance one by one.

Formalizing an alerting specification will allow developers and administrators
to speak a common language and have a common understanding, even if the concepts
seem unintuitive initially.
## Design Details

### Risks and Mitigations
### Recommended Reading

People will make wrong assumptions about alerts and how to handle them. This
is not unique to alerts, people often believe certain components should behave
in a particular way in which they do not. This is an artifact of poor
documentation that leaves too much to the imagination.
A list of references on good alerting practices:

We can work around this problem with clear and concise documentation.
* [Google SRE Book - Monitoring Distributed Systems][7]
* [Prometheus Alerting Documentation][8]
* [Alerting for Distributed Systems][9]

## Design Details
### Alert Ownership

Individual teams are responsible for writing and maintaining alerting rules for
their components, i.e. their operators and operands. The monitoring team is
available for consulting. Frequently asked questions and broadly applicable
patterns should be added to the developer documentation this proposal aims to
result in.

Teams should also take into consideration how their components interact with
existing monitoring and alerting. As an example, if your operator deploys a
service which creates one or more `PersistentVolume` resources, and these
volumes are expected to be mostly full as part of normal operation, it's likely
that this will cause unnecessary `KubePersistentVolumeFillingUp` alerts to fire.
You should work with the monitoring team to find a solution to avoid triggering
these alerts if they are not actionable.

### Style Guide

* Alert names MUST be CamelCase, e.g.: `PrometheusRuleFailures`
* Alert names SHOULD be prefixed with a component, e.g.: `AlertmanagerFailedReload`
* There may be exceptions for some broadly scoped alerts, e.g.: `TargetDown`
* Alerts MUST include a `severity` label indicating the alert's urgency.
* Valid severities are: `critical`, `warning`, or `info` — see below for
guidelines on writing alerts of each severity.
* Alerts MUST include `summary` and `description` annotations.
* Think of `summary` as the first line of a commit message, or an email
subject line. It should be brief but informative. The `description` is the
longer, more detailed explanation of the alert.
* Alerts SHOULD include a `namespace` label indicating the source of the alert.
* Many alerts will include this by virtue of the fact that their PromQL
expressions result in a namespace label. Others may require a static
namespace label — see for example, the [KubeCPUOvercommit][1] alert.
* All critical alerts MUST include a `runbook_url` annotation.
* Runbook style documentation for resolving critical alerts is required.
These runbooks are reviewed by OpenShift SREs and currently live in the
[openshift/runbooks][2] repository.

### Critical Alerts

TL/DR: For alerting current and impending disaster situations.
TL/DR: For alerting current and impending disaster situations. These alerts
page an SRE. The situation should warrant waking someone in the middle of the
night.

Timeline: ~5 minutes.

Expand All @@ -99,37 +140,39 @@ alerts so they fire before the situation becomes irrecoverable. Expect users to
be notified of a critical alert within a short period of time after it fires so
they can respond with corrective action quickly.

Some disaster situations are:
* loss or impending loss of etcd-quorum
* etcd corruption
* inability to route application traffic externally or internally
(data plane disruption)
* inability to start/restart any pod other than capacity. EG, if the API server
were to restart or need to be rescheduled, would it be able to start again?

In other words, critical alerts are something that require someone to get out
of bed in the middle of the night and fix something **right now** or they will
be faced with a disaster.

An example of something that is NOT a critical alert:
* MCO and/or related components are completely dead/crash looping.

Even though the above is quite significant, there is no risk of immediate loss
of the cluster or running applications.

You might be thinking to yourself: "But my component is **super** important, and
if it's not working, the user can't do X, which user definitely will want." All
that is probably true, but that doesn't make your alert "critical" it just makes
it worthy of an alert.

The group of critical alerts should be small, very well defined,
highly documented, polished and with a high bar set for entry. This includes a
Example critical alert: [KubeAPIDown][3]

```yaml
- alert: KubeAPIDown
annotations:
summary: Target disappeared from Prometheus target discovery.
description: KubeAPI has disappeared from Prometheus target discovery.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/KubeAPIDown.md
expr: |
absent(up{job="apiserver"} == 1)
for: 15m
labels:
severity: critical
```

This alert fires if no Kubernetes API server instance has reported metrics
successfully in the last 15 minutes. This is a clear example of a critical
control-plane issue that represents a threat to the operability of the cluster
as a whole, and likely warrants paging someone. The alert has clear summary and
description annotations, and it links to a runbook with information on
investigating and resolving the issue.

The group of critical alerts should be small, very well defined, highly
documented, polished and with a high bar set for entry. This includes a
mandatory review of a proposed critical alert by the Red Hat SRE team.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add details on how and where this happens? How does someone contact the SRE team for this review?


### Warning Alerts

TL/DR: fix this soon, or some things won't work and upgrades will probably be
blocked.
TL/DR: The vast majority of alerts should use the severity. Issues at the
warning level should be addressed in a timely manner, but don't pose an
immediate threat to the operation of the cluster as a whole.

Timeline: ~60 minutes

If your alert does not meet the criteria in "Critical Alerts" above, it belongs
to the warning level or lower.
Expand All @@ -141,88 +184,65 @@ that they do not fire until components have sufficient time to try to recover
from the interruption automatically. Expect users to be notified of a warning,
but for them not to respond with corrective action immediately.

Timeline: ~60 minutes

60 minutes? That seems high! That's because it is high. We want to reduce
the noise. We've done a lot to make clusters and operators auto-heal
themselves. The whole idea is that if a condition has persisted for more than
60 minutes, it's unlikely to be resolved without intervention any time later.

#### Example 1: All machine-config-daemons are crashlooping

That seems significant
and it probably is, but nobody has to get out of bed to fix this **right now**.
But, what if a machine dies, I won't be able to get a replacement? Yeah, that
is probably true, but how often does that happen? Also, that's an entirely
separate set of concerns.

#### Example 2: I have an operator that needs at least 3 of something to work
* With 2/3 replicas, warning alert after 60M.
* With 1/3 replicas, warning alert after 10M.
* With 0/3 replicas, warning alert after 5M.

Q: How long should we wait until the above conditions rises to a 'critical'
alert?

A: It should never rise to the level of a critical alert, it's not critical. If
it was, this section would not apply.

#### Example 3: I have an operator that only needs 1 replica to function

If the cluster can upgrade with only 1 replica, and the the service is
available despite other replicas being unavailable, this can probably
be just an info-level alert.

#### Q: What if a particular condition will block upgrades?

A: It's a warning level alert.


### Alert Ownership

Previously, the bulk of our alerting was handled directly by the monitoring
team. Initially, this gave use some indication into platform status without
much effort by each component team. As we mature our alerting system, we should
ensure all teams take ownership of alerts for their individual components.

Going forward, teams will be expected to own alerting rules within their own
repositories. This will reduce the burden on the monitoring team and better
enable component owners to control their alerts.

Additional responsibilities include:

1. First in line to receive the bug for an alert firing
1. Responsible for describing, "what does it mean" and "what does it do"
1. Responsible for choosing the level of alert
1. Responsible for deciding if the alert even matters


### Documentation Required

1. The name of the alerting rule should clearly identify the component impacted
by the issue (for example etcdInsufficientMembers instead of
InsufficientMembers, MachineConfigDaemonDrainError instead of MCDDrainError).
It should camel case, without whitespace, starting with a capital letter. The
first part of the alert name should be the same for all alerts originating from
the same component.
1. Alerting rules should have a "severity" label whose value is either info,
warning or critical (matching what we have today and staying aside from the
discussion whether we want minor or not).
1. Alerting rules should have a description annotation providing details about
what is happening and how to resolve the issue.
1. Alerting rules should have a summary annotation providing a high-level
description (similar to the first line of a commit message or email subject).
1. If there's a runbook in https://github.com/openshift/runbooks, it should be
linked in the runbook_url annotation.


### Open Questions [optional]

* Explore additional alerting severity (eg, 'minor') for triage purposes
Example warning alert: [ClusterNotUpgradeable][4]

```yaml
- alert: ClusterNotUpgradeable
annotations:
summary: One or more cluster operators have been blocking minor version cluster upgrades for at least an hour.
description: In most cases, you will still be able to apply patch releases.
Reason {{ "{{ with $cluster_operator_conditions := \"cluster_operator_conditions\" | query}}{{range $value := .}}{{if and (eq (label \"name\" $value) \"version\") (eq (label \"condition\" $value) \"Upgradeable\") (eq (label \"endpoint\" $value) \"metrics\") (eq (value $value) 0.0) (ne (len (label \"reason\" $value)) 0) }}{{label \"reason\" $value}}.{{end}}{{end}}{{end}}"}}
For more information refer to 'oc adm upgrade'{{ "{{ with $console_url := \"console_url\" | query }}{{ if ne (len (label \"url\" (first $console_url ) ) ) 0}} or {{ label \"url\" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}" }}.
expr: |
max by (name, condition, endpoint) (cluster_operator_conditions{name="version", condition="Upgradeable", endpoint="metrics"} == 0)
for: 60m
labels:
severity: warning
```

This alert fires if one or more operators have not reported their `Upgradeable`
condition as true in more than an hour. The alert has a clear name and
informative summary and description annotations. The timeline is appropriate
for allowing the operator a chance to resolve the issue automatically, avoiding
the need to alert an administrator.

### Info Alerts

TL/DR: Info level alerts represent situations an administrator should be aware
of, but that don't necessarily require any action. Use these sparingly, and
consider instead reporting this information via Kubernetes events.

Example info alert: [MultipleContainersOOMKilled][5]

```yaml
- alert: MultipleContainersOOMKilled
annotations:
description: Multiple containers were out of memory killed within the past
15 minutes. There are many potential causes of OOM errors, however issues
on a specific node or containers breaching their limits is common.
summary: Containers are being killed due to OOM
expr: sum(max by(namespace, container, pod) (increase(kube_pod_container_status_restarts_total[12m]))
and max by(namespace, container, pod) (kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}) == 1) > 5
for: 15m
labels:
namespace: kube-system
severity: info
```

This alert fires if multiple containers have been terminated due to out of
memory conditions in the last 15 minutes. This is something the administrator
should be aware of, but may not require immediate action.

### Test Plan

Automated tests enforcing acceptance criteria for critical alerts, and basic
style linting will be added to the [openshift/origin][6] end-to-end test suite.
The monitoring team will work with anyone shipping existing critical alerts that
don't meet these criteria in order to resolve the issue before enabling the
tests.

## Implementation History
None

## Drawbacks

Expand Down Expand Up @@ -251,3 +271,14 @@ None

### Version Skew Strategy"
None


[1]: https://github.com/openshift/cluster-monitoring-operator/blob/79cdf68/assets/control-plane/prometheus-rule.yaml#L235-L247
[2]: https://github.com/openshift/runbooks
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/79cdf68/assets/control-plane/prometheus-rule.yaml#L412-L421
[4]: https://github.com/openshift/cluster-version-operator/blob/513a2fc/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L68-L76
[5]: https://github.com/openshift/cluster-monitoring-operator/blob/79cdf68/assets/cluster-monitoring-operator/prometheus-rule.yaml#L326-L338
[6]: https://github.com/openshift/origin
[7]: https://sre.google/sre-book/monitoring-distributed-systems/
[8]: https://prometheus.io/docs/practices/alerting/
[9]: https://www.usenix.org/sites/default/files/conference/protected-files/srecon16europe_slides_rabenstein.pdf