Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
253 changes: 253 additions & 0 deletions enhancements/monitoring/alerting-consistency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
---
title: alerting-consistency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if we could link to some upstream Prometheus and blog posts on this topic, so we adhere to best practices and align to them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any good looks or info you'd like to incorporate here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is plenty, @s-urbaniak wrote a blog post on this. But couple of things are:

To name a few, not sure if @s-urbaniak blog post on this topic is out yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't refer to those resources as 'upstream' but rather just some content from the web. I think they are probably helpful for establishing a rationale, but I'm not sure that including them is better or worse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the blog post is being prepared to be published, but i think i will have to nudge again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, include a bulleted list of references in the summary section. The Monitoring section in the Google SRE book includes majority of the guidance to support readers of this document. https://sre.google/sre-book/monitoring-distributed-systems/

authors:
- "@michaelgugino"
reviewers:
- TBD
approvers:
- TBD
creation-date: 2021-02-03
last-updated: 2021-02-03
status: implementable
---

# Alerting Consistency

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Operational readiness criteria is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

Clear and actionable alerts are a key component of a smooth operational
experience. Ensuring we have clear and concise guidelines for our alerts
will allow developers to better inform users of problem situations and how to
resolve them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, i really like the idea of creating consistent guidelines for not only the alerts, but the documentation to support them. i am also hugely in favor of bringing the focus of generating those artifacts closer to the project development.


## Motivation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for this enhancement needs to lead with "Establish implementation guidelines for developers writing alerts for their component". All other benefits are a by-product of that.

Improve Signal to noise of alerts. Align alert levels "Critical, Warning, Info"
with operational expectations to allow easier triaging. Ensure current and
future alerts have documentation covering problem/investigation/solution steps.

### Goals

* Define clear criteria for designating an alert 'critical'
* Define clear criteria for designating an alert 'warning'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only care about those 2 levels, or is the goal to describe criteria for all of the levels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You tell me. I just wanted to get the conversation started. I would prefer to have 3 alert levels, "Critical" "Major" and "Minor". Since that ship might have already sailed, I am focusing on what we have today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other levels are mentioned in the text below, so I wasn't sure if you were just focusing on the top 2 as a start or if there was another reason they weren't included here. I'd be OK with starting by nailing down the criteria for critical and warning and seeing where that takes us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. It is listed as TBD below. Not a blocker for initial merge of this IMO.

* Define minimum time thresholds before prior to triggering an alert

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unrealistic. Certain conditions can't be recovered from on their own and/or need instant action because in a high number of cases the component doesn't recover without interaction. There doesn't need to be an arbitrary wait time.
Generally arbitrary wait times are hard to define all over OpenShift

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. This is about consistency. The point of this document is to discuss what we think alerting and related elements should look like. There's no reason to not define a standard set of criteria. Anyone that thinks they have a special use case is probably wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it would be easy to audit. For example, alerts installed via the CVO that do not set for:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.7.0-rc.0-x86_64
$ for x in $(grep -lr '^kind: PrometheusRule' manifests); do yaml2json < "${x}" | jq -r 'if (. | type) == "object" then [.] else . end | .[] | select(.kind == "PrometheusRule").spec.groups[].rules[] | select(.alert != null and .for == null) | .labels.severity + " " + .alert + " " + .expr'; done | sort | uniq

critical MCDRebootError mcd_reboot_err > 0
warning FailingOperator csv_abnormal{phase="Failed"}
warning ImageRegistryStorageReconfigured increase(image_registry_operator_storage_reconfigured_total[30m]) > 0
warning KubeletHealthState mcd_kubelet_state > 2
warning MCDPivotError mcd_pivot_err > 0
warning SystemMemoryExceedsReservation sum by (node) (container_memory_rss{id="/system.slice"}) > ((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.9)

How do we feel about those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"SystemMemoryExceedsReservation" is definitely one that needs a threshold. I have some questions about the utility of this particular alert, but that's outside the scope of this discussion.

Copy link
Member

@wking wking Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SystemMemoryExceedsReservation is here from openshift/machine-config-operator#2033, if you want to follow up on that alert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much any alert that spuriously fires(in the sense that no action is needed, it's going to be fine once the upgrade finishes) during upgrade today is a candidate for "this alert needs to wait longer before firing" imho.

i'm also very interested to see examples where the time it takes an admin to see, process, and take action to resolve an alert doesn't significantly dominate the time between the condition happening and the alert firing. That is to say:

if it takes an admin 30 minutes to fix something they were alerted to, then does it matter whether we alerted them 0 minutes or 5 minutes after it started happening? Particularly if by waiting 5 minutes we may be able to avoid alerting them at all (because the situation does resolve itself, such as in the case of most alerts caused by node reboots).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to capture some timeline standards in the individual alerting sections. I would say, for critical alerts, they can probably be added to a 0m, 5m or 15m bucket.

0m bucket: etcd quorum is lost, we don't know how long this cluster might be alive so fire right away and hope for the best.
5m bucket: etcd quorum member lost (1 of 3). If it's a small network blip, hopefully this is covered by etcd's internal timeout, and 5 minutes should hopefully be enough to recover.
15m bucket: Pretty much all other critical alerts, like 1/3 healthy API servers.

Back to the upgrade bit, we shouldn't fire any alerts during upgrade, however, I don't think timing is the core issue, at least for the critical alerts. The vast majority of alerts could be Warning or lower severity, and 60m timeout. This achieves a number of things. First, it gives time for the situation to self-correct, or for a helper like MachineHealthChecks to remediate a problem. Second, it allows us to know that it's probably not going to self correct. Third, it keeps your alert screen from having tons of alerts at once, especially if there are critical alerts firing, we don't need the extra distractions.

Many of our alerts today are 'raw' alerts, IMO. Things like 9/10 daemonsets are running for MCD. Does this need an alert? It would be nice if we can aggregate some things into node specific issues. EG, if sdn-daemon is down on a given node, we should have a 'NodeUnhealthy' alert, and aggregate some conditions onto that screen. I'm not sure how feasible this is today. At the very least, each component's operator could detect the specifics about it's operands and a given node. For instance, if a given node is under maintenance, and the daemonset is missing, that's expected. The operator is the layer we can utilize to understand this state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we need to provide numbers as guidance, if only from a hygiene standpoint. Otherwise we're not going to make progress on standardizing, which is the purpose of this PR. Even if those numbers need tweaking in the end, or if they don't make sense on a particular alert, we can't reasonably ask a component team to write alerts without this basic guidance. This whole process is iterative and we have to start with something.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the wording on this a but doubled now?

* Define ownership and responsibilities of existing and new alerts
* Establish clear documentation guidelines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to call out the runbooks repo here. The idea is that every alert comes with a runbook, and they're open source, part of the product code-base. https://github.com/openshift/runbooks.

We had some discussion with the support team on this and they ack'd use of the runbooks repo (the alternative suggested was KCS, but was decided against).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PLM strongly suggest the use of KCS vs a github as it creates more on brand and controlled space (not influenced by the community).

* Outline operational triage of firing alerts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it cover runbooks as per @jeremyeder comment above? If not what is the intent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should pull this in as one of the major tenants of this enhancement. It's a major win from SRE and the general "alert consumer" PoV. It should be the responsibility of the alert author, so that it will be lifecycled along with the alert (e.g. runbook CRUD when alert CRUD).


### Non-Goals

* Define specific alert needs
* Implement user-defined alerts

## Proposal

### User Stories

#### Story 1

As an OpenShift developer, I need to ensure that my alerts are appropriately
tuned to allow end-user operational efficiency.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less ambiguous: OpenShift alerts must be optimized for operational efficiency.


#### Story 2

As an SRE, I need alerts to be informative, actionable and thoroughly
documented.

### Implementation Details/Notes/Constraints [optional]

This enhancement will require some developers and administrators to rethink
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the purpose of this paragraph is, consider dropping?

their concept of alerts and alert levels. Many people seem to have an
intuition about how alerts should work that doesn't necessarily match how
alerts actually work or how others believe alerts should work.

Formalizing an alerting specification will allow developers and administrators
to speak a common language and have a common understanding, even if the concepts
seem unintuitive initially.

### Risks and Mitigations

People will make wrong assumptions about alerts and how to handle them. This
is not unique to alerts, people often believe certain components should behave
in a particular way in which they do not. This is an artifact of poor
documentation that leaves too much to the imagination.

We can work around this problem with clear and concise documentation.

## Design Details

### Critical Alerts

TL/DR: For alerting current and impending disaster situations.

Timeline: ~5 minutes.

Reserve critical level alerts only for reporting conditions that may lead to
loss of data or inability to deliver service for the cluster as a whole.
Failures of most individual components should not trigger critical level alerts,
unless they would result in either of those conditions. Configure critical level
alerts so they fire before the situation becomes irrecoverable. Expect users to
be notified of a critical alert within a short period of time after it fires so
they can respond with corrective action quickly.

Some disaster situations are:
* loss or impending loss of etcd-quorum
* etcd corruption
* inability to route application traffic externally or internally
(data plane disruption)
* inability to start/restart any pod other than capacity. EG, if the API server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this imply that the kube-apiserver is considered critical? Perhaps we should enumerate those binaries which are "special".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything that can affect an SLA is critical. kube-apiserver is the only SLA our managed product is contracted to uphold. However, I don't think every alert for kube-apiserver should be set to critical.

Anything that can affect an SLO (which are often company and SRE team specific), should be a warning, because SLO thresholds will be set more aggressively than the SLA that they support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collapsing discussion from this earlier thread about this same line, I still don't understand the capacity carve-out. If a pod failing to schedule is a problem, then the reason the pod failed to schedule helps admins figure out how to resolve/mitigate, but the problem exists regardless of the reason. E.g. "I have no API-server pods, and am failing to schedule more" is a disaster. If the reason is "because I'm out of CPU", it's still a disaster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. "I have no API-server pods, and am failing to schedule more" is a disaster. If the reason is "because I'm out of CPU", it's still a disaster.

I think this example is contrary to symptom based alerting. Also, the API-server pods have a very high priority, so capacity should never be an issue on a properly (eg, default or higher) sized node.

Pods failing to schedule due to lack of capacity is not a problem. Sometimes clusters fill up, sometimes this is desired. One use case: run 1 Million of pod x, low priority, preemptible.

were to restart or need to be rescheduled, would it be able to start again?

In other words, critical alerts are something that require someone to get out
of bed in the middle of the night and fix something **right now** or they will
be faced with a disaster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify "disaster"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that requires use of disaster recovery docs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify in the proposal if possible please


An example of something that is NOT a critical alert:
Copy link
Contributor

@simonpasquier simonpasquier Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we need a less controversial example of an alert that shouldn't be critical? I can propose ThanosQueryRangeLatencyHigh: Thanos Query isn't xritical to ensure that user workloads keep running so it shouldn't fire critical alerts.

I'd also like to see an example of an alert that is unambiguously critical (e.g. etcdInsufficientMembers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier would you be able to supply the specific alert definitions in addition to the names so that it can be included here as a reference?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe PrometheusRemoteStorageFailures? The fact that stuff isn't getting pushed to telemeter is an internal problem for us, but customers don't really have a reason to care.

* MCO and/or related components are completely dead/crash looping.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's to say that MCO itself is not relevant to the clusters health, which is not true for all cluster setups

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCO being down does not create a disaster situation. The MCO does not need to be alive to run existing workloads, ensure the API is available (schedule new/replacement workloads), or jeopardize etcd-quorum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCO might need to be alive to run existing workloads. For example, if your pull-secret needs rotating, or your Kube-API CA is getting rolled, etc., those are things that get pushed out to nodes via the machine-config operator. And if that isn't getting pushed out, the nodes may die. And possibly be destroyed. And new nodes (replacements or autoscaling increases) will not come up without sufficient machine-config stack health. So lots of vulnerabilities. But vulnerabilities are not necessarily page-at-midnight until the lack of capacity or stale-pull-secret or other fallout is actually biting you. And "core MCO operator is dead" shouldn't happen all that often anyway. Is this really something where noise is a problem? Or are you extrapolating from a different machine-config alert that you do find too noisy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like in this case there should be independent monitoring of this particular situation for kube-api CA's getting rolled. So, the alert would be something along the lines of "CA not synced && MCO dead".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"&& MCO dead" doesn't sound like something the CA-generator should be taught to care about. But kubelets could provide metrics about remaining CA time, and there could be a new alert about those expiration times dropping under a given threshold? I think it's fine to leave it to the user or local Insights-style rules to consolidate a collection of distinct alerts into diagnoses. There is a distinction between "there is a problem" (alerts) and "I have a theory for the root cause" (diagnoses), and if we attempt to only target diagnoses with alerts, we risk silence when there is some new failure mode or presentation that doesn't match an existing diagnosis fingerprint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is wrong example. MCO for sure is on the critical path and it should alert appropriately as critical component due to all reasons above and that it is actioner of a lot of control plane actions. And alerting in kube manager for "Cert rotation stuck" without giving me where to look is almost useless. But Alerting in kube manager with "Cert Rotation stuck. Rotation path: Manager -> MCO-> node" Gives me hint what to check and which alerts to look for next. So I dont see MCO alert in critical I know it works fine and go to the next in line.
Every component and alert needs to understand their dependency flow in the the system overall.

s/MCO/samples operator/ and your example should be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. If Cert rotation is the thing that will drag the cluster down, then we need to set an alert for that specifically. We also aren't (shouldn't be?) waiting until the last possible minute to do cert rotation, we have plenty of time to take action on that before they expire, on the order of weeks IIRC. So, if you have an MCO that goes down, it's not immediately critical. Yes, if you completely neglect your cluster for weeks on end, it will fail, but the MCO itself going down is not going to prevent application communication.


Even though the above is quite significant, there is no risk of immediate loss
of the cluster or running applications.

You might be thinking to yourself: "But my component is **super** important, and
if it's not working, the user can't do X, which user definitely will want." All
that is probably true, but that doesn't make your alert "critical" it just makes
it worthy of an alert.

The group of critical alerts should be small, very well defined,
highly documented, polished and with a high bar set for entry. This includes a
mandatory review of a proposed critical alert by the Red Hat SRE team.

### Warning Alerts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to use ClusterNotUpgradeable as an example of a good warning alert.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier could you document and land an example origin e2e test in the actual CI system to point to?


TL/DR: fix this soon, or some things won't work and upgrades will probably be
blocked.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a similar stab at rephrasing this section:

Use warning level alerts for reporting conditions that may lead to inability to deliver individual features of the cluster, but not service for the cluster as a whole. Most alerts are likely to be warnings. Configure warning level alerts so that they do not fire until components have sufficient time to try to recover from the interruption automatically. Expect users to be notified of a warning, but for them not to respond with corrective action immediately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great wording here, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, thanks doug


If your alert does not meet the criteria in "Critical Alerts" above, it belongs
to the warning level or lower.

Use warning level alerts for reporting conditions that may lead to inability to
deliver individual features of the cluster, but not service for the cluster as a
whole. Most alerts are likely to be warnings. Configure warning level alerts so
that they do not fire until components have sufficient time to try to recover
from the interruption automatically. Expect users to be notified of a warning,
but for them not to respond with corrective action immediately.

Timeline: ~60 minutes

60 minutes? That seems high! That's because it is high. We want to reduce
the noise. We've done a lot to make clusters and operators auto-heal
themselves. The whole idea is that if a condition has persisted for more than
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream alert levels from here:

  • Critical: An issue, that needs to page a person to take instant action
  • Warning: An issue, that needs to be worked on but in the regular work queue or for during office hours rather than paging the oncall
  • Info: Is meant to support a trouble shooting process by informing about a non-normal situation for one or more systems but not worth a page or ticket on its own.

For warnings, I'd be happier with "if you are very confident that this will not auto-heal, even if it hasn't been 60m". And example would be AlertmanagerFailedReload, which, as I read it, will fire after 15m without a successful load. The expectation is that someone's fumbled the config, and only an admin correcting the config will recover things. I'd personally be happier if the alert manager copied verified configs over into some private location, so "admin fumbles a config touch" didn't leave you exposed to losing all alert manager functionality if the pod restarted. The fact that config fumbles leave you vulnerable to alert-manager-loss today, and alert-manager being the thing paging you at midnight for other issues, makes the current critical there make sense to me. If the alert manager did have valid-config-caching that survived pod restarts/reschedules, I'd rather have the alert be a warning, so no midnight page, but I'd still be comfortable waiting only 15 minutes before firing, because you want the admin to come back and fix the config they broke before they clock out for the night.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, AlertmanagerFailedReloadwould only fire if someone touches the Alertmanager configuration which means that there's someone already awake. It might change when we allow users to manage their own AlertmanagerConfig resources but the operator should be able to validate that the configs aren't broken.

if the alert manager copied verified configs over into some private location

In principle yes but given the current design of Alertmanager, it isn't trivial. As stated above, the Prometheus operator should take care of checking that at least the configuration is syntactically valid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...only fire if someone touches the Alertmanager configuration which means that there's someone already awake.

I used that same argument internally when I tried to argue that CannotRetrieveUpdates made sense at critical, because it was firing was because OSD tooling had set a bogus spec.channel leading to VersionNotFound. It still got demoted to warning in openshift/cluster-version-operator#509. Assuming that an awake human is the one touching your config is apparently a leaky argument ;).

In principle yes but given the current design of Alertmanager, it isn't trivial...

The way it works today doesn't seem to be bothering folks too often, and it makes sense to have "lots of work to solidify guards for a rare situation" be lower priority than lots of other things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CannotRetrieveUpdates is a bit different IMHO since it can probably wait until tomorrow morning (unless I'm wrong about the scope). If the Alertmanager config is broken, you're only one step away from losing all alert notifications in case all your Alertmanager pods get restarted.
But I don't think we need to argue on this specific alert here and I'd be fine if someone wants it to be demoted to "warning" because it causes them pain :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is alertManager rolled out and gated on readiness during changes? Seems like catastrophe is already being prevented here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...can probably wait until tomorrow morning (unless I'm wrong about the scope)...

Depends on the alerts you're missing. If there are no available updates, not being able to fetch them doesn't matter. If someone is DoSing your network to keep you from hearing about a critical CVE update, then not hearing about them means attackers have longer to exploit the flaw before you wake up and start working the update-retrieval issue.

Is alertManager rolled out and gated on readiness during changes?

You are currently safe as long as there is an alertmanager around which was running the old config. But that's not a resilient position. For example, someone bumps a MachineConfig and the machine-config operator rolls your compute pool, and now your Alertmanagers are all dead. Maybe PDBs protect you from some of that? But OSD allows customers to configure limited respect for PDBs on the ~hours scale, and sometimes machines go down hard, without allowing time for graceful PDB guards.

60 minutes, it's unlikely to be resolved without intervention any time later.

#### Example 1: All machine-config-daemons are crashlooping

That seems significant
and it probably is, but nobody has to get out of bed to fix this **right now**.
But, what if a machine dies, I won't be able to get a replacement? Yeah, that
is probably true, but how often does that happen? Also, that's an entirely
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"doesn't happen often" is not an argument for "not critical". The argument for this not being critical is that there is no risk of data-loss or customer impact inherent in a machine dying, or failing to provision a new one, or rolling out new secrets to nodes, or all the other things that the machine-config does. The critical impact would be "my customer-facing workload is failing to schedule because I have no capacity" or "I am on the edge of etcd-quorum loss because I cannot revive my third control-plane node". Those are critical. Maybe they are due to the machine-config stack being sad, so it's good to have warning-level machine-config alerts. But I don't think that they need to be critical-level machine-config alerts. But "happens rarely" is an argument for "even if the severity is wrong, fixing it is not all that important", not an argument for what the severity should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"my customer-facing workload is failing to schedule because I have no capacity"

No, we can't own this. Users should have their own monitoring for apps that fail to schedule. There's all sorts of capacity user stories and users need to use careful consideration in this area.

The machine-api does not replace failed instances anyway, MHC does that.

I disagree that frequency is not a component of severity. If clusters routinely lost 1 instance an hour (in some hypothetical reality... there's a joke to be made about a cloud provider in here lol) just as a matter of course, then having a function machine-api would be critical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should have their own monitoring for apps that fail to schedule.

User workloads should have "failed to schedule" monitoring so they can stay under the current caps and quotas. But cluster admins should have "limited capacity is impacting user workloads" monitoring so they can decide when they want to grow quota. This is the same as the autoscaler, which definitely cares about user workload capacity today, despite being managed by cluster admins. The alerts would fill the manually-scaled use case, and also cover the "autoscaler is maxed / dead" use case.

The machine-api does not replace failed instances anyway, MHC does that.

I thought the MHC killed dead/degraded machines/nodes, but that the machine API then provisioned the replacement Machine to get the MachineSet back up to it's spec.replicas. If you have a MHC on a machine/node that is not part of a MachineSet, is a replacement created? I'd have expected no replacement to be created.

If clusters routinely lost 1 instance an hour (in some hypothetical reality... then having a function machine-api would be critical.

But it would not be a critical alert. Losing your customer-facing workloads or living on the edge of etcd quorum loss would be a critical alert. And then there would be machine API is dead, high machine death rate, and unable to schedule many pods due to limited overall cluster capacity warning alerts to help point the responding admin at likely root causes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"limited capacity is impacting user workloads" monitoring so they can decide when they want to grow quota

This assumes having pending pods is a problem. Some clusters might schedule very large batches of jobs, having limited capacity is not a problem for them, they just want to chew through the jobs. Like I said, there are all sorts of capacity user stories. A cluster can run indefinitely with 0 capacity.

If you have a MHC on a machine/node

I can't recall MHC specifics, but it's not intended to delete machines that aren't part of a machineset today. The machineset creates the replacement machine, but in effect, you can paraphrase that the MHC replaces the failed machine.

separate set of concerns.

#### Example 2: I have an operator that needs at least 3 of something to work
* With 2/3 replicas, warning alert after 60M.
* With 1/3 replicas, warning alert after 10M.
* With 0/3 replicas, warning alert after 5M.

Q: How long should we wait until the above conditions rises to a 'critical'
alert?

A: It should never rise to the level of a critical alert, it's not critical. If
it was, this section would not apply.

#### Example 3: I have an operator that only needs 1 replica to function

If the cluster can upgrade with only 1 replica, and the the service is
available despite other replicas being unavailable, this can probably
be just an info-level alert.

#### Q: What if a particular condition will block upgrades?

A: It's a warning level alert.


### Alert Ownership
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strongly in favor of this aspect of the proposal. If the proposal remains contentious/unmerged overall, I'd like to see this aspect split out. I bet the monitoring team has an interest in helping this along.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed the only way to quickly improve and have product alerting be sustainable into the future. However, I have not seen an explicit ack on this from OCP engineering at a high enough level + documented/sign-off level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spreading out alert ownership without clear and simple guidelines on how to create them to go along with that might lead to a lot of headache for alert consumers down the pipe.


Previously, the bulk of our alerting was handled directly by the monitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done the numbers and here is the current breakdown on a 4.7 cluster which could be useful to add to the proposal as a datapoint.

  • Alerts shipped by the cluster monitoring operator (CMO)
    • critical: 43
    • warning: 82
  • Other alerts
    • critical: 14
    • warning: 35

Diving into the alerts shipped by CMO:

  • alerts for etcd, Kubelet, API server, Kube controller and scheduler. They live in the CMO repository for historical reasons and the plan is to move them to their respective repositories.
    • etcd: 12 alerts
    • control plane: 12
    • kubelet: 6
  • alerts for the monitoring components (Prometheus, Alertmanager, ...)
    • 57 alerts
  • alerts for node metrics (clock skew, filesystems filling up, ...).
    • 17 alerts
  • alerts based on kube-state-metrics metrics like KubePodNotReady, KubePodCrashLooping, ...
    • 25 alerts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier who owns that plan and where can I track it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openshift/cluster-monitoring-operator#1076 for the control plane alerts and openshift/cluster-monitoring-operator#1085 for the etcd alerts. We need to synchronize with their respective teams for the hand over.

team. Initially, this gave use some indication into platform status without
much effort by each component team. As we mature our alerting system, we should
ensure all teams take ownership of alerts for their individual components.

Going forward, teams will be expected to own alerting rules within their own
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should enumerate the ownership responsibilities, including

  1. first in line to receive the bug for an alert firing
  2. responsible for describing, "what does it mean" and "what does it do"
  3. responsible for choosing the level of alert
  4. responsible for deciding if the alert even matters

I think we could get some improvement by categorizing alerts by owner so we can push for better refinement, but that could be a later step.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 totally agree with what David write, but assuming we have clear guidelines about critical/warning/info alert levels described above in this document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said earlier, I don't think writing this in this PR constitutes active acceptance and acknowledgement by the stakeholders. Want to avoid surprising folks with new responsibilities they had no heads-up to.

repositories. This will reduce the burden on the monitoring team and better
enable component owners to control their alerts.

Additional responsibilities include:

1. First in line to receive the bug for an alert firing
1. Responsible for describing, "what does it mean" and "what does it do"
1. Responsible for choosing the level of alert
1. Responsible for deciding if the alert even matters


### Documentation Required

1. The name of the alerting rule should clearly identify the component impacted
by the issue (for example etcdInsufficientMembers instead of
InsufficientMembers, MachineConfigDaemonDrainError instead of MCDDrainError).
It should camel case, without whitespace, starting with a capital letter. The
first part of the alert name should be the same for all alerts originating from
the same component.
1. Alerting rules should have a "severity" label whose value is either info,
warning or critical (matching what we have today and staying aside from the
discussion whether we want minor or not).
1. Alerting rules should have a description annotation providing details about
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see a follow up to this that adds examples that people can prescriptively follow, maybe with 1-3 cherry-picked examples of "good descriptions" including some guidance. Or we can put htat in a "how to write alerts doc" and link here.

what is happening and how to resolve the issue.
1. Alerting rules should have a summary annotation providing a high-level
description (similar to the first line of a commit message or email subject).
1. If there's a runbook in https://github.com/openshift/runbooks, it should be
linked in the runbook_url annotation.


### Open Questions [optional]

* Explore additional alerting severity (eg, 'minor') for triage purposes

### Test Plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually love to see a test plan here, adding tests to e2e origin of openshift which have some best practices tested for alerts would be good here. Otherwise we are in the same spot in 6 months :) WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to a test plan, but I don't have one ATM. I'm open to suggestion here.

Copy link
Member

@wking wking Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take out a few compute machines. Wait an hour. Confirm that no critical alerts are firing, or were ever firing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few ideas:

  • Ensure that all alerts have a severity label and comply with the list of allowed values.
  • Ensure that no critical alert gets added without a sign-off from teams that are on-call (e.g. OSD).
  • Ensure that all alerts have mandatory annotations (e.g. description and summary).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually need to test the alert condition duration or can this be sped up, eg, do I actually need to take out some machines to see if the alert would fire? If I do need to take them out, do I have to wait for the hour period?

In terms of testing for alerts I was wondering if we could have some more static analysis style rules but perhaps that's a bit naive in at this level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like folks opinion if we need to resolve a test plan approach for alerts in the context of merging this enhancement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on what we agree about rule conventions and guidelines, we should translate the more we can into origin e2e tests. E.g. there should be a test validating that all alert rules have a severity label matching the set of allowed values. Same goes for the summary and description annotations if we agree that they are mandatory.

Asking teams to write unit tests for all rules using promtool is going to have a small benefit IMHO. Unit tests are mostly useful when joining metrics from different sources, I expect that most of the rules don't fall in this category.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like a test plan, because policies that are not backed up by automated tests are hard to preserve. Doesn't mean we need to have the tests implemented before we can start pushing towards new policy, but we should at least have an idea of what the tests will look like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New e2e tests in openshift/origin:

  • Test validating that all alerting rules have a severity label matching the set of allowed values.
  • Test validating that all alerting rules have a summary and description annotations.
  • Test validating that all runbook_url annotations point to an existing web page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier link to these tests?


## Implementation History

## Drawbacks

People might have rules around the existing broken alerts. They will have to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe best would be to help folks with this, we have tried in the monitoring team but its often not as easy as just saying they will have to change some alerts. So agreed this is a drawback but also the actual correct result of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, and my experience so far is that teams are happy to have a 2nd set of eyes. So I hope this ends up being a non-issue.

change some of these rules.

## Alternatives

Document policies, but not make any backwards-incompatible changes to the
existing alerts and only apply the policies to new alerts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing you're listing this because it's in the template, but I feel compelled to say it is not an acceptable alternative from SRE standpoint. We have to do two things: 1) improve existing noise 2) put in place a process (this doc) to ensure we don't regress.


### Graduation Criteria
None

#### Dev Preview -> Tech Preview"
None

#### Tech Preview -> GA"
None

#### Removing a deprecated feature"
None

### Upgrade / Downgrade Strategy"
None

### Version Skew Strategy"
None