-
Notifications
You must be signed in to change notification settings - Fork 537
Add: Alerting Standards #637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,253 @@ | ||
| --- | ||
| title: alerting-consistency | ||
| authors: | ||
| - "@michaelgugino" | ||
| reviewers: | ||
| - TBD | ||
| approvers: | ||
| - TBD | ||
| creation-date: 2021-02-03 | ||
| last-updated: 2021-02-03 | ||
| status: implementable | ||
| --- | ||
|
|
||
| # Alerting Consistency | ||
|
|
||
| ## Release Signoff Checklist | ||
|
|
||
| - [ ] Enhancement is `implementable` | ||
| - [ ] Design details are appropriately documented from clear requirements | ||
| - [ ] Test plan is defined | ||
| - [ ] Operational readiness criteria is defined | ||
| - [ ] Graduation criteria for dev preview, tech preview, GA | ||
| - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
|
||
| ## Summary | ||
|
|
||
| Clear and actionable alerts are a key component of a smooth operational | ||
| experience. Ensuring we have clear and concise guidelines for our alerts | ||
| will allow developers to better inform users of problem situations and how to | ||
| resolve them. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1, i really like the idea of creating consistent guidelines for not only the alerts, but the documentation to support them. i am also hugely in favor of bringing the focus of generating those artifacts closer to the project development. |
||
|
|
||
| ## Motivation | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The motivation for this enhancement needs to lead with "Establish implementation guidelines for developers writing alerts for their component". All other benefits are a by-product of that. |
||
| Improve Signal to noise of alerts. Align alert levels "Critical, Warning, Info" | ||
| with operational expectations to allow easier triaging. Ensure current and | ||
| future alerts have documentation covering problem/investigation/solution steps. | ||
|
|
||
| ### Goals | ||
|
|
||
| * Define clear criteria for designating an alert 'critical' | ||
| * Define clear criteria for designating an alert 'warning' | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we only care about those 2 levels, or is the goal to describe criteria for all of the levels?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You tell me. I just wanted to get the conversation started. I would prefer to have 3 alert levels, "Critical" "Major" and "Minor". Since that ship might have already sailed, I am focusing on what we have today.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Other levels are mentioned in the text below, so I wasn't sure if you were just focusing on the top 2 as a start or if there was another reason they weren't included here. I'd be OK with starting by nailing down the criteria for critical and warning and seeing where that takes us.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here. It is listed as TBD below. Not a blocker for initial merge of this IMO. |
||
| * Define minimum time thresholds before prior to triggering an alert | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's unrealistic. Certain conditions can't be recovered from on their own and/or need instant action because in a high number of cases the component doesn't recover without interaction. There doesn't need to be an arbitrary wait time.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I disagree. This is about consistency. The point of this document is to discuss what we think alerting and related elements should look like. There's no reason to not define a standard set of criteria. Anyone that thinks they have a special use case is probably wrong.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this seems like it would be easy to audit. For example, alerts installed via the CVO that do not set $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.7.0-rc.0-x86_64
$ for x in $(grep -lr '^kind: PrometheusRule' manifests); do yaml2json < "${x}" | jq -r 'if (. | type) == "object" then [.] else . end | .[] | select(.kind == "PrometheusRule").spec.groups[].rules[] | select(.alert != null and .for == null) | .labels.severity + " " + .alert + " " + .expr'; done | sort | uniq
critical MCDRebootError mcd_reboot_err > 0
warning FailingOperator csv_abnormal{phase="Failed"}
warning ImageRegistryStorageReconfigured increase(image_registry_operator_storage_reconfigured_total[30m]) > 0
warning KubeletHealthState mcd_kubelet_state > 2
warning MCDPivotError mcd_pivot_err > 0
warning SystemMemoryExceedsReservation sum by (node) (container_memory_rss{id="/system.slice"}) > ((sum by (node) (kube_node_status_capacity{resource="memory"} - kube_node_status_allocatable{resource="memory"})) * 0.9)How do we feel about those?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "SystemMemoryExceedsReservation" is definitely one that needs a threshold. I have some questions about the utility of this particular alert, but that's outside the scope of this discussion.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pretty much any alert that spuriously fires(in the sense that no action is needed, it's going to be fine once the upgrade finishes) during upgrade today is a candidate for "this alert needs to wait longer before firing" imho. i'm also very interested to see examples where the time it takes an admin to see, process, and take action to resolve an alert doesn't significantly dominate the time between the condition happening and the alert firing. That is to say: if it takes an admin 30 minutes to fix something they were alerted to, then does it matter whether we alerted them 0 minutes or 5 minutes after it started happening? Particularly if by waiting 5 minutes we may be able to avoid alerting them at all (because the situation does resolve itself, such as in the case of most alerts caused by node reboots).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried to capture some timeline standards in the individual alerting sections. I would say, for critical alerts, they can probably be added to a 0m, 5m or 15m bucket. 0m bucket: etcd quorum is lost, we don't know how long this cluster might be alive so fire right away and hope for the best. Back to the upgrade bit, we shouldn't fire any alerts during upgrade, however, I don't think timing is the core issue, at least for the critical alerts. The vast majority of alerts could be Warning or lower severity, and 60m timeout. This achieves a number of things. First, it gives time for the situation to self-correct, or for a helper like MachineHealthChecks to remediate a problem. Second, it allows us to know that it's probably not going to self correct. Third, it keeps your alert screen from having tons of alerts at once, especially if there are critical alerts firing, we don't need the extra distractions. Many of our alerts today are 'raw' alerts, IMO. Things like 9/10 daemonsets are running for MCD. Does this need an alert? It would be nice if we can aggregate some things into node specific issues. EG, if sdn-daemon is down on a given node, we should have a 'NodeUnhealthy' alert, and aggregate some conditions onto that screen. I'm not sure how feasible this is today. At the very least, each component's operator could detect the specifics about it's operands and a given node. For instance, if a given node is under maintenance, and the daemonset is missing, that's expected. The operator is the layer we can utilize to understand this state.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do think we need to provide numbers as guidance, if only from a hygiene standpoint. Otherwise we're not going to make progress on standardizing, which is the purpose of this PR. Even if those numbers need tweaking in the end, or if they don't make sense on a particular alert, we can't reasonably ask a component team to write alerts without this basic guidance. This whole process is iterative and we have to start with something. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't the wording on this a but doubled now? |
||
| * Define ownership and responsibilities of existing and new alerts | ||
| * Establish clear documentation guidelines | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to call out the runbooks repo here. The idea is that every alert comes with a runbook, and they're open source, part of the product code-base. https://github.com/openshift/runbooks. We had some discussion with the support team on this and they ack'd use of the runbooks repo (the alternative suggested was KCS, but was decided against). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PLM strongly suggest the use of KCS vs a github as it creates more on brand and controlled space (not influenced by the community). |
||
| * Outline operational triage of firing alerts | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it cover runbooks as per @jeremyeder comment above? If not what is the intent?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should pull this in as one of the major tenants of this enhancement. It's a major win from SRE and the general "alert consumer" PoV. It should be the responsibility of the alert author, so that it will be lifecycled along with the alert (e.g. runbook CRUD when alert CRUD). |
||
|
|
||
| ### Non-Goals | ||
|
|
||
| * Define specific alert needs | ||
| * Implement user-defined alerts | ||
|
|
||
| ## Proposal | ||
|
|
||
| ### User Stories | ||
|
|
||
| #### Story 1 | ||
|
|
||
| As an OpenShift developer, I need to ensure that my alerts are appropriately | ||
| tuned to allow end-user operational efficiency. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Less ambiguous: OpenShift alerts must be optimized for operational efficiency. |
||
|
|
||
| #### Story 2 | ||
|
|
||
| As an SRE, I need alerts to be informative, actionable and thoroughly | ||
| documented. | ||
|
|
||
| ### Implementation Details/Notes/Constraints [optional] | ||
|
|
||
| This enhancement will require some developers and administrators to rethink | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure what the purpose of this paragraph is, consider dropping? |
||
| their concept of alerts and alert levels. Many people seem to have an | ||
| intuition about how alerts should work that doesn't necessarily match how | ||
| alerts actually work or how others believe alerts should work. | ||
|
|
||
| Formalizing an alerting specification will allow developers and administrators | ||
| to speak a common language and have a common understanding, even if the concepts | ||
| seem unintuitive initially. | ||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| People will make wrong assumptions about alerts and how to handle them. This | ||
| is not unique to alerts, people often believe certain components should behave | ||
| in a particular way in which they do not. This is an artifact of poor | ||
| documentation that leaves too much to the imagination. | ||
|
|
||
| We can work around this problem with clear and concise documentation. | ||
|
|
||
| ## Design Details | ||
|
|
||
| ### Critical Alerts | ||
|
|
||
| TL/DR: For alerting current and impending disaster situations. | ||
|
|
||
| Timeline: ~5 minutes. | ||
|
|
||
| Reserve critical level alerts only for reporting conditions that may lead to | ||
| loss of data or inability to deliver service for the cluster as a whole. | ||
| Failures of most individual components should not trigger critical level alerts, | ||
| unless they would result in either of those conditions. Configure critical level | ||
| alerts so they fire before the situation becomes irrecoverable. Expect users to | ||
| be notified of a critical alert within a short period of time after it fires so | ||
| they can respond with corrective action quickly. | ||
|
|
||
| Some disaster situations are: | ||
| * loss or impending loss of etcd-quorum | ||
| * etcd corruption | ||
| * inability to route application traffic externally or internally | ||
| (data plane disruption) | ||
| * inability to start/restart any pod other than capacity. EG, if the API server | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this imply that the kube-apiserver is considered critical? Perhaps we should enumerate those binaries which are "special".
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Anything that can affect an SLA is critical. kube-apiserver is the only SLA our managed product is contracted to uphold. However, I don't think every alert for kube-apiserver should be set to critical. Anything that can affect an SLO (which are often company and SRE team specific), should be a warning, because SLO thresholds will be set more aggressively than the SLA that they support.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Collapsing discussion from this earlier thread about this same line, I still don't understand the capacity carve-out. If a pod failing to schedule is a problem, then the reason the pod failed to schedule helps admins figure out how to resolve/mitigate, but the problem exists regardless of the reason. E.g. "I have no API-server pods, and am failing to schedule more" is a disaster. If the reason is "because I'm out of CPU", it's still a disaster.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think this example is contrary to symptom based alerting. Also, the API-server pods have a very high priority, so capacity should never be an issue on a properly (eg, default or higher) sized node. Pods failing to schedule due to lack of capacity is not a problem. Sometimes clusters fill up, sometimes this is desired. One use case: run 1 Million of pod x, low priority, preemptible. |
||
| were to restart or need to be rescheduled, would it be able to start again? | ||
|
|
||
| In other words, critical alerts are something that require someone to get out | ||
| of bed in the middle of the night and fix something **right now** or they will | ||
| be faced with a disaster. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. clarify "disaster"
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something that requires use of disaster recovery docs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarify in the proposal if possible please |
||
|
|
||
| An example of something that is NOT a critical alert: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps we need a less controversial example of an alert that shouldn't be critical? I can propose I'd also like to see an example of an alert that is unambiguously critical (e.g.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @simonpasquier would you be able to supply the specific alert definitions in addition to the names so that it can be included here as a reference?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe PrometheusRemoteStorageFailures? The fact that stuff isn't getting pushed to telemeter is an internal problem for us, but customers don't really have a reason to care. |
||
| * MCO and/or related components are completely dead/crash looping. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's to say that MCO itself is not relevant to the clusters health, which is not true for all cluster setups
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The MCO being down does not create a disaster situation. The MCO does not need to be alive to run existing workloads, ensure the API is available (schedule new/replacement workloads), or jeopardize etcd-quorum.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The MCO might need to be alive to run existing workloads. For example, if your pull-secret needs rotating, or your Kube-API CA is getting rolled, etc., those are things that get pushed out to nodes via the machine-config operator. And if that isn't getting pushed out, the nodes may die. And possibly be destroyed. And new nodes (replacements or autoscaling increases) will not come up without sufficient machine-config stack health. So lots of vulnerabilities. But vulnerabilities are not necessarily page-at-midnight until the lack of capacity or stale-pull-secret or other fallout is actually biting you. And "core MCO operator is dead" shouldn't happen all that often anyway. Is this really something where noise is a problem? Or are you extrapolating from a different machine-config alert that you do find too noisy?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It sounds like in this case there should be independent monitoring of this particular situation for kube-api CA's getting rolled. So, the alert would be something along the lines of "CA not synced && MCO dead".
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "&& MCO dead" doesn't sound like something the CA-generator should be taught to care about. But kubelets could provide metrics about remaining CA time, and there could be a new alert about those expiration times dropping under a given threshold? I think it's fine to leave it to the user or local Insights-style rules to consolidate a collection of distinct alerts into diagnoses. There is a distinction between "there is a problem" (alerts) and "I have a theory for the root cause" (diagnoses), and if we attempt to only target diagnoses with alerts, we risk silence when there is some new failure mode or presentation that doesn't match an existing diagnosis fingerprint. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is wrong example. MCO for sure is on the critical path and it should alert appropriately as critical component due to all reasons above and that it is actioner of a lot of control plane actions. And alerting in kube manager for "Cert rotation stuck" without giving me where to look is almost useless. But Alerting in kube manager with "Cert Rotation stuck. Rotation path: Manager -> MCO-> node" Gives me hint what to check and which alerts to look for next. So I dont see MCO alert in critical I know it works fine and go to the next in line.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I disagree. If Cert rotation is the thing that will drag the cluster down, then we need to set an alert for that specifically. We also aren't (shouldn't be?) waiting until the last possible minute to do cert rotation, we have plenty of time to take action on that before they expire, on the order of weeks IIRC. So, if you have an MCO that goes down, it's not immediately critical. Yes, if you completely neglect your cluster for weeks on end, it will fail, but the MCO itself going down is not going to prevent application communication. |
||
|
|
||
| Even though the above is quite significant, there is no risk of immediate loss | ||
| of the cluster or running applications. | ||
|
|
||
| You might be thinking to yourself: "But my component is **super** important, and | ||
| if it's not working, the user can't do X, which user definitely will want." All | ||
| that is probably true, but that doesn't make your alert "critical" it just makes | ||
| it worthy of an alert. | ||
|
|
||
| The group of critical alerts should be small, very well defined, | ||
| highly documented, polished and with a high bar set for entry. This includes a | ||
| mandatory review of a proposed critical alert by the Red Hat SRE team. | ||
|
|
||
| ### Warning Alerts | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I propose to use
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @simonpasquier could you document and land an example origin e2e test in the actual CI system to point to? |
||
|
|
||
| TL/DR: fix this soon, or some things won't work and upgrades will probably be | ||
| blocked. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Taking a similar stab at rephrasing this section:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great wording here, thanks!
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agree, thanks doug |
||
|
|
||
| If your alert does not meet the criteria in "Critical Alerts" above, it belongs | ||
| to the warning level or lower. | ||
|
|
||
| Use warning level alerts for reporting conditions that may lead to inability to | ||
| deliver individual features of the cluster, but not service for the cluster as a | ||
| whole. Most alerts are likely to be warnings. Configure warning level alerts so | ||
| that they do not fire until components have sufficient time to try to recover | ||
| from the interruption automatically. Expect users to be notified of a warning, | ||
| but for them not to respond with corrective action immediately. | ||
|
|
||
| Timeline: ~60 minutes | ||
|
|
||
| 60 minutes? That seems high! That's because it is high. We want to reduce | ||
| the noise. We've done a lot to make clusters and operators auto-heal | ||
| themselves. The whole idea is that if a condition has persisted for more than | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Upstream alert levels from here:
For warnings, I'd be happier with "if you are very confident that this will not auto-heal, even if it hasn't been 60m". And example would be
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In practice,
In principle yes but given the current design of Alertmanager, it isn't trivial. As stated above, the Prometheus operator should take care of checking that at least the configuration is syntactically valid.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I used that same argument internally when I tried to argue that
The way it works today doesn't seem to be bothering folks too often, and it makes sense to have "lots of work to solidify guards for a rare situation" be lower priority than lots of other things.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is alertManager rolled out and gated on readiness during changes? Seems like catastrophe is already being prevented here.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Depends on the alerts you're missing. If there are no available updates, not being able to fetch them doesn't matter. If someone is DoSing your network to keep you from hearing about a critical CVE update, then not hearing about them means attackers have longer to exploit the flaw before you wake up and start working the update-retrieval issue.
You are currently safe as long as there is an alertmanager around which was running the old config. But that's not a resilient position. For example, someone bumps a MachineConfig and the machine-config operator rolls your compute pool, and now your Alertmanagers are all dead. Maybe PDBs protect you from some of that? But OSD allows customers to configure limited respect for PDBs on the ~hours scale, and sometimes machines go down hard, without allowing time for graceful PDB guards. |
||
| 60 minutes, it's unlikely to be resolved without intervention any time later. | ||
|
|
||
| #### Example 1: All machine-config-daemons are crashlooping | ||
|
|
||
| That seems significant | ||
| and it probably is, but nobody has to get out of bed to fix this **right now**. | ||
| But, what if a machine dies, I won't be able to get a replacement? Yeah, that | ||
| is probably true, but how often does that happen? Also, that's an entirely | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "doesn't happen often" is not an argument for "not critical". The argument for this not being critical is that there is no risk of data-loss or customer impact inherent in a machine dying, or failing to provision a new one, or rolling out new secrets to nodes, or all the other things that the machine-config does. The critical impact would be "my customer-facing workload is failing to schedule because I have no capacity" or "I am on the edge of etcd-quorum loss because I cannot revive my third control-plane node". Those are critical. Maybe they are due to the machine-config stack being sad, so it's good to have warning-level machine-config alerts. But I don't think that they need to be critical-level machine-config alerts. But "happens rarely" is an argument for "even if the severity is wrong, fixing it is not all that important", not an argument for what the severity should be.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, we can't own this. Users should have their own monitoring for apps that fail to schedule. There's all sorts of capacity user stories and users need to use careful consideration in this area. The machine-api does not replace failed instances anyway, MHC does that. I disagree that frequency is not a component of severity. If clusters routinely lost 1 instance an hour (in some hypothetical reality... there's a joke to be made about a cloud provider in here lol) just as a matter of course, then having a function machine-api would be critical.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
User workloads should have "failed to schedule" monitoring so they can stay under the current caps and quotas. But cluster admins should have "limited capacity is impacting user workloads" monitoring so they can decide when they want to grow quota. This is the same as the autoscaler, which definitely cares about user workload capacity today, despite being managed by cluster admins. The alerts would fill the manually-scaled use case, and also cover the "autoscaler is maxed / dead" use case.
I thought the MHC killed dead/degraded machines/nodes, but that the machine API then provisioned the replacement Machine to get the MachineSet back up to it's
But it would not be a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This assumes having pending pods is a problem. Some clusters might schedule very large batches of jobs, having limited capacity is not a problem for them, they just want to chew through the jobs. Like I said, there are all sorts of capacity user stories. A cluster can run indefinitely with 0 capacity.
I can't recall MHC specifics, but it's not intended to delete machines that aren't part of a machineset today. The machineset creates the replacement machine, but in effect, you can paraphrase that the MHC replaces the failed machine. |
||
| separate set of concerns. | ||
|
|
||
| #### Example 2: I have an operator that needs at least 3 of something to work | ||
| * With 2/3 replicas, warning alert after 60M. | ||
| * With 1/3 replicas, warning alert after 10M. | ||
| * With 0/3 replicas, warning alert after 5M. | ||
|
|
||
| Q: How long should we wait until the above conditions rises to a 'critical' | ||
| alert? | ||
|
|
||
| A: It should never rise to the level of a critical alert, it's not critical. If | ||
| it was, this section would not apply. | ||
|
|
||
| #### Example 3: I have an operator that only needs 1 replica to function | ||
|
|
||
| If the cluster can upgrade with only 1 replica, and the the service is | ||
| available despite other replicas being unavailable, this can probably | ||
| be just an info-level alert. | ||
|
|
||
| #### Q: What if a particular condition will block upgrades? | ||
|
|
||
| A: It's a warning level alert. | ||
|
|
||
|
|
||
| ### Alert Ownership | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm strongly in favor of this aspect of the proposal. If the proposal remains contentious/unmerged overall, I'd like to see this aspect split out. I bet the monitoring team has an interest in helping this along.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is indeed the only way to quickly improve and have product alerting be sustainable into the future. However, I have not seen an explicit ack on this from OCP engineering at a high enough level + documented/sign-off level.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Spreading out alert ownership without clear and simple guidelines on how to create them to go along with that might lead to a lot of headache for alert consumers down the pipe. |
||
|
|
||
| Previously, the bulk of our alerting was handled directly by the monitoring | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've done the numbers and here is the current breakdown on a 4.7 cluster which could be useful to add to the proposal as a datapoint.
Diving into the alerts shipped by CMO:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @simonpasquier who owns that plan and where can I track it?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. openshift/cluster-monitoring-operator#1076 for the control plane alerts and openshift/cluster-monitoring-operator#1085 for the etcd alerts. We need to synchronize with their respective teams for the hand over. |
||
| team. Initially, this gave use some indication into platform status without | ||
| much effort by each component team. As we mature our alerting system, we should | ||
| ensure all teams take ownership of alerts for their individual components. | ||
|
|
||
| Going forward, teams will be expected to own alerting rules within their own | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should enumerate the ownership responsibilities, including
I think we could get some improvement by categorizing alerts by owner so we can push for better refinement, but that could be a later step. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 totally agree with what David write, but assuming we have clear guidelines about critical/warning/info alert levels described above in this document.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I said earlier, I don't think writing this in this PR constitutes active acceptance and acknowledgement by the stakeholders. Want to avoid surprising folks with new responsibilities they had no heads-up to. |
||
| repositories. This will reduce the burden on the monitoring team and better | ||
| enable component owners to control their alerts. | ||
|
|
||
| Additional responsibilities include: | ||
|
|
||
| 1. First in line to receive the bug for an alert firing | ||
| 1. Responsible for describing, "what does it mean" and "what does it do" | ||
| 1. Responsible for choosing the level of alert | ||
| 1. Responsible for deciding if the alert even matters | ||
|
|
||
|
|
||
| ### Documentation Required | ||
|
|
||
| 1. The name of the alerting rule should clearly identify the component impacted | ||
| by the issue (for example etcdInsufficientMembers instead of | ||
| InsufficientMembers, MachineConfigDaemonDrainError instead of MCDDrainError). | ||
| It should camel case, without whitespace, starting with a capital letter. The | ||
| first part of the alert name should be the same for all alerts originating from | ||
| the same component. | ||
| 1. Alerting rules should have a "severity" label whose value is either info, | ||
| warning or critical (matching what we have today and staying aside from the | ||
| discussion whether we want minor or not). | ||
| 1. Alerting rules should have a description annotation providing details about | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like to see a follow up to this that adds examples that people can prescriptively follow, maybe with 1-3 cherry-picked examples of "good descriptions" including some guidance. Or we can put htat in a "how to write alerts doc" and link here. |
||
| what is happening and how to resolve the issue. | ||
| 1. Alerting rules should have a summary annotation providing a high-level | ||
| description (similar to the first line of a commit message or email subject). | ||
| 1. If there's a runbook in https://github.com/openshift/runbooks, it should be | ||
| linked in the runbook_url annotation. | ||
|
|
||
|
|
||
| ### Open Questions [optional] | ||
|
|
||
| * Explore additional alerting severity (eg, 'minor') for triage purposes | ||
|
|
||
| ### Test Plan | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would actually love to see a test plan here, adding tests to e2e origin of openshift which have some best practices tested for alerts would be good here. Otherwise we are in the same spot in 6 months :) WDYT?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not opposed to a test plan, but I don't have one ATM. I'm open to suggestion here.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Take out a few compute machines. Wait an hour. Confirm that no critical alerts are firing, or were ever firing.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A few ideas:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you actually need to test the alert condition duration or can this be sped up, eg, do I actually need to take out some machines to see if the alert would fire? If I do need to take them out, do I have to wait for the hour period? In terms of testing for alerts I was wondering if we could have some more static analysis style rules but perhaps that's a bit naive in at this level
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like folks opinion if we need to resolve a test plan approach for alerts in the context of merging this enhancement.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Depending on what we agree about rule conventions and guidelines, we should translate the more we can into origin e2e tests. E.g. there should be a test validating that all alert rules have a Asking teams to write unit tests for all rules using
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like a test plan, because policies that are not backed up by automated tests are hard to preserve. Doesn't mean we need to have the tests implemented before we can start pushing towards new policy, but we should at least have an idea of what the tests will look like.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. New e2e tests in openshift/origin:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @simonpasquier link to these tests? |
||
|
|
||
| ## Implementation History | ||
|
|
||
| ## Drawbacks | ||
|
|
||
| People might have rules around the existing broken alerts. They will have to | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe best would be to help folks with this, we have tried in the monitoring team but its often not as easy as just saying they will have to change some alerts. So agreed this is a drawback but also the actual correct result of this.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed, and my experience so far is that teams are happy to have a 2nd set of eyes. So I hope this ends up being a non-issue. |
||
| change some of these rules. | ||
|
|
||
| ## Alternatives | ||
|
|
||
| Document policies, but not make any backwards-incompatible changes to the | ||
| existing alerts and only apply the policies to new alerts. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Realizing you're listing this because it's in the template, but I feel compelled to say it is not an acceptable alternative from SRE standpoint. We have to do two things: 1) improve existing noise 2) put in place a process (this doc) to ensure we don't regress. |
||
|
|
||
| ### Graduation Criteria | ||
| None | ||
|
|
||
| #### Dev Preview -> Tech Preview" | ||
| None | ||
|
|
||
| #### Tech Preview -> GA" | ||
| None | ||
|
|
||
| #### Removing a deprecated feature" | ||
| None | ||
|
|
||
| ### Upgrade / Downgrade Strategy" | ||
| None | ||
|
|
||
| ### Version Skew Strategy" | ||
| None | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great if we could link to some upstream Prometheus and blog posts on this topic, so we adhere to best practices and align to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any good looks or info you'd like to incorporate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes there is plenty, @s-urbaniak wrote a blog post on this. But couple of things are:
To name a few, not sure if @s-urbaniak blog post on this topic is out yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't refer to those resources as 'upstream' but rather just some content from the web. I think they are probably helpful for establishing a rationale, but I'm not sure that including them is better or worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the blog post is being prepared to be published, but i think i will have to nudge again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, include a bulleted list of references in the summary section. The Monitoring section in the Google SRE book includes majority of the guidance to support readers of this document. https://sre.google/sre-book/monitoring-distributed-systems/