-
Notifications
You must be signed in to change notification settings - Fork 537
Installer: check operators for stability #1189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,174 @@ | ||||||
| --- | ||||||
| title: cluster-operator-installation-check | ||||||
| authors: | ||||||
| - @patrickdillon | ||||||
| reviewers: # Include a comment about what domain expertise a reviewer is expected to bring and what area of the enhancement you expect them to focus on. For example: - "@networkguru, for networking aspects, please look at IP bootstrapping aspect" | ||||||
| - @wking | ||||||
| - @LalatenduMohanty | ||||||
| approvers: | ||||||
| - @sdodson | ||||||
| - @deads2k | ||||||
| - @bparees | ||||||
| api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None" | ||||||
| - None | ||||||
| creation-date: 2021-07-14 | ||||||
| last-updated: yyyy-mm-dd | ||||||
| tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement | ||||||
| - none | ||||||
| see-also: | ||||||
| - "/enhancements/this-other-neat-thing.md" | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You could remove this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And all the comments above
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I now have a desire to place something awesome at this location :) |
||||||
| replaces: | ||||||
| - n/a | ||||||
| superseded-by: | ||||||
| - n/a | ||||||
| --- | ||||||
|
|
||||||
| # Cluster Operator Installation Check | ||||||
|
|
||||||
| ## Summary | ||||||
|
|
||||||
| This enhancement proposes that the OpenShift Installer should check the status | ||||||
| of individual cluster operators and use those statuses as a threshold for | ||||||
| determining whether an install is a success (exit 0) or failure (exit != 0). | ||||||
| Specifically, the Installer should check whether cluster operators have stopped | ||||||
| progressing for a certain amount of time. If the Installer sees that an operator | ||||||
| is Available=true but fails to enter a stable Progressing=false state, the Installer | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wondering if you have considered when operator is degraded for a certain amount of time . |
||||||
| will exit non-zero and log an error. | ||||||
|
|
||||||
| ## Motivation | ||||||
|
|
||||||
| This enhancement allows the Installer to identify and catch a class of errors | ||||||
| which are currently going unchecked. Without this enhancement, the Installer can | ||||||
| (and does) declare clusters successfully installed when the cluster has failing components. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Just because failing here is not always the case. MAPI doesn't consider it a failure that you only have 4 of 5 worker instances because AWS is out of your instance type at the moment.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn’t realize this enhancement would cause installs to fail if a machine is not properly provisioned, but that makes sense. Is that correct?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. regardless of MAPI behavior i agree w/ the sentiment behind the suggested edit. All we know about the components is they are still progressing, we don't actually know that they are failing (and if they were truly failing, they should probably be setting available=false or at least degraded=true). In my mind this is more about ensuring that when we still the user their cluster is ready, it's actually ready and not still tidying up some things, than it is about reporting a failure that we currently ignore (not that that isn't also useful). Do we actually have any data points about how often clusters never "settle", such that these changes to the installer would result in reported failures that today are ignored? I think the more interesting aspect is that w/ this EP we'll no longer report install complete "prematurely"
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, that's a good summary of why I suggested this change. I don't agree that still progressing is the same as failing. WRT to timing The only thing that I know is service delivery waits up to 20 minutes before handing things off to the customer no matter what the result of their supplemental readiness checks. I'll ask if they've collected data on how long they generally wait and how often they hit the 20 minute cap.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@bparees I guess you meant
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
whoops, yes, of course. will edit original comment. |
||||||
| Adding a check for cluster operator stability will allow normal users and managed services teams | ||||||
| to more easily identify and troubleshoot issues when they encounter this class of failure. The change | ||||||
| will also help the Technical Release Team identify this class of problem in CI and triage | ||||||
| issues to the appropriate operator development team. | ||||||
|
|
||||||
| ### User Stories | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we think about providing a library with such functionality - so it can be reused in AI as well?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What functionality specificall? Scanning operator conditions and ensuring they've all settled for at least 30s? |
||||||
|
|
||||||
| As an openshift-install user, I want the Installer to tell me when an operator never stops | ||||||
| progressing so that I can check whether the operator has an issue. | ||||||
|
|
||||||
| As a member of the technical release team, I want the Installer to exit non-zero when | ||||||
| an operator never stops progressing so that I can triage operator bugs. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can add another use case which SRE has around the installer. They can not deploy the workload immediately after the installation because the cluster is not ready , so they had to develop extra code which checks the cluster to see if the operators are not progressing anymore which is inconvenient (same for our customers) cc @cblecker |
||||||
|
|
||||||
| ### Goals | ||||||
|
|
||||||
| * Installer correctly identifies a failed cluster install where cluster operators are not stable | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| * Installer exits non-zero in those situations and delivers meaningful error message | ||||||
|
|
||||||
| ### Non-Goals | ||||||
|
|
||||||
| * Installer handling other classes of errors, such as failure to provision machines by MAO | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| ## Proposal | ||||||
|
|
||||||
| ### Workflow Description | ||||||
|
|
||||||
| Cluster admin will begin installing cluster as usual. The Installation workflow will be: | ||||||
|
|
||||||
| 1. Cluster initializes as usual | ||||||
| 2. As usual, installer checks that cluster version is Available=true Progressing=False Degraded=False | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. clusterversion doesn't assert a
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably, even though that's not yet wired up it was agreed to be. @patrickdillon @jottofar lets make sure we sync up on this.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct the installer code checks against the failing condition. |
||||||
| 3. Installer checks status of each cluster operator for stability | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For reference, here's the things we are currently looking at today with OSD/ROSA to try and determine when a cluster install has actually finished: |
||||||
| 4. If a cluster operator is not stable the installer logs a message and throws an error | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we get a summary of all the things the installer looks at today to determine that the installation is done (or failed), so we can better understand the delta+implications of that delta being proposed here?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Serially, as I posted above, I believe it is: Checks for bootstrap complete configmap. If that's good:
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we fold the console check into the proposed CO checks? What do we do in a cluster where the console is disabled?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
is this checked by looking at the console CO conditions, or by explicitly trying to access the console url?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Since openshift/installer#5336, the installer will warning-log |
||||||
|
|
||||||
| ### API Extensions | ||||||
|
|
||||||
| None | ||||||
|
|
||||||
| ### Implementation Details/Notes/Constraints [optional] | ||||||
|
|
||||||
| The important detail is how we define _stable_. The definition proposed here is: | ||||||
|
|
||||||
| ``` | ||||||
| If a cluster operator does not maintain Progressing=False for at least 30 seconds, | ||||||
| during a five minute period it is unstable. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we get some more detail on how this will work in practice? At what point will the installer start the "5 minute window" for example? And presumably it will be looking for a 30 second period during which all operators are progressing=false, meaning you could have a situation where:
right? Not saying that's a problem, just trying to understand how the detection/determination will work in practice.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we trust lastTransition Time?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The implementation is here: The intent It is pretty much what scott said except we don't necessarily wait the whole 5 minutes. The 5s detail in point 3 is abstracted away into library-go & watch events. I will update these crucial details in the enhancement.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Scott's explanation in https://github.com/openshift/enhancements/pull/1189/files#r921528565 makes sense to me. I think the logic in the PR may have a bug. I commented. |
||||||
| ``` | ||||||
|
|
||||||
| ### Risks and Mitigations | ||||||
|
|
||||||
| One risk would be a false positive: the Installer identifies that a cluster | ||||||
| operator is unstable, but it turns out the operator is perfectly healthy; | ||||||
| the install was declared a failure but was actually successful. This risk | ||||||
| seems low and a risk that could be managed by monitoring these specific failures | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed. It would also be an operator bug, not an installer bug. |
||||||
| in CI. | ||||||
|
|
||||||
| ### Drawbacks | ||||||
|
|
||||||
| This enhancement has some significant potential drawbacks in terms of design: | ||||||
|
|
||||||
| A fundamental design principle of 4.x and the operator framework has been | ||||||
| delegating responsibility away from the Installer to allow clean separation | ||||||
| of concerns and better maintainbility. Without this enhancement, it is the | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| responsibility of the Cluster-Version Operator to determine whether given | ||||||
| Cluster Operators constitute a successful version. The idea of keeping the | ||||||
| cluster-version operator as the single responsible party is discussed in the | ||||||
| alternatives section. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. another aspect to this is it opens the door for differing behavior between which clusteroperators CVO cares about, vs installer. Today my understanding is the CVO only watches clusteroperator objects that it created (came out of manifests in the payload). If another component creates its own CO for some reason, that will have no bearing on CVO's observation of the cluster and reporting of available/progressing/failing conditions. I imagine the installer is not going to make that distinction (though perhaps it could, via looking at annotations on the COs), so we could potentially end up with a situation where the installer has one view of "what matters" and the CVO has a different one.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bparees Which of those, in the context of installation, makes the most sense given where we're going in the future?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that the installer has nothing to do w/ upgrades, and i'd expect similar semantics to function around upgrades as they do around install (in terms of when we think it is "done"), i'd be inclined towards the CVO owning this aspect over the installer. As for future, i assume you mean w/ respect to platform operators? So far the PO plan is to have the platform operator manager proxy the individual PO status, via a single CO. So PO doesn't really care whether it's CVO or installer that is watching the COs. But part of why the POM has to proxy the status to a single CO is because the CVO doesn't watch non-payload COs. I could see us changing that in the future, which is another reason i'd have to see us have two different components watching potentially different sets of COs to make similar decisions. |
||||||
|
|
||||||
| Following the previous point, this enhancement could introduce a slippery slope. | ||||||
| Where before this enhancement, the Installer would simply query the CVO; now we | ||||||
| are adding additional conditional logic. Introducing this conditional logic leads | ||||||
| to the risk of further conditions, which could make the install process more brittle | ||||||
| and introduce the potential for false positives or other failures. | ||||||
|
|
||||||
| Does implementing this enhancement address symptoms of issues with operator status definitions? | ||||||
| Should operators themselves be setting Degraded=True when they don't meet this stability criterion? | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think so. Progressing is about moving from one steady state to another. Degraded is about "something is broken". They serve different purposes and some progressing states easily last longer than 5 minutes. For instance, rolling out a new kube-apiserver level takes 15-20 minutes. |
||||||
|
|
||||||
| As we have seen with other timeouts in the Installer, developers and users will want to change these. | ||||||
| We should define a process for refining our stability criteria. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Via another five minutes of |
||||||
|
|
||||||
| ## Design Details | ||||||
|
|
||||||
| ### Open Questions [optional] | ||||||
|
|
||||||
| 1. What is the correct definition of a stable operator? More importantly, how can we | ||||||
| refine this definition? | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For purposes of this EP and the installer's perspective, it sounds like you're defining it as "progressing=false for 5 minutes"? Is the installer going to look at any other conditions on the operator, e.g. available=false, or it will continue to rely on the CVO to proxy that?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's progressing=False for the last 30 seconds but only waiting for all COs to achieve that up to 5 minutes. Only waiting up to five minutes because these aren't failed installs, the cluster is in all cases I'm aware of, a viable cluster that just differs from the exact spec defined in install-config.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah sorry i described it badly. but i'm still interested in what the full set of conditions that determine "install is complete (and healthy or unhealthy)" are, beyond this "30s window of progressing=false"
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, it would continue to rely on CVO as a proxy which is a good way of putting it. <edited to copy/paste/fix here: https://github.com/openshift/enhancements/pull/1189#discussion_r921698993>
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm happy with the definition as put forward by this enhancement plus the existing logic that requires A=t and D=f. |
||||||
| 2. Should this logic belong in the Installer, CVO, or another component? | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a strong opinion about this one. The logic described here seems fine. I'd also be fine seeing the CVO do it. |
||||||
|
|
||||||
| ### Test Plan | ||||||
|
|
||||||
| This code would go directly into the installer and be exercised by e2e tests. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By a synthetic CO that immediately becomes Available=True fulfilling the historic requirement for install-complete but maintains Progressing=True for 30 seconds after? Do we have tests that watch the cluster during the install or is this new ground? I think @deads2k has called these observers?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We don't have e2e observers yet, but if they existed it could work. I built a backstop into our pre-test CI step that we can change from "wait five minutes to settle" to "fail if anything is not settled. If it trips, this feature has a bug. |
||||||
|
|
||||||
| ### Graduation Criteria | ||||||
|
|
||||||
|
|
||||||
|
|
||||||
| #### Removing a deprecated feature | ||||||
|
|
||||||
| N/A | ||||||
|
|
||||||
| ### Upgrade / Downgrade Strategy | ||||||
|
|
||||||
| The cluster-version operator can institute similar logic and trigger alerts when | ||||||
| operator stability is in question during an upgrade. See alternatives section. | ||||||
|
|
||||||
| ### Version Skew Strategy | ||||||
|
|
||||||
| N/A | ||||||
|
|
||||||
| ### Operational Aspects of API Extensions | ||||||
|
|
||||||
| N/A | ||||||
|
|
||||||
| #### Failure Modes | ||||||
|
|
||||||
| - The worst case failure more is that the Installer throws an error when there is not an actual | ||||||
| problem with a cluster. In this case, an admin would need to investigate an error or automation would | ||||||
| need to rerun an install. We would hope to eliminate this failures through monitoring CI. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would this failed state cause a full TF destroy/retry in it's interactions with hive? |
||||||
|
|
||||||
| #### Support Procedures | ||||||
|
|
||||||
| We should instruct users on how to debug or review their operators when this error occurs. | ||||||
|
|
||||||
| ## Implementation History | ||||||
|
|
||||||
| N/A | ||||||
|
|
||||||
| ## Alternatives | ||||||
|
|
||||||
| We should seriously consider whether this logic should go into the cluster-version | ||||||
| operator rather than the installer. (Or if it should be in both, perhaps just the CVO | ||||||
| for upgrades.) Management of Cluster Operators has been the responsibility of the CVO | ||||||
| and the Installer should defer responsibility where possible to maintain separation | ||||||
| of concerns. | ||||||
sdodson marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could possibly link openshift/installer#6124 if there are no Jira trackers? That got reverted in openshift/installer#6503, but presumably whichever pull request restores it with an adjusted threshold will also link 6124.