-
Notifications
You must be signed in to change notification settings - Fork 533
create all clusteroperators in the CVO payload immediately #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,117 @@ | ||
| --- | ||
| title: clusteroperator-resource-handling | ||
| authors: | ||
| - "@deads2k" | ||
| reviewers: | ||
| - "@abhinavdahiya" | ||
| - "@sdodson" | ||
| approvers: | ||
| - "@abhinavdahiya" | ||
| - "@derekwaynecarr" | ||
| creation-date: 2019-12-11 | ||
| last-updated: 2019-12-11 | ||
| status: implementable | ||
| see-also: | ||
| replaces: | ||
| superseded-by: | ||
| --- | ||
|
|
||
| # ClusterOperator Resource Handling | ||
|
|
||
| ## Release Signoff Checklist | ||
|
|
||
| - [ ] Enhancement is `implementable` | ||
| - [ ] Design details are appropriately documented from clear requirements | ||
| - [ ] Test plan is defined | ||
| - [ ] Graduation criteria for dev preview, tech preview, GA | ||
| - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
|
||
| ## Summary | ||
|
|
||
| `clusteroperators.config.openshift.io` are used to determine success of installation. | ||
| They are also used to drive collection of debugging data from tools like `oc adm inspect` and `oc adm must-gather`. | ||
| The `clusteroperator` resources in a payload should always be present, even before installation of particular | ||
| operators to see which `clusteroperators` need to report in and to allow establishing `.status.relatedResources` before | ||
| an operator pod runs. | ||
| This is critical for debugging clusters that fail to install or fail to upgrade with new operators present. | ||
|
|
||
| ## Motivation | ||
|
|
||
| Debugging information for clusters that succeed in bootstrapping, but fail during installation is missing most | ||
| of the data required to resolve the issue via must-gather. | ||
|
|
||
| ### Goals | ||
|
|
||
| 1. allow collection of debugging data for failed installs using normal tools. | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| 1. create a new tool to gather data for failed installs after bootstrapping. | ||
| 2. taking responsibility for creating clusteroperator resources. Individual operators are responsible for creating and maintaining | ||
| clusteroperator resources. | ||
|
|
||
| ## Proposal | ||
|
|
||
| 1. `clusteroperator` resources in the payload should be created with the required status conditions (available, progressing, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since https://github.com/openshift/enhancements/pull/149/files#diff-01afa9f6f5d804a7c3bb01b2ccb0b664R55 expects the operators to define the default relatedObjects in the release-image.. why not use the same mechanism for these?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
no preference on implementation. |
||
| degraded) set to `Unknown`. | ||
|
Comment on lines
+55
to
+56
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. has more than listed here, which ones do we care about now, and how we get this initial list updated when more conditions become the norm
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Only those three are required. I would draw the line at required. |
||
| 2. `clusteroperator` creation by the CVO needs to honor or update `.status.relatedResources`. This requires updating | ||
| status after the creation. | ||
|
Comment on lines
+57
to
+58
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. both the conditions and relatedObjects are status field. ie since status cannot be updated with CREATE, we need to clearly define when these fields need to be defaulted as these are going to be 2 calls and hence open to races.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
however, we know that the operators are all controllers, so a race without consequence.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about a |
||
| 3. `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest a see also: https://bugzilla.redhat.com/show_bug.cgi?id=1787660 (an issue that we can address per operator, but this would be nice to have nonetheless). |
||
| they are located. This ensures that they are always present during collection. | ||
|
Comment on lines
+59
to
+60
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we define immediately ? also why is this behavior required?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this introduce a new API that operators can assume the these conditions will be pre-set..?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not in any reasonable way. They are controllers.
As noted in motivation above, we need the metadata to be able to collect from failed installs. Immediately I would define as searching the entire payload for clusteroperators and creating them before anything else.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ugh, this is going to break a lot of progress notification stuff. This is hugely ugly and may revert end user stuff. We will probably have to lie about progress - do a first pass create attempt, but if it fails do not report it at all.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would rather not special-case ordering for a particular type. We already provide a way for operators to declare manifest ordering, let's just use that instead of complicating it. Having operators each PR to shift their ClusterOperator manifest to the front of their block is enough to get them all safely up by the time there is any operator-specific content to gather. If an operator-adding update gets hung up early without pushing the ClusterOperator for the new operator, then there would be no other components associated with that new operator (namespace, deployments, etc.) around to gather. |
||
| 4. The CVO waiting logic on `clusteroperator` remains the same. | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if a user deleted the clusteroperator object, is the responsibilty of CVO to create and set these defaults again?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
it would be a race with clusteroperators. I don't think the distinction matters.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Racing in general is not great, but it's mitigated in the ClusterOperator case by having no
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The cluster operator object in the relase-image has fields in the status ie. So this new behavior now makes certain fields in the release-image manifest indication of create this default and certain fields required for upgrade done criteria This is getting a little convoluted...
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
from an end-user perspective the behavior is easy, for operators no behavior changes, for the CVO an extra stanza is added, it doesn't seem too bad. |
||
| ### Specific Implementation Option | ||
|
|
||
| This isn't a required mechanism for implementation, but it demonstrates how narrowly scoped the change is. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not want to do this at all. This is horrifying. :) |
||
| 1. create a new control loop with a clusteroperator lister, clusteroperator client, and a function to get the current payload. | ||
| 2. register event handlers on clusteroperator informer and time based every minute. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the payload should simply find all CV and do a first pass on them, ignoring errors or progress reporting, then continue. We have to deadline it, but I do not want a new loop.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a super specific case on install / upgrade. I'm NAKing this impl, I'd accept something at the beginning of sync payload that loops over the manifests and does a quick parallel run of all cluster operators with a bounded scope, then does normal processing. It's an optimization.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (but in case it isn't clear, I'm fine with this proposal as long as it doesn't in any way disturb the current, well tested, SANE loops). |
||
| 3. sync loop reads the current payload. for each clusteroperator in the payload | ||
| 1. check lister to see if clusteroperator exists. If so, continue to next clusteroperator. | ||
| 2. create clusteroperator with empty spec and metadata. If create fails, continue to next clusteroperator. | ||
| 3. update clusteroperator/status with `.status.relatedResources` and the three required conditions in `Unknown` state. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What needs to go in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Responsibility of the individual operators. Generally it will be things like input resources with fixed names and interesting namespaces.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so as far as this enhancement is concerned, we can drop that and have this line (for the CVO) be:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
And without anything useful in
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Never mind, this is crazy. Better to have operator maintainers set their intended |
||
|
|
||
| There is no need to modify the existing CVO logic because it's all valid. | ||
| The individual operators are controllers so the presence of an unknown state doesn't matter. | ||
| The new control loop doesn't fight with any individual operators because it's a create-only call with a one time status priming. | ||
| If anything goes wrong with this control loop, the rest of the system continues to function as it does today. | ||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| 1. Existing clusteroperators may treat presence and absence or condition==Unknown as special and fail to reconcile. | ||
| This would be a bug in the operator implementation that needs to be fixed. | ||
|
|
||
| ## Design Details | ||
|
|
||
| ### Test Plan | ||
|
|
||
| 1. When an install in CI fails at some point in the release, we should see must-gather information | ||
| 2. During an installation, the `clusteroperator` resources should be visible via the API immediately. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CI coverage for this
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm confident enough that I'll investigate a "cluster didn't install" problem next release that I'll know if it works.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's 1 though. I don't see how 2 is all that important on its own, certainly not enough to be worth specific CI coverage. I'm happy if 1 gets CI coverage. I'm not happy if something important enough to be in the Test Plan is covered by "@deads2k manually looks into this occasionally" ;).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Presumably a unit test will work. |
||
|
|
||
| ### Graduation Criteria | ||
|
|
||
| GA. When it works, we ship it. | ||
|
|
||
| ### Upgrade / Downgrade Strategy | ||
|
|
||
| No special handling is needed because the condition meaning remains the same. The upgrade will simply have new | ||
| `clusteroperators` created at the start of the upgrade. | ||
|
|
||
| ### Version Skew Strategy | ||
|
|
||
| No special consideration. | ||
|
|
||
| ## Implementation History | ||
|
|
||
| Major milestones in the life cycle of a proposal should be tracked in `Implementation | ||
| History`. | ||
|
|
||
| ## Drawbacks | ||
|
|
||
| The idea is to find the best form of an argument why this enhancement should _not_ be implemented. | ||
|
|
||
| ## Alternatives | ||
|
|
||
| Similar to the `Drawbacks` section the `Alternatives` section is used to | ||
| highlight and record other possible approaches to delivering the value proposed | ||
| by an enhancement. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shouldn't we check this off to match
status: implementableabove^^?