-
Notifications
You must be signed in to change notification settings - Fork 533
create all clusteroperators in the CVO payload immediately #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create all clusteroperators in the CVO payload immediately #149
Conversation
| 1. `clusteroperator` resources in the payload should be created with the required status conditions (available, progressing, | ||
| degraded) set to `Unknown`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has more than listed here, which ones do we care about now, and how we get this initial list updated when more conditions become the norm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has more than listed here, which ones do we care about now, and how we get this initial list updated when more conditions become the norm
Only those three are required. I would draw the line at required.
| 2. `clusteroperator` creation by the CVO needs to honor or update `.status.relatedResources`. This requires updating | ||
| status after the creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both the conditions and relatedObjects are status field. ie since status cannot be updated with CREATE, we need to clearly define when these fields need to be defaulted as these are going to be 2 calls and hence open to races.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both the conditions and relatedObjects are status field. ie since status cannot be updated with CREATE, we need to clearly define when these fields need to be defaulted as these are going to be 2 calls and hence open to races.
however, we know that the operators are all controllers, so a race without consequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a spec field for relatedObjects instead / as well? with spec.relatedObjects being "gather this even if the operator doesn't start" and status.relatedObjects being updated by the operator continuously.
| 3. `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering | ||
| they are located. This ensures that they are always present during collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we define immediately ? also why is this behavior required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this introduce a new API that operators can assume the these conditions will be pre-set..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this introduce a new API that operators can assume the these conditions will be pre-set..?
Not in any reasonable way. They are controllers.
can we define immediately ? also why is this behavior required?
As noted in motivation above, we need the metadata to be able to collect from failed installs. Immediately I would define as searching the entire payload for clusteroperators and creating them before anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, this is going to break a lot of progress notification stuff. This is hugely ugly and may revert end user stuff. We will probably have to lie about progress - do a first pass create attempt, but if it fails do not report it at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not special-case ordering for a particular type. We already provide a way for operators to declare manifest ordering, let's just use that instead of complicating it. Having operators each PR to shift their ClusterOperator manifest to the front of their block is enough to get them all safely up by the time there is any operator-specific content to gather. If an operator-adding update gets hung up early without pushing the ClusterOperator for the new operator, then there would be no other components associated with that new operator (namespace, deployments, etc.) around to gather.
| 3. `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering | ||
| they are located. This ensures that they are always present during collection. | ||
| 4. The CVO waiting logic on `clusteroperator` remains the same. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a user deleted the clusteroperator object, is the responsibilty of CVO to create and set these defaults again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a user deleted the clusteroperator object, is the responsibilty of CVO to create and set these defaults again?
it would be a race with clusteroperators. I don't think the distinction matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be a race with clusteroperators.
Racing in general is not great, but it's mitigated in the ClusterOperator case by having no spec and very little metadata (e.g. here).
|
|
||
| ## Proposal | ||
|
|
||
| 1. `clusteroperator` resources in the payload should be created with the required status conditions (available, progressing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since https://github.com/openshift/enhancements/pull/149/files#diff-01afa9f6f5d804a7c3bb01b2ccb0b664R55 expects the operators to define the default relatedObjects in the release-image.. why not use the same mechanism for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since https://github.com/openshift/enhancements/pull/149/files#diff-01afa9f6f5d804a7c3bb01b2ccb0b664R55 expects the operators to define the default relatedObjects in the release-image.. why not use the same mechanism for these?
no preference on implementation.
| 3. `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering | ||
| they are located. This ensures that they are always present during collection. | ||
| 4. The CVO waiting logic on `clusteroperator` remains the same. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cluster operator object in the relase-image has fields in the status ie. status.versions that provide CVO the version it should be waiting for operator to be for done upgrading
So this new behavior now makes certain fields in the release-image manifest indication of create this default and certain fields required for upgrade done criteria
This is getting a little convoluted...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cluster operator object in the relase-image has fields in the status ie.
status.versionsthat provide CVO the version it should be waiting for operator to be for done upgradingSo this new behavior now makes certain fields in the release-image manifest indication of create this default and certain fields required for upgrade done criteria
This is getting a little convoluted...
from an end-user perspective the behavior is easy, for operators no behavior changes, for the CVO an extra stanza is added, it doesn't seem too bad.
|
the clusteroperator is an object owned and managed by operators themselves and used to provide their status and creating the cluster operator for them seeming weird to me personally. also between the benefit of capturing the related objects vs the increase confusion for co objects in release-image wrt cvo, personally i'm on the side of not increasing this confusion. |
We have a supportability problem in the field. I'm open to other alternatives that don't require writing additional knowledge into debugging tools, but this seems like a very good balance between debuggability and effort with a very low impact across the org. |
…lusteroperator early
|
|
||
| ## Release Signoff Checklist | ||
|
|
||
| - [ ] Enhancement is `implementable` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shouldn't we check this off to match status: implementable above^^?
|
|
||
| ### Specific Implementation Option | ||
|
|
||
| This isn't a required mechanism for implementation, but it demonstrates how narrowly scoped the change is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not want to do this at all. This is horrifying. :)
|
|
||
| This isn't a required mechanism for implementation, but it demonstrates how narrowly scoped the change is. | ||
| 1. create a new control loop with a clusteroperator lister, clusteroperator client, and a function to get the current payload. | ||
| 2. register event handlers on clusteroperator informer and time based every minute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the payload should simply find all CV and do a first pass on them, ignoring errors or progress reporting, then continue. We have to deadline it, but I do not want a new loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a super specific case on install / upgrade.
I'm NAKing this impl, I'd accept something at the beginning of sync payload that loops over the manifests and does a quick parallel run of all cluster operators with a bounded scope, then does normal processing. It's an optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but in case it isn't clear, I'm fine with this proposal as long as it doesn't in any way disturb the current, well tested, SANE loops).
| 3. sync loop reads the current payload. for each clusteroperator in the payload | ||
| 1. check lister to see if clusteroperator exists. If so, continue to next clusteroperator. | ||
| 2. create clusteroperator with empty spec and metadata. If create fails, continue to next clusteroperator. | ||
| 3. update clusteroperator/status with `.status.relatedResources` and the three required conditions in `Unknown` state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What needs to go in relatedResources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What needs to go in
relatedResources?
Responsibility of the individual operators. Generally it will be things like input resources with fixed names and interesting namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so as far as this enhancement is concerned, we can drop that and have this line (for the CVO) be:
- update clusteroperator/status with the three required conditions in
Unknownstate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so as far as this enhancement is concerned, we can drop that...
And without anything useful in relatedObjects (is relatedResources a typo?), must-gather is still stuck, right? I don't see how the ClusterVersion operator would know what to fill in there unless we put some sort of annotation on the intended, CVO created resources. So an operator's manifest set would look like:
- Namespace, whatever else we expect to not have problems with, all have the
release.openshift.io/operator-resource: <operator-name>annotation. - The ClusterOperator. When the CVO sees this is missing, it creates it, seeds the conditions as you've laid out, and fills
relatedObjectswith anything it had already pushed that sync round with therelease.openshift.io/operator-resourceannotation whose annotation value matched the ClusterOperator name. - Deployments or other vulnerable types. At this point it would be an error to hit anything with a matching
release.openshift.io/operator-resourceannotation. Not sure how to enforce that error, probably go Degraded but keep working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So an operator's manifest set would look like...
Never mind, this is crazy. Better to have operator maintainers set their intended relatedObjects content in their ClusterOperator.
| ### Test Plan | ||
|
|
||
| 1. When an install in CI fails at some point in the release, we should see must-gather information | ||
| 2. During an installation, the `clusteroperator` resources should be visible via the API immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI coverage for this immediately seems tricky. Were you expecting to use audit logs or some such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI coverage for this
immediatelyseems tricky. Were you expecting to use audit logs or some such?
I'm confident enough that I'll investigate a "cluster didn't install" problem next release that I'll know if it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confident enough that I'll investigate a "cluster didn't install" problem next release that I'll know if it works.
That's 1 though. I don't see how 2 is all that important on its own, certainly not enough to be worth specific CI coverage. I'm happy if 1 gets CI coverage. I'm not happy if something important enough to be in the Test Plan is covered by "@deads2k manually looks into this occasionally" ;).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confident enough that I'll investigate a "cluster didn't install" problem next release that I'll know if it works.
That's 1 though. I don't see how 2 is all that important on its own, certainly not enough to be worth specific CI coverage. I'm happy if 1 gets CI coverage. I'm not happy if something important enough to be in the Test Plan is covered by "@deads2k manually looks into this occasionally" ;).
Presumably a unit test will work.
| degraded) set to `Unknown`. | ||
| 2. `clusteroperator` creation by the CVO needs to honor or update `.status.relatedResources`. This requires updating | ||
| status after the creation. | ||
| 3. `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest a 2.5: it would be nice if CVO waited for the clusteroperator API to be available in discovery before creating any of the instances.
see also: https://bugzilla.redhat.com/show_bug.cgi?id=1787660 (an issue that we can address per operator, but this would be nice to have nonetheless).
|
This has already been implemented for 4.5 and is potentially being backported to 4.4 here openshift/cluster-version-operator#376 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, sdodson The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
We need a feature-request bug for this targeting 4.5.0 that links all the work which we can clone back to 4.4.z so QE can verify all the backports. |
clusteroperators.config.openshift.ioare used to determine success of installation.They are also used to drive collection of debugging data from tools like
oc adm inspectandoc adm must-gather.The
clusteroperatorresources in a payload should always be present, even before installation of particularoperators to see which
clusteroperatorsneed to report in and to allow establishing.status.relatedResourcesbeforean operator pod runs.
This is critical for debugging clusters that fail to install or fail to upgrade with new operators present.
/assign @abhinavdahiya @wking @sdodson