create all clusteroperators in the CVO payload immediately #149

deads2k · 2019-12-11T18:21:04Z

clusteroperators.config.openshift.io are used to determine success of installation.
They are also used to drive collection of debugging data from tools like oc adm inspect and oc adm must-gather.
The clusteroperator resources in a payload should always be present, even before installation of particular
operators to see which clusteroperators need to report in and to allow establishing .status.relatedResources before
an operator pod runs.
This is critical for debugging clusters that fail to install or fail to upgrade with new operators present.

/assign @abhinavdahiya @wking @sdodson

abhinavdahiya · 2019-12-11T18:24:25Z

enhancements/cvo/clusteroperator-resource-handling.md

+1. `clusteroperator` resources in the payload should be created with the required status conditions (available, progressing,
+   degraded) set to `Unknown`.


https://github.com/openshift/api/blob/9ffd03c1c270ddd8cbb625295b1a74ade7e01229/config/v1/types_cluster_operator.go#L141-L175

has more than listed here, which ones do we care about now, and how we get this initial list updated when more conditions become the norm

https://github.com/openshift/api/blob/9ffd03c1c270ddd8cbb625295b1a74ade7e01229/config/v1/types_cluster_operator.go#L141-L175

has more than listed here, which ones do we care about now, and how we get this initial list updated when more conditions become the norm

Only those three are required. I would draw the line at required.

abhinavdahiya · 2019-12-11T18:27:09Z

enhancements/cvo/clusteroperator-resource-handling.md

+2. `clusteroperator` creation by the CVO needs to honor or update `.status.relatedResources`.  This requires updating
+    status after the creation.


both the conditions and relatedObjects are status field. ie since status cannot be updated with CREATE, we need to clearly define when these fields need to be defaulted as these are going to be 2 calls and hence open to races.

both the conditions and relatedObjects are status field. ie since status cannot be updated with CREATE, we need to clearly define when these fields need to be defaulted as these are going to be 2 calls and hence open to races.

however, we know that the operators are all controllers, so a race without consequence.

What about a spec field for relatedObjects instead / as well? with spec.relatedObjects being "gather this even if the operator doesn't start" and status.relatedObjects being updated by the operator continuously.

abhinavdahiya · 2019-12-11T18:28:02Z

enhancements/cvo/clusteroperator-resource-handling.md

+3.  `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering
+    they are located.  This ensures that they are always present during collection.


can we define immediately ? also why is this behavior required?

does this introduce a new API that operators can assume the these conditions will be pre-set..?

does this introduce a new API that operators can assume the these conditions will be pre-set..?

Not in any reasonable way. They are controllers.

can we define immediately ? also why is this behavior required?

As noted in motivation above, we need the metadata to be able to collect from failed installs. Immediately I would define as searching the entire payload for clusteroperators and creating them before anything else.

Ugh, this is going to break a lot of progress notification stuff. This is hugely ugly and may revert end user stuff. We will probably have to lie about progress - do a first pass create attempt, but if it fails do not report it at all.

I would rather not special-case ordering for a particular type. We already provide a way for operators to declare manifest ordering, let's just use that instead of complicating it. Having operators each PR to shift their ClusterOperator manifest to the front of their block is enough to get them all safely up by the time there is any operator-specific content to gather. If an operator-adding update gets hung up early without pushing the ClusterOperator for the new operator, then there would be no other components associated with that new operator (namespace, deployments, etc.) around to gather.

abhinavdahiya · 2019-12-11T18:29:00Z

enhancements/cvo/clusteroperator-resource-handling.md

+3.  `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering
+    they are located.  This ensures that they are always present during collection.
+4.  The CVO waiting logic on `clusteroperator` remains the same.
+


if a user deleted the clusteroperator object, is the responsibilty of CVO to create and set these defaults again?

if a user deleted the clusteroperator object, is the responsibilty of CVO to create and set these defaults again?

it would be a race with clusteroperators. I don't think the distinction matters.

it would be a race with clusteroperators.

Racing in general is not great, but it's mitigated in the ClusterOperator case by having no spec and very little metadata (e.g. here).

abhinavdahiya · 2019-12-11T18:30:20Z

enhancements/cvo/clusteroperator-resource-handling.md

+
+## Proposal
+
+1. `clusteroperator` resources in the payload should be created with the required status conditions (available, progressing,


since https://github.com/openshift/enhancements/pull/149/files#diff-01afa9f6f5d804a7c3bb01b2ccb0b664R55 expects the operators to define the default relatedObjects in the release-image.. why not use the same mechanism for these?

since https://github.com/openshift/enhancements/pull/149/files#diff-01afa9f6f5d804a7c3bb01b2ccb0b664R55 expects the operators to define the default relatedObjects in the release-image.. why not use the same mechanism for these?

no preference on implementation.

abhinavdahiya · 2019-12-11T18:33:38Z

enhancements/cvo/clusteroperator-resource-handling.md

+3.  `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering
+    they are located.  This ensures that they are always present during collection.
+4.  The CVO waiting logic on `clusteroperator` remains the same.
+


The cluster operator object in the relase-image has fields in the status ie. status.versions that provide CVO the version it should be waiting for operator to be for done upgrading

So this new behavior now makes certain fields in the release-image manifest indication of create this default and certain fields required for upgrade done criteria

This is getting a little convoluted...

The cluster operator object in the relase-image has fields in the status ie. status.versions that provide CVO the version it should be waiting for operator to be for done upgrading

So this new behavior now makes certain fields in the release-image manifest indication of create this default and certain fields required for upgrade done criteria

This is getting a little convoluted...

from an end-user perspective the behavior is easy, for operators no behavior changes, for the CVO an extra stanza is added, it doesn't seem too bad.

abhinavdahiya · 2019-12-11T18:35:13Z

the clusteroperator is an object owned and managed by operators themselves and used to provide their status and creating the cluster operator for them seeming weird to me personally.

also between the benefit of capturing the related objects vs the increase confusion for co objects in release-image wrt cvo, personally i'm on the side of not increasing this confusion.

deads2k · 2019-12-12T17:34:02Z

the clusteroperator is an object owned and managed by operators themselves and used to provide their status and creating the cluster operator for them seeming weird to me personally.

also between the benefit of capturing the related objects vs the increase confusion for co objects in release-image wrt cvo, personally i'm on the side of not increasing this confusion.

We have a supportability problem in the field. I'm open to other alternatives that don't require writing additional knowledge into debugging tools, but this seems like a very good balance between debuggability and effort with a very low impact across the org.

…lusteroperator early

wking · 2019-12-12T21:07:45Z

enhancements/cvo/clusteroperator-resource-handling.md

+
+## Release Signoff Checklist
+
+- [ ] Enhancement is `implementable`


nit: shouldn't we check this off to match status: implementable above^^?

smarterclayton · 2019-12-12T21:13:46Z

enhancements/cvo/clusteroperator-resource-handling.md

+
+### Specific Implementation Option
+
+This isn't a required mechanism for implementation, but it demonstrates how narrowly scoped the change is.


I do not want to do this at all. This is horrifying. :)

smarterclayton · 2019-12-12T21:14:18Z

enhancements/cvo/clusteroperator-resource-handling.md

+
+This isn't a required mechanism for implementation, but it demonstrates how narrowly scoped the change is.
+ 1. create a new control loop with a clusteroperator lister, clusteroperator client, and a function to get the current payload.
+ 2. register event handlers on clusteroperator informer and time based every minute.


I think the payload should simply find all CV and do a first pass on them, ignoring errors or progress reporting, then continue. We have to deadline it, but I do not want a new loop.

This is a super specific case on install / upgrade.

I'm NAKing this impl, I'd accept something at the beginning of sync payload that loops over the manifests and does a quick parallel run of all cluster operators with a bounded scope, then does normal processing. It's an optimization.

(but in case it isn't clear, I'm fine with this proposal as long as it doesn't in any way disturb the current, well tested, SANE loops).

wking · 2019-12-12T21:19:23Z

enhancements/cvo/clusteroperator-resource-handling.md

+ 3. sync loop reads the current payload.  for each clusteroperator in the payload
+    1. check lister to see if clusteroperator exists.  If so, continue to next clusteroperator.
+    2. create clusteroperator with empty spec and metadata.  If create fails, continue to next clusteroperator.
+    3. update clusteroperator/status with `.status.relatedResources` and the three required conditions in `Unknown` state.


What needs to go in relatedResources?

What needs to go in relatedResources?

Responsibility of the individual operators. Generally it will be things like input resources with fixed names and interesting namespaces.

so as far as this enhancement is concerned, we can drop that and have this line (for the CVO) be:

update clusteroperator/status with the three required conditions in Unknown state.

so as far as this enhancement is concerned, we can drop that...

And without anything useful in relatedObjects (is relatedResources a typo?), must-gather is still stuck, right? I don't see how the ClusterVersion operator would know what to fill in there unless we put some sort of annotation on the intended, CVO created resources. So an operator's manifest set would look like:

Namespace, whatever else we expect to not have problems with, all have the release.openshift.io/operator-resource: <operator-name> annotation.

The ClusterOperator. When the CVO sees this is missing, it creates it, seeds the conditions as you've laid out, and fills relatedObjects with anything it had already pushed that sync round with the release.openshift.io/operator-resource annotation whose annotation value matched the ClusterOperator name.

Deployments or other vulnerable types. At this point it would be an error to hit anything with a matching release.openshift.io/operator-resource annotation. Not sure how to enforce that error, probably go Degraded but keep working.

So an operator's manifest set would look like...

Never mind, this is crazy. Better to have operator maintainers set their intended relatedObjects content in their ClusterOperator.

wking · 2019-12-12T21:20:55Z

enhancements/cvo/clusteroperator-resource-handling.md

+### Test Plan
+
+1. When an install in CI fails at some point in the release, we should see must-gather information
+2. During an installation, the `clusteroperator` resources should be visible via the API immediately.


CI coverage for this immediately seems tricky. Were you expecting to use audit logs or some such?

CI coverage for this immediately seems tricky. Were you expecting to use audit logs or some such?

I'm confident enough that I'll investigate a "cluster didn't install" problem next release that I'll know if it works.

I'm confident enough that I'll investigate a "cluster didn't install" problem next release that I'll know if it works.

That's 1 though. I don't see how 2 is all that important on its own, certainly not enough to be worth specific CI coverage. I'm happy if 1 gets CI coverage. I'm not happy if something important enough to be in the Test Plan is covered by "@deads2k manually looks into this occasionally" ;).

I'm confident enough that I'll investigate a "cluster didn't install" problem next release that I'll know if it works.

That's 1 though. I don't see how 2 is all that important on its own, certainly not enough to be worth specific CI coverage. I'm happy if 1 gets CI coverage. I'm not happy if something important enough to be in the Test Plan is covered by "@deads2k manually looks into this occasionally" ;).

Presumably a unit test will work.

ecordell · 2020-01-03T23:34:34Z

enhancements/cvo/clusteroperator-resource-handling.md

+   degraded) set to `Unknown`.
+2. `clusteroperator` creation by the CVO needs to honor or update `.status.relatedResources`.  This requires updating
+    status after the creation.
+3.  `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering


I would suggest a 2.5: it would be nice if CVO waited for the clusteroperator API to be available in discovery before creating any of the instances.

see also: https://bugzilla.redhat.com/show_bug.cgi?id=1787660 (an issue that we can address per operator, but this would be nice to have nonetheless).

sdodson · 2020-05-29T00:39:37Z

This has already been implemented for 4.5 and is potentially being backported to 4.4 here openshift/cluster-version-operator#376
/lgtm

openshift-ci-robot · 2020-05-29T00:39:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k,sdodson]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2020-05-29T03:51:17Z

We need a feature-request bug for this targeting 4.5.0 that links all the work which we can clone back to 4.4.z so QE can verify all the backports.

create all clusteroperators in the CVO payload immediately

33c4438

openshift-ci-robot assigned abhinavdahiya, sdodson and wking Dec 11, 2019

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 11, 2019

openshift-ci-robot requested review from jwmatthews and soltysh December 11, 2019 18:21

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2019

abhinavdahiya reviewed Dec 11, 2019

View reviewed changes

add details about how to narrowly scope the CVO change for creating c…

aec1a0d

…lusteroperator early

wking reviewed Dec 12, 2019

View reviewed changes

smarterclayton reviewed Dec 12, 2019

View reviewed changes

wking reviewed Dec 12, 2019

View reviewed changes

ecordell reviewed Jan 3, 2020

View reviewed changes

deads2k mentioned this pull request Mar 13, 2020

reconcile relatedObjects and place them in the clusteroperator for CVO openshift/machine-config-operator#1566

Merged

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2020

openshift-merge-robot merged commit c76953f into openshift:master May 29, 2020

wking mentioned this pull request Jun 17, 2020

Bug 1841239: Avoid pre-creating clusteroperators that should be excluded openshift/cluster-version-operator#376

Merged

		1. `clusteroperator` resources in the payload should be created with the required status conditions (available, progressing,
		degraded) set to `Unknown`.

		2. `clusteroperator` creation by the CVO needs to honor or update `.status.relatedResources`. This requires updating
		status after the creation.

		3. `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering
		they are located. This ensures that they are always present during collection.


		## Proposal

		1. `clusteroperator` resources in the payload should be created with the required status conditions (available, progressing,


		## Release Signoff Checklist

		- [ ] Enhancement is `implementable`


		### Specific Implementation Option

		This isn't a required mechanism for implementation, but it demonstrates how narrowly scoped the change is.

create all clusteroperators in the CVO payload immediately #149

create all clusteroperators in the CVO payload immediately #149

Uh oh!

Conversation

deads2k commented Dec 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya commented Dec 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deads2k commented Dec 12, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smarterclayton Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdodson commented May 29, 2020

Uh oh!

openshift-ci-robot commented May 29, 2020

abhinavdahiya commented Dec 11, 2019 •

edited

Loading

smarterclayton Dec 12, 2019 •

edited

Loading