Skip to content

Conversation

@deads2k
Copy link
Contributor

@deads2k deads2k commented Dec 11, 2019

clusteroperators.config.openshift.io are used to determine success of installation.
They are also used to drive collection of debugging data from tools like oc adm inspect and oc adm must-gather.
The clusteroperator resources in a payload should always be present, even before installation of particular
operators to see which clusteroperators need to report in and to allow establishing .status.relatedResources before
an operator pod runs.
This is critical for debugging clusters that fail to install or fail to upgrade with new operators present.

/assign @abhinavdahiya @wking @sdodson

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 11, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 11, 2019
Comment on lines +53 to +54
1. `clusteroperator` resources in the payload should be created with the required status conditions (available, progressing,
degraded) set to `Unknown`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/openshift/api/blob/9ffd03c1c270ddd8cbb625295b1a74ade7e01229/config/v1/types_cluster_operator.go#L141-L175

has more than listed here, which ones do we care about now, and how we get this initial list updated when more conditions become the norm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/openshift/api/blob/9ffd03c1c270ddd8cbb625295b1a74ade7e01229/config/v1/types_cluster_operator.go#L141-L175

has more than listed here, which ones do we care about now, and how we get this initial list updated when more conditions become the norm

Only those three are required. I would draw the line at required.

Comment on lines +55 to +56
2. `clusteroperator` creation by the CVO needs to honor or update `.status.relatedResources`. This requires updating
status after the creation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both the conditions and relatedObjects are status field. ie since status cannot be updated with CREATE, we need to clearly define when these fields need to be defaulted as these are going to be 2 calls and hence open to races.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both the conditions and relatedObjects are status field. ie since status cannot be updated with CREATE, we need to clearly define when these fields need to be defaulted as these are going to be 2 calls and hence open to races.

however, we know that the operators are all controllers, so a race without consequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a spec field for relatedObjects instead / as well? with spec.relatedObjects being "gather this even if the operator doesn't start" and status.relatedObjects being updated by the operator continuously.

Comment on lines +57 to +58
3. `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering
they are located. This ensures that they are always present during collection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define immediately ? also why is this behavior required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this introduce a new API that operators can assume the these conditions will be pre-set..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this introduce a new API that operators can assume the these conditions will be pre-set..?

Not in any reasonable way. They are controllers.

can we define immediately ? also why is this behavior required?

As noted in motivation above, we need the metadata to be able to collect from failed installs. Immediately I would define as searching the entire payload for clusteroperators and creating them before anything else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, this is going to break a lot of progress notification stuff. This is hugely ugly and may revert end user stuff. We will probably have to lie about progress - do a first pass create attempt, but if it fails do not report it at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not special-case ordering for a particular type. We already provide a way for operators to declare manifest ordering, let's just use that instead of complicating it. Having operators each PR to shift their ClusterOperator manifest to the front of their block is enough to get them all safely up by the time there is any operator-specific content to gather. If an operator-adding update gets hung up early without pushing the ClusterOperator for the new operator, then there would be no other components associated with that new operator (namespace, deployments, etc.) around to gather.

3. `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering
they are located. This ensures that they are always present during collection.
4. The CVO waiting logic on `clusteroperator` remains the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a user deleted the clusteroperator object, is the responsibilty of CVO to create and set these defaults again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a user deleted the clusteroperator object, is the responsibilty of CVO to create and set these defaults again?

it would be a race with clusteroperators. I don't think the distinction matters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be a race with clusteroperators.

Racing in general is not great, but it's mitigated in the ClusterOperator case by having no spec and very little metadata (e.g. here).


## Proposal

1. `clusteroperator` resources in the payload should be created with the required status conditions (available, progressing,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since https://github.com/openshift/enhancements/pull/149/files#diff-01afa9f6f5d804a7c3bb01b2ccb0b664R55 expects the operators to define the default relatedObjects in the release-image.. why not use the same mechanism for these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since https://github.com/openshift/enhancements/pull/149/files#diff-01afa9f6f5d804a7c3bb01b2ccb0b664R55 expects the operators to define the default relatedObjects in the release-image.. why not use the same mechanism for these?

no preference on implementation.

3. `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering
they are located. This ensures that they are always present during collection.
4. The CVO waiting logic on `clusteroperator` remains the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster operator object in the relase-image has fields in the status ie. status.versions that provide CVO the version it should be waiting for operator to be for done upgrading

So this new behavior now makes certain fields in the release-image manifest indication of create this default and certain fields required for upgrade done criteria

This is getting a little convoluted...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster operator object in the relase-image has fields in the status ie. status.versions that provide CVO the version it should be waiting for operator to be for done upgrading

So this new behavior now makes certain fields in the release-image manifest indication of create this default and certain fields required for upgrade done criteria

This is getting a little convoluted...

from an end-user perspective the behavior is easy, for operators no behavior changes, for the CVO an extra stanza is added, it doesn't seem too bad.

@abhinavdahiya
Copy link
Contributor

abhinavdahiya commented Dec 11, 2019

the clusteroperator is an object owned and managed by operators themselves and used to provide their status and creating the cluster operator for them seeming weird to me personally.

also between the benefit of capturing the related objects vs the increase confusion for co objects in release-image wrt cvo, personally i'm on the side of not increasing this confusion.

@deads2k
Copy link
Contributor Author

deads2k commented Dec 12, 2019

the clusteroperator is an object owned and managed by operators themselves and used to provide their status and creating the cluster operator for them seeming weird to me personally.

also between the benefit of capturing the related objects vs the increase confusion for co objects in release-image wrt cvo, personally i'm on the side of not increasing this confusion.

We have a supportability problem in the field. I'm open to other alternatives that don't require writing additional knowledge into debugging tools, but this seems like a very good balance between debuggability and effort with a very low impact across the org.


## Release Signoff Checklist

- [ ] Enhancement is `implementable`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shouldn't we check this off to match status: implementable above^^?


### Specific Implementation Option

This isn't a required mechanism for implementation, but it demonstrates how narrowly scoped the change is.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want to do this at all. This is horrifying. :)


This isn't a required mechanism for implementation, but it demonstrates how narrowly scoped the change is.
1. create a new control loop with a clusteroperator lister, clusteroperator client, and a function to get the current payload.
2. register event handlers on clusteroperator informer and time based every minute.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the payload should simply find all CV and do a first pass on them, ignoring errors or progress reporting, then continue. We have to deadline it, but I do not want a new loop.

Copy link
Contributor

@smarterclayton smarterclayton Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super specific case on install / upgrade.

I'm NAKing this impl, I'd accept something at the beginning of sync payload that loops over the manifests and does a quick parallel run of all cluster operators with a bounded scope, then does normal processing. It's an optimization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but in case it isn't clear, I'm fine with this proposal as long as it doesn't in any way disturb the current, well tested, SANE loops).

3. sync loop reads the current payload. for each clusteroperator in the payload
1. check lister to see if clusteroperator exists. If so, continue to next clusteroperator.
2. create clusteroperator with empty spec and metadata. If create fails, continue to next clusteroperator.
3. update clusteroperator/status with `.status.relatedResources` and the three required conditions in `Unknown` state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What needs to go in relatedResources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What needs to go in relatedResources?

Responsibility of the individual operators. Generally it will be things like input resources with fixed names and interesting namespaces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so as far as this enhancement is concerned, we can drop that and have this line (for the CVO) be:

  1. update clusteroperator/status with the three required conditions in Unknown state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so as far as this enhancement is concerned, we can drop that...

And without anything useful in relatedObjects (is relatedResources a typo?), must-gather is still stuck, right? I don't see how the ClusterVersion operator would know what to fill in there unless we put some sort of annotation on the intended, CVO created resources. So an operator's manifest set would look like:

  1. Namespace, whatever else we expect to not have problems with, all have the release.openshift.io/operator-resource: <operator-name> annotation.
  2. The ClusterOperator. When the CVO sees this is missing, it creates it, seeds the conditions as you've laid out, and fills relatedObjects with anything it had already pushed that sync round with the release.openshift.io/operator-resource annotation whose annotation value matched the ClusterOperator name.
  3. Deployments or other vulnerable types. At this point it would be an error to hit anything with a matching release.openshift.io/operator-resource annotation. Not sure how to enforce that error, probably go Degraded but keep working.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So an operator's manifest set would look like...

Never mind, this is crazy. Better to have operator maintainers set their intended relatedObjects content in their ClusterOperator.

### Test Plan

1. When an install in CI fails at some point in the release, we should see must-gather information
2. During an installation, the `clusteroperator` resources should be visible via the API immediately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI coverage for this immediately seems tricky. Were you expecting to use audit logs or some such?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI coverage for this immediately seems tricky. Were you expecting to use audit logs or some such?

I'm confident enough that I'll investigate a "cluster didn't install" problem next release that I'll know if it works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confident enough that I'll investigate a "cluster didn't install" problem next release that I'll know if it works.

That's 1 though. I don't see how 2 is all that important on its own, certainly not enough to be worth specific CI coverage. I'm happy if 1 gets CI coverage. I'm not happy if something important enough to be in the Test Plan is covered by "@deads2k manually looks into this occasionally" ;).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confident enough that I'll investigate a "cluster didn't install" problem next release that I'll know if it works.

That's 1 though. I don't see how 2 is all that important on its own, certainly not enough to be worth specific CI coverage. I'm happy if 1 gets CI coverage. I'm not happy if something important enough to be in the Test Plan is covered by "@deads2k manually looks into this occasionally" ;).

Presumably a unit test will work.

degraded) set to `Unknown`.
2. `clusteroperator` creation by the CVO needs to honor or update `.status.relatedResources`. This requires updating
status after the creation.
3. `clusteroperator` resources in the payload should all be created immediately regardless of where in the payload ordering
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a 2.5: it would be nice if CVO waited for the clusteroperator API to be available in discovery before creating any of the instances.

see also: https://bugzilla.redhat.com/show_bug.cgi?id=1787660 (an issue that we can address per operator, but this would be nice to have nonetheless).

@sdodson
Copy link
Member

sdodson commented May 29, 2020

This has already been implemented for 4.5 and is potentially being backported to 4.4 here openshift/cluster-version-operator#376
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 29, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit c76953f into openshift:master May 29, 2020
@wking
Copy link
Member

wking commented May 29, 2020

We need a feature-request bug for this targeting 4.5.0 that links all the work which we can clone back to 4.4.z so QE can verify all the backports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants