Skip to content

Conversation

@smarterclayton
Copy link
Contributor

The must-gather and insights operator depend on cluster operators
and related objects in order to identify resources to create. Because
cluster operators are delegated to the operator install and upgrade
failures of new operators can fail to gather the requisite info if
the cluster degrades before those steps.

Add a new selective Precreating install mode and do a single pass
over all cluster operators in the payload without retries at the beginning
of an initializing or upgrading sync pass to attempt to create the
ClusterOperators if they don't exist.

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 7, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2020
@smarterclayton smarterclayton force-pushed the fast_fill_cluster_operators branch from df84a9b to 2fa9d45 Compare February 7, 2020 19:20
The must-gather and insights operator depend on cluster operators
and related objects in order to identify resources to create. Because
cluster operators are delegated to the operator install and upgrade
failures of new operators can fail to gather the requisite info if
the cluster degrades before those steps.

Add a new selective Precreating install mode and do a single pass
over all cluster operators in the payload without retries at the beginning
of an initializing or upgrading sync pass to attempt to create the
ClusterOperators if they don't exist. If we succeed at creating the
object, try exactly once to update status so that relatedObjects can
be set.
@smarterclayton smarterclayton force-pushed the fast_fill_cluster_operators branch from 2fa9d45 to 2a469e3 Compare February 7, 2020 19:34
@wking
Copy link
Member

wking commented Feb 16, 2020

images failed with a CI-registry flake:

could not copy stable imagestreamtag: Timeout: request did not complete within allowed duration

/retest

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

Looks like machine-config ends up without any relatedObjects - not sure why.

@kikisdeliveryservice
Copy link

Looks like machine-config ends up without any relatedObjects - not sure why.

Where should we be looking to see that the relatedObjects are missing? Don't really know what I'm supposed to look at here?

@deads2k
Copy link
Contributor

deads2k commented Mar 13, 2020

Looks like machine-config ends up without any relatedObjects - not sure why.

Where should we be looking to see that the relatedObjects are missing? Don't really know what I'm supposed to look at here?

Based on the method name here, it looks like relatedObjects are only set by the MCO during the first updateStatus, not all updatestatus calls. Meaning the MCO isn't reconciling.

@deads2k
Copy link
Contributor

deads2k commented Mar 13, 2020

@sdodson
Copy link
Member

sdodson commented Mar 18, 2020

@wking @abhinavdahiya Can we work towards getting this reviewed by end of week?

@deads2k
Copy link
Contributor

deads2k commented Mar 26, 2020

/retest

clusterOperator.Status.RelatedObjects = os.Status.DeepCopy().RelatedObjects
if _, err := b.createClient.UpdateStatus(clusterOperator); err != nil {
if kerrors.IsConflict(err) {
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. why would we not error? this could cause the cvo to move out of precreating mode, without relatedObjects set and that seemed like was the entire goal of this..?

Copy link
Contributor

@deads2k deads2k Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. why would we not error? this could cause the cvo to move out of precreating mode, without relatedObjects set and that seemed like was the entire goal of this..?

We want to have a shot at every other clusteroperator resource. If we have a conflict, we're done with this one, but not all of them.

Having a conflict isn't an error, it just means we're done.

@deads2k
Copy link
Contributor

deads2k commented Apr 2, 2020

/test all

@deads2k
Copy link
Contributor

deads2k commented Apr 3, 2020

/refresh

@deads2k
Copy link
Contributor

deads2k commented Apr 6, 2020

/retest

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 6, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 14, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@deads2k
Copy link
Contributor

deads2k commented Apr 14, 2020

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 07e65a3 into openshift:master Apr 14, 2020
@deads2k
Copy link
Contributor

deads2k commented Apr 15, 2020

/cherrypick release-4.4

@openshift-cherrypick-robot

@deads2k: new pull request created: #348

Details

In response to this:

/cherrypick release-4.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

enxebre added a commit to enxebre/cluster-version-operator that referenced this pull request Apr 27, 2020
This openshift#318 introduced a PrecreatingMode which actually let the CVO to push the clusterOperator in manifests.
This enables a new pattern for individual operators to set related objects into the clusterOperator defined in manifests/ openshift/cluster-kube-apiserver-operator#833
openshift/cluster-etcd-operator#312
openshift/cluster-kube-controller-manager-operator#398
wking added a commit to wking/cluster-version-operator that referenced this pull request Jun 3, 2020
With this commit, I drop contextIsCancelled in favor of Context.Err().
From the docs [1]:

  If Done is not yet closed, Err returns nil.  If Done is closed, Err
  returns a non-nil error explaining why: Canceled if the context
  was canceled or DeadlineExceeded if the context's deadline
  passed.  After Err returns a non-nil error, successive calls to
  Err return the same error.

I dunno why we'd been checking Done() instead, but contextIsCancelled
dates back to 961873d (sync: Do config syncing in the background,
2019-01-11, openshift#82).

I've also generalized a number of *Cancel* helpers to be *Context* to
remind folks that Context.Err() can be DeadlineExceeded as well as
Canceled, and the CVO uses both WithCancel and WithTimeout.  The new
error messages will be either:

  update context deadline exceeded at 1 of 2

or:

  update context canceled at 1 of 2

Instead of always claiming:

  update was cancelled at 1 of 2

Cherry-picked from eea2092 (pkg/cvo/sync_worker: Generalize
CancelError to ContextError, 2020-05-28, openshift#378) and edited to resolve
context conflicts because release-4.4 lacks 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318).

[1]: https://golang.org/pkg/context/#Context
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Jun 4, 2020
With this commit, I drop contextIsCancelled in favor of Context.Err().
From the docs [1]:

  If Done is not yet closed, Err returns nil.  If Done is closed, Err
  returns a non-nil error explaining why: Canceled if the context
  was canceled or DeadlineExceeded if the context's deadline
  passed.  After Err returns a non-nil error, successive calls to
  Err return the same error.

I dunno why we'd been checking Done() instead, but contextIsCancelled
dates back to 961873d (sync: Do config syncing in the background,
2019-01-11, openshift#82).

I've also generalized a number of *Cancel* helpers to be *Context* to
remind folks that Context.Err() can be DeadlineExceeded as well as
Canceled, and the CVO uses both WithCancel and WithTimeout.  The new
error messages will be either:

  update context deadline exceeded at 1 of 2

or:

  update context canceled at 1 of 2

Instead of always claiming:

  update was cancelled at 1 of 2

Cherry-picked from eea2092 (pkg/cvo/sync_worker: Generalize
CancelError to ContextError, 2020-05-28, openshift#378) and edited to resolve
context conflicts because release-4.4 lacks 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318).

[1]: https://golang.org/pkg/context/#Context
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Jun 11, 2020
With this commit, I drop contextIsCancelled in favor of Context.Err().
From the docs [1]:

  If Done is not yet closed, Err returns nil.  If Done is closed, Err
  returns a non-nil error explaining why: Canceled if the context
  was canceled or DeadlineExceeded if the context's deadline
  passed.  After Err returns a non-nil error, successive calls to
  Err return the same error.

I dunno why we'd been checking Done() instead, but contextIsCancelled
dates back to 961873d (sync: Do config syncing in the background,
2019-01-11, openshift#82).

I've also generalized a number of *Cancel* helpers to be *Context* to
remind folks that Context.Err() can be DeadlineExceeded as well as
Canceled, and the CVO uses both WithCancel and WithTimeout.  The new
error messages will be either:

  update context deadline exceeded at 1 of 2

or:

  update context canceled at 1 of 2

Instead of always claiming:

  update was cancelled at 1 of 2

Cherry-picked from eea2092 (pkg/cvo/sync_worker: Generalize
CancelError to ContextError, 2020-05-28, openshift#378) and edited to resolve
context conflicts because release-4.4 lacks 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318).

[1]: https://golang.org/pkg/context/#Context
wking added a commit to wking/cluster-version-operator that referenced this pull request Mar 16, 2021
This is a hack fix for [1], where we have a delay on 4.6->4.7 updates,
and on some 4.7 installs, between the very early ClusterOperator
precreation and the operator eventually coming up to set its status
conditions.  In the interim, there are no conditions, which causes
cluster_operator_up to be 0, which causes the critical
ClusterOperatorDown to fire.  We'll want a more general fix going
forward, this commit is a temporary hack to avoid firing the critical
ClusterOperatorDown while we build consensus around the general fix.

The downside to dropping precreates for this operator is that we lose
the must-gather references when the operator fails to come up.  That
was what precreation was designed to address in 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318).
If we actually get a must-gather without the bare-metal bits and we
miss them, we can revisit the approach this hack is taking.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1929917
wking added a commit to wking/cluster-version-operator that referenced this pull request Mar 16, 2021
This is a hack fix for [1], where we have a delay on 4.6->4.7 updates,
and on some 4.7 installs, between the very early ClusterOperator
precreation and the operator eventually coming up to set its status
conditions.  In the interim, there are no conditions, which causes
cluster_operator_up to be 0, which causes the critical
ClusterOperatorDown to fire.  We'll want a more general fix going
forward, this commit is a temporary hack to avoid firing the critical
ClusterOperatorDown while we build consensus around the general fix.

The downside to dropping precreates for this operator is that we lose
the must-gather references when the operator fails to come up.  That
was what precreation was designed to address in 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318).
If we actually get a must-gather without the bare-metal bits and we
miss them, we can revisit the approach this hack is taking.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1929917
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Mar 20, 2021
This is a hack fix for [1], where we have a delay on 4.6->4.7 updates,
and on some 4.7 installs, between the very early ClusterOperator
precreation and the operator eventually coming up to set its status
conditions.  In the interim, there are no conditions, which causes
cluster_operator_up to be 0, which causes the critical
ClusterOperatorDown to fire.  We'll want a more general fix going
forward, this commit is a temporary hack to avoid firing the critical
ClusterOperatorDown while we build consensus around the general fix.

The downside to dropping precreates for this operator is that we lose
the must-gather references when the operator fails to come up.  That
was what precreation was designed to address in 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318).
If we actually get a must-gather without the bare-metal bits and we
miss them, we can revisit the approach this hack is taking.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1929917
wking added a commit to wking/cluster-version-operator that referenced this pull request Mar 28, 2021
This is a hack fix for [1], where we have a delay on 4.6->4.7 updates,
and on some 4.7 installs, between the very early ClusterOperator
precreation and the operator eventually coming up to set its status
conditions.  In the interim, there are no conditions, which causes
cluster_operator_up to be 0, which causes the critical
ClusterOperatorDown to fire.  We'll want a more general fix going
forward, this commit is a temporary hack to avoid firing the critical
ClusterOperatorDown while we build consensus around the general fix.

The downside to dropping precreates for this operator is that we lose
the must-gather references when the operator fails to come up.  That
was what precreation was designed to address in 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318).
If we actually get a must-gather without the bare-metal bits and we
miss them, we can revisit the approach this hack is taking.

Manually picked back to 4.6, which doesn't include b0f73af (Don't
create ClusterOperator during precreation step if it's present in
overrides, 2020-12-10, openshift#488).

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1929917
wking added a commit to wking/cluster-version-operator that referenced this pull request May 6, 2021
…fest-task node

ClusterOperator pre-creation landed in 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318)
to move us from:

1. CVO creates a namespace for an operator.
2. CVO creates ... for the operator.
3. CVO creates the operator Deployment.
4. Operator deployment never comes up, for whatever reason.
5. Admin must-gathers.
6. Must gather uses ClusterOperators for discovering important stuff,
   and because the ClusterOperator doesn't exist yet, we get no data
   about why the deployment didn't come up.

to:

1. CVO pre-creates ClusterOperator for an operator.
2. CVO creates the namespace for an operator.
3. CVO creates ... for the operator.
4. CVO creates the operator Deployment.
5. Operator deployment never comes up, for whatever reason.
6. Admin must-gathers.
7. Must gather uses ClusterOperators for discovering important stuff,
   and finds the one the CVO had pre-created with hard-coded
   relatedObjects, gathers stuff from the referenced operator
   namespace, and allows us to trouble-shoot the issue.

But when ClusterOperator pre-creation happens at the beginning of an
update sync cycle, it can take a while before the CVO gets from the
ClusterOperator creation in (1) to the operator managing that
ClusterOperator in (4), which can lead to ClusterOperatorDown alerts
[1,2].

fdef37d (pkg/cvo/sync_worker: Skip precreation of baremetal
ClusterOperator, 2021-03-16, openshift#531) landed a narrow hack to avoid
issues on 4.6 -> 4.7 updates, which added the baremetal operator [1].
But we're adding a cloud-controller-manager operator in 4.7 -> 4.8,
and breaking the same way [2].  This commit pivots to a more generic
fix, by delaying the pre-creation until the CVO reaches the
manifest-task node containing the ClusterOperator manifest.  That will
usually be the same node that has the other critical operator
manifests like the namespace, RBAC, and operator deployment.

Dropping fdef37d's baremetal hack will re-expose us to issues on
install, where we race through all the manifests as fast as possible.
It's possible that we will now pre-create the ClusterOperator early
(because it's only blocked by the CRD) and still be a ways in front of
the operator pod coming up (because that needs a schedulable
control-plane node).  But we can address that by surpressing
ClusterOperatorDown and ClusterOperatorDegraded for some portion of
install in follow-up work.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1929917
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1957775
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request May 11, 2021
…fest-task node

ClusterOperator pre-creation landed in 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318)
to move us from:

1. CVO creates a namespace for an operator.
2. CVO creates ... for the operator.
3. CVO creates the operator Deployment.
4. Operator deployment never comes up, for whatever reason.
5. Admin must-gathers.
6. Must gather uses ClusterOperators for discovering important stuff,
   and because the ClusterOperator doesn't exist yet, we get no data
   about why the deployment didn't come up.

to:

1. CVO pre-creates ClusterOperator for an operator.
2. CVO creates the namespace for an operator.
3. CVO creates ... for the operator.
4. CVO creates the operator Deployment.
5. Operator deployment never comes up, for whatever reason.
6. Admin must-gathers.
7. Must gather uses ClusterOperators for discovering important stuff,
   and finds the one the CVO had pre-created with hard-coded
   relatedObjects, gathers stuff from the referenced operator
   namespace, and allows us to trouble-shoot the issue.

But when ClusterOperator pre-creation happens at the beginning of an
update sync cycle, it can take a while before the CVO gets from the
ClusterOperator creation in (1) to the operator managing that
ClusterOperator in (4), which can lead to ClusterOperatorDown alerts
[1,2].

fdef37d (pkg/cvo/sync_worker: Skip precreation of baremetal
ClusterOperator, 2021-03-16, openshift#531) landed a narrow hack to avoid
issues on 4.6 -> 4.7 updates, which added the baremetal operator [1].
But we're adding a cloud-controller-manager operator in 4.7 -> 4.8,
and breaking the same way [2].  This commit pivots to a more generic
fix, by delaying the pre-creation until the CVO reaches the
manifest-task node containing the ClusterOperator manifest.  That will
usually be the same node that has the other critical operator
manifests like the namespace, RBAC, and operator deployment.

Dropping fdef37d's baremetal hack will re-expose us to issues on
install, where we race through all the manifests as fast as possible.
It's possible that we will now pre-create the ClusterOperator early
(because it's only blocked by the CRD) and still be a ways in front of
the operator pod coming up (because that needs a schedulable
control-plane node).  But we can address that by surpressing
ClusterOperatorDown and ClusterOperatorDegraded for some portion of
install in follow-up work.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1929917
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1957775
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 23, 2022
Originally, all component operators were responsible for creating
their own ClusterOperator, and we'd just watch to make sure we were
happy enough with what they did.  However, on install, or when
updating to a version that added a new component, we could have
timelines like:

1. CVO creates a namespace for an operator.
2. CVO creates ... for the operator.
3. CVO creates the operator Deployment.
4. Operator deployment never comes up, for whatever reason.
5. Admin must-gathers.
6. Must gather uses ClusterOperators for discovering important stuff,
   and because the ClusterOperator doesn't exist yet, we get no data
   about why the deployment didn't come up.

So in 2a469e3 (cvo: When installing or upgrading, fast-fill
cluster-operators, 2020-02-07, openshift#318), we added ClusterOperator
pre-creation to get:

1. CVO pre-creates ClusterOperator for an operator.
2. CVO creates the namespace for an operator.
3. CVO creates ... for the operator.
4. CVO creates the operator Deployment.
5. Operator deployment never comes up, for whatever reason.
6. Admin must-gathers.
7. Must gather uses ClusterOperators for discovering important stuff,
   and finds the one the CVO had pre-created with hard-coded
   relatedObjects, gathers stuff from the referenced operator
   namespace, and allows us to trouble-shoot the issue.

However, all existing component operators already knew how to create
their own ClusterOperator, because that was the only path before the
CVO learned about pre-creation.  And even since then, most new
operators come into the cluster on install or on update, when the CVO
is pre-creating.  New in 4.12, the platform-operator is coming in [1],
and it has two relevant characteristics:

* It does not know how to create the platform-operators-aggregated
  ClusterOperator [2].
* It is gated behind TechPreviewNoUpgrade [3].

So we are exposed to:

1. Admin installs a cluster.  No platform-operators-aggregated,
   because it's not TechPreviewNoUpgrade.
2. Install complete.  CVO transitions to reconciling mode.
3. Admin enables TechPreviewNoUpgrade.
4. CVO notices, and reboots fc00c62 (update the manifest selection
   to honor any featureset, 2022-08-17, openshift#821).
5. Because we decided to not transition into updating mode for
   feature-set changes, we stay in reconciling mode.
6. Because we're in reconciling mode, we skip the ClusterOperator
   pre-creation, and get right in to the status check.
7. Because the platform operator didn't create the ClusterOperator
   either, the CVO's status check fails with [2]:

     45657:E0923 01:43:25.610286       1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found

With this commit, I stop making the ClusterOperator pre-creation
conditional, so the new flow is:

...
6. Even in reconciling mode, we pre-create the ClusterOperator.
7. Because we pre-created the ClusterOperator, the CVO's status check
   succeeds (at least, after the operator writes acceptable status to
   the ClusterOperator we've created for it).

This will also help us recover components where a bunch of in-cluster
resources had been deleted, assuming the CVO was still alive.  There
may be other component operators who rely on the CVO for
ClusterOperator creation, but which we haven't noticed because they
aren't also gated behind TechPreviewNoUpgrade.

[1]: https://github.com/openshift/enhancements/blob/6e1697418be807d0ae567a9f83ac654a1fd0ee9a/enhancements/olm/platform-operators.md
[2]: https://issues.redhat.com/browse/OCPBUGS-1636
[3]: https://github.com/openshift/platform-operators/blob/4ecea427cf5302dfcdf4a5af8d28eadebacc2037/manifests/0000_50_cluster-platform-operator-manager_07-aggregated-clusteroperator.yaml#L8
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 23, 2022
Originally, all component operators were responsible for creating
their own ClusterOperator, and we'd just watch to make sure we were
happy enough with what they did.  However, on install, or when
updating to a version that added a new component, we could have
timelines like:

1. CVO creates a namespace for an operator.
2. CVO creates ... for the operator.
3. CVO creates the operator Deployment.
4. Operator deployment never comes up, for whatever reason.
5. Admin must-gathers.
6. Must gather uses ClusterOperators for discovering important stuff,
   and because the ClusterOperator doesn't exist yet, we get no data
   about why the deployment didn't come up.

So in 2a469e3 (cvo: When installing or upgrading, fast-fill
cluster-operators, 2020-02-07, openshift#318), we added ClusterOperator
pre-creation to get:

1. CVO pre-creates ClusterOperator for an operator.
2. CVO creates the namespace for an operator.
3. CVO creates ... for the operator.
4. CVO creates the operator Deployment.
5. Operator deployment never comes up, for whatever reason.
6. Admin must-gathers.
7. Must gather uses ClusterOperators for discovering important stuff,
   and finds the one the CVO had pre-created with hard-coded
   relatedObjects, gathers stuff from the referenced operator
   namespace, and allows us to trouble-shoot the issue.

However, all existing component operators already knew how to create
their own ClusterOperator, because that was the only path before the
CVO learned about pre-creation.  And even since then, most new
operators come into the cluster on install or on update, when the CVO
is pre-creating.  New in 4.12, the platform-operator is coming in [1],
and it has two relevant characteristics:

* It does not know how to create the platform-operators-aggregated
  ClusterOperator [2].
* It is gated behind TechPreviewNoUpgrade [3].

So we are exposed to:

1. Admin installs a cluster.  No platform-operators-aggregated,
   because it's not TechPreviewNoUpgrade.
2. Install complete.  CVO transitions to reconciling mode.
3. Admin enables TechPreviewNoUpgrade.
4. CVO notices, and reboots fc00c62 (update the manifest selection
   to honor any featureset, 2022-08-17, openshift#821).
5. Because we decided to not transition into updating mode for
   feature-set changes, we stay in reconciling mode.
6. Because we're in reconciling mode, we skip the ClusterOperator
   pre-creation, and get right in to the status check.
7. Because the platform operator didn't create the ClusterOperator
   either, the CVO's status check fails with [2]:

     45657:E0923 01:43:25.610286       1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found

With this commit, I stop making the ClusterOperator pre-creation
conditional, so the new flow is:

...
6. Even in reconciling mode, we pre-create the ClusterOperator.
7. Because we pre-created the ClusterOperator, the CVO's status check
   succeeds (at least, after the operator writes acceptable status to
   the ClusterOperator we've created for it).

This will also help us recover components where a bunch of in-cluster
resources had been deleted, assuming the CVO was still alive.  There
may be other component operators who rely on the CVO for
ClusterOperator creation, but which we haven't noticed because they
aren't also gated behind TechPreviewNoUpgrade.

[1]: https://github.com/openshift/enhancements/blob/6e1697418be807d0ae567a9f83ac654a1fd0ee9a/enhancements/olm/platform-operators.md
[2]: https://issues.redhat.com/browse/OCPBUGS-1636
[3]: https://github.com/openshift/platform-operators/blob/4ecea427cf5302dfcdf4a5af8d28eadebacc2037/manifests/0000_50_cluster-platform-operator-manager_07-aggregated-clusteroperator.yaml#L8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.