Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented May 6, 2021

ClusterOperator pre-creation landed in 2a469e3 (#318) to move us from:

  1. CVO creates a namespace for an operator.
  2. CVO creates ... for the operator.
  3. CVO creates the operator Deployment.
  4. Operator deployment never comes up, for whatever reason.
  5. Admin must-gathers.
  6. Must gather uses ClusterOperators for discovering important stuff, and because the ClusterOperator doesn't exist yet, we get no data about why the deployment didn't come up.

to:

  1. CVO pre-creates ClusterOperator for an operator.
  2. CVO creates the namespace for an operator.
  3. CVO creates ... for the operator.
  4. CVO creates the operator Deployment.
  5. Operator deployment never comes up, for whatever reason.
  6. Admin must-gathers.
  7. Must gather uses ClusterOperators for discovering important stuff, and finds the one the CVO had pre-created with hard-coded relatedObjects, gathers stuff from the referenced operator namespace, and allows us to trouble-shoot the issue.

But when ClusterOperator pre-creation happens at the beginning of an update sync cycle, it can take a while before the CVO gets from the ClusterOperator creation in (1) to the operator managing that ClusterOperator in (4), which can lead to ClusterOperatorDown alerts (rhbz#1929917, rhbz#1957775).

fdef37d (#531) landed a narrow hack to avoid issues on 4.6 -> 4.7 updates, which added the baremetal operator (rhbz#1929917). But we're adding a cloud-controller-manager operator in 4.7 -> 4.8, and breaking the same way (rhbz#1957775). This commit pivots to a more generic fix, by delaying the pre-creation until the CVO reaches the manifest-task node containing the ClusterOperator manifest. That will usually be the same node that has the other critical operator manifests like the namespace, RBAC, and operator deployment.

Dropping fdef37d's baremetal hack will re-expose us to issues on install, where we race through all the manifests as fast as possible. It's possible that we will now pre-create the ClusterOperator early (because it's only blocked by the CRD) and still be a ways in front of the operator pod coming up (because that needs a schedulable control-plane node). But we can address that by surpressing ClusterOperatorDown and ClusterOperatorDegraded for some portion of install in follow-up work.

…fest-task node

ClusterOperator pre-creation landed in 2a469e3 (cvo: When
installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318)
to move us from:

1. CVO creates a namespace for an operator.
2. CVO creates ... for the operator.
3. CVO creates the operator Deployment.
4. Operator deployment never comes up, for whatever reason.
5. Admin must-gathers.
6. Must gather uses ClusterOperators for discovering important stuff,
   and because the ClusterOperator doesn't exist yet, we get no data
   about why the deployment didn't come up.

to:

1. CVO pre-creates ClusterOperator for an operator.
2. CVO creates the namespace for an operator.
3. CVO creates ... for the operator.
4. CVO creates the operator Deployment.
5. Operator deployment never comes up, for whatever reason.
6. Admin must-gathers.
7. Must gather uses ClusterOperators for discovering important stuff,
   and finds the one the CVO had pre-created with hard-coded
   relatedObjects, gathers stuff from the referenced operator
   namespace, and allows us to trouble-shoot the issue.

But when ClusterOperator pre-creation happens at the beginning of an
update sync cycle, it can take a while before the CVO gets from the
ClusterOperator creation in (1) to the operator managing that
ClusterOperator in (4), which can lead to ClusterOperatorDown alerts
[1,2].

fdef37d (pkg/cvo/sync_worker: Skip precreation of baremetal
ClusterOperator, 2021-03-16, openshift#531) landed a narrow hack to avoid
issues on 4.6 -> 4.7 updates, which added the baremetal operator [1].
But we're adding a cloud-controller-manager operator in 4.7 -> 4.8,
and breaking the same way [2].  This commit pivots to a more generic
fix, by delaying the pre-creation until the CVO reaches the
manifest-task node containing the ClusterOperator manifest.  That will
usually be the same node that has the other critical operator
manifests like the namespace, RBAC, and operator deployment.

Dropping fdef37d's baremetal hack will re-expose us to issues on
install, where we race through all the manifests as fast as possible.
It's possible that we will now pre-create the ClusterOperator early
(because it's only blocked by the CRD) and still be a ways in front of
the operator pod coming up (because that needs a schedulable
control-plane node).  But we can address that by surpressing
ClusterOperatorDown and ClusterOperatorDegraded for some portion of
install in follow-up work.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1929917
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1957775
@openshift-ci openshift-ci bot added the bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. label May 6, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 6, 2021

@wking: This pull request references Bugzilla bug 1957775, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianlinliu

Details

In response to this:

Bug 1957775: pkg/cvo/sync_worker: Shift ClusterOperator pre-creation into the manifest-task node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label May 6, 2021
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 6, 2021

@wking: This pull request references Bugzilla bug 1957775, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianlinliu

Details

In response to this:

Bug 1957775: pkg/cvo/sync_worker: Shift ClusterOperator pre-creation into the manifest-task node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 6, 2021

@wking: This pull request references Bugzilla bug 1957775, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianlinliu

Details

In response to this:

Bug 1957775: pkg/cvo/sync_worker: Shift ClusterOperator pre-creation into the manifest-task node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jottofar
Copy link
Contributor

jottofar commented May 6, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 6, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 6, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

15 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented May 8, 2021

update job just had Kube API-server connectivity issues, which are unrelated.

/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 8, 2021

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

update job just had Kube API-server connectivity issues, which are unrelated.

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit ca52290 into openshift:master May 8, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 8, 2021

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 1957775 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1957775: pkg/cvo/sync_worker: Shift ClusterOperator pre-creation into the manifest-task node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the reduce-ClusterOperator-pre-creation-skew branch May 8, 2021 13:28
@wking
Copy link
Member Author

wking commented May 11, 2021

/cherrypick release-4.7

@openshift-cherrypick-robot

@wking: new pull request created: #557

Details

In response to this:

/cherrypick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants