Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Apr 20, 2020

As described in the bug, bumping status.desired when a new spec.desiredUpdate is set leads to trouble like:

$ oc adm upgrade --to 4.3.13
Updating to 4.3.13
$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort
2020-04-20T20:57:12Z RetrievedUpdates True
2020-04-20T21:23:40Z Available True Done applying 4.3.10
2020-04-20T22:01:55Z Upgradeable False Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
2020-04-20T22:16:40Z Failing True Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
2020-04-20T22:16:40Z Progressing True Unable to apply 4.3.13: it may not be safe to apply this update
$ oc get -o json clusterversion version | jq -r '.status.desired.version'
4.3.13
$ oc adm upgrade --to=4.3.13 --force
info: Cluster is already at version 4.3.13
$ oc adm upgrade --clear
Cleared the update field, still at 4.3.13
$ oc get -o json clusterversion version | jq -r '.status.desired.version'
4.3.10

Where the "already at..." and "still at ..." messages are relying on the status.desired semantics:

desired is the version that the cluster is reconciling towards.

When the user sets spec.desiredUpdate, they are making their intention clear (and bumping history at this point is appropriate, because we need somewhere to store the verified history entry). But to match the "reconciling towards" semantics, this commit shifts the actual status.desired bump so it happens right after the new CVO comes up (with the new currentVersion).

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label Apr 20, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1826115, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1826115: pkg/cvo/status: Always set status.desired to match current CVO

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Apr 20, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2020
@wking wking force-pushed the only-record-current-cvo-version-in-status-desired branch from 244ccfc to 3fd35dc Compare April 20, 2020 22:50
@wking
Copy link
Member Author

wking commented Apr 20, 2020

/hold

I'm going to have to rework this a bit more, because blindly dropping Actual into the history (as the unit tests seem to expect) seems crazy.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 20, 2020
@wking
Copy link
Member Author

wking commented Apr 20, 2020

/assign @smarterclayton

I'm still working on updating the unit tests for this, but assigning to Clayton for review, because in 961873d (#82) he added a bunch of:

// Prefers the payload version over the operator's version (although in general they will remain in sync

comments, and I'm reversing that here.

As described in [1], bumping status.desired when a new
spec.desiredUpdate is set leads to trouble like:

  $ oc adm upgrade --to 4.3.13
  Updating to 4.3.13
  $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort
  2020-04-20T20:57:12Z RetrievedUpdates True
  2020-04-20T21:23:40Z Available True Done applying 4.3.10
  2020-04-20T22:01:55Z Upgradeable False Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
  2020-04-20T22:16:40Z Failing True Precondition "ClusterVersionUpgradeable" failed because of "DefaultSecurityContextConstraints_Mutated": Cluster operator kube-apiserver cannot be upgraded: DefaultSecurityContextConstraintsUpgradeable: Default SecurityContextConstraints object(s) have mutated [privileged]
  2020-04-20T22:16:40Z Progressing True Unable to apply 4.3.13: it may not be safe to apply this update
  $ oc get -o json clusterversion version | jq -r '.status.desired.version'
  4.3.13
  $ oc adm upgrade --to=4.3.13 --force
  info: Cluster is already at version 4.3.13
  $ oc adm upgrade --clear
  Cleared the update field, still at 4.3.13
  $ oc get -o json clusterversion version | jq -r '.status.desired.version'
  4.3.10

Where the "already at..." and "still at ..." messages are relying on
the status.desired semantics of [2]:

  > desired is the version that the cluster is reconciling towards.

Ideally the property would have been called 'current' or some such,
but it's too late to change it now.

When the user sets spec.desiredUpdate, they are making their intention
clear (and bumping history at this point is appropriate, because we
need *somewhere* to store the 'verified' history entry).  But to match
the "reconciling towards" semantics, this commit shifts the actual
status.desired bump so it happens right after the new CVO comes up
(with the new currentVersion).

This change reverses the earlier:

  Prefers the payload version over the operator's version...

comments from 961873d (sync: Do config syncing in the background,
2019-01-11, openshift#82).  It also sets the stage for a world in which [3] has
been fixed and the CVO continues to apply the current release's
manifests while vetting a new target's preconditions in parallel.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1826115
[2]: https://github.com/openshift/api/blob/0f159fee64dbf711d40dac3fa2ec8b563a2aaca8/config/v1/types_cluster_version.go#L82-L87
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1822752
@wking wking force-pushed the only-record-current-cvo-version-in-status-desired branch from 3fd35dc to e9c53b1 Compare April 21, 2020 03:53
@wking
Copy link
Member Author

wking commented Apr 21, 2020

Ok, 3fd35dc -> e9c53b1 should get unit tests passing in this new world. But it's a pretty large change to the tests which reverses the earlier choice of explicitly preferring the sync worker's status over the operator version. So still would be good to have @smarterclayton sign off on the new direction.

@wking
Copy link
Member Author

wking commented Apr 21, 2020

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2020
@wking
Copy link
Member Author

wking commented Apr 21, 2020

Hmm, still need to update the integration tests.

// TODO: prune Z versions over transitions to Y versions, keep initial installed version
pruneStatusHistory(config, 50)

config.Status.Desired = desired
Copy link
Member

@LalatenduMohanty LalatenduMohanty Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not clear why this does not work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syncStatus is feeding mergeOperatorHistory status.Actual, which diverges from optr.currentVersion() when an update has been requested and we're working through preconditions. With this commit, we're always setting Desired to currentVersion().

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot
Copy link
Contributor

@LalatenduMohanty: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@LalatenduMohanty
Copy link
Member

/lgtm cancel
As I have not gone through the test changes.

@openshift-ci-robot
Copy link
Contributor

@LalatenduMohanty: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm cancel
As I have not gone through the test changes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/unit 23fa499 link /test unit
ci/prow/integration 23fa499 link /test integration
ci/prow/e2e-gcp-upgrade 23fa499 link /test e2e-gcp-upgrade
ci/prow/e2e 23fa499 link /test e2e
ci/prow/e2e-upgrade 23fa499 link /test e2e-upgrade
ci/prow/gofmt 23fa499 link /test gofmt

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-agnostic 23fa499 link /test e2e-agnostic
ci/prow/e2e-agnostic-upgrade 23fa499 link /test e2e-agnostic-upgrade
ci/prow/e2e-agnostic-operator 23fa499 link /test e2e-agnostic-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2021
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2021
@openshift-ci-robot openshift-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 9, 2021
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 9, 2021

@wking: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 9, 2021
@openshift-ci openshift-ci bot closed this May 9, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 9, 2021

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 9, 2021

@wking: This pull request references Bugzilla bug 1826115. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

Details

In response to this:

Bug 1826115: pkg/cvo/status: Always set status.desired to match current CVO

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants