pkg/payload/task: Disambiguate the ClusterOperatorNotAvailable Progressing message #575

wking · 2021-05-26T22:32:25Z

For example, in a recent test cluster:

$ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803
2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available
2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out
2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured.
2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading

But the "has not yet successfully rolled out" we used for Progressing is much less direct than "is not available" we used for Failing. 6c7cd99 (#573) removed the last important abuse of ClusterOperatorNotAvailable (for failures to client.Get the target ClusterOperator). In this commit I'm moving the remaining abuser over to a new ClusterOperatorNoVersions. So the only remaining ClusterOperatorNotAvailable really means "not Available=True", and we can say that more directly.

openshift-ci · 2021-05-26T22:32:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vrutkovs · 2021-05-27T08:36:27Z

/retest

openshift-bot · 2021-09-23T08:47:06Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-10-23T09:16:48Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

jottofar · 2021-10-25T15:41:01Z

pkg/payload/task.go

-			return nil
+			return true, nil
+		}
+		if updateErr, ok := lastErr.(*UpdateError); ok {


So it will now make at most two attempts as opposed to previous behavior of as many attempts as possible within ctx deadline. Is that the intent?
If so should update method comment to reflect.

Rebased onto master and updated the Task.Run godocs with 115ed0a -> 1414029. We can still make a bunch of requests here, but a single UpgradeError is enough to break out of the retry loop and fail fast (which should hopefully be clear from the updated godocs).

wking · 2021-10-26T21:55:23Z

I've fixed some error handling, and set backoff steps for some unit tests in a4e3a77 -> c8da304. Hopefully that greens us up :)

Things like API-server connection errors and patch conflicts deserve some retries before we bubble them up into ClusterVersion conditions. But when we are able to retrieve in-cluster objects and determine that they are not happy, we should exit more quickly so we can complain about the resource state and start in on the next sync cycle. For example, see the recent e02d148 (pkg/cvo/internal/operatorstatus: Replace wait-for with single-shot "is it alive now?", 2021-05-12, openshift#560) and the older cc9292a (lib/resourcebuilder: Replace wait-for with single-shot "is it alive now?", 2020-07-07, openshift#400). This commit uses the presence of an UpdateError as a marker for "fail fast; no need to retry". The install-time backoff is from fee2d06 (sync: Completely parallelize the initial payload, 2019-03-11, openshift#136). I'm not sure if it really wants the same cap as reconcile and update modes, but I've left them the same for now, and future commits to pivot the backoff settings can focus on motivating those pivots. I'd tried dropping: backoff := st.Backoff and passing st.Backoff directly to ExponentialBackoffWithContext, but it turns out that Step() [1]: ... mutates the provided Backoff to update its Steps and Duration. Luckily, Backoff has no pointer properties, so storing as a local variable is sufficient to give us a fresh copy for the local mutations. [1]: https://pkg.go.dev/k8s.io/apimachinery/pkg/util/wait#Backoff.Step

…ssing message For example, in a recent test cluster: $ oc get -o json clusterversion version | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-05-26T20:18:35Z Available=True -: Done applying 4.8.0-0.ci-2021-05-26-172803 2021-05-26T21:58:16Z Failing=True ClusterOperatorNotAvailable: Cluster operator machine-config is not available 2021-05-26T21:42:31Z Progressing=False ClusterOperatorNotAvailable: Error while reconciling 4.8.0-0.ci-2021-05-26-172803: the cluster operator machine-config has not yet successfully rolled out 2021-05-26T19:53:47Z RetrievedUpdates=False NoChannel: The update channel has not been configured. 2021-05-26T21:26:31Z Upgradeable=False DegradedPool: Cluster operator machine-config cannot be upgraded between minor versions: One or more machine config pools are degraded, please see `oc get mcp` for further details and resolve before upgrading But the "has not yet successfully rolled out" we used for Progressing is much less direct than "is not available" we used for Failing. 6c7cd99 (pkg/payload/task: Fail fast for UpdateError, 2021-05-24, openshift#573) removed the last important abuse of ClusterOperatorNotAvailable (for failures to client.Get the target ClusterOperator). In this commit I'm moving the remaining abuser over to a new ClusterOperatorNoVersions. So the only remaining ClusterOperatorNotAvailable really means "not Available=True", and we can say that more directly.

To make it easier to understand what each case is about.

wking · 2021-10-29T01:11:16Z

086b39d -> 2bb6f37 fixed the remaining unit failure and got rid of a magical timing hack in the process. Should be green now 🤞

wking · 2021-10-29T22:53:32Z

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/575/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1453890285628035072/build-log.txt | grep FAIL:
--- FAIL: TestIntegrationCVO_initializeAndHandleError (60.64s)

/retest

openshift-ci · 2021-10-30T00:14:13Z

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-agnostic	`2bb6f37`	link	true	`/test e2e-agnostic`
ci/prow/e2e-agnostic-operator	`2bb6f37`	link	true	`/test e2e-agnostic-operator`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-11-29T03:53:16Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-11-29T03:53:44Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2022-08-05T17:15:29Z

Ended up landing via #577.

openshift-ci bot requested review from crawford and smarterclayton May 26, 2021 22:32

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 26, 2021

wking mentioned this pull request May 27, 2021

pkg/cvo/sync_worker: Consolidate all ClusterOperator errors by reason #577

Merged

wking force-pushed the disambiguate-cluster-not-available-progressing-message branch from 7c7ad4a to 115ed0a Compare May 27, 2021 02:56

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2021

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 23, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 23, 2021

jottofar reviewed Oct 25, 2021

View reviewed changes

wking force-pushed the disambiguate-cluster-not-available-progressing-message branch from 115ed0a to 1414029 Compare October 26, 2021 18:12

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 26, 2021

wking force-pushed the disambiguate-cluster-not-available-progressing-message branch 2 times, most recently from a4e3a77 to c8da304 Compare October 26, 2021 21:54

wking force-pushed the disambiguate-cluster-not-available-progressing-message branch from c8da304 to 086b39d Compare October 26, 2021 23:00

wking added 3 commits October 28, 2021 18:03

pkg/cvo: Names for Test_SyncWorker_apply sub-tests

2bb6f37

To make it easier to understand what each case is about.

wking force-pushed the disambiguate-cluster-not-available-progressing-message branch from 086b39d to 2bb6f37 Compare October 29, 2021 01:03

openshift-ci bot closed this Nov 29, 2021

wking deleted the disambiguate-cluster-not-available-progressing-message branch August 5, 2022 17:15

pkg/payload/task: Disambiguate the ClusterOperatorNotAvailable Progressing message #575

pkg/payload/task: Disambiguate the ClusterOperatorNotAvailable Progressing message #575

Uh oh!

Conversation

wking commented May 26, 2021

Uh oh!

openshift-ci bot commented May 26, 2021

Uh oh!

vrutkovs commented May 27, 2021

Uh oh!

openshift-bot commented Sep 23, 2021

Uh oh!

openshift-bot commented Oct 23, 2021

Uh oh!

jottofar Oct 25, 2021

Choose a reason for hiding this comment

Uh oh!

wking Oct 26, 2021

Choose a reason for hiding this comment

Uh oh!

wking commented Oct 26, 2021

Uh oh!

wking commented Oct 29, 2021

Uh oh!

wking commented Oct 29, 2021

Uh oh!

openshift-ci bot commented Oct 30, 2021

Uh oh!

openshift-bot commented Nov 29, 2021

Uh oh!

openshift-ci bot commented Nov 29, 2021

Uh oh!

wking commented Aug 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants