From dc662b7f1870c177264d9867436ddf876c241c78 Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Wed, 27 May 2020 16:59:57 -0700 Subject: [PATCH 1/2] pkg/cvo: Set NoDesiredImage reason when desired.Image is empty Because every Failing=True condition should have a reason. Also wordsmith the user-facing docs to replace "synchronize" with "reconcile", because our merge logic is more nuanced than the complete match "synchronize" implies for me. The ClusterOperatorNotAvailable special casing landed with convertErrorToProgressing in c2ac20fa17 (status: Report the operators that have not yet deployed, 2019-04-09, #158). --- docs/user/reconciliation.md | 8 ++++---- docs/user/status.md | 22 ++++++++++++++++++++++ pkg/cvo/cvo.go | 5 ++++- pkg/cvo/cvo_scenarios_test.go | 12 ++++++------ 4 files changed, 36 insertions(+), 11 deletions(-) diff --git a/docs/user/reconciliation.md b/docs/user/reconciliation.md index d5172b9eda..42116bab66 100644 --- a/docs/user/reconciliation.md +++ b/docs/user/reconciliation.md @@ -93,22 +93,22 @@ So the graph nodes are all parallelized with the by-number ordering flattened ou For the usual reconciliation loop (neither an upgrade between releases nor a fresh install), the flattened graph is also randomly permuted to avoid hanging on ordering bugs. -## Synchronizing the graph +## Reconciling the graph The cluster-version operator spawns worker goroutines that walk the graph, pushing manifests in their queue. -For each manifest in the node, the worker synchronizes the cluster with the manifest using a resource builder. +For each manifest in the node, the worker reconciles the cluster with the manifest using a resource builder. On error (or timeout), the worker abandons the manifest, graph node, and any dependencies of that graph node. On success, the worker proceeds to the next manifest in the graph node. ## Resource builders -Resource builders synchronize the cluster with a manifest from the release image. +Resource builders reconcile a cluster object with a manifest from the release image. The general approach is to generates a merged manifest combining critical spec properties from the release-image manifest with data from a preexisting in-cluster object, if any. If the merged manifest differs from the in-cluster object, the merged manifest is pushed back into the cluster. Some types have additional logic, as described in the following subsections. Note that this logic only applies to manifests included in the release image itself. -For example, only [ClusterOperator](../dev/clusteroperator.md) from the release image will have the blocking logic described [below](#clusteroperator); if an admin or secondary operator pushed a ClusterOperator object, it would not impact the cluster-version operator's graph synchronization. +For example, only [ClusterOperator](../dev/clusteroperator.md) from the release image will have the blocking logic described [below](#clusteroperator); if an admin or secondary operator pushed a ClusterOperator object, it would not impact the cluster-version operator's graph reconciliation. ### ClusterOperator diff --git a/docs/user/status.md b/docs/user/status.md index 62278f8a21..fb2a7cbc0b 100644 --- a/docs/user/status.md +++ b/docs/user/status.md @@ -3,6 +3,27 @@ [The ClusterVersion object](../dev/clusterversion.md) sets `conditions` describing the state of the cluster-version operator (CVO). This document describes those conditions and, where appropriate, suggests possible mitigations. +## Failing + +When `Failing` is True, the CVO is failing to reconcile the cluster with the desired release image. +In all cases, the impact on the cluster will be that dependent nodes in [the manifest graph](reconciliation.md#manifest-graph) may not be [reconciled](reconciliation.md#reconciling-the-graph). +Note that the graph [may be flattened](reconciliation.md#manifest-graph), in which case there are no dependent nodes. + +Most reconciliation errors will result in `Failing=True`, although [`ClusterOperatorNotAvailable`](#clusteroperatornotavailable) has special handling. + +### NoDesiredImage + +The CVO has not been given a release image to reconcile. + +If this happens it is a CVO coding error, because clearing [`desiredUpdate`][api-desired-update] should return you to the current CVO's release image. + +### ClusterOperatorNotAvailable + +`ClusterOperatorNotAvailable` (or the consolidated `ClusterOperatorsNotAvailable`) is set when the CVO fails to retrieve the ClusterOperator from the cluster or when the retrieved ClusterOperator does not satisfy [the reconciliation conditions](reconciliation.md#clusteroperator). + +Unlike most manifest-reconciliation failures, this error does not immediately result in `Failing=True`. +Under some conditions during installs and updates, the CVO will treat this condition as a `Progressing=True` condition and give the operator up to ten minutes to level before reporting `Failing=True`. + ## RetrievedUpdates When `RetrievedUpdates` is `True`, the CVO is succesfully retrieving updates, which is good. @@ -107,5 +128,6 @@ If this error occurs because you forced an update to a release that is not in an If this happens it is a CVO coding error. There is no mitigation short of updating to a new release image with a fixed CVO. +[api-desired-update]: https://github.com/openshift/api/blob/34f54f12813aaed8822bb5bc56e97cbbfa92171d/config/v1/types_cluster_version.go#L40-L54 [channels]: https://docs.openshift.com/container-platform/4.3/updating/updating-cluster-between-minor.html#understanding-upgrade-channels_updating-cluster-between-minor [Cincinnati]: https://github.com/openshift/cincinnati/blob/master/docs/design/openshift.md diff --git a/pkg/cvo/cvo.go b/pkg/cvo/cvo.go index c0f3a315e5..d5bc3d4ce0 100644 --- a/pkg/cvo/cvo.go +++ b/pkg/cvo/cvo.go @@ -472,7 +472,10 @@ func (optr *Operator) sync(key string) error { // handle the case of a misconfigured CVO by doing nothing if len(desired.Image) == 0 { return optr.syncStatus(original, config, &SyncWorkerStatus{ - Failure: fmt.Errorf("No configured operator version, unable to update cluster"), + Failure: &payload.UpdateError{ + Reason: "NoDesiredImage", + Message: "No configured operator version, unable to update cluster", + }, }, errs) } diff --git a/pkg/cvo/cvo_scenarios_test.go b/pkg/cvo/cvo_scenarios_test.go index ed2def9d68..8b6f8c1335 100644 --- a/pkg/cvo/cvo_scenarios_test.go +++ b/pkg/cvo/cvo_scenarios_test.go @@ -165,8 +165,8 @@ func TestCVO_StartupAndSync(t *testing.T) { Conditions: []configv1.ClusterOperatorStatusCondition{ {Type: configv1.OperatorAvailable, Status: configv1.ConditionFalse}, // report back to the user that we don't have enough info to proceed - {Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Message: "No configured operator version, unable to update cluster"}, - {Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Message: "Unable to apply : an error occurred"}, + {Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "No configured operator version, unable to update cluster"}, + {Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "Unable to apply : an unknown error has occurred: NoDesiredImage"}, {Type: configv1.RetrievedUpdates, Status: configv1.ConditionFalse}, }, }, @@ -436,8 +436,8 @@ func TestCVO_StartupAndSyncUnverifiedPayload(t *testing.T) { Conditions: []configv1.ClusterOperatorStatusCondition{ {Type: configv1.OperatorAvailable, Status: configv1.ConditionFalse}, // report back to the user that we don't have enough info to proceed - {Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Message: "No configured operator version, unable to update cluster"}, - {Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Message: "Unable to apply : an error occurred"}, + {Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "No configured operator version, unable to update cluster"}, + {Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "Unable to apply : an unknown error has occurred: NoDesiredImage"}, {Type: configv1.RetrievedUpdates, Status: configv1.ConditionFalse}, }, }, @@ -697,8 +697,8 @@ func TestCVO_StartupAndSyncPreconditionFailing(t *testing.T) { Conditions: []configv1.ClusterOperatorStatusCondition{ {Type: configv1.OperatorAvailable, Status: configv1.ConditionFalse}, // report back to the user that we don't have enough info to proceed - {Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Message: "No configured operator version, unable to update cluster"}, - {Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Message: "Unable to apply : an error occurred"}, + {Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "No configured operator version, unable to update cluster"}, + {Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "Unable to apply : an unknown error has occurred: NoDesiredImage"}, {Type: configv1.RetrievedUpdates, Status: configv1.ConditionFalse}, }, }, From f00e20f5f71b5ad4b98b830ea4b3bb452fbc627a Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Fri, 31 Jul 2020 08:50:25 -0700 Subject: [PATCH 2/2] pkg/cvo/status: Raise Operator leveling grace-period to 20 minutes Reduce false-positives when operators take a while to level (like the machine-config operator, which has to roll the control plane machines). We may want to raise this further in the future, but baby steps ;). The previous 10-minute value is from c2ac20fa17 (status: Report the operators that have not yet deployed, 2019-04-09, #158), which doesn't make a case for that specific value. So the bump is unlikely to break anything unexpected. --- docs/user/status.md | 2 +- pkg/cvo/status.go | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/user/status.md b/docs/user/status.md index fb2a7cbc0b..26a1875542 100644 --- a/docs/user/status.md +++ b/docs/user/status.md @@ -22,7 +22,7 @@ If this happens it is a CVO coding error, because clearing [`desiredUpdate`][api `ClusterOperatorNotAvailable` (or the consolidated `ClusterOperatorsNotAvailable`) is set when the CVO fails to retrieve the ClusterOperator from the cluster or when the retrieved ClusterOperator does not satisfy [the reconciliation conditions](reconciliation.md#clusteroperator). Unlike most manifest-reconciliation failures, this error does not immediately result in `Failing=True`. -Under some conditions during installs and updates, the CVO will treat this condition as a `Progressing=True` condition and give the operator up to ten minutes to level before reporting `Failing=True`. +Under some conditions during installs and updates, the CVO will treat this condition as a `Progressing=True` condition and give the operator up to twenty minutes to level before reporting `Failing=True`. ## RetrievedUpdates diff --git a/pkg/cvo/status.go b/pkg/cvo/status.go index 65e5df406a..755f2fd55a 100644 --- a/pkg/cvo/status.go +++ b/pkg/cvo/status.go @@ -331,13 +331,13 @@ func (optr *Operator) syncStatus(original, config *configv1.ClusterVersion, stat // convertErrorToProgressing returns true if the provided status indicates a failure condition can be interpreted as // still making internal progress. The general error we try to suppress is an operator or operators still being -// unavailable AND the general payload task making progress towards its goal. An operator is given 10 minutes since +// unavailable AND the general payload task making progress towards its goal. An operator is given 20 minutes since // its last update to go ready, or an hour has elapsed since the update began, before the condition is ignored. func convertErrorToProgressing(history []configv1.UpdateHistory, now time.Time, status *SyncWorkerStatus) (reason string, message string, ok bool) { if len(history) == 0 || status.Failure == nil || status.Reconciling || status.LastProgress.IsZero() { return "", "", false } - if now.Sub(status.LastProgress) > 10*time.Minute || now.Sub(history[0].StartedTime.Time) > time.Hour { + if now.Sub(status.LastProgress) > 20*time.Minute || now.Sub(history[0].StartedTime.Time) > time.Hour { return "", "", false } uErr, ok := status.Failure.(*payload.UpdateError)