diff --git a/docs/user/reconciliation.md b/docs/user/reconciliation.md index d5172b9eda..42116bab66 100644 --- a/docs/user/reconciliation.md +++ b/docs/user/reconciliation.md @@ -93,22 +93,22 @@ So the graph nodes are all parallelized with the by-number ordering flattened ou For the usual reconciliation loop (neither an upgrade between releases nor a fresh install), the flattened graph is also randomly permuted to avoid hanging on ordering bugs. -## Synchronizing the graph +## Reconciling the graph The cluster-version operator spawns worker goroutines that walk the graph, pushing manifests in their queue. -For each manifest in the node, the worker synchronizes the cluster with the manifest using a resource builder. +For each manifest in the node, the worker reconciles the cluster with the manifest using a resource builder. On error (or timeout), the worker abandons the manifest, graph node, and any dependencies of that graph node. On success, the worker proceeds to the next manifest in the graph node. ## Resource builders -Resource builders synchronize the cluster with a manifest from the release image. +Resource builders reconcile a cluster object with a manifest from the release image. The general approach is to generates a merged manifest combining critical spec properties from the release-image manifest with data from a preexisting in-cluster object, if any. If the merged manifest differs from the in-cluster object, the merged manifest is pushed back into the cluster. Some types have additional logic, as described in the following subsections. Note that this logic only applies to manifests included in the release image itself. -For example, only [ClusterOperator](../dev/clusteroperator.md) from the release image will have the blocking logic described [below](#clusteroperator); if an admin or secondary operator pushed a ClusterOperator object, it would not impact the cluster-version operator's graph synchronization. +For example, only [ClusterOperator](../dev/clusteroperator.md) from the release image will have the blocking logic described [below](#clusteroperator); if an admin or secondary operator pushed a ClusterOperator object, it would not impact the cluster-version operator's graph reconciliation. ### ClusterOperator diff --git a/docs/user/status.md b/docs/user/status.md index 62278f8a21..26a1875542 100644 --- a/docs/user/status.md +++ b/docs/user/status.md @@ -3,6 +3,27 @@ [The ClusterVersion object](../dev/clusterversion.md) sets `conditions` describing the state of the cluster-version operator (CVO). This document describes those conditions and, where appropriate, suggests possible mitigations. +## Failing + +When `Failing` is True, the CVO is failing to reconcile the cluster with the desired release image. +In all cases, the impact on the cluster will be that dependent nodes in [the manifest graph](reconciliation.md#manifest-graph) may not be [reconciled](reconciliation.md#reconciling-the-graph). +Note that the graph [may be flattened](reconciliation.md#manifest-graph), in which case there are no dependent nodes. + +Most reconciliation errors will result in `Failing=True`, although [`ClusterOperatorNotAvailable`](#clusteroperatornotavailable) has special handling. + +### NoDesiredImage + +The CVO has not been given a release image to reconcile. + +If this happens it is a CVO coding error, because clearing [`desiredUpdate`][api-desired-update] should return you to the current CVO's release image. + +### ClusterOperatorNotAvailable + +`ClusterOperatorNotAvailable` (or the consolidated `ClusterOperatorsNotAvailable`) is set when the CVO fails to retrieve the ClusterOperator from the cluster or when the retrieved ClusterOperator does not satisfy [the reconciliation conditions](reconciliation.md#clusteroperator). + +Unlike most manifest-reconciliation failures, this error does not immediately result in `Failing=True`. +Under some conditions during installs and updates, the CVO will treat this condition as a `Progressing=True` condition and give the operator up to twenty minutes to level before reporting `Failing=True`. + ## RetrievedUpdates When `RetrievedUpdates` is `True`, the CVO is succesfully retrieving updates, which is good. @@ -107,5 +128,6 @@ If this error occurs because you forced an update to a release that is not in an If this happens it is a CVO coding error. There is no mitigation short of updating to a new release image with a fixed CVO. +[api-desired-update]: https://github.com/openshift/api/blob/34f54f12813aaed8822bb5bc56e97cbbfa92171d/config/v1/types_cluster_version.go#L40-L54 [channels]: https://docs.openshift.com/container-platform/4.3/updating/updating-cluster-between-minor.html#understanding-upgrade-channels_updating-cluster-between-minor [Cincinnati]: https://github.com/openshift/cincinnati/blob/master/docs/design/openshift.md diff --git a/pkg/cvo/cvo.go b/pkg/cvo/cvo.go index c0f3a315e5..d5bc3d4ce0 100644 --- a/pkg/cvo/cvo.go +++ b/pkg/cvo/cvo.go @@ -472,7 +472,10 @@ func (optr *Operator) sync(key string) error { // handle the case of a misconfigured CVO by doing nothing if len(desired.Image) == 0 { return optr.syncStatus(original, config, &SyncWorkerStatus{ - Failure: fmt.Errorf("No configured operator version, unable to update cluster"), + Failure: &payload.UpdateError{ + Reason: "NoDesiredImage", + Message: "No configured operator version, unable to update cluster", + }, }, errs) } diff --git a/pkg/cvo/cvo_scenarios_test.go b/pkg/cvo/cvo_scenarios_test.go index ed2def9d68..8b6f8c1335 100644 --- a/pkg/cvo/cvo_scenarios_test.go +++ b/pkg/cvo/cvo_scenarios_test.go @@ -165,8 +165,8 @@ func TestCVO_StartupAndSync(t *testing.T) { Conditions: []configv1.ClusterOperatorStatusCondition{ {Type: configv1.OperatorAvailable, Status: configv1.ConditionFalse}, // report back to the user that we don't have enough info to proceed - {Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Message: "No configured operator version, unable to update cluster"}, - {Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Message: "Unable to apply : an error occurred"}, + {Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "No configured operator version, unable to update cluster"}, + {Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "Unable to apply : an unknown error has occurred: NoDesiredImage"}, {Type: configv1.RetrievedUpdates, Status: configv1.ConditionFalse}, }, }, @@ -436,8 +436,8 @@ func TestCVO_StartupAndSyncUnverifiedPayload(t *testing.T) { Conditions: []configv1.ClusterOperatorStatusCondition{ {Type: configv1.OperatorAvailable, Status: configv1.ConditionFalse}, // report back to the user that we don't have enough info to proceed - {Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Message: "No configured operator version, unable to update cluster"}, - {Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Message: "Unable to apply : an error occurred"}, + {Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "No configured operator version, unable to update cluster"}, + {Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "Unable to apply : an unknown error has occurred: NoDesiredImage"}, {Type: configv1.RetrievedUpdates, Status: configv1.ConditionFalse}, }, }, @@ -697,8 +697,8 @@ func TestCVO_StartupAndSyncPreconditionFailing(t *testing.T) { Conditions: []configv1.ClusterOperatorStatusCondition{ {Type: configv1.OperatorAvailable, Status: configv1.ConditionFalse}, // report back to the user that we don't have enough info to proceed - {Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Message: "No configured operator version, unable to update cluster"}, - {Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Message: "Unable to apply : an error occurred"}, + {Type: ClusterStatusFailing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "No configured operator version, unable to update cluster"}, + {Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Reason: "NoDesiredImage", Message: "Unable to apply : an unknown error has occurred: NoDesiredImage"}, {Type: configv1.RetrievedUpdates, Status: configv1.ConditionFalse}, }, }, diff --git a/pkg/cvo/status.go b/pkg/cvo/status.go index 65e5df406a..755f2fd55a 100644 --- a/pkg/cvo/status.go +++ b/pkg/cvo/status.go @@ -331,13 +331,13 @@ func (optr *Operator) syncStatus(original, config *configv1.ClusterVersion, stat // convertErrorToProgressing returns true if the provided status indicates a failure condition can be interpreted as // still making internal progress. The general error we try to suppress is an operator or operators still being -// unavailable AND the general payload task making progress towards its goal. An operator is given 10 minutes since +// unavailable AND the general payload task making progress towards its goal. An operator is given 20 minutes since // its last update to go ready, or an hour has elapsed since the update began, before the condition is ignored. func convertErrorToProgressing(history []configv1.UpdateHistory, now time.Time, status *SyncWorkerStatus) (reason string, message string, ok bool) { if len(history) == 0 || status.Failure == nil || status.Reconciling || status.LastProgress.IsZero() { return "", "", false } - if now.Sub(status.LastProgress) > 10*time.Minute || now.Sub(history[0].StartedTime.Time) > time.Hour { + if now.Sub(status.LastProgress) > 20*time.Minute || now.Sub(history[0].StartedTime.Time) > time.Hour { return "", "", false } uErr, ok := status.Failure.(*payload.UpdateError)