add clusteroperator Upgradeable conditions #206

deads2k · 2019-02-11T16:53:06Z

We cannot gate upgrading on available, failing, and progressing because if any of those conditions are "wrong", we may need to change the version in order to get them working. This introduces a new condition that indicates whether or not the operator can accept a payload change.

@bparees
/assign @smarterclayton

bparees · 2019-02-11T18:22:37Z

config/v1/types_cluster_operator.go

+
+	// OperatorUpgradeable indicates whether or not the operator is in a state where it can accept a new payload.  When
+	// set to `False` the CVO will disallow moving to a new version.
+	OperatorUpgradeable ClusterStatusConditionType = "Upgradeable"


Since we've already established that it's not necessary an "upgrade", and also because i still think there might be some value in using this to determine general readiness of an operator, i'd propose renaming it to Ready.

Since we've already established that it's not necessary an "upgrade", and also because i still think there might be some value in using this to determine general readiness of an operator, i'd propose renaming it to Ready.

You cannot have the CVO base its decision about whether or not to allow a version change upon the operator status for the same reason you cannot allow it make a decision based on operand status. If you did, you would be unable to fix a broken operator with a version change. You're describing a different condition.

The upgradeable condition will block an upgrade and it should only be set in a case when we cannot safely allow a version change to proceed.

You cannot have the CVO base its decision about whether or not to allow a version change upon the operator status for the same reason you cannot allow it make a decision based on operand status.

who's setting this "upgradeable" condition if not the operator?

You cannot have the CVO base its decision about whether or not to allow a version change upon the operator status for the same reason you cannot allow it make a decision based on operand status.

who's setting this "upgradeable" condition if not the operator?

Assuming you intend for Ready to mean your operator is functional. If an operator is Ready==false, then we should definitely allow upgrading that operator because otherwise it can be stuck broken. If an operator is Ready=true, then we should definitely allow upgrading that operator because it is functioning correctly. It's a condition unrelated to whether or not the CVO should allow an upgrade.

How/when do you define that operator is not ready (Ready=false) but upgradable (Upgradable=true)? What's the algorithm/conditions behind that decision? Is it human made?

How/when do you define that operator is not ready (Ready=false) but upgradable (Upgradable=true)? What's the algorithm/conditions behind that decision? Is it human made?

I think we're looking at different conditions and I'd like to have this pull focus on Upgradeable. It is an operator made choice.

derekwaynecarr · 2019-02-11T19:47:50Z

this is a little meta. i know we discussed this previously, but when would an operator disallow moving to a new version? do you have a scenario in mind where you would set this False with your current operator?

…

On Mon, Feb 11, 2019 at 2:19 PM Ben Parees ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In config/v1/types_cluster_operator.go <#206 (comment)>: > @@ -124,6 +124,10 @@ const ( // The binary maintained by the operator (eg: openshift-apiserver for the openshift-apiserver-operator) may still be // available, but the user intent cannot be fulfilled. OperatorFailing ClusterStatusConditionType = "Failing" + + // OperatorUpgradeable indicates whether or not the operator is in a state where it can accept a new payload. When + // set to `False` the CVO will disallow moving to a new version. + OperatorUpgradeable ClusterStatusConditionType = "Upgradeable" You cannot have the CVO base its decision about whether or not to allow a version change upon the operator status for the same reason you cannot allow it make a decision based on operand status. who's setting this "upgradeable" condition if not the operator? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#206 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF8dbJYuzdrBwC2SOl-hchaX0xfjfA_eks5vMcIxgaJpZM4a0nQL> .

deads2k · 2019-02-11T20:14:30Z

do you have a scenario in mind where you would set this False with your current operator?

yes. When the unsupportedConfigOverrides have values in them.

soltysh

The idea is reasonable here imo, the discussions currently are around when and how which is fine but should not block the PR itself.
lgtm

openshift-ci-robot · 2019-02-13T11:24:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k,soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2019-02-13T16:40:54Z

config/v1/types_cluster_operator.go

 	// available, but the user intent cannot be fulfilled.
 	OperatorFailing ClusterStatusConditionType = "Failing"
+
+	// OperatorUpgradeable indicates whether or not the operator is in a state where it can accept a new payload.  When


Don't use payload, that word is being scrubbed, and don't use CVO. Use this description instead:

Upgradeable indicates whether the operator is in a state that is safe to upgrade. When status is False administrators should not upgrade their cluster and the message field should contain a human readable description of what the administrator should do to allow the operator to successfully update.

smarterclayton · 2019-02-13T16:41:10Z

Change the godoc as per my comment.

smarterclayton · 2019-02-13T18:57:25Z

The current use cases described for this are:

kube-apiserver wants to prevent an update to a new payload (micro) because user has set unsupported changes
CLI and UI wants to display information about failing or non-upgradable ClusterOperators

There are a few future use cases we need to describe:

The MCO needs to block 4.Y updates of kube-apiserver when kubelets older than a specific version
We want to prevent an upgrade based on a specific API condition

@abhinavdahiya and I had a long chat, I think what we allow right now is:

David can set upgradeable false for unsupported version config
CVO will write Upgradeable=False to CV if any operator is failing or if anyone has Upgradeable false and a message
CLI and UI will show upgradeable false as informational message
CVO will not halt on upgradeable false yet

Once we have that much implemented, we'll go back and assess.

Future CVO might have "precondition jobs" that we use to assess whether a version could be started (without actually updating anything). This would capture most of the "quick and dirty checks", assuming we don't have many of them.

In the short term, the CVO will do the following

Versions takes priority - we will never go past a CO that has cluster operator status versions that don't match
If versions match, Available=False and Failing=True will both block CVO from going past
If no versions are set, in the short term, Progressing=True will also block CVO (but when versions are set Progressing = true is informational)

Operators should be Available=True if they satisfy the "reason they are part of the payload" - i.e. if they expose an API, if the API is deployed they are available true. If they have a single operand that should always be running, they are Available=true. If they have an optional (as defined by biz use case) operand, they are available "if they have either created the operand or are waiting on a human". If they have many operands, they should become available when "the default operands as defined by a human" are created.

bparees · 2019-02-13T21:18:24Z

Operators should be Available=True if they satisfy the "reason they are part of the payload" - i.e. if they expose an API, if the API is deployed they are available true. If they have a single operand that should always be running, they are Available=true. If they have an optional (as defined by biz use case) operand, they are available "if they have either created the operand or are waiting on a human". If they have many operands, they should become available when "the default operands as defined by a human" are created.

So applying this statement to the two devex use cases:

the registry operand should be considered "optional" (your cluster can work without it), thus on platforms where the registry cannot deploy (no storage), the registry operator would still report available=true/progressing=false/failing=false (assuming the operator itself is otherwise happy). This makes the clusteroperator conditions largely useless for an admin checking the health of the registry, but ensures registry will not block install or upgrade.
the samples operand should also be considered "optional", meaning again, the samples operator should just report "available=true/progressing=false/failing=false" even if it has not created any samples yet.

For both of these, we will want to continue to have actually useful conditions on the operator-specific resource that indicate the state of the operand and why. Unfortunately this means there's no consistent way to check that all operators and their operands are healthy/stable. Or at least that the ClusterOperator resource is not that mechanism.

I think this largely goes against the current ClusterOperator api documentation(so that should be fixed) and from the perspective of an operator author trying to report useful status about the state of my operand, I think it's a step backwards, but if we are saying the CVO isn't going to switch to a different mechanism for determining when an operator is sufficiently stable such that the install/upgrade can continue, I guess that's what we have.

/cc @openshift/sig-developer-experience

smarterclayton · 2019-02-14T17:38:11Z

This makes the clusteroperator conditions largely useless for an admin checking the health of the registry, but ensures registry will not block install or upgrade.

Why? If you don't have an installed registry your health is "good".

2. the samples operand should also be considered "optional", meaning again, the samples operator should just report "available=true/progressing=false/failing=false" even if it has not created any samples yet.

Why? The specific condition of you don't have a pull secret to the registry is something the operator should report but requires human intervention (today). When we ship, it will again be an error.

bparees · 2019-02-14T17:45:17Z

Why? If you don't have an installed registry your health is "good".

And when i do have an installed registry but the storage isn't configured/available, but the operator still reports available=true/progressing=false so that it doesn't block the install? I wouldn't consider that a registry that's in good health.

Why? The specific condition of you don't have a pull secret to the registry is something the operator should report but requires human intervention (today). When we ship, it will again be an error.

Because importing the samples takes time(thus making the install take longer) and having the samples available isn't critical to having a working cluster. We're being asked why we're delaying the install.

smarterclayton · 2019-02-14T18:05:44Z

And when i do have an installed registry but the storage isn't configured/available, but the operator still reports available=true/progressing=false so that it doesn't block the install?

There is no difference between a registry without configured storage and NetworkOperator Type=None - if you are waiting for human, you are available.

bparees · 2019-02-14T18:19:55Z

There is no difference between a registry without configured storage and NetworkOperator Type=None - if you are waiting for human, you are available.

That seems like a slippery slope. When no registry pods can be scheduled because of node issues that require a human to go add more capacity, am I available/healthy? When the storage is configured to a PVC but the autoprovisioner didn't generate a PV that's claimable, am I healthy because the admin just needs to go create a PV for me?

If anything it seems particularly weird to report available=true when you know the admin needs to take some action to actually make the thing available.

From an admin/end-user perspective, it doesn't matter why the registry isn't running, I need to know that it's not(and then i will care why and see if i need to do something about it). Clearly the clusteroperator "available=true" condition isn't the thing that tells the admin that.

smarterclayton · 2019-02-14T18:21:21Z

When no registry pods can be scheduled because of node issues that require a human to go add more capacity, am I available/healthy?

No, you aren't.

When the storage is configured to a PVC but the autoprovisioner didn't generate a PV that's claimable, am I healthy because the admin just needs to go create a PV for me?

No you aren't.

Those both seemed pretty easy?

bparees · 2019-02-14T18:34:33Z

Those both seemed pretty easy?

I don't think they look that different to an admin. The difference between "you didn't configure storage at all" and "you configured storage but it's not available, maybe you did it wrong" is pretty subtle, yet you're saying in one case the registry is "available" and in the other it's not. Not configuring storage is basically a special case of "you configured storage that isn't usable".

And your criteria above was "if you're waiting for a human you're available". Both of the scenarios I presented are cases where you're also waiting for a human. So I don't see it as an easy distinction, but more importantly I don't see it as a useful distinction for an administrator who just cares if the thing is actually push/pullable or not (or upgradeable or not).

Which gets to the larger point. There are two concerns:

reporting availability of the operand (clearly available=true isn't for that)
determining when an upgrade is done/can be started: the operator has to decide, given the state of itself and its operand, whether it is ok with an upgrade proceeding. For the registry operator, I probably don't care why there are no registry pods running. Either i'm going to allow you to upgrade when there are no registry pods, or i'm never going to allow you to upgrade when there are no registry pods. But i'm not going to say "when you haven't configured storage, you can upgrade, but when you've configured invalid storage, i'm not going to let you upgrade" because it makes no difference in what happens during the upgrade. I might want to block an upgrade when you've configured valid storage but no registry pods are running because I want to ensure storage migration happens correctly (then again I might not if i know it doesn't matter since I think we've only had one actual storage format migration in the history of the registry).

deads2k · 2019-02-14T20:37:20Z

It looks like most of the action in this pull is about the meaning of Available, not the meaning of Upgradeable.

@smarterclayton doc updated

deads2k · 2019-02-18T13:47:09Z

doc may need clarification to indicate that only False does something

openshift-ci-robot · 2019-02-18T13:48:17Z

New changes are detected. LGTM label has been removed.

benjaminapetersen · 2019-02-20T21:05:35Z

I'm late to the game, but I'll add a little more detail to what makes this confusing. I'm working on the console-operator and deciding how to set the statuses Available, Progressing, Failing when ManagementState:Removed.

If removed, I am deleting all of my resources, including the deployment for my operand (console).
If my operand does not exist, logically Available:False is the correct reporting (to a human). Its not possible to use it, it doesn't exist.
Progressing:false and Failing:false are not hard to reason about here. While I am deleting resources, Progressing: true. So long as I succeed, Failing:false.

It seems very confusing that I should actually be reporting Available: true on my operand, as its supposed to be deleted by user intent, even though it is clearly unavailable.

benjaminapetersen · 2019-02-21T21:07:04Z

Gonna add one more.

If I set my operator ManagementState: Unmanaged, I desire to set Available: operatorsv1.ConditionUnknown because the operator doesn't know what a user might be doing. At present my understanding is that I am still to set it to Available: true because the operator is honoring the user's desired state. I'm not sure the intention of ConditionUnknown at this point.

benjaminapetersen · 2019-02-22T19:29:04Z

However this: https://github.com/openshift/library-go/blob/master/pkg/operator/status/status_controller.go#L116

// if we are unmanaged, force everything to Unknown.
if detailedSpec.ManagementState == operatorv1.Unmanaged {
    clusterOperatorObj.Status = configv1.ClusterOperatorStatus{}
    configv1helpers.SetStatusCondition(&clusterOperatorObj.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorAvailable, Status: configv1.ConditionUnknown, Reason: "Unmanaged"})
    configv1helpers.SetStatusCondition(&clusterOperatorObj.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorProgressing, Status: configv1.ConditionUnknown, Reason: "Unmanaged"})
    configv1helpers.SetStatusCondition(&clusterOperatorObj.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorFailing, Status: configv1.ConditionUnknown, Reason: "Unmanaged"})
    configv1helpers.SetStatusCondition(&clusterOperatorObj.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorUpgradeable, Status: configv1.ConditionUnknown, Reason: "Unmanaged"})
}

makes me question what I previously stated.

…onal update risk We've had Upgradeable since 2019 [1], but it is a confusing condition, because the "between minor versions" wording [2] in the message isn't obvious to users who have not yet internalized SemVer's MAJOR.MINOR.PATCH terminology [3]. Conditional update risks landed in 2021 [4] and give us a clear API for declaring exactly which update targets have the exposure. This commit adds client-side code to the tech-preview 'recommend' subcommand, so folks can get a feel for that user experience. If it goes over well, we can shift the logic to the cluster-version operator so all clients can benefit. The alreadyHasUpgradeableRisk check ensures we don't double up if the CVO eventually picks up this pivot. [1]: openshift/api#206 [2]: https://github.com/openshift/cluster-version-operator/blob/c4b8362d8acd08d63a600b4d53c33e8737ed7a53/pkg/cvo/upgradeable.go#L218-L228 [3]: https://semver.org/spec/v2.0.0.html#summary [4]: openshift/api#1011

openshift-ci-robot assigned smarterclayton Feb 11, 2019

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 11, 2019

openshift-ci-robot requested review from abhinavdahiya and adambkaplan February 11, 2019 16:53

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2019

bparees reviewed Feb 11, 2019

View reviewed changes

soltysh approved these changes Feb 13, 2019

View reviewed changes

smarterclayton reviewed Feb 13, 2019

View reviewed changes

deads2k force-pushed the upgradeable branch from 892d4c0 to 00bd77f Compare February 14, 2019 19:32

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 14, 2019

deads2k added lgtm Indicates that a PR is ready to be merged. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 18, 2019

add clusteroperator Upgradeable conditions

f179600

deads2k force-pushed the upgradeable branch from 00bd77f to f179600 Compare February 18, 2019 13:48

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 18, 2019

deads2k added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 18, 2019

openshift-merge-robot merged commit af79bef into openshift:master Feb 18, 2019

gabemontero mentioned this pull request Feb 18, 2019

operator status Progressing=True long after the cluster reaches idle openshift/cluster-samples-operator#100

Closed

wking mentioned this pull request Nov 6, 2024

OTA-902: pkg/cli/admin/upgrade/recommend: Treat Upgradeable=False as a conditional update risk openshift/oc#1907

Merged

add clusteroperator Upgradeable conditions #206

add clusteroperator Upgradeable conditions #206

Uh oh!

Conversation

deads2k commented Feb 11, 2019

Uh oh!

bparees Feb 11, 2019

Choose a reason for hiding this comment

Uh oh!

deads2k Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bparees Feb 11, 2019

Choose a reason for hiding this comment

Uh oh!

deads2k Feb 11, 2019

Choose a reason for hiding this comment

Uh oh!

soltysh Feb 12, 2019

Choose a reason for hiding this comment

Uh oh!

deads2k Feb 12, 2019

Choose a reason for hiding this comment

Uh oh!

derekwaynecarr commented Feb 11, 2019 via email

Uh oh!

deads2k commented Feb 11, 2019

Uh oh!

soltysh left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Feb 13, 2019

Uh oh!

smarterclayton Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Feb 13, 2019

Uh oh!

smarterclayton commented Feb 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bparees commented Feb 13, 2019

Uh oh!

smarterclayton commented Feb 14, 2019

Uh oh!

bparees commented Feb 14, 2019

Uh oh!

smarterclayton commented Feb 14, 2019

Uh oh!

bparees commented Feb 14, 2019

Uh oh!

smarterclayton commented Feb 14, 2019

Uh oh!

bparees commented Feb 14, 2019

Uh oh!

deads2k commented Feb 14, 2019

Uh oh!

deads2k commented Feb 18, 2019

Uh oh!

openshift-ci-robot commented Feb 18, 2019

Uh oh!

benjaminapetersen commented Feb 20, 2019

Uh oh!

benjaminapetersen commented Feb 21, 2019

Uh oh!

benjaminapetersen commented Feb 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

deads2k Feb 11, 2019 •

edited

Loading

smarterclayton commented Feb 13, 2019 •

edited

Loading