Skip to content

Conversation

@deads2k
Copy link
Contributor

@deads2k deads2k commented Feb 11, 2019

We cannot gate upgrading on available, failing, and progressing because if any of those conditions are "wrong", we may need to change the version in order to get them working. This introduces a new condition that indicates whether or not the operator can accept a payload change.

@bparees
/assign @smarterclayton

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 11, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2019

// OperatorUpgradeable indicates whether or not the operator is in a state where it can accept a new payload. When
// set to `False` the CVO will disallow moving to a new version.
OperatorUpgradeable ClusterStatusConditionType = "Upgradeable"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've already established that it's not necessary an "upgrade", and also because i still think there might be some value in using this to determine general readiness of an operator, i'd propose renaming it to Ready.

Copy link
Contributor Author

@deads2k deads2k Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've already established that it's not necessary an "upgrade", and also because i still think there might be some value in using this to determine general readiness of an operator, i'd propose renaming it to Ready.

You cannot have the CVO base its decision about whether or not to allow a version change upon the operator status for the same reason you cannot allow it make a decision based on operand status. If you did, you would be unable to fix a broken operator with a version change. You're describing a different condition.

The upgradeable condition will block an upgrade and it should only be set in a case when we cannot safely allow a version change to proceed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot have the CVO base its decision about whether or not to allow a version change upon the operator status for the same reason you cannot allow it make a decision based on operand status.

who's setting this "upgradeable" condition if not the operator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot have the CVO base its decision about whether or not to allow a version change upon the operator status for the same reason you cannot allow it make a decision based on operand status.

who's setting this "upgradeable" condition if not the operator?

Assuming you intend for Ready to mean your operator is functional. If an operator is Ready==false, then we should definitely allow upgrading that operator because otherwise it can be stuck broken. If an operator is Ready=true, then we should definitely allow upgrading that operator because it is functioning correctly. It's a condition unrelated to whether or not the CVO should allow an upgrade.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How/when do you define that operator is not ready (Ready=false) but upgradable (Upgradable=true)? What's the algorithm/conditions behind that decision? Is it human made?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How/when do you define that operator is not ready (Ready=false) but upgradable (Upgradable=true)? What's the algorithm/conditions behind that decision? Is it human made?

I think we're looking at different conditions and I'd like to have this pull focus on Upgradeable. It is an operator made choice.

@derekwaynecarr
Copy link
Member

derekwaynecarr commented Feb 11, 2019 via email

@deads2k
Copy link
Contributor Author

deads2k commented Feb 11, 2019

do you have a scenario in mind where you would set this False with your current operator?

yes. When the unsupportedConfigOverrides have values in them.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is reasonable here imo, the discussions currently are around when and how which is fine but should not block the PR itself.
lgtm

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

// available, but the user intent cannot be fulfilled.
OperatorFailing ClusterStatusConditionType = "Failing"

// OperatorUpgradeable indicates whether or not the operator is in a state where it can accept a new payload. When
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use payload, that word is being scrubbed, and don't use CVO. Use this description instead:

Upgradeable indicates whether the operator is in a state that is safe to upgrade. When status is False administrators should not upgrade their cluster and the message field should contain a human readable description of what the administrator should do to allow the operator to successfully update.

@smarterclayton
Copy link
Contributor

Change the godoc as per my comment.

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 13, 2019

The current use cases described for this are:

  1. kube-apiserver wants to prevent an update to a new payload (micro) because user has set unsupported changes
  2. CLI and UI wants to display information about failing or non-upgradable ClusterOperators

There are a few future use cases we need to describe:

  1. The MCO needs to block 4.Y updates of kube-apiserver when kubelets older than a specific version
  2. We want to prevent an upgrade based on a specific API condition

@abhinavdahiya and I had a long chat, I think what we allow right now is:

  1. David can set upgradeable false for unsupported version config
  2. CVO will write Upgradeable=False to CV if any operator is failing or if anyone has Upgradeable false and a message
  3. CLI and UI will show upgradeable false as informational message
  4. CVO will not halt on upgradeable false yet

Once we have that much implemented, we'll go back and assess.

Future CVO might have "precondition jobs" that we use to assess whether a version could be started (without actually updating anything). This would capture most of the "quick and dirty checks", assuming we don't have many of them.

In the short term, the CVO will do the following

  1. Versions takes priority - we will never go past a CO that has cluster operator status versions that don't match
  2. If versions match, Available=False and Failing=True will both block CVO from going past
  3. If no versions are set, in the short term, Progressing=True will also block CVO (but when versions are set Progressing = true is informational)

Operators should be Available=True if they satisfy the "reason they are part of the payload" - i.e. if they expose an API, if the API is deployed they are available true. If they have a single operand that should always be running, they are Available=true. If they have an optional (as defined by biz use case) operand, they are available "if they have either created the operand or are waiting on a human". If they have many operands, they should become available when "the default operands as defined by a human" are created.

@bparees
Copy link
Contributor

bparees commented Feb 13, 2019

Operators should be Available=True if they satisfy the "reason they are part of the payload" - i.e. if they expose an API, if the API is deployed they are available true. If they have a single operand that should always be running, they are Available=true. If they have an optional (as defined by biz use case) operand, they are available "if they have either created the operand or are waiting on a human". If they have many operands, they should become available when "the default operands as defined by a human" are created.

So applying this statement to the two devex use cases:

  1. the registry operand should be considered "optional" (your cluster can work without it), thus on platforms where the registry cannot deploy (no storage), the registry operator would still report available=true/progressing=false/failing=false (assuming the operator itself is otherwise happy). This makes the clusteroperator conditions largely useless for an admin checking the health of the registry, but ensures registry will not block install or upgrade.

  2. the samples operand should also be considered "optional", meaning again, the samples operator should just report "available=true/progressing=false/failing=false" even if it has not created any samples yet.

For both of these, we will want to continue to have actually useful conditions on the operator-specific resource that indicate the state of the operand and why. Unfortunately this means there's no consistent way to check that all operators and their operands are healthy/stable. Or at least that the ClusterOperator resource is not that mechanism.

I think this largely goes against the current ClusterOperator api documentation(so that should be fixed) and from the perspective of an operator author trying to report useful status about the state of my operand, I think it's a step backwards, but if we are saying the CVO isn't going to switch to a different mechanism for determining when an operator is sufficiently stable such that the install/upgrade can continue, I guess that's what we have.

/cc @openshift/sig-developer-experience

@smarterclayton
Copy link
Contributor

  1. This makes the clusteroperator conditions largely useless for an admin checking the health of the registry, but ensures registry will not block install or upgrade.

Why? If you don't have an installed registry your health is "good".

2. the samples operand should also be considered "optional", meaning again, the samples operator should just report "available=true/progressing=false/failing=false" even if it has not created any samples yet.

Why? The specific condition of you don't have a pull secret to the registry is something the operator should report but requires human intervention (today). When we ship, it will again be an error.

@bparees
Copy link
Contributor

bparees commented Feb 14, 2019

Why? If you don't have an installed registry your health is "good".

And when i do have an installed registry but the storage isn't configured/available, but the operator still reports available=true/progressing=false so that it doesn't block the install? I wouldn't consider that a registry that's in good health.

Why? The specific condition of you don't have a pull secret to the registry is something the operator should report but requires human intervention (today). When we ship, it will again be an error.

Because importing the samples takes time(thus making the install take longer) and having the samples available isn't critical to having a working cluster. We're being asked why we're delaying the install.

@smarterclayton
Copy link
Contributor

And when i do have an installed registry but the storage isn't configured/available, but the operator still reports available=true/progressing=false so that it doesn't block the install?

There is no difference between a registry without configured storage and NetworkOperator Type=None - if you are waiting for human, you are available.

@bparees
Copy link
Contributor

bparees commented Feb 14, 2019

There is no difference between a registry without configured storage and NetworkOperator Type=None - if you are waiting for human, you are available.

That seems like a slippery slope. When no registry pods can be scheduled because of node issues that require a human to go add more capacity, am I available/healthy? When the storage is configured to a PVC but the autoprovisioner didn't generate a PV that's claimable, am I healthy because the admin just needs to go create a PV for me?

If anything it seems particularly weird to report available=true when you know the admin needs to take some action to actually make the thing available.

From an admin/end-user perspective, it doesn't matter why the registry isn't running, I need to know that it's not(and then i will care why and see if i need to do something about it). Clearly the clusteroperator "available=true" condition isn't the thing that tells the admin that.

@smarterclayton
Copy link
Contributor

When no registry pods can be scheduled because of node issues that require a human to go add more capacity, am I available/healthy?

No, you aren't.

When the storage is configured to a PVC but the autoprovisioner didn't generate a PV that's claimable, am I healthy because the admin just needs to go create a PV for me?

No you aren't.

Those both seemed pretty easy?

@bparees
Copy link
Contributor

bparees commented Feb 14, 2019

Those both seemed pretty easy?

I don't think they look that different to an admin. The difference between "you didn't configure storage at all" and "you configured storage but it's not available, maybe you did it wrong" is pretty subtle, yet you're saying in one case the registry is "available" and in the other it's not. Not configuring storage is basically a special case of "you configured storage that isn't usable".

And your criteria above was "if you're waiting for a human you're available". Both of the scenarios I presented are cases where you're also waiting for a human. So I don't see it as an easy distinction, but more importantly I don't see it as a useful distinction for an administrator who just cares if the thing is actually push/pullable or not (or upgradeable or not).

Which gets to the larger point. There are two concerns:

  1. reporting availability of the operand (clearly available=true isn't for that)

  2. determining when an upgrade is done/can be started: the operator has to decide, given the state of itself and its operand, whether it is ok with an upgrade proceeding. For the registry operator, I probably don't care why there are no registry pods running. Either i'm going to allow you to upgrade when there are no registry pods, or i'm never going to allow you to upgrade when there are no registry pods. But i'm not going to say "when you haven't configured storage, you can upgrade, but when you've configured invalid storage, i'm not going to let you upgrade" because it makes no difference in what happens during the upgrade. I might want to block an upgrade when you've configured valid storage but no registry pods are running because I want to ensure storage migration happens correctly (then again I might not if i know it doesn't matter since I think we've only had one actual storage format migration in the history of the registry).

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 14, 2019
@deads2k
Copy link
Contributor Author

deads2k commented Feb 14, 2019

It looks like most of the action in this pull is about the meaning of Available, not the meaning of Upgradeable.

@smarterclayton doc updated

@deads2k deads2k added lgtm Indicates that a PR is ready to be merged. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 18, 2019
@deads2k
Copy link
Contributor Author

deads2k commented Feb 18, 2019

doc may need clarification to indicate that only False does something

@openshift-ci-robot
Copy link

New changes are detected. LGTM label has been removed.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 18, 2019
@deads2k deads2k added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 18, 2019
@openshift-merge-robot openshift-merge-robot merged commit af79bef into openshift:master Feb 18, 2019
@benjaminapetersen
Copy link
Contributor

I'm late to the game, but I'll add a little more detail to what makes this confusing. I'm working on the console-operator and deciding how to set the statuses Available, Progressing, Failing when ManagementState:Removed.

  • If removed, I am deleting all of my resources, including the deployment for my operand (console).
  • If my operand does not exist, logically Available:False is the correct reporting (to a human). Its not possible to use it, it doesn't exist.
  • Progressing:false and Failing:false are not hard to reason about here. While I am deleting resources, Progressing: true. So long as I succeed, Failing:false.

It seems very confusing that I should actually be reporting Available: true on my operand, as its supposed to be deleted by user intent, even though it is clearly unavailable.

@benjaminapetersen
Copy link
Contributor

Gonna add one more.

If I set my operator ManagementState: Unmanaged, I desire to set Available: operatorsv1.ConditionUnknown because the operator doesn't know what a user might be doing. At present my understanding is that I am still to set it to Available: true because the operator is honoring the user's desired state. I'm not sure the intention of ConditionUnknown at this point.

@benjaminapetersen
Copy link
Contributor

However this: https://github.com/openshift/library-go/blob/master/pkg/operator/status/status_controller.go#L116

// if we are unmanaged, force everything to Unknown.
if detailedSpec.ManagementState == operatorv1.Unmanaged {
    clusterOperatorObj.Status = configv1.ClusterOperatorStatus{}
    configv1helpers.SetStatusCondition(&clusterOperatorObj.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorAvailable, Status: configv1.ConditionUnknown, Reason: "Unmanaged"})
    configv1helpers.SetStatusCondition(&clusterOperatorObj.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorProgressing, Status: configv1.ConditionUnknown, Reason: "Unmanaged"})
    configv1helpers.SetStatusCondition(&clusterOperatorObj.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorFailing, Status: configv1.ConditionUnknown, Reason: "Unmanaged"})
    configv1helpers.SetStatusCondition(&clusterOperatorObj.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorUpgradeable, Status: configv1.ConditionUnknown, Reason: "Unmanaged"})
}

makes me question what I previously stated.

wking added a commit to wking/oc that referenced this pull request Nov 6, 2024
…onal update risk

We've had Upgradeable since 2019 [1], but it is a confusing condition,
because the "between minor versions" wording [2] in the message isn't
obvious to users who have not yet internalized SemVer's
MAJOR.MINOR.PATCH terminology [3].  Conditional update risks landed in
2021 [4] and give us a clear API for declaring exactly which update
targets have the exposure.  This commit adds client-side code to the
tech-preview 'recommend' subcommand, so folks can get a feel for that
user experience.  If it goes over well, we can shift the logic to the
cluster-version operator so all clients can benefit.  The
alreadyHasUpgradeableRisk check ensures we don't double up if the CVO
eventually picks up this pivot.

[1]: openshift/api#206
[2]: https://github.com/openshift/cluster-version-operator/blob/c4b8362d8acd08d63a600b4d53c33e8737ed7a53/pkg/cvo/upgradeable.go#L218-L228
[3]: https://semver.org/spec/v2.0.0.html#summary
[4]: openshift/api#1011
wking added a commit to wking/oc that referenced this pull request Nov 8, 2024
…onal update risk

We've had Upgradeable since 2019 [1], but it is a confusing condition,
because the "between minor versions" wording [2] in the message isn't
obvious to users who have not yet internalized SemVer's
MAJOR.MINOR.PATCH terminology [3].  Conditional update risks landed in
2021 [4] and give us a clear API for declaring exactly which update
targets have the exposure.  This commit adds client-side code to the
tech-preview 'recommend' subcommand, so folks can get a feel for that
user experience.  If it goes over well, we can shift the logic to the
cluster-version operator so all clients can benefit.  The
alreadyHasUpgradeableRisk check ensures we don't double up if the CVO
eventually picks up this pivot.

[1]: openshift/api#206
[2]: https://github.com/openshift/cluster-version-operator/blob/c4b8362d8acd08d63a600b4d53c33e8737ed7a53/pkg/cvo/upgradeable.go#L218-L228
[3]: https://semver.org/spec/v2.0.0.html#summary
[4]: openshift/api#1011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants