-
Notifications
You must be signed in to change notification settings - Fork 533
Enhancement for adding upgrade preflight checks for operators #363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: LalatenduMohanty The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
6be54e2 to
d033c1a
Compare
| * How CVO can get the list of operators which has the preflight checks implemented? | ||
| * What information CVO needs from the operators to create the preflight jobs? | ||
| * Does CVO already has capability to create jobs for preflight checks? | ||
| * How to trigger execution of preflight checks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a whole subsection, with some possibilities picked up for this enhancement and some punted into alternatives and possibly picked up in future expansions. Internally, we've discussed all of:
- Run these right after the update is selected, as we currently do for preconditions since Bug 1730401: prevent automatic upgrades for clusters with cluster operators reporting Upgradeable false cluster-version-operator#243.
- Run these on-demand for cluster admins, so they can do things like "I'm planning on updating to $VERSION in a month; give me a fresh preflight report to turn up any current issues".
- Have the CVO proactively poll with some (configurable?) staleness threshold, like this.
Those are not mutually exlcusive. If folks buy my argument for reporting preflight results via a new custom resource, I'd make that custom resource itself a request for an audit. Something like:
apiVersion: config.openshift.io/v1alpha1
kind: UpdatePreflight
metadata:
name: whatever-the-creator-wants
namespace: openshift-cluster-version
spec:
image: quay.io/openshift-release-dev/ocp-release@sha256:7613d8f7db639147b91b16b54b24cfa351c3cbde6aa7b7bf1b9c80c260efad06
frequency: 1d # optional *Duration, https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1?tab=doc#Duration
status:
... report fields go here...Then the UpdatePreflight controller could just process those, and admins could create them directly, or the CVO could create them, and the CVO could block on a pass for the requested target version before going ahead with the update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LalatenduMohanty Can we get a section where we outline these alternatives? I really don't want to over design this but I do think we need to lay out the options and decide which of them we're pursuing today, which we may consider in the future, and which we're deciding against for the foreseeable future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crawford Do you have any thoughts on how far we want to go here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a section for this 4cadad5#diff-f53e26d63b28b30cadb2bd5bc6342812R80
| CVO will run the preflight check from the new operator image to which the current operator image is trying to upgrade. | ||
| Because the current version of operator does not have knowledge of what new code changes the new version bringing with it. | ||
| So its the new version's responsibility to figure out if the current image can update safely to the new image. | ||
| The preflight jobs would run in the same namespace as operators as it would need same RBAC as the operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm... this would make running preflights for all referenced images hard, and would also block using read-only RBAC to keep preflights from making changes. Can we run the jobs in the openshift-cluster-version namespace (or a new openshift-update-preflight namespace, or some such?), and have this enhancement define the RBAC policy that gets applied? Then the RBAC decisions would be baked into the source release image, but I'm more confident in our ability to figure out that RBAC ahead of time than I am about our ability to figure out the operator-specific logic ahead of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we drop the "same RBAC as the operator" in favor of global read-only RBAC, but still run in the operator's namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does running in the operator gain you (for images where you can figure it out)? Besides the two issues I raise above (image -> operator identification and read-only RBAC), this would also break CVO volume sharing, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good case for the CVO ns. I suppose we can access any data we need via RBAC regardless and there shouldn't be anything needing to mount in from the operator namespace. Your approach would also work for net-new operators coming in that don't already have a namespace in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this case of a net new operator needs to be included in the goals section or somewhere else.
| * How CVO can get the list of operators which has the preflight checks implemented? | ||
| * What information CVO needs from the operators to create the preflight jobs? | ||
| * Does CVO already has capability to create jobs for preflight checks? | ||
| * How to trigger execution of preflight checks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LalatenduMohanty Can we get a section where we outline these alternatives? I really don't want to over design this but I do think we need to lay out the options and decide which of them we're pursuing today, which we may consider in the future, and which we're deciding against for the foreseeable future.
| CVO will run the preflight check from the new operator image to which the current operator image is trying to upgrade. | ||
| Because the current version of operator does not have knowledge of what new code changes the new version bringing with it. | ||
| So its the new version's responsibility to figure out if the current image can update safely to the new image. | ||
| The preflight jobs would run in the same namespace as operators as it would need same RBAC as the operators. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this case of a net new operator needs to be included in the goals section or somewhere else.
| * How CVO can get the list of operators which has the preflight checks implemented? | ||
| * What information CVO needs from the operators to create the preflight jobs? | ||
| * Does CVO already has capability to create jobs for preflight checks? | ||
| * How to trigger execution of preflight checks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crawford Do you have any thoughts on how far we want to go here?
| * Preflight checks would run right after the update is selected, as we currently do for preconditions. | ||
| * Run these on-demand for cluster admins. As it would help for them in the process of planning on updating to a specific version. | ||
|
|
||
| #### Reporting results from preflight checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU preflight checks would run in the current to be updated environment. What about the case where the operator needs the new environment to run successfully, what information can we get about the update?
Especially in the case of out-of-tree drivers, the hardware enablement operators would need to know e.g. the new kernel version or operating system version. Would this be possible to extract in a preflight check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pre-flight mechanism is for CVO-managed operators only. In your example, the hardware enablement operators would know during CI whether or not they are compatible with the given OS version. Third party operators could probably make use of some future mechanism, though I would probably lean toward MCD/MCO for your example since it is so closely tied to the OS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crawford We have this enhancement for out-of-tree driver support in OpenShift, which would make SRO a CVO-managed operator: #357. This is the case I am "worried" about.
I am totally on your side that a CI should catch that, be it our or third party enablement operator. Since we're trying to create a general solution here we want also to catch cases where the drivers are not part of a CI/CD pipeline.
My thinking was to create some preflight checks so that customers do not upgrade if the drivers are not compatible and render their cluster unusable.
In the case of hardware accelerators, I would like to know the kernel and os version and if the kernel-devel package is available. In the past, I have seen cases where e.g. we had a bugfix kernel without the kernel-devel and kernel-headers packages on an entitled cluster, where those were part of the machine-os-content and not the official rhel repositories.
/cc @mrunalp
5472558 to
190ec1f
Compare
c514696 to
4a5fa8f
Compare
|
Just to follow up from the meeting, I think it would help better inform the architecture and constraints if we can articulate some concrete use cases in this document. CCO pre-flight checksis one concrete use case. Importantly, we don't want to mutate any existing resources when running these checks. |
|
|
||
|
|
||
| *Note:* | ||
| To give specific access to preflight jobs, you need to create a new ServiceAccount which the job can use and give the ServiceAccount required RBAC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point out that this RBAC needs to be set up in the running release, not the candidate release which provides the preflight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be handled as part of the preflight job. As long as it is idempotent it should be ok.
| Example: | ||
|
|
||
| ```sh | ||
| oc adm upgrade --to-latest --dry-run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we either punt this to a section about possible future work or explain how it will be implemented? Because I'd expect some sort of API between oc (and web console) and the cluster-version operator, and we don't talk about what that would look like yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Will work on it.
| Use preflight checks to improve the success rate of upgrades. | ||
|
|
||
|
|
||
| ### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I generally agree with your sentiment, I think CVO is a bit of a special case. Due to it being a single point of failure for updates, it is critical that this code base remain as simple as possible. Otherwise, we run the risk of clusters becoming wedged and unable to apply or even check for further updates. By their nature, these pre-flight checks would be written by a number of developers, across teams, who are likely not as familiar with the CVO and the assumptions that have been made. Additionally, this coupling of operators to CVO through pre-flight checks feels like a step backward from the operator model we have today.
| * This enhancement proposes a framework for preflight checks and discusses the CVO side implementation of that framework. | ||
| * Administrators will be able to run preflight checks without running the upgrade. | ||
| * Give Operators (CVO managed) ability to add checks to check if they can safely upgrade to a given release. | ||
| * Preflight checks would be idempotent in nature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further, they should be read-only to the state that they are checking. We don't want folks using pre-flight checks as a migration mechanism.
|
|
||
| ### User Stories | ||
|
|
||
| #### Story 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another good use case: https://github.com/openshift/enhancements/pull/363/files#r482442200. @LalatenduMohanty, would you mind adding it to the enhancement?
In addition to that one, the MCO could make use of these to help with the migration off of v2 Ignition configs (we are moving to v3). This is a similar situation to the one described by the Cloud Credential Operator. We can technically make use of Upgradable=false, but that requires adding the check one release before we are going to remove it. It's certainly possible, but it's a whole lot simpler to add the check at the same time (and in the same place) that we remove support for v2.
c334016 to
b64620a
Compare
8f1cc8c to
979e3ce
Compare
Signed-off-by: Lalatendu Mohanty <[email protected]>
979e3ce to
d9dfb88
Compare
smarterclayton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with the new use cases I don’t find a compelling use case requiring complex new code that is superior to “upgradeable false and backport checks”.
Checks before y (if a team is trying to change something) is the definition of a teams job. I agree keeping it out of CVO is good, but putting it in the operator for all scenarios is possible today, and any “cancel and change” problems can be fixed without this.
|
|
||
| cluster-ingress-operator uses ingress controller as the operand. The cluster-ingress-operator configures the ingress controller using an administrator-provided default serving certificate. When we moved to RHEL8 base image it brought a new version of OpenSSL that tightened restrictions on TLS keys, which caused the new operand image i.e. ingress controller to reject keys that the old operand image accepted ([BZ#1874278](https://bugzilla.redhat.com/show_bug.cgi?id=1874278)). | ||
|
|
||
| In this case a preflight check can be added for cluster-ingress-operator that would use the new operand image to verify that it did not reject the current certificate in use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we just deliver that code to the older version and set upgradeable false.
|
|
||
| #### Story 3 | ||
|
|
||
| The MCO could make use of these to help with the migration off of v2 Ignition configs (we are moving to v3). This is a similar situation to the one described by the Cloud Credential Operator described in [user story 1](#story-1). Alternative would be to use Upgradeable=false, but that requires adding the check one release before we are going to remove it. It's certainly possible, but it's a whole lot simpler to add the check at the same time (and in the same place) that we remove support for v2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not a problem.
|
|
||
| * Assume Cluster is running release A. | ||
| * Cluster begins updating to B, but operator "foo" has not yet updated to B. | ||
| * Admin re-targets cluster to release C, but there are some AB-vs-C incompatibilities for operator "foo". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super hypothetical
|
|
||
| ### User Stories | ||
|
|
||
| #### Story 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am extremely unconvinced this proposal is superior to upgradeable=false and backport in all 4 use cases described. The scenario here is complicated and can be addressed in simpler ways. Customer would have had to force upgrade for this, so we should have already told them what version to go to. And in general, we would never encourage upgrading to a new z and then suddenly jumping to a new y
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
/lifecycle frozen |
Upstream docs [1]: kubelet must not be newer than kube-apiserver, and may be up to two minor versions older. The coming OpenShift 4.8 is aiming at 1.21 [2]. This commit will end up in 4.8. So when it is considering Upgradeable guards, it is saying "is the current state compatible with the next OpenShift minor?". The next minor will be 4.9, which will presumably run 1.22. So it will be compatible with 1.20, 1.21, and 1.22 kubelets. If for some reason OpenShift 4.9 ends up being compatible with 1.19 too (say, because it sticks with 4.8's 1.21), we'll backport a patch to the 4.8 cluster-version operator to bring that wisdom back. Those next-minor-knowledge backports are the best we can do unless we get something like preflight checks [3]. I'm not wildly excited about the current framework. It has lots of informers, and I think we might want a single NewSharedInformer to reduce the number of parallel watches. It also periodically polls all of the upgradeable checks, and I would prefer a framework where our informers fed queues and we had workers processing change events from those queues to trigger state changes, which should give us lower latency reactions. But since we may end up taking this code back to 4.6, the current commit is a minimal addition following the existing patterns, and we can consider more extensive refactors once we have the backport-friendly pivot in place. [1]: https://kubernetes.io/docs/setup/release/version-skew-policy/#kubelet [2]: openshift/kubernetes#641 [3]: openshift/enhancements#363
|
This proposal is over a year old. As part of recent efforts to clean up old pull requests, I am removing the life-cycle/frozen label to allow it to age out and be closed. If the proposal is still active, please restore the label. /remove-lifecycle frozen |
|
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
|
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
|
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: Lalatendu Mohanty [email protected]