Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 197 additions & 0 deletions enhancements/cvo/upgrade_preflight_checks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
---
title: upgrade-preflight-checks-for-operators
authors:
- "@LalatenduMohanty"
reviewers:
- "@dgoodwin"
- "@crawford"
- "@sdodson"
- "@wking"
approvers:
- "@crawford"
- "@sdodson"
creation-date: 2020-06-04
last-updated: 2020-09-02
status: implementable
---

# Upgrade Preflight Checks For Operators

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [x] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

The goal of OpenShift 4 is to make upgrades seamless and risk free.
To improve the upgrade success we should be able to detect if the new version is compatible with the existing version before running the actual upgrade.
As OpenShift V4 is composed of many operators, we need to give the operators ability to run some checks before the upgrade to find out if the new incoming version is compatible with the existing version.

## Motivation

CVO managed operators should be able to run preflight checks before going through the actual upgrade.
It would help operators to find out if they can move to the specific version safely before doing an actual upgrade.
Also administrators are empowered check if upgrade to the target version is going to be successfully without doing an upgrade.
Use preflight checks to improve the success rate of upgrades.


### Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I am -1 to generic frameworks for things that operators should encapsulate. I do not want to see complex abstractions for executing code that could a) be encapsulated in the operator, b) are not going to be sufficiently common to do them frequently (thus not justifying a fully generic mechanism), or c) are no better than just encoding code in the CVO.

In fact, given all possible options, I would rather add O(10) specific code fragments to CVO in a z release over doing ANY generic framework. The CVO was designed with the intent that if we needed specific workarounds, we could do them in CVO. The CVO is not a general purpose component for tens of distributions, it is the core openshift operator. While I don't want sprawl, I see no problem with specific time limited workarounds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I generally agree with your sentiment, I think CVO is a bit of a special case. Due to it being a single point of failure for updates, it is critical that this code base remain as simple as possible. Otherwise, we run the risk of clusters becoming wedged and unable to apply or even check for further updates. By their nature, these pre-flight checks would be written by a number of developers, across teams, who are likely not as familiar with the CVO and the assumptions that have been made. Additionally, this coupling of operators to CVO through pre-flight checks feels like a step backward from the operator model we have today.


* This enhancement proposes a framework for preflight checks and discusses the CVO side implementation of that framework.
* Administrators will be able to run preflight checks without running the upgrade.
* Give Operators (CVO managed) ability to add checks to check if they can safely upgrade to a given release.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the utility of this vs managing this via our existing testing and engineering processes. We have a graph that describes what you can safely do. We have a need to break glass and do unsafe things sometimes for customers (force). Is this trying to make force safer? Make the graph safer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes update safety depends on conditions that are visible in-cluster but which are not reported up through Telemetry/Insights. #426 allows us to address conditions that are reported up via graph tuning, and this proposal covers the bits that aren't. For example, clusters on restricted networks and those which have opted out of Telemetry/Insights cannot be addressed via #426, and need an in-cluster mechanism like the one in this enhancement. We could offload checks in those clusters to manual admin operations, but I'd rather have them automated to reduce the chances of folks missing a check and breaking their cluster.

Copy link
Member Author

@LalatenduMohanty LalatenduMohanty Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a need to break glass and do unsafe things sometimes for customers (force). Is this trying to make force safer? Make the graph safer?

The aim is to make upgrades more predictable and catch things which might be specific to clusters which might prevent successful upgrade and communicating to customers why the upgrade might not be successful or what they need to fix so that upgrade would be successful. Operators now do not have any way to check the environment without actually starting the upgrade. The idea is if we provide a way for operators to check the environment before doing upgrades it would help. Also it is going to be another layer of safety valve.

* Preflight checks would be idempotent in nature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, they should be read-only to the state that they are checking. We don't want folks using pre-flight checks as a migration mechanism.

* Downloading and verification of images associated with release payload should be part of preflight check.

### Non-Goals

* This enhancement does not commit particular operators for providing preflight implementations or discuss the operator-specific logic that should be executed by preflight checks.
* The preflight checks should not be used as a migration mechanism.
* It is not the responsibility of preflight checks to protect against upgrades involving more than one minor version i.e. 4.y.z to 4.(y+2).z.

## Proposal

Operators need to provide a different bin path for the preflight checks i.e. /bin/preflight.
The bin path need to exist inside of the operator image.
Cluster version operator (CVO) will be running a single job per operator for the preflight checks.
/bin/preflight must be executable. It can be symbolic link to the operator binary or a script or It can be non existent.
Users can skip or override the preflight checks with `--force`.

### User Stories

#### Story 1
Copy link
Contributor

@smarterclayton smarterclayton Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this use case alone justifies a generic code mechanism, and so would either like to see a description of 2-3 real other pressing problems only solvable using this mechanism, or a description of compelling future problem classes than need it.

Right now this is pretty close to YAGNI and so I either need to see the AGNI or we should do something simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another good use case: https://github.com/openshift/enhancements/pull/363/files#r482442200. @LalatenduMohanty, would you mind adding it to the enhancement?

In addition to that one, the MCO could make use of these to help with the migration off of v2 Ignition configs (we are moving to v3). This is a similar situation to the one described by the Cloud Credential Operator. We can technically make use of Upgradable=false, but that requires adding the check one release before we are going to remove it. It's certainly possible, but it's a whole lot simpler to add the check at the same time (and in the same place) that we remove support for v2.

Copy link
Member Author

@LalatenduMohanty LalatenduMohanty Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crawford Added these use cases in c334016

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can technically make use of Upgradable=false, but that requires adding the check one release before we are going to remove it.

Several threads around motivation. Are we collapsing to this one? Here's another argument for preflights over Upgradeable=False:

  1. Cluster is running release A.
  2. Cluster begins updating to B, but has not yet updated operator O.
  3. Admin retargets cluster to release C, but there are some AB-vs-C incompatibilities for operator O.

In a "we backport Upgradeable=False checks" world, it's possible that A's O hasn't learned about the check, but B's O has. However, B's O was not yet in place to perform the check, so the unstable update to C was not blocked. So the code performing the check may not be from a node attached to the recommended B->C update edge. Blocking the edge from A->C would not be sufficient to ensure folks hit B's O before leaping over to C.

In a preflight world, the checking code (in C) is always a node attached to the recommended update edge. We are isolated from any version-skew due to a heterogenous in-cluster operators.

Confirming the versions we use when requesting update recommendations, we use optr.release.Version here, and set that when we load our release from disk, so my understanding is that we will immediately begin requesting recommendations from B as soon as we load the CVO from the B release image to begin managing the update to B.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking Thats a good point. Going to include it.

Also I think preflight checks will be useful for rollbacks as compared to upgradeable=false. Though we do not support rollbacks but it would be good to know where rollbacks will part of the design for any operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b64620a

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am extremely unconvinced this proposal is superior to upgradeable=false and backport in all 4 use cases described. The scenario here is complicated and can be addressed in simpler ways. Customer would have had to force upgrade for this, so we should have already told them what version to go to. And in general, we would never encourage upgrading to a new z and then suddenly jumping to a new y

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Customer would have had to force upgrade for this, so we should have already told them what version to go to.

Wait, who has to force an update?

And in general, we would never encourage upgrading to a new z and then suddenly jumping to a new y

What is the guard for this? Should we only populate ClusterVersion's status.availableUpdates when we are in reconciliation mode? That avoids "it's dicey to jump into a new update ->C from some undefined point in the middle of A->B", but it exposes you to "I'm hung in A->B, and need a fix delivered in C to unstick my update". The CVO today has no such guards in this space, and will happily recommend C to you if the Cincinnati graph recommends B->C regardless of how far your cluster is along A->B.


Some customers would prefer to disable automatic credential provisioning by Cloud credential operator (CCO) and instead manage it themselves.
However if new permissions are required by the target version, the upgrade could block midway even if the administrator provisioned more powerful credentials.
With the help of preflight checks CCO can check if they can safely upgrade to a given release before doing the actual upgrade.
Copy link
Contributor

@smarterclayton smarterclayton Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the operator not reporting upgradable false? Is this assuming that you are trying to block z upgrades? Why would we block a z upgrade?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-flights supposed to tell the customer what will be set to upgradable false if upgrade actually started. Now customer has information without doing the actual upgrade. Now customer can fix the cluster and start the upgrade. Also this check can be done against multiple version to see which versions can be upgraded without any issue. Preflight checks will not set upgradable false` for both z stream or y stream.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cred folks actually are moving to Upgradeable=False for this use case. The downside is that they need to backport knowledge of incoming 4.(y+1) credential requests to the 4.y.z release, and that means raising the minimum 4.y.z from which you can update to 4.(y+1). But that's not a terrible problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it only recently came to our attention that upgradable=false does NOT impact .z upgrades and would only block y upgrades. This largely solves our use case, it does require us to push code to stable streams to check for what is coming in 4.6, and it puts that burden on our team to monitor component cred changes and implement the backwards checks when we a change and get it pushed to a .z, as opposed to a generic solution we could implement with preflights by examining the upcoming release image. We can live with this and see how it goes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we going to remove this use case or put it in “solved by upgradeable”?


#### Story 2

cluster-ingress-operator uses ingress controller as the operand. The cluster-ingress-operator configures the ingress controller using an administrator-provided default serving certificate. When we moved to RHEL8 base image it brought a new version of OpenSSL that tightened restrictions on TLS keys, which caused the new operand image i.e. ingress controller to reject keys that the old operand image accepted ([BZ#1874278](https://bugzilla.redhat.com/show_bug.cgi?id=1874278)).

In this case a preflight check can be added for cluster-ingress-operator that would use the new operand image to verify that it did not reject the current certificate in use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we just deliver that code to the older version and set upgradeable false.


#### Story 3

The MCO could make use of these to help with the migration off of v2 Ignition configs (we are moving to v3). This is a similar situation to the one described by the Cloud Credential Operator described in [user story 1](#story-1). Alternative would be to use Upgradeable=false, but that requires adding the check one release before we are going to remove it. It's certainly possible, but it's a whole lot simpler to add the check at the same time (and in the same place) that we remove support for v2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not a problem.


#### Story 4

Currently operators set upgradeable = false if they do not want the cluster to upgrade to another version. But this might not be effective in below situation but preflight checks would be better.

* Assume Cluster is running release A.
* Cluster begins updating to B, but operator "foo" has not yet updated to B.
* Admin re-targets cluster to release C, but there are some AB-vs-C incompatibilities for operator "foo".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super hypothetical


In the "upgradeable = false" world even if we need to backport Upgradeable=False checks to release B to check the incompatibilities. So lets version B has the right checks to find the incompatibility. It is possible that operator "foo" still in version A and hasn't learned about the check. Because operator "foo" is not in version B yet it is not in a place to perform the check. So the update to C will not be blocked.

In the above scenario the edges from A to C might be blocked but the code performing the check may be from release A. Blocking the edge from A->C would not be sufficient to ensure folks hit operator "foo" in release B before leaping over to C.

In a preflight world, the checking code in release C and this will ensure we check the compatibility before upgrading. We are isolated from any version-skew due to heterogenous versions of operator in a cluster.

### Implementation Details

CVO will run the preflight check from the new operator image to which the current operator image is trying to upgrade.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful if the preflight check could make use of the new operand image. We have a specific use case with cluster-ingress-operator where the operand is an ingress controller, which the operator configures using an administrator-provided default serving certificate. We want to write a preflight check for cluster-ingress-operator that would use the new operand image to verify that it did not reject the current certificate in use.

This use-case is inspired by the recent changeover to a RHEL8 base image, which brought a new OpenSSL version that tightened restrictions on TLS keys, which caused the new operand image to reject keys that the old operand image accepted (BZ#1874278).

Absent the option to use the operand image, alternative approaches to writing such a pre-flight check would be to import the operand image (or the executable and libraries from it) when building the operator image, or writing a small application to approximate the operand's behavior as closely as possible, which would be more effort and be more error-prone than using the operand image directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the CVO cares about this distinction. You should be able to just reference whatever you need in the image property of your recommending Job. We should probably pivot to more generic language to make that more clear. In this case "new image" is probably sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, It should depend on the job manifest you define as part of the preflight check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as an usecase in 4ca3ea8

Because the current version of operator would not have knowledge of what code changes coming with the new version.
So its the incoming version's responsibility to figure out if the current image can update safely to it.
However CVO need to go through signature verification to pass on the target release before we launch preflight checks.

CVO will look for manifests with a specific annotation i.e. openshift.io/preflight-check: "true" and use the manifests to start the preflight jobs.

Some more important points:

* Operators can choose the name and namespace for the job and there will be no restriction from CVO on this.
* However when CVO will run these jobs it will append some randomized suffix for keeping the jobs unique. Because there can be many preflight jobs running at the same time, so we need a way keep them unique.
* Preflight checks should not mutate any existing resources as it will make the checks non-idempotent. So it is advisable to give read only access to preflight checks.


#### Registering for preflight checks

Operators need to register themselves for a preflight check by defining a Kubernetes job (v1) manifest.
Preflight checks will be skipped for operator which does not have the manifest.

An example of preflight job manifest
```
$ cat 0000_50_cloud-credential-operator_11_preflight-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: cloud-credential-preflight-checks
namespace: openshift-cloud-credential-operator
annotations:
openshift.io/preflight-check: "true"
spec:
template:
spec:
serviceAccountName: openshift-cloud-credential-preflight
containers:
- name: cloud-credential-operator
image: quay.io/openshift/origin-cloud-credential-operator:latest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call out that this will get the same image-reference replacement as all CVO manifests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand your comment. Can you please explain?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

images-references replacement applies to this manifest too. We should explicitly point that out in the text, to make it less likely that folks assume it only applies to "operator" images.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it in b64620a

resources:
requests:
cpu: 10m
memory: 150Mi
command: ["/bin/preflight", "foo", "baar"]
backoffLimit: 4
```

The standard [images-references] replacement applies will be apllicable to this manifest too.

*Note:*
To give specific access to preflight jobs, you need to create a new ServiceAccount which the job can use and give the ServiceAccount required RBAC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point out that this RBAC needs to be set up in the running release, not the candidate release which provides the preflight?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled as part of the preflight job. As long as it is idempotent it should be ok.


#### Triggering execution of preflight checks

We should be able to run the preflight check as below

##### Case1

Preflight checks would run implicitly right after the update is selected, as we currently do for preconditions checks.

##### Case 2

Run preflight checks on-demand for cluster admins using `--dry-run`.
Example:

```sh
oc adm upgrade --to-latest --dry-run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we either punt this to a section about possible future work or explain how it will be implemented? Because I'd expect some sort of API between oc (and web console) and the cluster-version operator, and we don't talk about what that would look like yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Will work on it.

```

#### Collecting results from preflight checks

The preflight job for a release will be created with an unique label i.e. the set of preflight jobs triggered for doing a preflight check for a specific version will have the unique labels.
This unique label will be used to keep track of the jobs and collect result from these.
Also at any point of time there might be many preflight jobs running in the cluster.
So it would also help to keep track of the preflight jobs based on when it was started.
To collect and display the result we can query for all the jobs by the label.

The objective is to collect the results i.e. pass or fail status and error message from preflight jobs.

### Risks and Mitigation

The responsibility of using implementation of actual preflight tests are on respective operator developers. So any bugs in the tests can create false positives or false negative results.

## Design Details

### Open Questions

1. Should we save the preflight history? Will it be really useful for users?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to know when preflight's fail but the user uses Force to override and update anyway. I am fine using the existingVerified for this, or with growing a new history property.

2. Tf the pre-flight is already running for a specific release and user again starts the preflight for the same image (same signature) we can just tell the user that the preflight is already running and here are the results so far, right?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not covered yet in this proposal is how old preflight results can be before we consider them stale (and reap the job?). I'm fine just recycling the running job and/or old results as long as we haven't hit that threshold (24h?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking just reusing the job till it is in flight. Once a job finishes a new job will be created if user wants to do a prefligth check. Do not want to keep it complicated at this point of time.


### Test Plan

To be done

## Drawbacks

The current proposal does not support preflight checks for a new operator. As preflight checks should not be responsible for deploying new operators. Also we do not want preflight checks to change anything as it will complicate the design.

## Alternatives

None

[images-references]: https://github.com/openshift/cluster-version-operator/blob/d6475eff00192a56cd6a833aa7f73d6f4f4a2cef/docs/dev/operators.md#how-do-i-ensure-the-right-images-get-used-by-my-manifests