Skip to content

Conversation

@LalatenduMohanty
Copy link
Member

Signed-off-by: Lalatendu Mohanty [email protected]

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LalatenduMohanty
To complete the pull request process, please assign adambkaplan
You can assign the PR to them by writing /assign @adambkaplan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LalatenduMohanty LalatenduMohanty force-pushed the OTA-94 branch 2 times, most recently from 6be54e2 to d033c1a Compare June 8, 2020 14:17
* How CVO can get the list of operators which has the preflight checks implemented?
* What information CVO needs from the operators to create the preflight jobs?
* Does CVO already has capability to create jobs for preflight checks?
* How to trigger execution of preflight checks?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a whole subsection, with some possibilities picked up for this enhancement and some punted into alternatives and possibly picked up in future expansions. Internally, we've discussed all of:

  1. Run these right after the update is selected, as we currently do for preconditions since Bug 1730401: prevent automatic upgrades for clusters with cluster operators reporting Upgradeable false cluster-version-operator#243.
  2. Run these on-demand for cluster admins, so they can do things like "I'm planning on updating to $VERSION in a month; give me a fresh preflight report to turn up any current issues".
  3. Have the CVO proactively poll with some (configurable?) staleness threshold, like this.

Those are not mutually exlcusive. If folks buy my argument for reporting preflight results via a new custom resource, I'd make that custom resource itself a request for an audit. Something like:

apiVersion: config.openshift.io/v1alpha1
kind: UpdatePreflight
metadata:
  name: whatever-the-creator-wants
  namespace: openshift-cluster-version
spec:
  image: quay.io/openshift-release-dev/ocp-release@sha256:7613d8f7db639147b91b16b54b24cfa351c3cbde6aa7b7bf1b9c80c260efad06
  frequency: 1d # optional *Duration, https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1?tab=doc#Duration
status:
  ... report fields go here...

Then the UpdatePreflight controller could just process those, and admins could create them directly, or the CVO could create them, and the CVO could block on a pass for the requested target version before going ahead with the update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LalatenduMohanty Can we get a section where we outline these alternatives? I really don't want to over design this but I do think we need to lay out the options and decide which of them we're pursuing today, which we may consider in the future, and which we're deciding against for the foreseeable future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crawford Do you have any thoughts on how far we want to go here?

Copy link
Member Author

@LalatenduMohanty LalatenduMohanty Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CVO will run the preflight check from the new operator image to which the current operator image is trying to upgrade.
Because the current version of operator does not have knowledge of what new code changes the new version bringing with it.
So its the new version's responsibility to figure out if the current image can update safely to the new image.
The preflight jobs would run in the same namespace as operators as it would need same RBAC as the operators.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm... this would make running preflights for all referenced images hard, and would also block using read-only RBAC to keep preflights from making changes. Can we run the jobs in the openshift-cluster-version namespace (or a new openshift-update-preflight namespace, or some such?), and have this enhancement define the RBAC policy that gets applied? Then the RBAC decisions would be baked into the source release image, but I'm more confident in our ability to figure out that RBAC ahead of time than I am about our ability to figure out the operator-specific logic ahead of time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we drop the "same RBAC as the operator" in favor of global read-only RBAC, but still run in the operator's namespace?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does running in the operator gain you (for images where you can figure it out)? Besides the two issues I raise above (image -> operator identification and read-only RBAC), this would also break CVO volume sharing, right?

Copy link
Contributor

@dgoodwin dgoodwin Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good case for the CVO ns. I suppose we can access any data we need via RBAC regardless and there shouldn't be anything needing to mount in from the operator namespace. Your approach would also work for net-new operators coming in that don't already have a namespace in the cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this case of a net new operator needs to be included in the goals section or somewhere else.

* How CVO can get the list of operators which has the preflight checks implemented?
* What information CVO needs from the operators to create the preflight jobs?
* Does CVO already has capability to create jobs for preflight checks?
* How to trigger execution of preflight checks?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LalatenduMohanty Can we get a section where we outline these alternatives? I really don't want to over design this but I do think we need to lay out the options and decide which of them we're pursuing today, which we may consider in the future, and which we're deciding against for the foreseeable future.

CVO will run the preflight check from the new operator image to which the current operator image is trying to upgrade.
Because the current version of operator does not have knowledge of what new code changes the new version bringing with it.
So its the new version's responsibility to figure out if the current image can update safely to the new image.
The preflight jobs would run in the same namespace as operators as it would need same RBAC as the operators.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this case of a net new operator needs to be included in the goals section or somewhere else.

* How CVO can get the list of operators which has the preflight checks implemented?
* What information CVO needs from the operators to create the preflight jobs?
* Does CVO already has capability to create jobs for preflight checks?
* How to trigger execution of preflight checks?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crawford Do you have any thoughts on how far we want to go here?

* Preflight checks would run right after the update is selected, as we currently do for preconditions.
* Run these on-demand for cluster admins. As it would help for them in the process of planning on updating to a specific version.

#### Reporting results from preflight checks
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU preflight checks would run in the current to be updated environment. What about the case where the operator needs the new environment to run successfully, what information can we get about the update?

Especially in the case of out-of-tree drivers, the hardware enablement operators would need to know e.g. the new kernel version or operating system version. Would this be possible to extract in a preflight check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pre-flight mechanism is for CVO-managed operators only. In your example, the hardware enablement operators would know during CI whether or not they are compatible with the given OS version. Third party operators could probably make use of some future mechanism, though I would probably lean toward MCD/MCO for your example since it is so closely tied to the OS.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crawford We have this enhancement for out-of-tree driver support in OpenShift, which would make SRO a CVO-managed operator: #357. This is the case I am "worried" about.

I am totally on your side that a CI should catch that, be it our or third party enablement operator. Since we're trying to create a general solution here we want also to catch cases where the drivers are not part of a CI/CD pipeline.

My thinking was to create some preflight checks so that customers do not upgrade if the drivers are not compatible and render their cluster unusable.

In the case of hardware accelerators, I would like to know the kernel and os version and if the kernel-devel package is available. In the past, I have seen cases where e.g. we had a bugfix kernel without the kernel-devel and kernel-headers packages on an entitled cluster, where those were part of the machine-os-content and not the official rhel repositories.

/cc @mrunalp

@michaelgugino
Copy link
Contributor

Just to follow up from the meeting, I think it would help better inform the architecture and constraints if we can articulate some concrete use cases in this document. CCO pre-flight checksis one concrete use case. Importantly, we don't want to mutate any existing resources when running these checks.



*Note:*
To give specific access to preflight jobs, you need to create a new ServiceAccount which the job can use and give the ServiceAccount required RBAC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point out that this RBAC needs to be set up in the running release, not the candidate release which provides the preflight?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled as part of the preflight job. As long as it is idempotent it should be ok.

Example:

```sh
oc adm upgrade --to-latest --dry-run
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we either punt this to a section about possible future work or explain how it will be implemented? Because I'd expect some sort of API between oc (and web console) and the cluster-version operator, and we don't talk about what that would look like yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Will work on it.

Use preflight checks to improve the success rate of upgrades.


### Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I generally agree with your sentiment, I think CVO is a bit of a special case. Due to it being a single point of failure for updates, it is critical that this code base remain as simple as possible. Otherwise, we run the risk of clusters becoming wedged and unable to apply or even check for further updates. By their nature, these pre-flight checks would be written by a number of developers, across teams, who are likely not as familiar with the CVO and the assumptions that have been made. Additionally, this coupling of operators to CVO through pre-flight checks feels like a step backward from the operator model we have today.

* This enhancement proposes a framework for preflight checks and discusses the CVO side implementation of that framework.
* Administrators will be able to run preflight checks without running the upgrade.
* Give Operators (CVO managed) ability to add checks to check if they can safely upgrade to a given release.
* Preflight checks would be idempotent in nature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, they should be read-only to the state that they are checking. We don't want folks using pre-flight checks as a migration mechanism.


### User Stories

#### Story 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another good use case: https://github.com/openshift/enhancements/pull/363/files#r482442200. @LalatenduMohanty, would you mind adding it to the enhancement?

In addition to that one, the MCO could make use of these to help with the migration off of v2 Ignition configs (we are moving to v3). This is a similar situation to the one described by the Cloud Credential Operator. We can technically make use of Upgradable=false, but that requires adding the check one release before we are going to remove it. It's certainly possible, but it's a whole lot simpler to add the check at the same time (and in the same place) that we remove support for v2.

@LalatenduMohanty LalatenduMohanty force-pushed the OTA-94 branch 3 times, most recently from c334016 to b64620a Compare September 10, 2020 18:55
@LalatenduMohanty LalatenduMohanty force-pushed the OTA-94 branch 2 times, most recently from 8f1cc8c to 979e3ce Compare September 24, 2020 18:04
Copy link
Contributor

@smarterclayton smarterclayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with the new use cases I don’t find a compelling use case requiring complex new code that is superior to “upgradeable false and backport checks”.

Checks before y (if a team is trying to change something) is the definition of a teams job. I agree keeping it out of CVO is good, but putting it in the operator for all scenarios is possible today, and any “cancel and change” problems can be fixed without this.


cluster-ingress-operator uses ingress controller as the operand. The cluster-ingress-operator configures the ingress controller using an administrator-provided default serving certificate. When we moved to RHEL8 base image it brought a new version of OpenSSL that tightened restrictions on TLS keys, which caused the new operand image i.e. ingress controller to reject keys that the old operand image accepted ([BZ#1874278](https://bugzilla.redhat.com/show_bug.cgi?id=1874278)).

In this case a preflight check can be added for cluster-ingress-operator that would use the new operand image to verify that it did not reject the current certificate in use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we just deliver that code to the older version and set upgradeable false.


#### Story 3

The MCO could make use of these to help with the migration off of v2 Ignition configs (we are moving to v3). This is a similar situation to the one described by the Cloud Credential Operator described in [user story 1](#story-1). Alternative would be to use Upgradeable=false, but that requires adding the check one release before we are going to remove it. It's certainly possible, but it's a whole lot simpler to add the check at the same time (and in the same place) that we remove support for v2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not a problem.


* Assume Cluster is running release A.
* Cluster begins updating to B, but operator "foo" has not yet updated to B.
* Admin re-targets cluster to release C, but there are some AB-vs-C incompatibilities for operator "foo".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super hypothetical


### User Stories

#### Story 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am extremely unconvinced this proposal is superior to upgradeable=false and backport in all 4 use cases described. The scenario here is complicated and can be addressed in simpler ways. Customer would have had to force upgrade for this, so we should have already told them what version to go to. And in general, we would never encourage upgrading to a new z and then suddenly jumping to a new y

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 28, 2020
@sdodson
Copy link
Member

sdodson commented Dec 30, 2020

/lifecycle frozen

@openshift-ci-robot openshift-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 30, 2020
wking added a commit to wking/cluster-version-operator that referenced this pull request Apr 7, 2021
Upstream docs [1]:

  kubelet must not be newer than kube-apiserver, and may be up to two
  minor versions older.

The coming OpenShift 4.8 is aiming at 1.21 [2].  This commit will end
up in 4.8.  So when it is considering Upgradeable guards, it is saying
"is the current state compatible with the next OpenShift minor?".  The
next minor will be 4.9, which will presumably run 1.22.  So it will be
compatible with 1.20, 1.21, and 1.22 kubelets.  If for some reason
OpenShift 4.9 ends up being compatible with 1.19 too (say, because it
sticks with 4.8's 1.21), we'll backport a patch to the 4.8
cluster-version operator to bring that wisdom back.  Those
next-minor-knowledge backports are the best we can do unless we get
something like preflight checks [3].

I'm not wildly excited about the current framework.  It has lots of
informers, and I think we might want a single NewSharedInformer to
reduce the number of parallel watches.  It also periodically polls all
of the upgradeable checks, and I would prefer a framework where our
informers fed queues and we had workers processing change events from
those queues to trigger state changes, which should give us lower
latency reactions.  But since we may end up taking this code back to
4.6, the current commit is a minimal addition following the existing
patterns, and we can consider more extensive refactors once we have
the backport-friendly pivot in place.

[1]: https://kubernetes.io/docs/setup/release/version-skew-policy/#kubelet
[2]: openshift/kubernetes#641
[3]: openshift/enhancements#363
@dhellmann
Copy link
Contributor

This proposal is over a year old. As part of recent efforts to clean up old pull requests, I am removing the life-cycle/frozen label to allow it to age out and be closed. If the proposal is still active, please restore the label.

/remove-lifecycle frozen

@openshift-ci openshift-ci bot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Sep 21, 2021
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 19, 2021
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 26, 2021
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Nov 2, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 2, 2021

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.