Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
178 changes: 178 additions & 0 deletions enhancements/storage/csi-vsphere-operator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
---
title: csi-ebs-operator
authors:
- "@gnufied"
reviewers:
- "@jsafrane”
- "@fbertina"
- "@chuffman"
approvers:
- "@..."
creation-date: 2020-10-05
last-updated: 2020-10-05
status: implementable
see-also: https://github.com/openshift/enhancements/blob/master/enhancements/storage/csi-driver-install.md
replaces:
superseded-by:
---

# CSI Driver operator for vSphere

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

This enhancement proposes deployment of vSphere CSI driver on Openshift as a default component.

## Motivation

* vSphere is a key cloud provider for OpenShift and is supported in Openshift 3.x and 4.x. We need it to be supported and available even if in-tree drivers have been removed.
* vSphere CSI driver provides new features such as - volume expansion, snapshotting and cloning which were previously unavailable with intree driver.

### Goals

* Create an operator to install the vSphere CSI driver.
* Publish the driver and operator via default OCP build pipeline.
* Enable creation of vSphere CSI storageclass, so as OCP users can start consuming vSphere CSI volumes without requiring further configuration.

### Non-Goals

* We don't intend to create a brand new driver with this KEP but merely want to deploy and configure upstream vSphere CSI driver - https://github.com/kubernetes-sigs/vsphere-csi-driver/


## Proposal

OCP ships with a vmware-vsphere-csi-driver-operator by default which is managed by [cluster-storage-operator](https://github.com/openshift/cluster-storage-operator/).
vSphere CSI driver has few dependencies on installer though and they are:

### Installer dependency

* vSphere CSI driver requires HW version 15 on VMs that make up OCP cluster. Currently the rhcos OVA file Red Hat ships has default HW version configured to 13 and hence VM version should be updated.
* Configuration of vSphere StorageClass requires knowledge of storage policy that was created in vCenter. Without this information - it is not possible to create a working storageClass for CSI driver.

#### HW version of OCP VMs

As mentioned above - vSphere CSI driver requires HW version 15 on all the VMs that make OCP cluster. Since, Openshift defaults to HW version 13 when a vSphere OCP cluster gets created - vSphere CSI driver isn't workable on the OCP cluster by default.

To solve this following alternatives were considered:

* We could provide manual instructions to the user about updating hardware version of their VMs. This won't be automatic but in some cases not even possible because some existing VMs can't be upgraded to HW version 15 in-place and hence this option is ruled out.
* We could update installer to set HW version 15 - when appropriate hypervisor version(6.7u3 or higher) is detected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the vSphere cluster has heterogenous hosts and some support ver 15 and some only older versions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can check vCenter version and all ESXi host versions for the specific vSphere cluster that is configured in the install-config.yaml. Not 100% sure what the appropriate outcome of that information gathering should be though.

* For UPI install - we can additionally document required HW version while creating VMs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should just be documentation.


### Deployment strategy

vSphere CSI driver operator will be deployed by [cluster-storage-operator](https://github.com/openshift/cluster-storage-operator/)

#### Operator deployment via cluster-storage-operator

The cluster-storage-operator will deploy all the resources necessary for creating the vSphere CSI driver operator.

1. vSphere CSI driver operator will be deployed in namespace `openshift-cluster-csi-drivers`.
2. A service account will be created for running the operator.
3. The operator will get RBAC rules necessary for running the operator and its operand (i.e the actual vSphere CSI driver).
4. A deployment will be created for starting the operator and will be managed by cluster-storage-operator.
5. A instance of `ClusterCSIDriver` will be created to faciliate managment of driver operator. `ClusterCSIDriver` is already defined in - https://github.com/openshift/api/blob/master/operator/v1/types_csi_cluster_driver.go but needs to be expanded to include vSphere CSI driver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the CR flags that the operator should first, install the driver

6. cluster-storage-operator will request CVO to create required cloud-credentials for talking with vCenter API.

#### Driver deployment via vmware-vsphere-csi-driver-operator.

The operator itself will be responsible for running the driver and all the required sidecars (attacher, provisioner etc).

1. The operator will assume namespace `openshift-cluster-csi-drivers` is already created for the driver.
2. A service account will be created for running the driver.
3. The operator will create RBAC rules necessary for running the driver and sidecars.
4. A deployment will be created and managed by operator to handle control-plane sidecars and controller-plane driver deployment with controller-side of CSI services.
5. A DaemonSet will be created and managed by operator to run node side of driver operations.
6. A `CSIDriver` object will be created to expose driver features to control-plane.
7. The driver operator will use and expose cloud-credentials created by CVO.

Most of the steps outlined above is common to all CSI driver operator and vSphere CSI driver is not unique among those aspects.

#### Additional consideration for vSphere

There are certain aspects of driver which require special handling:

##### StoragePolicy configuration

Currently while deploying Openshift a user can configure datastore used by OCP via install-config.yaml. vSphere CSI driver however can't use datastore directly and must be configured with vSphere storage policy.

To solve this problem vsphere CSI operator is going to create a storagePolicy by tagging selected datastore in the installer. This will require OCP to have expanded permissions of creating storagePolicies. After creating the storagePolicy, the vSphere CSI operator will also create corresponding storageclass.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tagging selected datastore in the installer

  1. Tagging datastore is not mentioned in the installer chapter above
  2. Installer is not running during upgrade. To upgrade a running cluster, something must tag the datastore and I would expect it's the CSI driver operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is why we moved tagging and create storagepolicy work to operator and not in the installer.


If CSI operator can not create storage Policy for some reason:

- The operator will mark itself as disabled and stop further driver installation.
- It will periodically retry installation and wait for admin to grant permissions to create StoragePolicy.
- Create a metric if storagePolicy and storageClass creation fails in 4.7.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can something else, like the installer, create the storagePolicy? If the operator could assume the storagePolicy exists, it wouldn't have to do any juggling with conditions nor potentially leave the cluster without a default StorageClass.

If creating the storagePolicy somewhere else is not an option:

The ultimate goal of the vsphere-csi-driver-operator should be creating a StorageClass in the cluster that users can use to provision their volumes. If for some reason it lacks any permissions to achieve such a goal, IMO it shouldn't leave the cluster with part of the work done and expect the user to fill in the gaps.

IMO it's reasonable to expect the operator to have such permissions in the cluster.

Copy link
Member Author

@gnufied gnufied Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in one of the design calls. During an upgrade, there is no installer pod and hence it must be vsphere CSI operator that creates the storagePolicy.

The operator is not going to leave with just storagePolicy, it will also create storageClass. I will update the wording.


##### Hardware and vCenter version handling

When vSphere CSI operator starts, using credentials provided by cluster-storage-operator, it will first verifiy vCenter version and HW version of VMs. If vCenter version is not 6.7u3 or greater and HW version is not 15 or greater on all VMs and this is a fresh install(i.e there is no `CSIDriver` object already installed by this operator) - it will:

1. Mark itself as disabled, stop CSI driver installation.
2. Add clear message to indicate the error.
3. It will keep retrying installation with exponential backoff assuming admin manually fixes the HW version of VMs.


If additional VMs are added later into the cluster and they do not have HW version 15 or greater, Operator will mark itself as `degraded` and nodes which don't have right HW version will have annotation `vsphere.driver.status.csi.openshift.io: degraded` added to them.

Additionally vSphere CSI operator will report HW version of VMs that make up the cluster as a metric.

##### Presence of existing drivers in the cluster being upgraded

A customer may install vSphere driver from external sources. In 4.7 we will install the CSI driver operator but will not proceed with driver install if an existing install of CSI driver is detected. The operator will mark itself as disabled but it will not degrade overall CSO status.

We will additionally gather metrics for such installation and decide in 4.8 timeframe, if we need to mark such clusters as "unupgradable".

#### Disabling the operator

#### API

The operator will use https://github.com/openshift/api/blob/master/operator/v1/types_csi_cluster_driver.go for operator configuration and managment.

### User stories

#### Story 1

### Implementation Details/Notes/Constraints

### Risks and Mitigations

* We don't yet know state of vSphere CSI driver. We need to start running e2e tests with vSphere driver as early as possible, so as we can determine how stable it is.
* We have some concerns about datastore not being supported in storageClass anymore. This means that in future when in-tree driver is removed, clusters without storagePolicy will become unupgradable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the operator see this condition and resolve it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there way how to reliably pick a storagePolicy listing vSphere API? Does it have permissions to do so? How it finds the right policy there?
IMO, the operator can make the cluster unupgradeable and wait for user to provide the policy.


### Test plan

* We plan to enable e2e for vSphere CSI driver.

### Graduation Criteria

There is no dev-preview phase.

##### Tech Preview

##### Tech Preview -> GA

##### Removing a deprecated feature

### Upgrade / Downgrade Strategy

### Version Skew Strategy

## Implementation History

## Drawbacks

## Alternatives

## Infrastructure Needed

* vmware-vsphere-csi-driver GitHub repository (forked from upstream).
* vmware-vsphere-csi-driver-operator GitHub repository.
* vmware-vsphere-csi-driver and vmware-vsphere-csi-driver-operator images.