From 862902f9fa2a56df2f97b57b1ab569fd7867057f Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Wed, 29 Jul 2020 18:10:31 +0200 Subject: [PATCH 01/28] Add proposal for CSI migration --- enhancements/storage/csi-migration.md | 136 ++++++++++++++++++++++++++ 1 file changed, 136 insertions(+) create mode 100644 enhancements/storage/csi-migration.md diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md new file mode 100644 index 0000000000..aaa4da6585 --- /dev/null +++ b/enhancements/storage/csi-migration.md @@ -0,0 +1,136 @@ +--- +title: Migration of in-tree volume plugins to CSI drivers +authors: + - "@fbertina" +reviewers: + - "@openshift/storage ” +approvers: + - "@openshift/openshift-architects" +creation-date: 2020-07-01 +last-updated: 2020-07-01 +status: provisional +see-also: +replaces: +superseded-by: +--- + +# Migration of in-tree volume plugins to CSI drivers + +## Release Signoff Checklist + +- [ ] Enhancement is `implementable` +- [ ] Design details are appropriately documented from clear requirements +- [ ] Test plan is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +We want to allow cluster administrators to enable and disable the migration of in-tree volumes to their counterparts CSI drivers. + +## Motivation + +CSI migration is going to be enabled by default in Kubernetes 1.22 (OCP 4.9). As a result, volumes from in-tree plugins will be migrated to their counterpart CSI driver by default. + +In order to avoid surprises once the migration is enabled by default, we want to allow cluster administrators to optionally enable the migration earlier, preferably in OCP 4.8. + +## Goals + +Introduce a mechanism to switch certain feature gates on and off accross OCP components *in a predefined order*. + +* When the CSI migration is **enabled**, events should happen in this order: + 1. Enable the feature gate in all control-plane components. + 2. Once that's done, drain nodes one-by-one and start the kubelet with the feature gate enabled. + +* When the CSI migration is **disabled**, events should happen in this order: + 1. One-by-one, drain the nodes and start the kubelet with the feature gate disabled. + 2. Once that's done, disable the feature gate in all control-plane components. + +## Non-Goals + +* Install or remove the CSI driver as the migration is enabled or disabled. + +## Proposal + +The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, the cluster administrator should enable these 2 feature gates: + +* CSIMigration +* CSIMigrationAWS + +The resource [FeatureGate](https://github.com/openshift/api/blob/dca637550e8c80dc2fa5ff6653b43a3b5c6c810c/config/v1/types_feature.go#L9-L21) can be used to enable and disable feature gates in OCP. + +### Proposal I (rejected): CustomNoUpgrade + +The CustomNoUpgrade feature set can be used to enable the required feature gates. Here is an example: + + +```shell +$ oc edit featuregates/cluster +``` + +```yaml +(...) +spec: + customNoUpgrade: + enabled: + - CSIMigration + - CSIMigrationGCE + featureSet: CustomNoUpgrade +``` + +Kubernetes components will restart with the features properly set. + +This is **not an acceptable solution** because: + +1. We can't control the order in which Kubernetes components will be restarted with the features set. +1. It's not possible to upgrade the cluster. + +### Proposal II (rejected): New FeatureSet + +We can create a new *FeatureSet* to enable or disable the CSI migration in a specific cloud. Each cloud platform will have its dedicated FeatureSet. + +This is what should be done: + +1. First, we create [a new FeatureSet](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L25-L43) called `CSIMigrationAWS`. +2. This FeatureSet contains contains 2 features gates enabled: `CSIMigration` and `CSIMigrationAWS`. +3. If a cluster administrator decides to enable the CSI Migration for AWS, they would add the FeatureSet to the `featuregates/cluster` object: +```shell +$ oc edit featuregates/cluster + +(...) +spec: + featureSet: CSIMigrationAWS +``` +4. OCP will restart all components with both feature gates above enabled. +5. Once the cluster administrator decides to disable the CSI migration for AWS, they would undo the step 3 above. + +Even though this solution allows the cluster to be upgraded, it's still not possible to control the order in each the components apply the feature gates. + +### Proposal III (to be rejected): New FeatureSets and custom code + +This is similar to the previous approach. + +1. We create 2 *FeatureSets* to support CSI migration in AWS: `CSIMigrationAWSNode` and `CSIMigrationAWSControlPlane`. + 1.1 Only control-plane operators will understand the `CSIMigrationAWSControlPlane` *FeatureSet*. The kubelet will ignore it. + 1.2 Only the kubelet will react to the `CSIMigrationAWSNode` *FeatureSet*. Control-plane operators will ignore it. +2. Both FeatureSets above enable 2 feature gates: `CSIMigration` and `CSIMigrationAWS`. +3. To enable the CSI Migration for AWS, either the cluster administrator or an operator would: + 3.1 Add the `CSIMigrationAWSControlPlane` *FeatureSet* to the `featuregates/cluster` object. + 3.3 Once all control-plane components restarted, add the `CSIMigrationAWSNode` *FeatureSet* to the `featuregates/cluster` object. +4. In order to disable the migration, follow the opposite steps: first remove `CSIMigrationAWSNode`, then `CSIMigrationAWSControlplane`. + +Although this approach does what we need, it has many drawbacks: + +1. Having the control-plane or the kubelet ignoring certain FeatureSets is not common. +1. A new operator needs to be created. +1. We need to patch many operators in order to ignore *FeatureSets*. + +### Risks and Mitigations + +## Design Details + +### Open Questions + +There are a few things that we don't know yet: + +1. Once we switch the feature gate using the `FeatureGate` resource, can we control the order that Kubernetes components will apply the change? If that's possible, we won't need the **Proposal III** above. From fd99226e9d1e41bb6bb21dc7c33edf294061a13c Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Wed, 27 Jan 2021 18:21:58 +0100 Subject: [PATCH 02/28] Update CSI migration proposal --- enhancements/storage/csi-migration.md | 107 ++++++++++++-------------- 1 file changed, 48 insertions(+), 59 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index aaa4da6585..082a3980ca 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -26,11 +26,11 @@ superseded-by: ## Summary -We want to allow cluster administrators to enable and disable the migration of in-tree volumes to their counterparts CSI drivers. +We want to allow cluster administrators to enable and disable the migration of in-tree volumes to their counterparts CSI drivers before the feature becomes GA in upstream. ## Motivation -CSI migration is going to be enabled by default in Kubernetes 1.22 (OCP 4.9). As a result, volumes from in-tree plugins will be migrated to their counterpart CSI driver by default. +CSI migration is going to tentatively be enabled by default in Kubernetes 1.22 (OCP 4.9). As a result, volumes from in-tree plugins will be migrated to their counterpart CSI driver by default. In order to avoid surprises once the migration is enabled by default, we want to allow cluster administrators to optionally enable the migration earlier, preferably in OCP 4.8. @@ -46,91 +46,80 @@ Introduce a mechanism to switch certain feature gates on and off accross OCP com 1. One-by-one, drain the nodes and start the kubelet with the feature gate disabled. 2. Once that's done, disable the feature gate in all control-plane components. +This mechanism should be used as a Tech Preview feature in OCP. This mechanism will be disabled in OCP once CSI migration becomes GA in upstream. + ## Non-Goals * Install or remove the CSI driver as the migration is enabled or disabled. ## Proposal -The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, the cluster administrator should enable these 2 feature gates: +The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, the cluster administrator should enable these two feature gates: * CSIMigration * CSIMigrationAWS -The resource [FeatureGate](https://github.com/openshift/api/blob/dca637550e8c80dc2fa5ff6653b43a3b5c6c810c/config/v1/types_feature.go#L9-L21) can be used to enable and disable feature gates in OCP. - -### Proposal I (rejected): CustomNoUpgrade - -The CustomNoUpgrade feature set can be used to enable the required feature gates. Here is an example: +In order to enable and disable these feature gates in OCP, we propose two different approaches to be used at different times during the feature lifecycle. +The first approach is intended to be used once CSI migration is Tech Preview in OCP. The other approach should be used to once we graduate CSI migration to GA. -```shell -$ oc edit featuregates/cluster -``` +### Tech Preview -```yaml -(...) -spec: - customNoUpgrade: - enabled: - - CSIMigration - - CSIMigrationGCE - featureSet: CustomNoUpgrade -``` -Kubernetes components will restart with the features properly set. +In OCP, *FeatureSets* can be used to aggregate one or more feature gates. Then, the resource [FeatureGate](https://github.com/openshift/api/blob/dca637550e8c80dc2fa5ff6653b43a3b5c6c810c/config/v1/types_feature.go#L9-L21) can be used to enable and disable *FeatureSets*. -This is **not an acceptable solution** because: +With that in mind, we propose to: -1. We can't control the order in which Kubernetes components will be restarted with the features set. -1. It's not possible to upgrade the cluster. +1. Create two [new FeatureSets](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L25-L43) to support CSI migration: `CSIMigrationNode` and `CSIMigrationControlPlane`. + 1.1 Both *FeatureSets* above enable the same feature gates: `CSIMigration`, `CSIMIgrationAWS`, `CSIMigrationGCE`, `CSIMigrationAzureDisk`, `CSIMigrationAzureFile`, `CSIMigrationvSphere`, `CSIMigrationOpenStack`. + 1.2 However, only control-plane operators will understand the `CSIMigrationControlPlane` *FeatureSet* and enable/disable the feature gates above. The machine-config-operator (MCO) will **ignore** it. + 1.3 On the other hand, only MCO will react to the `CSIMigrationNode` *FeatureSet*. Control-plane operators will ignore it. +2. To enable CSI Migration for any in-tree plugin, the cluster administrator should: + 2.1 Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object: + ```shell + $ oc edit featuregates/cluster -### Proposal II (rejected): New FeatureSet + (...) + spec: + featureSet: CSIMigrationControlPlane + ``` + 2.2 Once all control-plane components have restart, add the `CSIMigrationNode` *FeatureSet* to the `featuresgates/cluster` object: + ```shell + $ oc edit featuregates/cluster + (...) + spec: + featureSet: CSIMigrationControlPlane, CSIMigrationNode + ``` +3. To disable CSI migration, the cluster administrator should perform the same steps in the opposite order: + 3.1 Add the `CSIMigrationNode` *FeatureSet* to the `featuregates/cluster` object. + 3.2 Wait for all Nodes to be drained and restarted. + 3.3 Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object. +4. It is the **responsability of the cluster administrator** to guarante the ordering described in [Goals](#goals). -We can create a new *FeatureSet* to enable or disable the CSI migration in a specific cloud. Each cloud platform will have its dedicated FeatureSet. +### GA -This is what should be done: +Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default and the features will not be optional anymore. As a result, CSI migration will be enabled by default in OCP as wel, and there will not be an option to disable it. In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operatinal. In that case, those *FeatureSets* will be ignore by operators. -1. First, we create [a new FeatureSet](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L25-L43) called `CSIMigrationAWS`. -2. This FeatureSet contains contains 2 features gates enabled: `CSIMigration` and `CSIMigrationAWS`. -3. If a cluster administrator decides to enable the CSI Migration for AWS, they would add the FeatureSet to the `featuregates/cluster` object: -```shell -$ oc edit featuregates/cluster +As for the required ordering described in [Goals]g(#goals), the [upgrade order](https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#generalized-ordering) performed by CVO during a cluster upgrade will take care of applying the feature gates in the correct order. -(...) -spec: - featureSet: CSIMigrationAWS -``` -4. OCP will restart all components with both feature gates above enabled. -5. Once the cluster administrator decides to disable the CSI migration for AWS, they would undo the step 3 above. +#### Limitations -Even though this solution allows the cluster to be upgraded, it's still not possible to control the order in each the components apply the feature gates. +The limitations of this approach lies on the downgrade process. -### Proposal III (to be rejected): New FeatureSets and custom code +In OCP, a downgrade is fundamentally a regular upgrade to an older version. That means that CVO downgrades components in the same order as it upgrades them: control-plane first, nodes later. -This is similar to the previous approach. +However, as stated in [Goals](#goals), this ordering is not desired when **disabling** CSI migration, which is what is going to happen when the previous OCP version had CSI migration disabled. -1. We create 2 *FeatureSets* to support CSI migration in AWS: `CSIMigrationAWSNode` and `CSIMigrationAWSControlPlane`. - 1.1 Only control-plane operators will understand the `CSIMigrationAWSControlPlane` *FeatureSet*. The kubelet will ignore it. - 1.2 Only the kubelet will react to the `CSIMigrationAWSNode` *FeatureSet*. Control-plane operators will ignore it. -2. Both FeatureSets above enable 2 feature gates: `CSIMigration` and `CSIMigrationAWS`. -3. To enable the CSI Migration for AWS, either the cluster administrator or an operator would: - 3.1 Add the `CSIMigrationAWSControlPlane` *FeatureSet* to the `featuregates/cluster` object. - 3.3 Once all control-plane components restarted, add the `CSIMigrationAWSNode` *FeatureSet* to the `featuregates/cluster` object. -4. In order to disable the migration, follow the opposite steps: first remove `CSIMigrationAWSNode`, then `CSIMigrationAWSControlplane`. +To address this issue we propose to document a simple workaround: -Although this approach does what we need, it has many drawbacks: +1. Enable both `CSIMigrationNode`and `CSIMigrationControlplane` *FeatureSets*. +1. Downgrade. -1. Having the control-plane or the kubelet ignoring certain FeatureSets is not common. -1. A new operator needs to be created. -1. We need to patch many operators in order to ignore *FeatureSets*. +As stated above, the *FeatureSets* `CSIMigrationNode`and `CSIMigrationControlplane` are not operational once CSI migration becomes GA. However, they will be carried over during the downgrade and they will be correctly applied once the system is downgraded. ### Risks and Mitigations -## Design Details +Although this approach does what we need, it has some drawbacks: -### Open Questions - -There are a few things that we don't know yet: - -1. Once we switch the feature gate using the `FeatureGate` resource, can we control the order that Kubernetes components will apply the change? If that's possible, we won't need the **Proposal III** above. +1. Having the control-plane or the kubelet ignoring certain FeatureSets is not common. +1. We need to patch many operators in order to ignore *FeatureSets*. From c1589d898cc55b61fb12ba755953a16892b59004 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Thu, 28 Jan 2021 15:22:39 +0100 Subject: [PATCH 03/28] Re-arrange content to make things clearer --- enhancements/storage/csi-migration.md | 44 +++++++++++++++------------ 1 file changed, 24 insertions(+), 20 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index 082a3980ca..257217c8fa 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -26,30 +26,23 @@ superseded-by: ## Summary -We want to allow cluster administrators to enable and disable the migration of in-tree volumes to their counterparts CSI drivers before the feature becomes GA in upstream. +We want to allow cluster administrators to seamlessly migrate in-tree volumes to their counterparts CSI drivers. It is important to achieve this goal before CSI Migration feature becomes GA in upstream. ## Motivation CSI migration is going to tentatively be enabled by default in Kubernetes 1.22 (OCP 4.9). As a result, volumes from in-tree plugins will be migrated to their counterpart CSI driver by default. -In order to avoid surprises once the migration is enabled by default, we want to allow cluster administrators to optionally enable the migration earlier, preferably in OCP 4.8. +In order to avoid surprises once the migration is enabled by default, we want to allow cluster administrators to optionally enable the feature earlier, preferably in OCP 4.8. ## Goals -Introduce a mechanism to switch certain feature gates on and off accross OCP components *in a predefined order*. +For Tech Preview, we want to introduce a mechanism to switch certain feature gates on and off accross OCP components. -* When the CSI migration is **enabled**, events should happen in this order: - 1. Enable the feature gate in all control-plane components. - 2. Once that's done, drain nodes one-by-one and start the kubelet with the feature gate enabled. - -* When the CSI migration is **disabled**, events should happen in this order: - 1. One-by-one, drain the nodes and start the kubelet with the feature gate disabled. - 2. Once that's done, disable the feature gate in all control-plane components. - -This mechanism should be used as a Tech Preview feature in OCP. This mechanism will be disabled in OCP once CSI migration becomes GA in upstream. +Once CSI migration is enabled by default in upstream, it will not be possible to disable it again. Therefore, given mechanism shall be disabled in OCP once CSI migration becomes GA in upstream. ## Non-Goals +* Control the ordering in which OCP components will be upgraded or downgraded. * Install or remove the CSI driver as the migration is enabled or disabled. ## Proposal @@ -59,20 +52,29 @@ The CSI migration feature is hidden behind feature gates in Kubernetes. For inst * CSIMigration * CSIMigrationAWS +CSI Migration requires that feature flags are enabled and disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). In summary: + +* When the CSI migration is **enabled**, events should happen in this order: + 1. Enable the feature gate in all control-plane components. + 2. Once that's done, drain nodes one-by-one and start the kubelet with the feature gate enabled. + +* When the CSI migration is **disabled**, events should happen in this order: + 1. One-by-one, drain the nodes and start the kubelet with the feature gate disabled. + 2. Once that's done, disable the feature gate in all control-plane components. + In order to enable and disable these feature gates in OCP, we propose two different approaches to be used at different times during the feature lifecycle. The first approach is intended to be used once CSI migration is Tech Preview in OCP. The other approach should be used to once we graduate CSI migration to GA. ### Tech Preview - In OCP, *FeatureSets* can be used to aggregate one or more feature gates. Then, the resource [FeatureGate](https://github.com/openshift/api/blob/dca637550e8c80dc2fa5ff6653b43a3b5c6c810c/config/v1/types_feature.go#L9-L21) can be used to enable and disable *FeatureSets*. With that in mind, we propose to: 1. Create two [new FeatureSets](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L25-L43) to support CSI migration: `CSIMigrationNode` and `CSIMigrationControlPlane`. - 1.1 Both *FeatureSets* above enable the same feature gates: `CSIMigration`, `CSIMIgrationAWS`, `CSIMigrationGCE`, `CSIMigrationAzureDisk`, `CSIMigrationAzureFile`, `CSIMigrationvSphere`, `CSIMigrationOpenStack`. - 1.2 However, only control-plane operators will understand the `CSIMigrationControlPlane` *FeatureSet* and enable/disable the feature gates above. The machine-config-operator (MCO) will **ignore** it. + 1.1 Both *FeatureSets* contain the same feature gates: `CSIMigration`, `CSIMIgrationAWS`, `CSIMigrationGCE`, `CSIMigrationAzureDisk`, `CSIMigrationAzureFile`, `CSIMigrationvSphere`, `CSIMigrationOpenStack`. + 1.2 However, only control-plane operators will reacto to the `CSIMigrationControlPlane` *FeatureSet*. The machine-config-operator (MCO) will **ignore** it. 1.3 On the other hand, only MCO will react to the `CSIMigrationNode` *FeatureSet*. Control-plane operators will ignore it. 2. To enable CSI Migration for any in-tree plugin, the cluster administrator should: 2.1 Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object: @@ -94,13 +96,15 @@ With that in mind, we propose to: 3.1 Add the `CSIMigrationNode` *FeatureSet* to the `featuregates/cluster` object. 3.2 Wait for all Nodes to be drained and restarted. 3.3 Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object. -4. It is the **responsability of the cluster administrator** to guarante the ordering described in [Goals](#goals). +4. It is the **responsability of the cluster administrator** to guarante the ordering described above is respected. ### GA -Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default and the features will not be optional anymore. As a result, CSI migration will be enabled by default in OCP as wel, and there will not be an option to disable it. In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operatinal. In that case, those *FeatureSets* will be ignore by operators. +Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default and the features will not be optional anymore. As a result, CSI migration will be enabled by default in OCP as wel, and there will not be an option to disable it. + +In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operatinal. In that case, those *FeatureSets* should be ignored by operators. -As for the required ordering described in [Goals]g(#goals), the [upgrade order](https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#generalized-ordering) performed by CVO during a cluster upgrade will take care of applying the feature gates in the correct order. +As for the required ordering described above, the [upgrade order](https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#generalized-ordering) performed by CVO during a cluster upgrade will take care of applying the feature gates in the correct order. #### Limitations @@ -108,7 +112,7 @@ The limitations of this approach lies on the downgrade process. In OCP, a downgrade is fundamentally a regular upgrade to an older version. That means that CVO downgrades components in the same order as it upgrades them: control-plane first, nodes later. -However, as stated in [Goals](#goals), this ordering is not desired when **disabling** CSI migration, which is what is going to happen when the previous OCP version had CSI migration disabled. +However, as stated above, this ordering is **not** desired when **disabling** CSI migration, which is what is going to happen when the previous OCP version had CSI migration disabled. To address this issue we propose to document a simple workaround: @@ -121,5 +125,5 @@ As stated above, the *FeatureSets* `CSIMigrationNode`and `CSIMigrationControlpla Although this approach does what we need, it has some drawbacks: -1. Having the control-plane or the kubelet ignoring certain FeatureSets is not common. +1. Having operators ignoring certain *FeatureSets* is not common and error-prone. 1. We need to patch many operators in order to ignore *FeatureSets*. From 9f99434707ebfea16b96a8d02fd90c1fc3cf2232 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Thu, 28 Jan 2021 16:14:49 +0100 Subject: [PATCH 04/28] Add test plan --- enhancements/storage/csi-migration.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index 257217c8fa..bdc4b62ce5 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -127,3 +127,20 @@ Although this approach does what we need, it has some drawbacks: 1. Having operators ignoring certain *FeatureSets* is not common and error-prone. 1. We need to patch many operators in order to ignore *FeatureSets*. + +## Design Details + +### Test Plan + +#### E2E jobs + +We want to create E2E jobs for all migrated plugins. + +For each E2E job: + +1. Install an OCP cluster. +1. Enable the feature gate. Don't worry about the order in which components will apply the feature flags because at this point there are no volumes created. +1. Wait until MCO and control-plane operators restart. +1. Run E2E tests for in-tree volume plugins. +1. Disable the feature gate. Again, don't worry about the ordering. +1. Run E2E tests again. From 5e7ec815a3932cf09c760f6a9a98120677ac00ae Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Thu, 28 Jan 2021 16:41:41 +0100 Subject: [PATCH 05/28] Fix numeric lists --- enhancements/storage/csi-migration.md | 41 +++++++++++++-------------- 1 file changed, 20 insertions(+), 21 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index bdc4b62ce5..fea71df3bb 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -73,29 +73,28 @@ In OCP, *FeatureSets* can be used to aggregate one or more feature gates. Then, With that in mind, we propose to: 1. Create two [new FeatureSets](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L25-L43) to support CSI migration: `CSIMigrationNode` and `CSIMigrationControlPlane`. - 1.1 Both *FeatureSets* contain the same feature gates: `CSIMigration`, `CSIMIgrationAWS`, `CSIMigrationGCE`, `CSIMigrationAzureDisk`, `CSIMigrationAzureFile`, `CSIMigrationvSphere`, `CSIMigrationOpenStack`. - 1.2 However, only control-plane operators will reacto to the `CSIMigrationControlPlane` *FeatureSet*. The machine-config-operator (MCO) will **ignore** it. - 1.3 On the other hand, only MCO will react to the `CSIMigrationNode` *FeatureSet*. Control-plane operators will ignore it. + * Both *FeatureSets* contain the same feature gates: `CSIMigration`, `CSIMIgrationAWS`, `CSIMigrationGCE`, `CSIMigrationAzureDisk`, `CSIMigrationAzureFile`, `CSIMigrationvSphere`, `CSIMigrationOpenStack`. + * However, only control-plane operators will reacto to the `CSIMigrationControlPlane` *FeatureSet*. The machine-config-operator (MCO) will **ignore** it. + * On the other hand, only MCO will react to the `CSIMigrationNode` *FeatureSet*. Control-plane operators will ignore it. 2. To enable CSI Migration for any in-tree plugin, the cluster administrator should: - 2.1 Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object: - ```shell - $ oc edit featuregates/cluster - - (...) - spec: - featureSet: CSIMigrationControlPlane - ``` - 2.2 Once all control-plane components have restart, add the `CSIMigrationNode` *FeatureSet* to the `featuresgates/cluster` object: - ```shell - $ oc edit featuregates/cluster - (...) - spec: - featureSet: CSIMigrationControlPlane, CSIMigrationNode - ``` + * Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object: + ```shell + $ oc edit featuregates/cluster + (...) + spec: + featureSet: CSIMigrationControlPlane + ``` + * Once all control-plane components have restart, add the `CSIMigrationNode` *FeatureSet* to the `featuresgates/cluster` object: + ```shell + $ oc edit featuregates/cluster + (...) + spec: + featureSet: CSIMigrationControlPlane, CSIMigrationNode + ``` 3. To disable CSI migration, the cluster administrator should perform the same steps in the opposite order: - 3.1 Add the `CSIMigrationNode` *FeatureSet* to the `featuregates/cluster` object. - 3.2 Wait for all Nodes to be drained and restarted. - 3.3 Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object. + * Add the `CSIMigrationNode` *FeatureSet* to the `featuregates/cluster` object. + * Wait for all Nodes to be drained and restarted. + * Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object. 4. It is the **responsability of the cluster administrator** to guarante the ordering described above is respected. ### GA From 2f777ce48a04e57105250307a95cc2a3832da590 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Thu, 28 Jan 2021 16:45:16 +0100 Subject: [PATCH 06/28] Update date --- enhancements/storage/csi-migration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index fea71df3bb..bf386ce453 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -7,7 +7,7 @@ reviewers: approvers: - "@openshift/openshift-architects" creation-date: 2020-07-01 -last-updated: 2020-07-01 +last-updated: 2021-01-28 status: provisional see-also: replaces: From f1d95444533a3bcb663e528bfc92a85140a0c0ed Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 29 Jan 2021 13:17:15 +0100 Subject: [PATCH 07/28] Update wording --- enhancements/storage/csi-migration.md | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index bf386ce453..d3c051c0d3 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -36,23 +36,25 @@ In order to avoid surprises once the migration is enabled by default, we want to ## Goals -For Tech Preview, we want to introduce a mechanism to switch certain feature gates on and off accross OCP components. +For Tech Preview, we want to introduce a mechanism to allow switching CSI migration feature flags on and off accross OCP components. -Once CSI migration is enabled by default in upstream, it will not be possible to disable it again. Therefore, given mechanism shall be disabled in OCP once CSI migration becomes GA in upstream. +Due to upstream requirements, it is important that this mechanism allows for enabling the feature flags on control-plane components **before** the kubelet, and vice-versa. + +Once CSI migration is enabled by default in upstream, it will not be possible to disable it again. Therefore, such mechanism shall be disabled in OCP once CSI migration becomes GA in upstream. ## Non-Goals -* Control the ordering in which OCP components will be upgraded or downgraded. +* Control the ordering in which OCP components will be upgraded or downgraded. We will leave this responsability to the user. * Install or remove the CSI driver as the migration is enabled or disabled. ## Proposal -The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, the cluster administrator should enable these two feature gates: +The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, the cluster administrator should turn on these two feature gates: * CSIMigration * CSIMigrationAWS -CSI Migration requires that feature flags are enabled and disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). In summary: +Nevertheless, what makes things more complicated is the strict order in which those flags need to be enabled across components. In other words, CSI Migration requires that feature flags are enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). In summary: * When the CSI migration is **enabled**, events should happen in this order: 1. Enable the feature gate in all control-plane components. @@ -62,7 +64,7 @@ CSI Migration requires that feature flags are enabled and disabled in a [specifi 1. One-by-one, drain the nodes and start the kubelet with the feature gate disabled. 2. Once that's done, disable the feature gate in all control-plane components. -In order to enable and disable these feature gates in OCP, we propose two different approaches to be used at different times during the feature lifecycle. +In order to achieve that, we propose two different approaches to be used at different times during the feature lifecycle. The first approach is intended to be used once CSI migration is Tech Preview in OCP. The other approach should be used to once we graduate CSI migration to GA. @@ -74,7 +76,7 @@ With that in mind, we propose to: 1. Create two [new FeatureSets](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L25-L43) to support CSI migration: `CSIMigrationNode` and `CSIMigrationControlPlane`. * Both *FeatureSets* contain the same feature gates: `CSIMigration`, `CSIMIgrationAWS`, `CSIMigrationGCE`, `CSIMigrationAzureDisk`, `CSIMigrationAzureFile`, `CSIMigrationvSphere`, `CSIMigrationOpenStack`. - * However, only control-plane operators will reacto to the `CSIMigrationControlPlane` *FeatureSet*. The machine-config-operator (MCO) will **ignore** it. + * However, only control-plane operators will react to the `CSIMigrationControlPlane` *FeatureSet*. The machine-config-operator (MCO) will **ignore** it. * On the other hand, only MCO will react to the `CSIMigrationNode` *FeatureSet*. Control-plane operators will ignore it. 2. To enable CSI Migration for any in-tree plugin, the cluster administrator should: * Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object: @@ -101,7 +103,7 @@ With that in mind, we propose to: Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default and the features will not be optional anymore. As a result, CSI migration will be enabled by default in OCP as wel, and there will not be an option to disable it. -In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operatinal. In that case, those *FeatureSets* should be ignored by operators. +In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operational because the feature flags will be enabled by default. In that case, those *FeatureSets* should be ignored by operators. As for the required ordering described above, the [upgrade order](https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#generalized-ordering) performed by CVO during a cluster upgrade will take care of applying the feature gates in the correct order. @@ -133,7 +135,7 @@ Although this approach does what we need, it has some drawbacks: #### E2E jobs -We want to create E2E jobs for all migrated plugins. +We want E2E jobs for all migrated plugins ready at **Tech Preview** time. For each E2E job: From 8ae2e9c7aa7c5e9541cdf1a12df4e7a288e19d33 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Mon, 1 Feb 2021 15:54:17 +0100 Subject: [PATCH 08/28] Address review comments --- enhancements/storage/csi-migration.md | 91 ++++++++++++++++++--------- 1 file changed, 63 insertions(+), 28 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index d3c051c0d3..1f6299165e 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -30,9 +30,11 @@ We want to allow cluster administrators to seamlessly migrate in-tree volumes to ## Motivation -CSI migration is going to tentatively be enabled by default in Kubernetes 1.22 (OCP 4.9). As a result, volumes from in-tree plugins will be migrated to their counterpart CSI driver by default. +CSI migration is going to be enabled by default as a beta feature in Kubernetes 1.21 (OCP 4.8). In Kubenertes 1.22 (OCP 4.9) the feature may become GA. -In order to avoid surprises once the migration is enabled by default, we want to allow cluster administrators to optionally enable the feature earlier, preferably in OCP 4.8. +In OCP we can optionally disable beta features, however, that will no longer be an option once CSI migration becomes GA. + +In order to avoid surprises once the migration is enabled by default in OCP, we want to allow cluster administrators to optionally enable the feature earlier, preferably in OCP 4.8. ## Goals @@ -54,7 +56,9 @@ The CSI migration feature is hidden behind feature gates in Kubernetes. For inst * CSIMigration * CSIMigrationAWS -Nevertheless, what makes things more complicated is the strict order in which those flags need to be enabled across components. In other words, CSI Migration requires that feature flags are enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). In summary: +Nevertheless, what makes things more complicated is the strict order in which those flags need to be switched across components. In other words, CSI Migration requires that feature flags are enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). When enabling the feature, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. When disabling the feature, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. + +In summary: * When the CSI migration is **enabled**, events should happen in this order: 1. Enable the feature gate in all control-plane components. @@ -75,28 +79,40 @@ In OCP, *FeatureSets* can be used to aggregate one or more feature gates. Then, With that in mind, we propose to: 1. Create two [new FeatureSets](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L25-L43) to support CSI migration: `CSIMigrationNode` and `CSIMigrationControlPlane`. - * Both *FeatureSets* contain the same feature gates: `CSIMigration`, `CSIMIgrationAWS`, `CSIMigrationGCE`, `CSIMigrationAzureDisk`, `CSIMigrationAzureFile`, `CSIMigrationvSphere`, `CSIMigrationOpenStack`. - * However, only control-plane operators will react to the `CSIMigrationControlPlane` *FeatureSet*. The machine-config-operator (MCO) will **ignore** it. - * On the other hand, only MCO will react to the `CSIMigrationNode` *FeatureSet*. Control-plane operators will ignore it. + * Both *FeatureSets* will contain the same feature gates: + * `CSIMigration` + * `CSIMIgrationAWS` + * `CSIMigrationGCE` + * `CSIMigrationAzureDisk` + * `CSIMigrationAzureFile` + * `CSIMigrationvSphere` + * `CSIMigrationOpenStack` + * The machine-config-operator (MCO) will **ignore** the `CSIMigrationControlPlane` **FeatureSet**. + * On the other hand, all operators will react to the `CSIMigrationNode` *FeatureSet*, including control-plane operators. 2. To enable CSI Migration for any in-tree plugin, the cluster administrator should: - * Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object: - ```shell - $ oc edit featuregates/cluster - (...) - spec: - featureSet: CSIMigrationControlPlane - ``` - * Once all control-plane components have restart, add the `CSIMigrationNode` *FeatureSet* to the `featuresgates/cluster` object: - ```shell - $ oc edit featuregates/cluster - (...) - spec: - featureSet: CSIMigrationControlPlane, CSIMigrationNode - ``` + * First, enable the CSI migration feature flags in all control-plane components: + * Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object: + ```shell + $ oc edit featuregates/cluster + (...) + spec: + featureSet: CSIMigrationControlPlane + ``` + * With the exception of MCO, all operators will recognize this *FeatureSet* and will apply that associated feature gates to their operands. + * Second, once all control-plane components have restarted, enable the CSI migration feature flags in all kubelets: + * Add the `CSIMigrationNode` *FeatureSet* to the `featuresgates/cluster` object, replacing the previous value (i.e., `CSIMigrationControlPlane`): + ```shell + $ oc edit featuregates/cluster + (...) + spec: + featureSet: CSIMigrationNode + ``` + * All operators will recognize the `CSIMigrationNode` *FeatureSet*, however, control-plane operators already applied the associated feature gates in the step above, so in practice only MCO will have work to do. + * At this point, CSI migration is fully enabled in the cluster. 3. To disable CSI migration, the cluster administrator should perform the same steps in the opposite order: - * Add the `CSIMigrationNode` *FeatureSet* to the `featuregates/cluster` object. - * Wait for all Nodes to be drained and restarted. - * Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object. + * In `featuregates/cluster` object, replace the `CSIMigrationNode` *FeatureSet* by `CSIMigrationControlPlane`. + * Wait for all `CSINode` objects to have the annotation `storage.alpha.kubernetes.io/migrated-plugins` cleared. No storage plugins should be listed in this annotation. + * Remove the `CSIMigrationControlPlane` *FeatureSet* from the `featuregates/cluster` object. 4. It is the **responsability of the cluster administrator** to guarante the ordering described above is respected. ### GA @@ -109,13 +125,11 @@ As for the required ordering described above, the [upgrade order](https://github #### Limitations -The limitations of this approach lies on the downgrade process. +The limitations of this approach lies on the downgrade process. In OCP, a downgrade is fundamentally a regular upgrade to an older version. That means that CVO downgrades components in the same order as it upgrades them: control-plane first, nodes later. However, as stated above, this ordering is **not** desired when **disabling** CSI migration, which is what is going to happen when the previous OCP version had CSI migration disabled. -In OCP, a downgrade is fundamentally a regular upgrade to an older version. That means that CVO downgrades components in the same order as it upgrades them: control-plane first, nodes later. +It is important to note that the order in which operators are downgraded in OCP violates upstream version skew policy. A new kubelet must never run with an older API server or Controller Manager. A direct consequence of this violation is the need to introduce a workaround for CSI migration. -However, as stated above, this ordering is **not** desired when **disabling** CSI migration, which is what is going to happen when the previous OCP version had CSI migration disabled. - -To address this issue we propose to document a simple workaround: +That being said, to address this issue we propose to document a simple workaround: 1. Enable both `CSIMigrationNode`and `CSIMigrationControlplane` *FeatureSets*. 1. Downgrade. @@ -131,6 +145,18 @@ Although this approach does what we need, it has some drawbacks: ## Design Details +* openshift/api + * Introduce two new *FeatureSets*: `CSIMigrationNode` and `CSIMigrationControlPlane` +* Machine Config Operator + * Bump openshift/api in order to get the *FeatureSets* above. + * Ignore the `CSIMigrationControlPlane` *FeatureSet* +* Kubernetes Scheduler Operator + * Bump openshift/api in order to get the *FeatureSets* above. +* Kubernetes Controller Manager Operator + * Bump openshift/api in order to get the *FeatureSets* above. + +Other operators, like the Kubernetes API Server Operator, may have their openshift/api library bumped. However, that is not strictly necessary as their operands don't need to enable the CSI migration feature flags. + ### Test Plan #### E2E jobs @@ -145,3 +171,12 @@ For each E2E job: 1. Run E2E tests for in-tree volume plugins. 1. Disable the feature gate. Again, don't worry about the ordering. 1. Run E2E tests again. + +In addition to that, as a strech goal we want a separate job that: + +1. Runs a StatefulSet. +1. Enables the migration *FeatureSets* in the right order. +1. Wait for all components to have the right feature flags. +1. Checks if the StatefulSet survives. + +Once CSI migration is GA, we expect the regular upgrade jobs will cover upgrades from an OCP version with migration disabled to a version with migration enabled. From 46274a94c8e009faa7356a2b126575a291824360 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Tue, 2 Feb 2021 11:15:40 +0100 Subject: [PATCH 09/28] Add a few more details --- enhancements/storage/csi-migration.md | 98 +++++++++++++++++---------- 1 file changed, 63 insertions(+), 35 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index 1f6299165e..d88a39a5f9 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -56,7 +56,11 @@ The CSI migration feature is hidden behind feature gates in Kubernetes. For inst * CSIMigration * CSIMigrationAWS -Nevertheless, what makes things more complicated is the strict order in which those flags need to be switched across components. In other words, CSI Migration requires that feature flags are enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). When enabling the feature, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. When disabling the feature, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. +Nevertheless, what makes things more complicated is the strict order in which those flags need to be switched across components. In other words, CSI Migration requires that feature flags are enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). + +It is important to respect this ordering to avoid an undesired state of the components of volumes. For instance, when enabling the feature, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. + +In the same vein, when disabling the feature, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. In summary: @@ -70,7 +74,7 @@ In summary: In order to achieve that, we propose two different approaches to be used at different times during the feature lifecycle. -The first approach is intended to be used once CSI migration is Tech Preview in OCP. The other approach should be used to once we graduate CSI migration to GA. +The first approach is intended to be used for is Tech Preview in OCP. The other approach should be used to once we graduate CSI migration to GA. ### Tech Preview @@ -79,36 +83,36 @@ In OCP, *FeatureSets* can be used to aggregate one or more feature gates. Then, With that in mind, we propose to: 1. Create two [new FeatureSets](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L25-L43) to support CSI migration: `CSIMigrationNode` and `CSIMigrationControlPlane`. - * Both *FeatureSets* will contain the same feature gates: - * `CSIMigration` - * `CSIMIgrationAWS` - * `CSIMigrationGCE` - * `CSIMigrationAzureDisk` - * `CSIMigrationAzureFile` - * `CSIMigrationvSphere` - * `CSIMigrationOpenStack` + * Both *FeatureSets* will enable the **same** feature gates: + - `CSIMigration` + - `CSIMIgrationAWS` + - `CSIMigrationGCE` + - `CSIMigrationAzureDisk` + - `CSIMigrationAzureFile` + - `CSIMigrationvSphere` + - `CSIMigrationOpenStack` * The machine-config-operator (MCO) will **ignore** the `CSIMigrationControlPlane` **FeatureSet**. * On the other hand, all operators will react to the `CSIMigrationNode` *FeatureSet*, including control-plane operators. 2. To enable CSI Migration for any in-tree plugin, the cluster administrator should: * First, enable the CSI migration feature flags in all control-plane components: - * Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object: + - Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object: ```shell $ oc edit featuregates/cluster (...) spec: featureSet: CSIMigrationControlPlane ``` - * With the exception of MCO, all operators will recognize this *FeatureSet* and will apply that associated feature gates to their operands. - * Second, once all control-plane components have restarted, enable the CSI migration feature flags in all kubelets: - * Add the `CSIMigrationNode` *FeatureSet* to the `featuresgates/cluster` object, replacing the previous value (i.e., `CSIMigrationControlPlane`): - ```shell - $ oc edit featuregates/cluster - (...) - spec: - featureSet: CSIMigrationNode - ``` - * All operators will recognize the `CSIMigrationNode` *FeatureSet*, however, control-plane operators already applied the associated feature gates in the step above, so in practice only MCO will have work to do. - * At this point, CSI migration is fully enabled in the cluster. + - With the exception of MCO, all operators will recognize this *FeatureSet* and will initialize their operands with the associated feature gates. + * Second, once all control-plane components have restarted, enable the CSI migration feature flags in the kubelet: + - Add the `CSIMigrationNode` *FeatureSet* to the `featuresgates/cluster` object, replacing the previous value (i.e., `CSIMigrationControlPlane`): + ```shell + $ oc edit featuregates/cluster + (...) + spec: + featureSet: CSIMigrationNode + ``` + - All operators will recognize the `CSIMigrationNode` *FeatureSet*, however, control-plane operators already applied the associated feature gates in the step above, so in practice only MCO will have work to do. + * At this point, CSI migration is fully enabled in the cluster. 3. To disable CSI migration, the cluster administrator should perform the same steps in the opposite order: * In `featuregates/cluster` object, replace the `CSIMigrationNode` *FeatureSet* by `CSIMigrationControlPlane`. * Wait for all `CSINode` objects to have the annotation `storage.alpha.kubernetes.io/migrated-plugins` cleared. No storage plugins should be listed in this annotation. @@ -119,44 +123,68 @@ With that in mind, we propose to: Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default and the features will not be optional anymore. As a result, CSI migration will be enabled by default in OCP as wel, and there will not be an option to disable it. -In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operational because the feature flags will be enabled by default. In that case, those *FeatureSets* should be ignored by operators. +In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operational because the feature flags they enable will already be enabled in the cluster. As for the required ordering described above, the [upgrade order](https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#generalized-ordering) performed by CVO during a cluster upgrade will take care of applying the feature gates in the correct order. #### Limitations -The limitations of this approach lies on the downgrade process. In OCP, a downgrade is fundamentally a regular upgrade to an older version. That means that CVO downgrades components in the same order as it upgrades them: control-plane first, nodes later. However, as stated above, this ordering is **not** desired when **disabling** CSI migration, which is what is going to happen when the previous OCP version had CSI migration disabled. +The limitations of this approach lies on the downgrade process. In OCP, a downgrade is fundamentally a regular upgrade to an older version. That means that CVO downgrades components in the same order as it upgrades them: control-plane first, nodes later. -It is important to note that the order in which operators are downgraded in OCP violates upstream version skew policy. A new kubelet must never run with an older API server or Controller Manager. A direct consequence of this violation is the need to introduce a workaround for CSI migration. +However, this ordering is **not** desired when **disabling** CSI migration, which is what is going to happen when the previous OCP version had CSI migration disabled. + +It is important to note that the order in which operators are downgraded in OCP violates [upstream version skew policy](https://kubernetes.io/docs/setup/release/version-skew-policy). +The policy states that a new kubelet must never run with an older API server or Controller Manager. A direct consequence of this violation is the need to introduce a workaround to downgrade a cluster with CSI migration enabled. That being said, to address this issue we propose to document a simple workaround: -1. Enable both `CSIMigrationNode`and `CSIMigrationControlplane` *FeatureSets*. +1. Enable the `CSIMigrationNode` *FeatureSet*. 1. Downgrade. -As stated above, the *FeatureSets* `CSIMigrationNode`and `CSIMigrationControlplane` are not operational once CSI migration becomes GA. However, they will be carried over during the downgrade and they will be correctly applied once the system is downgraded. +As stated above, the CSI migration *FeatureSets* are not operational once CSI migration becomes GA. However, they will be carried over during the downgrade and they will be correctly applied once the system is downgraded. + +### Post-GA + +CSI migration *FeatureSets* can be removed from OCP API **one** release after CSI migration becomes GA. ### Risks and Mitigations -Although this approach does what we need, it has some drawbacks: +Although this three-phased approach does what we need, it has some drawbacks: -1. Having operators ignoring certain *FeatureSets* is not common and error-prone. -1. We need to patch many operators in order to ignore *FeatureSets*. +1. Having operators ignoring certain *FeatureSets* is not usual and is error-prone. Fortunately we only need to introduce the skipping in MCO. ## Design Details +This is what needs to be done across different support phases: + +1. Tech Preview: + * openshift/api * Introduce two new *FeatureSets*: `CSIMigrationNode` and `CSIMigrationControlPlane` * Machine Config Operator - * Bump openshift/api in order to get the *FeatureSets* above. - * Ignore the `CSIMigrationControlPlane` *FeatureSet* + * Bump openshift/api in order to get the *FeatureSets* above + * Introduce a patch to ignore the `CSIMigrationControlPlane` *FeatureSet* * Kubernetes Scheduler Operator - * Bump openshift/api in order to get the *FeatureSets* above. + * Bump openshift/api in order to get the *FeatureSets* above * Kubernetes Controller Manager Operator - * Bump openshift/api in order to get the *FeatureSets* above. + * Bump openshift/api in order to get the *FeatureSets* above Other operators, like the Kubernetes API Server Operator, may have their openshift/api library bumped. However, that is not strictly necessary as their operands don't need to enable the CSI migration feature flags. +2. GA + +Nothing needs to be done. + +3. GA + 1 release + +* openshift/api + * Remove CSI migration *FeatureSets*: `CSIMigrationNode` and `CSIMigrationControlPlane` +* Machine Config Operator + * Bump openshift/api + * Remove the skip added for Tech Preview +* Other operators + * Bump openshift/api + ### Test Plan #### E2E jobs @@ -172,7 +200,7 @@ For each E2E job: 1. Disable the feature gate. Again, don't worry about the ordering. 1. Run E2E tests again. -In addition to that, as a strech goal we want a separate job that: +In addition to that, as a strech goal, we want a separate job that: 1. Runs a StatefulSet. 1. Enables the migration *FeatureSets* in the right order. From a7d8a77c7cb5214a95a548eadb8d66db294fe8b8 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Wed, 3 Feb 2021 14:21:45 +0100 Subject: [PATCH 10/28] Address review comments --- enhancements/storage/csi-migration.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index d88a39a5f9..db543a1589 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -58,7 +58,7 @@ The CSI migration feature is hidden behind feature gates in Kubernetes. For inst Nevertheless, what makes things more complicated is the strict order in which those flags need to be switched across components. In other words, CSI Migration requires that feature flags are enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). -It is important to respect this ordering to avoid an undesired state of the components of volumes. For instance, when enabling the feature, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. +It is important to respect this ordering to avoid an undesired state of the components of volumes. For instance, when enabling the feature in the wrong order, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. In the same vein, when disabling the feature, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. @@ -121,7 +121,7 @@ With that in mind, we propose to: ### GA -Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default and the features will not be optional anymore. As a result, CSI migration will be enabled by default in OCP as wel, and there will not be an option to disable it. +Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default and the features will not be optional anymore. As a result, CSI migration will be enabled by default in OCP as well, and there will not be an option to disable it. In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operational because the feature flags they enable will already be enabled in the cluster. @@ -194,13 +194,12 @@ We want E2E jobs for all migrated plugins ready at **Tech Preview** time. For each E2E job: 1. Install an OCP cluster. -1. Enable the feature gate. Don't worry about the order in which components will apply the feature flags because at this point there are no volumes created. -1. Wait until MCO and control-plane operators restart. +1. Enable the feature gate in the right order. Even though it is a fresh cluster, we need to respect the order because [volumes might be created in CI](https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/monitoringpvc/ipi-install-monitoringpvc-ref.yaml). 1. Run E2E tests for in-tree volume plugins. -1. Disable the feature gate. Again, don't worry about the ordering. -1. Run E2E tests again. +1. Disable the feature gate in the right order. +1. Once again, run E2E tests for in-tree volume plugins. -In addition to that, as a strech goal, we want a separate job that: +In addition to that, as a stretch goal, we want a separate job that: 1. Runs a StatefulSet. 1. Enables the migration *FeatureSets* in the right order. From e756bc95dd84a5e38c1de18b9ed1331044a7da8d Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Tue, 9 Feb 2021 08:54:25 +0100 Subject: [PATCH 11/28] Add darkmuggle as approver from Node side --- enhancements/storage/csi-migration.md | 1 + 1 file changed, 1 insertion(+) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index db543a1589..d917421a83 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -6,6 +6,7 @@ reviewers: - "@openshift/storage ” approvers: - "@openshift/openshift-architects" + - "@darkmuggle" creation-date: 2020-07-01 last-updated: 2021-01-28 status: provisional From 28a41d08c4b9852d943ebfba2c2ae1f64adf4489 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 12 Feb 2021 11:33:10 +0100 Subject: [PATCH 12/28] Address review comments --- enhancements/storage/csi-migration.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index d917421a83..e60a2daf53 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -31,9 +31,11 @@ We want to allow cluster administrators to seamlessly migrate in-tree volumes to ## Motivation -CSI migration is going to be enabled by default as a beta feature in Kubernetes 1.21 (OCP 4.8). In Kubenertes 1.22 (OCP 4.9) the feature may become GA. +[CSI migration](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/625-csi-migration) is an upstream effort to migrate in-tree volume plugins to their counterpart CSI drivers. The feature is beta since Kubernetes 1.17, however, as of Kubernetes 1.20 it is still disabled by default. -In OCP we can optionally disable beta features, however, that will no longer be an option once CSI migration becomes GA. +That is going to change in Kubernetes 1.21 (OCP 4.8), where the feature will remain beta, but enabled by default. In Kubernetes 1.22 (OCP 4.9) the feature may become GA. + +In OCP we can optionally disable CSI migration feature while the it is still beta, however, that will no longer be an option once CSI migration becomes GA. In order to avoid surprises once the migration is enabled by default in OCP, we want to allow cluster administrators to optionally enable the feature earlier, preferably in OCP 4.8. @@ -41,13 +43,13 @@ In order to avoid surprises once the migration is enabled by default in OCP, we For Tech Preview, we want to introduce a mechanism to allow switching CSI migration feature flags on and off accross OCP components. -Due to upstream requirements, it is important that this mechanism allows for enabling the feature flags on control-plane components **before** the kubelet, and vice-versa. +Due to upstream requirements, it is important that this mechanism allows for enabling the feature flags on control-plane components **before** the kubelet, and vice versa. Once CSI migration is enabled by default in upstream, it will not be possible to disable it again. Therefore, such mechanism shall be disabled in OCP once CSI migration becomes GA in upstream. ## Non-Goals -* Control the ordering in which OCP components will be upgraded or downgraded. We will leave this responsability to the user. +* Control the ordering in which OCP components will be upgraded or downgraded. * Install or remove the CSI driver as the migration is enabled or disabled. ## Proposal @@ -118,7 +120,7 @@ With that in mind, we propose to: * In `featuregates/cluster` object, replace the `CSIMigrationNode` *FeatureSet* by `CSIMigrationControlPlane`. * Wait for all `CSINode` objects to have the annotation `storage.alpha.kubernetes.io/migrated-plugins` cleared. No storage plugins should be listed in this annotation. * Remove the `CSIMigrationControlPlane` *FeatureSet* from the `featuregates/cluster` object. -4. It is the **responsability of the cluster administrator** to guarante the ordering described above is respected. +4. It is the **responsibility of the cluster administrator** to guarante the ordering described above is respected. ### GA @@ -126,7 +128,9 @@ Once CSI migration reaches GA in upstream, the associated feature gates will be In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operational because the feature flags they enable will already be enabled in the cluster. -As for the required ordering described above, the [upgrade order](https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#generalized-ordering) performed by CVO during a cluster upgrade will take care of applying the feature gates in the correct order. +As for the required ordering described above, the [upgrade order](https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#generalized-ordering) performed by CVO during a cluster upgrade will take care of starting components in the desired order. + +In this phase, CSI migration feature gates are enabled by default in all components, so restarting control-plane components before the kubelet is enough to guarantee a smooth feature enablement. #### Limitations @@ -152,7 +156,7 @@ CSI migration *FeatureSets* can be removed from OCP API **one** release after CS Although this three-phased approach does what we need, it has some drawbacks: -1. Having operators ignoring certain *FeatureSets* is not usual and is error-prone. Fortunately we only need to introduce the skipping in MCO. +1. For Tech Preview, users might enable or disable *FeatureSets* in the wrong order, causing issues with attaching or detaching existing volumes. We plan to address this through documentation. ## Design Details From e87f692052545ad19457815e6f339e9f4b382fcd Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Thu, 4 Mar 2021 18:55:01 +0100 Subject: [PATCH 13/28] Add newer approach --- enhancements/storage/csi-migration.md | 243 ++++++++++++++++++-------- 1 file changed, 172 insertions(+), 71 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index e60a2daf53..5a1e0752f3 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -1,5 +1,5 @@ --- -title: Migration of in-tree volume plugins to CSI drivers +title: csi-migration authors: - "@fbertina" reviewers: @@ -27,136 +27,232 @@ superseded-by: ## Summary -We want to allow cluster administrators to seamlessly migrate in-tree volumes to their counterparts CSI drivers. It is important to achieve this goal before CSI Migration feature becomes GA in upstream. +We want to allow cluster administrators to seamlessly migrate volumes created using +the in-tree storage plugin to their counterparts CSI drivers. It is important to +achieve this goal before CSI Migration feature becomes GA in upstream. ## Motivation -[CSI migration](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/625-csi-migration) is an upstream effort to migrate in-tree volume plugins to their counterpart CSI drivers. The feature is beta since Kubernetes 1.17, however, as of Kubernetes 1.20 it is still disabled by default. +[CSI migration](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/625-csi-migration) +is an upstream effort to migrate in-tree volume plugins to their counterpart CSI +drivers. The feature is beta since Kubernetes 1.17, however, as of Kubernetes 1.20 +it is still *disabled* by default. -That is going to change in Kubernetes 1.21 (OCP 4.8), where the feature will remain beta, but enabled by default. In Kubernetes 1.22 (OCP 4.9) the feature may become GA. +That is going to change in Kubernetes 1.21 (OCP 4.8), where the feature will remain +beta, but *enabled* by default. In Kubernetes 1.22 (OCP 4.9) the feature may become GA. -In OCP we can optionally disable CSI migration feature while the it is still beta, however, that will no longer be an option once CSI migration becomes GA. +In OCP we can optionally disable the CSI migration feature while it is still beta, +however, that will no longer be an option once CSI migration becomes GA. -In order to avoid surprises once the migration is enabled by default in OCP, we want to allow cluster administrators to optionally enable the feature earlier, preferably in OCP 4.8. +In order to avoid surprises once the migration is enabled by default in OCP, we want +to allow cluster administrators to optionally enable the feature earlier, preferably +in OCP 4.8. -## Goals +### Goals -For Tech Preview, we want to introduce a mechanism to allow switching CSI migration feature flags on and off accross OCP components. +Our goals are different throughout our support lifecycle. -Due to upstream requirements, it is important that this mechanism allows for enabling the feature flags on control-plane components **before** the kubelet, and vice versa. +For Tech Preview, we want to introduce a mechanism to allow switching CSI migration +feature flags on and off across OCP components. Due to upstream requirements +described in (the design proposal)[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateun-migrate], +it is important that this mechanism allows for switching the feature +flags in the the correct order. -Once CSI migration is enabled by default in upstream, it will not be possible to disable it again. Therefore, such mechanism shall be disabled in OCP once CSI migration becomes GA in upstream. +In other words, when enabling CSI migration, control-plane components should have +their feature flags enabled before the kubelet. The opposite order applies when +disabling CSI migration. -## Non-Goals +For GA, we will not support disabling CSI migration. Existing in-tree volumes will +be migrated to CSI and users should not have to do any additional work. We do want +to make sure we will not break downgrades should the user decide to do that. + +### Non-Goals * Control the ordering in which OCP components will be upgraded or downgraded. * Install or remove the CSI driver as the migration is enabled or disabled. ## Proposal -The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, the cluster administrator should turn on these two feature gates: +### Requirements + +The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, +to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, +the cluster administrator should turn on these two feature gates: *CSIMigration*, +and *CSIMigrationAWS*. -* CSIMigration -* CSIMigrationAWS +In OCP, we can easily set those feature gates by using the [FeatureGate] +(https://docs.openshift.com/container-platform/4.7/nodes/clusters/nodes-cluster-enabling-features.html) +Custom Resource. OCP operators read this resource and restart their operands with +the appropriate features enabled. However, this approach alone is not acceptable +for CSI migration because the feature flags might be switched across components +in an arbitrary order. -Nevertheless, what makes things more complicated is the strict order in which those flags need to be switched across components. In other words, CSI Migration requires that feature flags are enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). +This is not acceptable because CSI Migration requires that feature flags are enabled +or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). -It is important to respect this ordering to avoid an undesired state of the components of volumes. For instance, when enabling the feature in the wrong order, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. +It is important to respect this ordering to avoid an undesired state of volumes. +For instance, if the feature is enabled in the kubelet before it is enabled in the +AttachDetach controller, volumes attached to nodes by the in-tree volume plugin +cannot be detached by the CSI driver and will stay attached forever. -In the same vein, when disabling the feature, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. +In the same vein, if the feature is disabled in the AttachDetach controller before +it is disabled in the kubelet, volumes attached by the CSI driver cannot be detached +by the in-tree volume plugin and will stay attached forever. -In summary: +In summary, this is what upstream recommends: * When the CSI migration is **enabled**, events should happen in this order: 1. Enable the feature gate in all control-plane components. - 2. Once that's done, drain nodes one-by-one and start the kubelet with the feature gate enabled. + 2. Once that's done, drain nodes one-by-one and start the kubelet with the + feature gate enabled. * When the CSI migration is **disabled**, events should happen in this order: 1. One-by-one, drain the nodes and start the kubelet with the feature gate disabled. 2. Once that's done, disable the feature gate in all control-plane components. -In order to achieve that, we propose two different approaches to be used at different times during the feature lifecycle. +In addition to that, upstream has a mechanism to keep the AttachDetach Controller +informed about the status of the migration on the nodes. Roughly speaking, +Kubelet propagates to an annotation the information for each potentially migrated +in-tree plugin on the node. + +As a result, the AttachDetach Controller knows if the in-tree plugin has been +migrated on the Node. If the feature flags are enabled in KCM and on the Node, +the AttachDetach Controller uses the CSI driver to attach volumes. Otherwise, +it will falls back to the in-tree plugin. + +### OCP + +That being said, we propose to add a carry-patch to OCP that allows the AttachDetach +Controller to ignore the CSI Migration feature flags passed to Kube Controller Manager. +That way, when deciding about using either the CSI driver or the in-tree plugin, +the AttachDetch Controller will **only** rely on the information propagated by the Node. + +If the feature flags are disabled on the Node, the AttachDetach Controller will never use +the CSI driver to attach or detach volumes, which makes this a safe approach. + +The biggest benefit of this approach is that we do not need to worry about the ordering +in which components are restarted with the CSI migration flags switched on or off. + +Another benefit is that this approach should not break downgrades once CSI Migration +becomes GA. In OCP, a downgrade is fundamentally a regular upgrade to an older version. +That means that CVO downgrades components in the same order as it upgrades them: control-plane +first, nodes later. This would impose an issue if the user downgraded from a version with CSI +Migration enabled to a version with CSI Migration disabled. With the above patch in the downgraded +version, that would not be a problem. + +Once this patch is merged, during Tech Preview users can enable the CSI migration using a *FeatureSet*. +It could be shared with [CCMO](https://github.com/openshift/enhancements/pull/463), but we may want +to have a specific *FeatureSet* only for CSI Migration. + +Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default. +As a result, it will not be necessary to use *FeatureSets* anymore. + +## Alternatives + +We have considered this alternative approach. It is NOT our preferable approach because +it introduces many shortcomings: -The first approach is intended to be used for is Tech Preview in OCP. The other approach should be used to once we graduate CSI migration to GA. +* It is error prone. +* We put on the user the responsability of applying the *FeatureSets* in the correct order. +* We need to create 2 extra *FeatureSets* as opposed to 1. +* We need to patch MCO so that it ignores one of these *FeatureSets*. ### Tech Preview -In OCP, *FeatureSets* can be used to aggregate one or more feature gates. Then, the resource [FeatureGate](https://github.com/openshift/api/blob/dca637550e8c80dc2fa5ff6653b43a3b5c6c810c/config/v1/types_feature.go#L9-L21) can be used to enable and disable *FeatureSets*. - -With that in mind, we propose to: - -1. Create two [new FeatureSets](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L25-L43) to support CSI migration: `CSIMigrationNode` and `CSIMigrationControlPlane`. - * Both *FeatureSets* will enable the **same** feature gates: - - `CSIMigration` - - `CSIMIgrationAWS` - - `CSIMigrationGCE` - - `CSIMigrationAzureDisk` - - `CSIMigrationAzureFile` - - `CSIMigrationvSphere` - - `CSIMigrationOpenStack` - * The machine-config-operator (MCO) will **ignore** the `CSIMigrationControlPlane` **FeatureSet**. - * On the other hand, all operators will react to the `CSIMigrationNode` *FeatureSet*, including control-plane operators. +1. Create two new FeatureSets: `CSIMigrationNode` and `CSIMigrationControlPlane`. +* Both *FeatureSets* will enable the **same** feature gates: +- `CSIMigration` +- `CSIMIgrationAWS` +- `CSIMigrationGCE` +- `CSIMigrationAzureDisk` +- `CSIMigrationAzureFile` +- `CSIMigrationvSphere` +- `CSIMigrationOpenStack` +* The machine-config-operator (MCO) will **ignore** the `CSIMigrationControlPlane` **FeatureSet**. +* On the other hand, all operators will react to the `CSIMigrationNode` *FeatureSet*, + including control-plane operators. 2. To enable CSI Migration for any in-tree plugin, the cluster administrator should: - * First, enable the CSI migration feature flags in all control-plane components: - - Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object: +* First, enable the CSI migration feature flags in all control-plane components: +- Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object: ```shell $ oc edit featuregates/cluster (...) spec: featureSet: CSIMigrationControlPlane ``` - - With the exception of MCO, all operators will recognize this *FeatureSet* and will initialize their operands with the associated feature gates. - * Second, once all control-plane components have restarted, enable the CSI migration feature flags in the kubelet: - - Add the `CSIMigrationNode` *FeatureSet* to the `featuresgates/cluster` object, replacing the previous value (i.e., `CSIMigrationControlPlane`): + - With the exception of MCO, all operators will recognize this *FeatureSet* and will initialize their operands with the associated feature gates. + * Second, once all control-plane components have restarted, enable the CSI migration feature flags in the kubelet: + - Add the `CSIMigrationNode` *FeatureSet* to the `featuresgates/cluster` object, replacing the previous value (i.e., `CSIMigrationControlPlane`): ```shell $ oc edit featuregates/cluster (...) spec: featureSet: CSIMigrationNode ``` - - All operators will recognize the `CSIMigrationNode` *FeatureSet*, however, control-plane operators already applied the associated feature gates in the step above, so in practice only MCO will have work to do. - * At this point, CSI migration is fully enabled in the cluster. -3. To disable CSI migration, the cluster administrator should perform the same steps in the opposite order: - * In `featuregates/cluster` object, replace the `CSIMigrationNode` *FeatureSet* by `CSIMigrationControlPlane`. - * Wait for all `CSINode` objects to have the annotation `storage.alpha.kubernetes.io/migrated-plugins` cleared. No storage plugins should be listed in this annotation. - * Remove the `CSIMigrationControlPlane` *FeatureSet* from the `featuregates/cluster` object. -4. It is the **responsibility of the cluster administrator** to guarante the ordering described above is respected. + - All operators will recognize the `CSIMigrationNode` *FeatureSet*, however, control-plane operators already applied the associated feature gates in the step above, so in practice only MCO will have work to do. + * At this point, CSI migration is fully enabled in the cluster. + 3. To disable CSI migration, the cluster administrator should perform the same steps in the opposite order: + * In `featuregates/cluster` object, replace the `CSIMigrationNode` *FeatureSet* by `CSIMigrationControlPlane`. + * Wait for all `CSINode` objects to have the annotation `storage.alpha.kubernetes.io/migrated-plugins` cleared. No storage plugins should be listed in this annotation. + * Remove the `CSIMigrationControlPlane` *FeatureSet* from the `featuregates/cluster` object. + 4. It is the **responsibility of the cluster administrator** to guarante the ordering described above is respected. ### GA -Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default and the features will not be optional anymore. As a result, CSI migration will be enabled by default in OCP as well, and there will not be an option to disable it. +Once CSI migration reaches GA in upstream, the associated feature gates will be +enabled by default and the features will not be optional anymore. As a result, +CSI migration will be enabled by default in OCP as well, and there will not be +an option to disable it. -In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operational because the feature flags they enable will already be enabled in the cluster. +In addition that, the *FeatureSets* created to handle the Tech Preview feature +will no longer be operational because the feature flags they enable will already +be enabled in the cluster. -As for the required ordering described above, the [upgrade order](https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#generalized-ordering) performed by CVO during a cluster upgrade will take care of starting components in the desired order. +As for the required ordering described above, the [upgrade order] +(https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#generalized-ordering) +performed by CVO during a cluster upgrade will take care of starting +components in the desired order. -In this phase, CSI migration feature gates are enabled by default in all components, so restarting control-plane components before the kubelet is enough to guarantee a smooth feature enablement. +In this phase, CSI migration feature gates are enabled by default in all +components, so restarting control-plane components before the kubelet +is enough to guarantee a smooth feature enablement. #### Limitations -The limitations of this approach lies on the downgrade process. In OCP, a downgrade is fundamentally a regular upgrade to an older version. That means that CVO downgrades components in the same order as it upgrades them: control-plane first, nodes later. +The limitations of this approach lies on the downgrade process. In OCP, a downgrade +is fundamentally a regular upgrade to an older version. That means that CVO downgrades +components in the same order as it upgrades them: control-plane first, nodes later. -However, this ordering is **not** desired when **disabling** CSI migration, which is what is going to happen when the previous OCP version had CSI migration disabled. +However, this ordering is **not** desired when **disabling** CSI migration, which is +what is going to happen when the previous OCP version had CSI migration disabled. -It is important to note that the order in which operators are downgraded in OCP violates [upstream version skew policy](https://kubernetes.io/docs/setup/release/version-skew-policy). -The policy states that a new kubelet must never run with an older API server or Controller Manager. A direct consequence of this violation is the need to introduce a workaround to downgrade a cluster with CSI migration enabled. +It is important to note that the order in which operators are downgraded in OCP violates +[upstream version skew policy](https://kubernetes.io/docs/setup/release/version-skew-policy). +The policy states that a new kubelet must never run with an older API server or Controller Manager. +A direct consequence of this violation is the need to introduce a workaround to downgrade a +cluster with CSI migration enabled. That being said, to address this issue we propose to document a simple workaround: 1. Enable the `CSIMigrationNode` *FeatureSet*. 1. Downgrade. -As stated above, the CSI migration *FeatureSets* are not operational once CSI migration becomes GA. However, they will be carried over during the downgrade and they will be correctly applied once the system is downgraded. +As stated above, the CSI migration *FeatureSets* are not operational once CSI migration +becomes GA. However, they will be carried over during the downgrade and they will be +correctly applied once the system is downgraded. ### Post-GA -CSI migration *FeatureSets* can be removed from OCP API **one** release after CSI migration becomes GA. +CSI migration *FeatureSets* can be removed from OCP API **one** release after CSI +migration becomes GA. ### Risks and Mitigations Although this three-phased approach does what we need, it has some drawbacks: -1. For Tech Preview, users might enable or disable *FeatureSets* in the wrong order, causing issues with attaching or detaching existing volumes. We plan to address this through documentation. +1. For Tech Preview, users might enable or disable *FeatureSets* in the wrong +order, causing issues with attaching or detaching existing volumes. We plan to +address this through documentation. ## Design Details @@ -165,16 +261,18 @@ This is what needs to be done across different support phases: 1. Tech Preview: * openshift/api - * Introduce two new *FeatureSets*: `CSIMigrationNode` and `CSIMigrationControlPlane` +* Introduce two new *FeatureSets*: `CSIMigrationNode` and `CSIMigrationControlPlane` * Machine Config Operator - * Bump openshift/api in order to get the *FeatureSets* above - * Introduce a patch to ignore the `CSIMigrationControlPlane` *FeatureSet* +* Bump openshift/api in order to get the *FeatureSets* above +* Introduce a patch to ignore the `CSIMigrationControlPlane` *FeatureSet* * Kubernetes Scheduler Operator - * Bump openshift/api in order to get the *FeatureSets* above +* Bump openshift/api in order to get the *FeatureSets* above * Kubernetes Controller Manager Operator - * Bump openshift/api in order to get the *FeatureSets* above +* Bump openshift/api in order to get the *FeatureSets* above -Other operators, like the Kubernetes API Server Operator, may have their openshift/api library bumped. However, that is not strictly necessary as their operands don't need to enable the CSI migration feature flags. +Other operators, like the Kubernetes API Server Operator, may have their +openshift/api library bumped. However, that is not strictly necessary as +their operands don't need to enable the CSI migration feature flags. 2. GA @@ -183,12 +281,12 @@ Nothing needs to be done. 3. GA + 1 release * openshift/api - * Remove CSI migration *FeatureSets*: `CSIMigrationNode` and `CSIMigrationControlPlane` +* Remove CSI migration *FeatureSets*: `CSIMigrationNode` and `CSIMigrationControlPlane` * Machine Config Operator - * Bump openshift/api - * Remove the skip added for Tech Preview +* Bump openshift/api +* Remove the skip added for Tech Preview * Other operators - * Bump openshift/api +* Bump openshift/api ### Test Plan @@ -199,7 +297,9 @@ We want E2E jobs for all migrated plugins ready at **Tech Preview** time. For each E2E job: 1. Install an OCP cluster. -1. Enable the feature gate in the right order. Even though it is a fresh cluster, we need to respect the order because [volumes might be created in CI](https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/monitoringpvc/ipi-install-monitoringpvc-ref.yaml). +1. Enable the feature gate in the right order. Even though it is a fresh cluster, + we need to respect the order because [volumes might be created in CI] + (https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/monitoringpvc/ipi-install-monitoringpvc-ref.yaml). 1. Run E2E tests for in-tree volume plugins. 1. Disable the feature gate in the right order. 1. Once again, run E2E tests for in-tree volume plugins. @@ -211,4 +311,5 @@ In addition to that, as a stretch goal, we want a separate job that: 1. Wait for all components to have the right feature flags. 1. Checks if the StatefulSet survives. -Once CSI migration is GA, we expect the regular upgrade jobs will cover upgrades from an OCP version with migration disabled to a version with migration enabled. +Once CSI migration is GA, we expect the regular upgrade jobs will cover upgrades +from an OCP version with migration disabled to a version with migration enabled. From 0be3890ec55b8ea1e0bc69041d6c32ca3b59c231 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 5 Mar 2021 09:55:11 +0100 Subject: [PATCH 14/28] More updates --- enhancements/storage/csi-migration.md | 263 +++++++------------------- 1 file changed, 66 insertions(+), 197 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index 5a1e0752f3..f096d5b06d 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -27,44 +27,26 @@ superseded-by: ## Summary -We want to allow cluster administrators to seamlessly migrate volumes created using -the in-tree storage plugin to their counterparts CSI drivers. It is important to -achieve this goal before CSI Migration feature becomes GA in upstream. +We want to allow cluster administrators to seamlessly migrate volumes created using the in-tree storage plugin to their counterparts CSI drivers. It is important to achieve this goal before CSI Migration feature becomes GA in upstream. Also, this is a requirement for supporting [out-of-tree cloud providers](https://github.com/openshift/enhancements/pull/463) ## Motivation -[CSI migration](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/625-csi-migration) -is an upstream effort to migrate in-tree volume plugins to their counterpart CSI -drivers. The feature is beta since Kubernetes 1.17, however, as of Kubernetes 1.20 -it is still *disabled* by default. +[CSI migration](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/625-csi-migration) is an upstream effort to migrate in-tree volume plugins to their counterpart CSI drivers. The feature is beta since Kubernetes 1.17, however, as of Kubernetes 1.20 it is still *disabled* by default. -That is going to change in Kubernetes 1.21 (OCP 4.8), where the feature will remain -beta, but *enabled* by default. In Kubernetes 1.22 (OCP 4.9) the feature may become GA. +That is going to change in Kubernetes 1.21 (OCP 4.8), where the feature will remain beta, but *enabled* by default. In Kubernetes 1.22 (OCP 4.9) the feature may become GA. -In OCP we can optionally disable the CSI migration feature while it is still beta, -however, that will no longer be an option once CSI migration becomes GA. - -In order to avoid surprises once the migration is enabled by default in OCP, we want -to allow cluster administrators to optionally enable the feature earlier, preferably -in OCP 4.8. +In OCP we can optionally disable the CSI migration feature while it is still beta, however, that will no longer be an option once CSI migration becomes GA. In order to avoid surprises once the migration is enabled by default in OCP, we want to allow cluster administrators to optionally enable the feature earlier, preferably in OCP 4.8. ### Goals Our goals are different throughout our support lifecycle. -For Tech Preview, we want to introduce a mechanism to allow switching CSI migration -feature flags on and off across OCP components. Due to upstream requirements -described in (the design proposal)[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateun-migrate], -it is important that this mechanism allows for switching the feature -flags in the the correct order. +For Tech Preview, we want to introduce a mechanism to allow switching CSI migration feature flags on and off across OCP components. Due to upstream requirements described in (the design proposal)[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateun-migrate], it is important that this mechanism allows for switching the feature flags in the the correct order. -In other words, when enabling CSI migration, control-plane components should have -their feature flags enabled before the kubelet. The opposite order applies when -disabling CSI migration. + +In other words, when enabling CSI migration, control-plane components should have their feature flags enabled before the kubelet. The opposite order applies when disabling CSI migration. -For GA, we will not support disabling CSI migration. Existing in-tree volumes will -be migrated to CSI and users should not have to do any additional work. We do want -to make sure we will not break downgrades should the user decide to do that. +For GA, we will not support disabling CSI migration. Existing in-tree volumes will be migrated to CSI and users should not have to do any additional work. We do want to make sure we will not break downgrades should the user decide to do that. ### Non-Goals @@ -75,29 +57,11 @@ to make sure we will not break downgrades should the user decide to do that. ### Requirements -The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, -to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, -the cluster administrator should turn on these two feature gates: *CSIMigration*, -and *CSIMigrationAWS*. - -In OCP, we can easily set those feature gates by using the [FeatureGate] -(https://docs.openshift.com/container-platform/4.7/nodes/clusters/nodes-cluster-enabling-features.html) -Custom Resource. OCP operators read this resource and restart their operands with -the appropriate features enabled. However, this approach alone is not acceptable -for CSI migration because the feature flags might be switched across components -in an arbitrary order. - -This is not acceptable because CSI Migration requires that feature flags are enabled -or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). +Before getting to our proposal, we need to describe some of the upstream requirements for using CSI Migration. -It is important to respect this ordering to avoid an undesired state of volumes. -For instance, if the feature is enabled in the kubelet before it is enabled in the -AttachDetach controller, volumes attached to nodes by the in-tree volume plugin -cannot be detached by the CSI driver and will stay attached forever. +The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, the cluster administrator should turn on these two feature gates: *CSIMigration* and *CSIMigrationAWS*. However, these flags must be enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). -In the same vein, if the feature is disabled in the AttachDetach controller before -it is disabled in the kubelet, volumes attached by the CSI driver cannot be detached -by the in-tree volume plugin and will stay attached forever. +It is important to respect this ordering to avoid an undesired state of volumes. For instance, if the feature is enabled in the kubelet before it is enabled in the AttachDetach controller, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. In the same vein, if the feature is disabled in the AttachDetach controller before it is disabled in the kubelet, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. In summary, this is what upstream recommends: @@ -110,183 +74,91 @@ In summary, this is what upstream recommends: 1. One-by-one, drain the nodes and start the kubelet with the feature gate disabled. 2. Once that's done, disable the feature gate in all control-plane components. -In addition to that, upstream has a mechanism to keep the AttachDetach Controller -informed about the status of the migration on the nodes. Roughly speaking, -Kubelet propagates to an annotation the information for each potentially migrated -in-tree plugin on the node. +In order to keep the Attach Detach Controller and the Kubelet in sync regarding using the CSI driver or the in-tree plugin, upstream has a mechanism to keep the AttachDetach Controller informed about the status of the migration on nodes. Roughly speaking, Kubelet propagates to an annotation the information for each migrated in-tree plugin on the node. -As a result, the AttachDetach Controller knows if the in-tree plugin has been -migrated on the Node. If the feature flags are enabled in KCM and on the Node, -the AttachDetach Controller uses the CSI driver to attach volumes. Otherwise, -it will falls back to the in-tree plugin. +As a result, the AttachDetach Controller knows if the in-tree plugin has been migrated on the Node. If the feature flags are enabled in KCM and on the Node, the AttachDetach Controller uses the CSI driver to attach volumes. Otherwise, it will falls back to the in-tree plugin. ### OCP -That being said, we propose to add a carry-patch to OCP that allows the AttachDetach -Controller to ignore the CSI Migration feature flags passed to Kube Controller Manager. -That way, when deciding about using either the CSI driver or the in-tree plugin, -the AttachDetch Controller will **only** rely on the information propagated by the Node. - -If the feature flags are disabled on the Node, the AttachDetach Controller will never use -the CSI driver to attach or detach volumes, which makes this a safe approach. - -The biggest benefit of this approach is that we do not need to worry about the ordering -in which components are restarted with the CSI migration flags switched on or off. - -Another benefit is that this approach should not break downgrades once CSI Migration -becomes GA. In OCP, a downgrade is fundamentally a regular upgrade to an older version. -That means that CVO downgrades components in the same order as it upgrades them: control-plane -first, nodes later. This would impose an issue if the user downgraded from a version with CSI -Migration enabled to a version with CSI Migration disabled. With the above patch in the downgraded -version, that would not be a problem. - -Once this patch is merged, during Tech Preview users can enable the CSI migration using a *FeatureSet*. -It could be shared with [CCMO](https://github.com/openshift/enhancements/pull/463), but we may want -to have a specific *FeatureSet* only for CSI Migration. - -Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default. -As a result, it will not be necessary to use *FeatureSets* anymore. - -## Alternatives +In OCP, we can easily set those feature gates by using the [FeatureGate] (https://docs.openshift.com/container-platform/4.7/nodes/clusters/nodes-cluster-enabling-features.html) Custom Resource. OCP operators read this resource and restart their operands with the appropriate features enabled. However, this approach alone is not acceptable for CSI migration because the feature flags might be switched across components in _any_ arbitrary order. -We have considered this alternative approach. It is NOT our preferable approach because -it introduces many shortcomings: +That being said, we plan to submit an upstream patch that allows the AttachDetach Controller to have its own custom feature gates, independent from Kube Controller Manager. -* It is error prone. -* We put on the user the responsability of applying the *FeatureSets* in the correct order. -* We need to create 2 extra *FeatureSets* as opposed to 1. -* We need to patch MCO so that it ignores one of these *FeatureSets*. +In addition to that, we propose to add a carry-patch to Attacth Detach Controller in OCP enables CSI Migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). -### Tech Preview +That way, when deciding about using either the CSI driver or the in-tree plugin, the AttachDetch Controller will **only** rely on the information propagated by the Node. +Other controllers from Kube Controller Manager, like the PV Controller, will still obey the flags passed to the Kube Controller Manager. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a Node basis rather than relying on a global state. -1. Create two new FeatureSets: `CSIMigrationNode` and `CSIMigrationControlPlane`. -* Both *FeatureSets* will enable the **same** feature gates: -- `CSIMigration` -- `CSIMIgrationAWS` -- `CSIMigrationGCE` -- `CSIMigrationAzureDisk` -- `CSIMigrationAzureFile` -- `CSIMigrationvSphere` -- `CSIMigrationOpenStack` -* The machine-config-operator (MCO) will **ignore** the `CSIMigrationControlPlane` **FeatureSet**. -* On the other hand, all operators will react to the `CSIMigrationNode` *FeatureSet*, - including control-plane operators. -2. To enable CSI Migration for any in-tree plugin, the cluster administrator should: -* First, enable the CSI migration feature flags in all control-plane components: -- Add the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object: - ```shell - $ oc edit featuregates/cluster - (...) - spec: - featureSet: CSIMigrationControlPlane - ``` - - With the exception of MCO, all operators will recognize this *FeatureSet* and will initialize their operands with the associated feature gates. - * Second, once all control-plane components have restarted, enable the CSI migration feature flags in the kubelet: - - Add the `CSIMigrationNode` *FeatureSet* to the `featuresgates/cluster` object, replacing the previous value (i.e., `CSIMigrationControlPlane`): - ```shell - $ oc edit featuregates/cluster - (...) - spec: - featureSet: CSIMigrationNode - ``` - - All operators will recognize the `CSIMigrationNode` *FeatureSet*, however, control-plane operators already applied the associated feature gates in the step above, so in practice only MCO will have work to do. - * At this point, CSI migration is fully enabled in the cluster. - 3. To disable CSI migration, the cluster administrator should perform the same steps in the opposite order: - * In `featuregates/cluster` object, replace the `CSIMigrationNode` *FeatureSet* by `CSIMigrationControlPlane`. - * Wait for all `CSINode` objects to have the annotation `storage.alpha.kubernetes.io/migrated-plugins` cleared. No storage plugins should be listed in this annotation. - * Remove the `CSIMigrationControlPlane` *FeatureSet* from the `featuregates/cluster` object. - 4. It is the **responsibility of the cluster administrator** to guarante the ordering described above is respected. +The biggest benefit of this approach is that we do not need to worry about the ordering in which components are restarted with the CSI migration flags switched on or off. Migration flags can be switched on or off across components in any order. -### GA - -Once CSI migration reaches GA in upstream, the associated feature gates will be -enabled by default and the features will not be optional anymore. As a result, -CSI migration will be enabled by default in OCP as well, and there will not be -an option to disable it. +Another benefit is that this approach should not break downgrades once CSI Migration becomes GA. In OCP, a downgrade is fundamentally a regular upgrade to an older version. +That means that CVO downgrades components in the same order as it upgrades them: connntrol-plane first, nodes later. This would impose an issue if the user downgraded from a version with CSI Migration enabled to a version with CSI Migration disabled. With the above patch in the downgraded version, that would not be a problem. -In addition that, the *FeatureSets* created to handle the Tech Preview feature -will no longer be operational because the feature flags they enable will already -be enabled in the cluster. +It is important to notice that, with this carry-patch in OCP, Attach Detach Controller will _not_ change its current behaviour as long as Nodes are not migrated to CSI, which is the default behaviour in OCP 4.8. -As for the required ordering described above, the [upgrade order] -(https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#generalized-ordering) -performed by CVO during a cluster upgrade will take care of starting -components in the desired order. +Once this patch is merged, during Tech Preview users can enable the CSI migration using a *FeatureSet*. It could be shared with [CCMO](https://github.com/openshift/enhancements/pull/463), but we may want to have a specific *FeatureSet* only for CSI Migration for users that want to enable CSI Migration without migrating to an external cloud provider. -In this phase, CSI migration feature gates are enabled by default in all -components, so restarting control-plane components before the kubelet -is enough to guarantee a smooth feature enablement. +Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default. As a result, it will not be necessary to use *FeatureSets* anymore. -#### Limitations - -The limitations of this approach lies on the downgrade process. In OCP, a downgrade -is fundamentally a regular upgrade to an older version. That means that CVO downgrades -components in the same order as it upgrades them: control-plane first, nodes later. +## Design Details -However, this ordering is **not** desired when **disabling** CSI migration, which is -what is going to happen when the previous OCP version had CSI migration disabled. +This is what needs to be done across different support phases: -It is important to note that the order in which operators are downgraded in OCP violates -[upstream version skew policy](https://kubernetes.io/docs/setup/release/version-skew-policy). -The policy states that a new kubelet must never run with an older API server or Controller Manager. -A direct consequence of this violation is the need to introduce a workaround to downgrade a -cluster with CSI migration enabled. +1. Tech Preview: -That being said, to address this issue we propose to document a simple workaround: +* Introduce a new *FeatureSet* in openshift/api called `CSIMigration`. +* Make sure the *FeatureSet* used by [CCMO](https://github.com/openshift/enhancements/pull/463) contains the CSI migration feature flags enabled for the respective storage backened. +* Introduce an upstream patch that allows Attach Detach Controller to have its own custom feature gates, independent from Kube Controller Manager. +* Introduce a carry-patch in OCP that enables CSI Migration for Cinder and GCP PD in Attach Detach Controller. +* A PoC of both upstream and OCP patches [are available](https://github.com/openshift/kubernetes/pull/601). -1. Enable the `CSIMigrationNode` *FeatureSet*. -1. Downgrade. +2. GA -As stated above, the CSI migration *FeatureSets* are not operational once CSI migration -becomes GA. However, they will be carried over during the downgrade and they will be -correctly applied once the system is downgraded. +Nothing needs to be done, migration flags are enabled by default and cannot be disabled. -### Post-GA +3. GA + 1 release -CSI migration *FeatureSets* can be removed from OCP API **one** release after CSI -migration becomes GA. +* Remove `CSIMigration` *FeatureSet* from openshift/api. -### Risks and Mitigations -Although this three-phased approach does what we need, it has some drawbacks: +## Alternatives -1. For Tech Preview, users might enable or disable *FeatureSets* in the wrong -order, causing issues with attaching or detaching existing volumes. We plan to -address this through documentation. +We have considered this alternative approach. It is NOT our preferable approach because it introduces many shortcomings: -## Design Details +* We put on the user the responsability of applying the *FeatureSets* in the correct order, which is very error-prone. +* We need to create 2 extra *FeatureSets* as opposed to 1. +* We need to patch MCO so that it ignores one of these *FeatureSets*. +* Once GA, downgrades to a previous version with CSI Migration disabled by default would break because OCP downgrades the control-plane before nodes. -This is what needs to be done across different support phases: +### Tech Preview -1. Tech Preview: +1. Create two new FeatureSets: `CSIMigrationNode` and `CSIMigrationControlPlane`. Both *FeatureSets* will enable all CSI Migration feature flags. + * The machine-config-operator (MCO) will **ignore** the `CSIMigrationControlPlane` **FeatureSet**. + * On the other hand, all operators will react to the `CSIMigrationNode` *FeatureSet*, including control-plane operators. +2. To enable CSI Migration for any in-tree plugin, the cluster administrator should: + * First, enable the CSI migration feature flags in all control-plane components by adding the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object. + - With the exception of MCO, all operators will recognize this *FeatureSet* and will initialize their operands with the associated feature gates. + * Second, once all control-plane components have restarted, enable the CSI migration feature flags in the kubelet: + - Add the `CSIMigrationNode` *FeatureSet* to the `featuresgates/cluster` object, replacing the previous value (i.e., `CSIMigrationControlPlane`). + - All operators will recognize the `CSIMigrationNode` *FeatureSet*, however, control-plane operators already applied the associated feature gates in the step above, so in practice only MCO will have work to do. + * At this point, CSI migration is fully enabled in the cluster. +3. To disable CSI migration, the cluster administrator should perform the same steps in the opposite order: + * In `featuregates/cluster` object, replace the `CSIMigrationNode` *FeatureSet* by `CSIMigrationControlPlane`. + * Wait for all `CSINode` objects to have the annotation `storage.alpha.kubernetes.io/migrated-plugins` cleared. No storage plugins should be listed in this annotation. + * Remove the `CSIMigrationControlPlane` *FeatureSet* from the `featuregates/cluster` object. +4. It is the **responsibility of the cluster administrator** to guarante the ordering described above is respected. -* openshift/api -* Introduce two new *FeatureSets*: `CSIMigrationNode` and `CSIMigrationControlPlane` -* Machine Config Operator -* Bump openshift/api in order to get the *FeatureSets* above -* Introduce a patch to ignore the `CSIMigrationControlPlane` *FeatureSet* -* Kubernetes Scheduler Operator -* Bump openshift/api in order to get the *FeatureSets* above -* Kubernetes Controller Manager Operator -* Bump openshift/api in order to get the *FeatureSets* above +### GA -Other operators, like the Kubernetes API Server Operator, may have their -openshift/api library bumped. However, that is not strictly necessary as -their operands don't need to enable the CSI migration feature flags. +Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default and the features will not be optional anymore. As a result, CSI migration will be enabled by default in OCP as well, and there will not be an option to disable it. -2. GA +In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operational because the feature flags they enable will already be enabled in the cluster. -Nothing needs to be done. +It is important to note that the order in which operators are downgraded in OCP violates [upstream version skew policy](https://kubernetes.io/docs/setup/release/version-skew-policy). The policy states that a new kubelet must never run with an older API server or Controller Manager. A direct consequence of this violation is the need to introduce a workaround to downgrade a cluster with CSI migration enabled. -3. GA + 1 release +### Post-GA -* openshift/api -* Remove CSI migration *FeatureSets*: `CSIMigrationNode` and `CSIMigrationControlPlane` -* Machine Config Operator -* Bump openshift/api -* Remove the skip added for Tech Preview -* Other operators -* Bump openshift/api +CSI migration *FeatureSets* can be removed from OCP API **one** release after CSI migration becomes GA. ### Test Plan @@ -297,9 +169,7 @@ We want E2E jobs for all migrated plugins ready at **Tech Preview** time. For each E2E job: 1. Install an OCP cluster. -1. Enable the feature gate in the right order. Even though it is a fresh cluster, - we need to respect the order because [volumes might be created in CI] - (https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/monitoringpvc/ipi-install-monitoringpvc-ref.yaml). +1. Enable the feature gate in the right order. Even though it is a fresh cluster, we need to respect the order because [volumes might be created in CI] (https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/monitoringpvc/ipi-install-monitoringpvc-ref.yaml). 1. Run E2E tests for in-tree volume plugins. 1. Disable the feature gate in the right order. 1. Once again, run E2E tests for in-tree volume plugins. @@ -311,5 +181,4 @@ In addition to that, as a stretch goal, we want a separate job that: 1. Wait for all components to have the right feature flags. 1. Checks if the StatefulSet survives. -Once CSI migration is GA, we expect the regular upgrade jobs will cover upgrades -from an OCP version with migration disabled to a version with migration enabled. +Once CSI migration is GA, we expect the regular upgrade jobs will cover upgrades from an OCP version with migration disabled to a version with migration enabled. From d7ce166d0023b17bac594419469d327f2eab447f Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 5 Mar 2021 10:18:48 +0100 Subject: [PATCH 15/28] Add more meta information --- enhancements/storage/csi-migration.md | 90 +++++++++++++++++---------- 1 file changed, 56 insertions(+), 34 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index f096d5b06d..5a458025cc 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -3,14 +3,14 @@ title: csi-migration authors: - "@fbertina" reviewers: - - "@openshift/storage ” + - "@openshift/storage” approvers: - "@openshift/openshift-architects" - "@darkmuggle" -creation-date: 2020-07-01 -last-updated: 2021-01-28 +creation-date: 2020-07-29 +last-updated: 2021-03-05 status: provisional -see-also: +see-also: https://github.com/openshift/enhancements/pull/463 replaces: superseded-by: --- @@ -55,7 +55,10 @@ For GA, we will not support disabling CSI migration. Existing in-tree volumes wi ## Proposal -### Requirements +We propose to add a carry-patch to Attacth Detach Controller in OCP thta enables CSI Migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). + + +### Implementation Details/Notes/Constraints Before getting to our proposal, we need to describe some of the upstream requirements for using CSI Migration. @@ -78,8 +81,6 @@ In order to keep the Attach Detach Controller and the Kubelet in sync regarding As a result, the AttachDetach Controller knows if the in-tree plugin has been migrated on the Node. If the feature flags are enabled in KCM and on the Node, the AttachDetach Controller uses the CSI driver to attach volumes. Otherwise, it will falls back to the in-tree plugin. -### OCP - In OCP, we can easily set those feature gates by using the [FeatureGate] (https://docs.openshift.com/container-platform/4.7/nodes/clusters/nodes-cluster-enabling-features.html) Custom Resource. OCP operators read this resource and restart their operands with the appropriate features enabled. However, this approach alone is not acceptable for CSI migration because the feature flags might be switched across components in _any_ arbitrary order. That being said, we plan to submit an upstream patch that allows the AttachDetach Controller to have its own custom feature gates, independent from Kube Controller Manager. @@ -89,6 +90,8 @@ In addition to that, we propose to add a carry-patch to Attacth Detach Controlle That way, when deciding about using either the CSI driver or the in-tree plugin, the AttachDetch Controller will **only** rely on the information propagated by the Node. Other controllers from Kube Controller Manager, like the PV Controller, will still obey the flags passed to the Kube Controller Manager. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a Node basis rather than relying on a global state. +#### Benefits + The biggest benefit of this approach is that we do not need to worry about the ordering in which components are restarted with the CSI migration flags switched on or off. Migration flags can be switched on or off across components in any order. Another benefit is that this approach should not break downgrades once CSI Migration becomes GA. In OCP, a downgrade is fundamentally a regular upgrade to an older version. @@ -96,15 +99,46 @@ That means that CVO downgrades components in the same order as it upgrades them: It is important to notice that, with this carry-patch in OCP, Attach Detach Controller will _not_ change its current behaviour as long as Nodes are not migrated to CSI, which is the default behaviour in OCP 4.8. -Once this patch is merged, during Tech Preview users can enable the CSI migration using a *FeatureSet*. It could be shared with [CCMO](https://github.com/openshift/enhancements/pull/463), but we may want to have a specific *FeatureSet* only for CSI Migration for users that want to enable CSI Migration without migrating to an external cloud provider. +#### Graduation + +During Tech Preview in OCP 4.8, users can enable the CSI migration using a *FeatureSet*. It could be shared with [CCMO](https://github.com/openshift/enhancements/pull/463), but we may want to have a specific *FeatureSet* only for CSI Migration for users that want to enable CSI Migration without migrating to an external cloud provider. Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default. As a result, it will not be necessary to use *FeatureSets* anymore. +### Risks and Mitigations + + + ## Design Details +### Test Plan + +#### E2E jobs + +We want E2E jobs for all migrated plugins ready at **Tech Preview** time. + +For each E2E job: + +1. Install an OCP cluster. +1. Enable the `CSIMigration` _FeatureSet_. +1. Run E2E tests for in-tree volume plugins. This should use the CSI driver instead. +1. Disable the `FeatureSet`. +1. Once again, run E2E tests for in-tree volume plugins. + +In addition to that, as a stretch goal, we want a separate job that: + +1. Runs a `StatefulSet`. +1. Enables the migration _FeatureSets_. +1. Wait for all components to have the right feature flags. +1. Checks if the StatefulSet survives. + +Once CSI migration is GA, we expect the regular upgrade jobs will cover upgrades from an OCP version with migration disabled to a version with migration enabled. + +### Graduation Criteria + This is what needs to be done across different support phases: -1. Tech Preview: +1. Tech Preview in OCP 4.8: * Introduce a new *FeatureSet* in openshift/api called `CSIMigration`. * Make sure the *FeatureSet* used by [CCMO](https://github.com/openshift/enhancements/pull/463) contains the CSI migration feature flags enabled for the respective storage backened. @@ -112,14 +146,25 @@ This is what needs to be done across different support phases: * Introduce a carry-patch in OCP that enables CSI Migration for Cinder and GCP PD in Attach Detach Controller. * A PoC of both upstream and OCP patches [are available](https://github.com/openshift/kubernetes/pull/601). -2. GA +2. GA in OCP 4.9 Nothing needs to be done, migration flags are enabled by default and cannot be disabled. -3. GA + 1 release +3. Post-GA in OCP 4.10 * Remove `CSIMigration` *FeatureSet* from openshift/api. +## Implementation History + +Main events only, this is not a faithful history. + +2020-07-29: Initial enhancement draft. +2021-01-28: Re-worked proposal to use 2 _FeatureSets_ with manual application by the user. +2021-03-05: Re-worked proposal to use carry-patch instead. Moved previous approach to "Alternatives" for reference. + +## Drawbacks + +The idea is to find the best form of an argument why this enhancement should _not_ be implemented. ## Alternatives @@ -159,26 +204,3 @@ It is important to note that the order in which operators are downgraded in OCP ### Post-GA CSI migration *FeatureSets* can be removed from OCP API **one** release after CSI migration becomes GA. - -### Test Plan - -#### E2E jobs - -We want E2E jobs for all migrated plugins ready at **Tech Preview** time. - -For each E2E job: - -1. Install an OCP cluster. -1. Enable the feature gate in the right order. Even though it is a fresh cluster, we need to respect the order because [volumes might be created in CI] (https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/monitoringpvc/ipi-install-monitoringpvc-ref.yaml). -1. Run E2E tests for in-tree volume plugins. -1. Disable the feature gate in the right order. -1. Once again, run E2E tests for in-tree volume plugins. - -In addition to that, as a stretch goal, we want a separate job that: - -1. Runs a StatefulSet. -1. Enables the migration *FeatureSets* in the right order. -1. Wait for all components to have the right feature flags. -1. Checks if the StatefulSet survives. - -Once CSI migration is GA, we expect the regular upgrade jobs will cover upgrades from an OCP version with migration disabled to a version with migration enabled. From 6aee9e8330f0b335f3f5f935b1839a967a61452c Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 5 Mar 2021 10:19:22 +0100 Subject: [PATCH 16/28] Remove @darkmuggle because this does not change MCO anymore --- enhancements/storage/csi-migration.md | 1 - 1 file changed, 1 deletion(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index 5a458025cc..65cf700a35 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -6,7 +6,6 @@ reviewers: - "@openshift/storage” approvers: - "@openshift/openshift-architects" - - "@darkmuggle" creation-date: 2020-07-29 last-updated: 2021-03-05 status: provisional From 108d172ea547eaebcbe54785a69a792144059d8a Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 5 Mar 2021 10:39:53 +0100 Subject: [PATCH 17/28] Clarify goals --- enhancements/storage/csi-migration.md | 30 ++++++++++++--------------- 1 file changed, 13 insertions(+), 17 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index 65cf700a35..274294fdf5 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -26,7 +26,7 @@ superseded-by: ## Summary -We want to allow cluster administrators to seamlessly migrate volumes created using the in-tree storage plugin to their counterparts CSI drivers. It is important to achieve this goal before CSI Migration feature becomes GA in upstream. Also, this is a requirement for supporting [out-of-tree cloud providers](https://github.com/openshift/enhancements/pull/463) +We want to allow cluster administrators to seamlessly migrate volumes created using the in-tree storage plugin to their counterparts CSI drivers. It is important to achieve this goal before CSI Migration feature becomes GA in upstream. This also a requirement for supporting [out-of-tree cloud providers](https://github.com/openshift/enhancements/pull/463) ## Motivation @@ -40,12 +40,9 @@ In OCP we can optionally disable the CSI migration feature while it is still bet Our goals are different throughout our support lifecycle. -For Tech Preview, we want to introduce a mechanism to allow switching CSI migration feature flags on and off across OCP components. Due to upstream requirements described in (the design proposal)[https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateun-migrate], it is important that this mechanism allows for switching the feature flags in the the correct order. +For Tech Preview, we want to introduce a mechanism to allow switching CSI migration feature flags on and off across OCP components. It is important that this mechanism allows for a seameless migration path, without breaking existing volumes. - -In other words, when enabling CSI migration, control-plane components should have their feature flags enabled before the kubelet. The opposite order applies when disabling CSI migration. - -For GA, we will not support disabling CSI migration. Existing in-tree volumes will be migrated to CSI and users should not have to do any additional work. We do want to make sure we will not break downgrades should the user decide to do that. +For GA, existing in-tree volumes will be migrated to CSI and users should not have to do any additional work. In this phase we will not support disabling CSI migration and we do want to make sure we will not break downgrades. ### Non-Goals @@ -54,8 +51,7 @@ For GA, we will not support disabling CSI migration. Existing in-tree volumes wi ## Proposal -We propose to add a carry-patch to Attacth Detach Controller in OCP thta enables CSI Migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). - +We propose to add a carry-patch to Attacth Detach Controller in OCP that enables the migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). ### Implementation Details/Notes/Constraints @@ -63,22 +59,22 @@ Before getting to our proposal, we need to describe some of the upstream require The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, the cluster administrator should turn on these two feature gates: *CSIMigration* and *CSIMigrationAWS*. However, these flags must be enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). -It is important to respect this ordering to avoid an undesired state of volumes. For instance, if the feature is enabled in the kubelet before it is enabled in the AttachDetach controller, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. In the same vein, if the feature is disabled in the AttachDetach controller before it is disabled in the kubelet, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. +It is important to respect this ordering to avoid an undesired state of volumes. For instance, if the feature is enabled in the Kubelet before it is enabled in the AttachDetach controller, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. In the same vein, if the feature is disabled in the AttachDetach controller before it is disabled in the Kubelet, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. In summary, this is what upstream recommends: * When the CSI migration is **enabled**, events should happen in this order: 1. Enable the feature gate in all control-plane components. - 2. Once that's done, drain nodes one-by-one and start the kubelet with the + 2. Once that's done, drain nodes one-by-one and start the Kubelet with the feature gate enabled. * When the CSI migration is **disabled**, events should happen in this order: - 1. One-by-one, drain the nodes and start the kubelet with the feature gate disabled. + 1. One-by-one, drain the nodes and start the Kubelet with the feature gate disabled. 2. Once that's done, disable the feature gate in all control-plane components. In order to keep the Attach Detach Controller and the Kubelet in sync regarding using the CSI driver or the in-tree plugin, upstream has a mechanism to keep the AttachDetach Controller informed about the status of the migration on nodes. Roughly speaking, Kubelet propagates to an annotation the information for each migrated in-tree plugin on the node. -As a result, the AttachDetach Controller knows if the in-tree plugin has been migrated on the Node. If the feature flags are enabled in KCM and on the Node, the AttachDetach Controller uses the CSI driver to attach volumes. Otherwise, it will falls back to the in-tree plugin. +As a result, the Attach Detach Controller knows if the in-tree plugin has been migrated on the node. If the feature flags are enabled in Kube Controller Manager and on the node, the AttachDetach Controller uses the CSI driver to attach volumes. Otherwise, it will falls back to the in-tree plugin. In OCP, we can easily set those feature gates by using the [FeatureGate] (https://docs.openshift.com/container-platform/4.7/nodes/clusters/nodes-cluster-enabling-features.html) Custom Resource. OCP operators read this resource and restart their operands with the appropriate features enabled. However, this approach alone is not acceptable for CSI migration because the feature flags might be switched across components in _any_ arbitrary order. @@ -86,8 +82,8 @@ That being said, we plan to submit an upstream patch that allows the AttachDetac In addition to that, we propose to add a carry-patch to Attacth Detach Controller in OCP enables CSI Migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). -That way, when deciding about using either the CSI driver or the in-tree plugin, the AttachDetch Controller will **only** rely on the information propagated by the Node. -Other controllers from Kube Controller Manager, like the PV Controller, will still obey the flags passed to the Kube Controller Manager. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a Node basis rather than relying on a global state. +That way, when deciding about using either the CSI driver or the in-tree plugin, the AttachDetch Controller will **only** rely on the information propagated by the node. +Other controllers from Kube Controller Manager, like the PV Controller, will still obey the flags passed to the Kube Controller Manager. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a node basis rather than relying on a global state. #### Benefits @@ -96,7 +92,7 @@ The biggest benefit of this approach is that we do not need to worry about the o Another benefit is that this approach should not break downgrades once CSI Migration becomes GA. In OCP, a downgrade is fundamentally a regular upgrade to an older version. That means that CVO downgrades components in the same order as it upgrades them: connntrol-plane first, nodes later. This would impose an issue if the user downgraded from a version with CSI Migration enabled to a version with CSI Migration disabled. With the above patch in the downgraded version, that would not be a problem. -It is important to notice that, with this carry-patch in OCP, Attach Detach Controller will _not_ change its current behaviour as long as Nodes are not migrated to CSI, which is the default behaviour in OCP 4.8. +It is important to notice that, with this carry-patch in OCP, Attach Detach Controller will _not_ change its current behaviour as long as nodes are not migrated to CSI, which is the default behaviour in OCP 4.8. #### Graduation @@ -182,7 +178,7 @@ We have considered this alternative approach. It is NOT our preferable approach 2. To enable CSI Migration for any in-tree plugin, the cluster administrator should: * First, enable the CSI migration feature flags in all control-plane components by adding the `CSIMigrationControlPlane` *FeatureSet* to the `featuregates/cluster` object. - With the exception of MCO, all operators will recognize this *FeatureSet* and will initialize their operands with the associated feature gates. - * Second, once all control-plane components have restarted, enable the CSI migration feature flags in the kubelet: + * Second, once all control-plane components have restarted, enable the CSI migration feature flags in the Kubelet: - Add the `CSIMigrationNode` *FeatureSet* to the `featuresgates/cluster` object, replacing the previous value (i.e., `CSIMigrationControlPlane`). - All operators will recognize the `CSIMigrationNode` *FeatureSet*, however, control-plane operators already applied the associated feature gates in the step above, so in practice only MCO will have work to do. * At this point, CSI migration is fully enabled in the cluster. @@ -198,7 +194,7 @@ Once CSI migration reaches GA in upstream, the associated feature gates will be In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operational because the feature flags they enable will already be enabled in the cluster. -It is important to note that the order in which operators are downgraded in OCP violates [upstream version skew policy](https://kubernetes.io/docs/setup/release/version-skew-policy). The policy states that a new kubelet must never run with an older API server or Controller Manager. A direct consequence of this violation is the need to introduce a workaround to downgrade a cluster with CSI migration enabled. +It is important to note that the order in which operators are downgraded in OCP violates [upstream version skew policy](https://kubernetes.io/docs/setup/release/version-skew-policy). The policy states that a new Kubelet must never run with an older API server or Controller Manager. A direct consequence of this violation is the need to introduce a workaround to downgrade a cluster with CSI migration enabled. ### Post-GA From 7d5a8625cf58464b560d6efb498f95fd1bcadd46 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 5 Mar 2021 10:47:21 +0100 Subject: [PATCH 18/28] Formatting & risks --- enhancements/storage/csi-migration.md | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index 274294fdf5..14ccd3859f 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -59,7 +59,7 @@ Before getting to our proposal, we need to describe some of the upstream require The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, the cluster administrator should turn on these two feature gates: *CSIMigration* and *CSIMigrationAWS*. However, these flags must be enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). -It is important to respect this ordering to avoid an undesired state of volumes. For instance, if the feature is enabled in the Kubelet before it is enabled in the AttachDetach controller, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. In the same vein, if the feature is disabled in the AttachDetach controller before it is disabled in the Kubelet, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. +It is important to respect this ordering to avoid an undesired state of volumes. For instance, if the feature is enabled in the Kubelet before it is enabled in the Attach Detach controller, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. In the same vein, if the feature is disabled in the Attach Detach controller before it is disabled in the Kubelet, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. In summary, this is what upstream recommends: @@ -72,18 +72,16 @@ In summary, this is what upstream recommends: 1. One-by-one, drain the nodes and start the Kubelet with the feature gate disabled. 2. Once that's done, disable the feature gate in all control-plane components. -In order to keep the Attach Detach Controller and the Kubelet in sync regarding using the CSI driver or the in-tree plugin, upstream has a mechanism to keep the AttachDetach Controller informed about the status of the migration on nodes. Roughly speaking, Kubelet propagates to an annotation the information for each migrated in-tree plugin on the node. +In order to keep the Attach Detach Controller and the Kubelet in sync regarding using the CSI driver or the in-tree plugin, upstream has a mechanism to keep the Attach Detach Controller informed about the status of the migration on nodes. Roughly speaking, Kubelet propagates to an annotation the information for each migrated in-tree plugin on the node. -As a result, the Attach Detach Controller knows if the in-tree plugin has been migrated on the node. If the feature flags are enabled in Kube Controller Manager and on the node, the AttachDetach Controller uses the CSI driver to attach volumes. Otherwise, it will falls back to the in-tree plugin. +As a result, the Attach Detach Controller knows if the in-tree plugin has been migrated on the node. If the feature flags are enabled in Kube Controller Manager **and** on the node, the Attach Detach Controller uses the CSI driver to attach volumes. Otherwise, it will falls back to the in-tree plugin. In OCP, we can easily set those feature gates by using the [FeatureGate] (https://docs.openshift.com/container-platform/4.7/nodes/clusters/nodes-cluster-enabling-features.html) Custom Resource. OCP operators read this resource and restart their operands with the appropriate features enabled. However, this approach alone is not acceptable for CSI migration because the feature flags might be switched across components in _any_ arbitrary order. -That being said, we plan to submit an upstream patch that allows the AttachDetach Controller to have its own custom feature gates, independent from Kube Controller Manager. - -In addition to that, we propose to add a carry-patch to Attacth Detach Controller in OCP enables CSI Migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). +That being said, we plan to submit an upstream patch that allows the Attach Detach Controller to have its own custom feature gates, independent from Kube Controller Manager. In addition to that, we propose to add a carry-patch to Attacth Detach Controller in OCP that enables CSI Migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). That way, when deciding about using either the CSI driver or the in-tree plugin, the AttachDetch Controller will **only** rely on the information propagated by the node. -Other controllers from Kube Controller Manager, like the PV Controller, will still obey the flags passed to the Kube Controller Manager. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a node basis rather than relying on a global state. +Other controllers from Kube Controller Manager, like the PV Controller, will still obey the flags passed to the Kube Controller Manager. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a node basis only. #### Benefits @@ -102,7 +100,7 @@ Once CSI migration reaches GA in upstream, the associated feature gates will be ### Risks and Mitigations - +A carry-patch means that OCP will be the only Kubernetes distribution exercising this code path, which can lead us to bugs that were not seen anywhere. However, we are confident that the patch is small and self-contained enough to be used. ## Design Details From 07fc05f56f39ededa6857109f56b7af7542e7908 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 5 Mar 2021 10:50:23 +0100 Subject: [PATCH 19/28] Fix linting issues --- enhancements/storage/csi-migration.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index 14ccd3859f..c8caf96466 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -59,14 +59,14 @@ Before getting to our proposal, we need to describe some of the upstream require The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, the cluster administrator should turn on these two feature gates: *CSIMigration* and *CSIMigrationAWS*. However, these flags must be enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). -It is important to respect this ordering to avoid an undesired state of volumes. For instance, if the feature is enabled in the Kubelet before it is enabled in the Attach Detach controller, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. In the same vein, if the feature is disabled in the Attach Detach controller before it is disabled in the Kubelet, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. +It is important to respect this ordering to avoid an undesired state of volumes. For instance, if the feature is enabled in the Kubelet before it is enabled in the Attach Detach controller, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. +In the same vein, if the feature is disabled in the Attach Detach controller before it is disabled in the Kubelet, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. In summary, this is what upstream recommends: * When the CSI migration is **enabled**, events should happen in this order: 1. Enable the feature gate in all control-plane components. - 2. Once that's done, drain nodes one-by-one and start the Kubelet with the - feature gate enabled. + 2. Once that's done, drain nodes one-by-one and start the Kubelet with the feature gate enabled. * When the CSI migration is **disabled**, events should happen in this order: 1. One-by-one, drain the nodes and start the Kubelet with the feature gate disabled. @@ -76,9 +76,11 @@ In order to keep the Attach Detach Controller and the Kubelet in sync regarding As a result, the Attach Detach Controller knows if the in-tree plugin has been migrated on the node. If the feature flags are enabled in Kube Controller Manager **and** on the node, the Attach Detach Controller uses the CSI driver to attach volumes. Otherwise, it will falls back to the in-tree plugin. -In OCP, we can easily set those feature gates by using the [FeatureGate] (https://docs.openshift.com/container-platform/4.7/nodes/clusters/nodes-cluster-enabling-features.html) Custom Resource. OCP operators read this resource and restart their operands with the appropriate features enabled. However, this approach alone is not acceptable for CSI migration because the feature flags might be switched across components in _any_ arbitrary order. +In OCP, we can easily set those feature gates by using the [FeatureGate] (https://docs.openshift.com/container-platform/4.7/nodes/clusters/nodes-cluster-enabling-features.html) Custom Resource. OCP operators read this resource and restart their operands with the appropriate features enabled. +However, this approach alone is not acceptable for CSI migration because the feature flags might be switched across components in _any_ arbitrary order. -That being said, we plan to submit an upstream patch that allows the Attach Detach Controller to have its own custom feature gates, independent from Kube Controller Manager. In addition to that, we propose to add a carry-patch to Attacth Detach Controller in OCP that enables CSI Migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). +That being said, we plan to submit an upstream patch that allows the Attach Detach Controller to have its own custom feature gates, independent from Kube Controller Manager. +In addition to that, we propose to add a carry-patch to Attach Controller in OCP that enables CSI Migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). That way, when deciding about using either the CSI driver or the in-tree plugin, the AttachDetch Controller will **only** rely on the information propagated by the node. Other controllers from Kube Controller Manager, like the PV Controller, will still obey the flags passed to the Kube Controller Manager. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a node basis only. @@ -192,7 +194,8 @@ Once CSI migration reaches GA in upstream, the associated feature gates will be In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operational because the feature flags they enable will already be enabled in the cluster. -It is important to note that the order in which operators are downgraded in OCP violates [upstream version skew policy](https://kubernetes.io/docs/setup/release/version-skew-policy). The policy states that a new Kubelet must never run with an older API server or Controller Manager. A direct consequence of this violation is the need to introduce a workaround to downgrade a cluster with CSI migration enabled. +It is important to note that the order in which operators are downgraded in OCP violates [upstream version skew policy](https://kubernetes.io/docs/setup/release/version-skew-policy). +The policy states that a new Kubelet must never run with an older API server or Controller Manager. A direct consequence of this violation is the need to introduce a workaround to downgrade a cluster with CSI migration enabled. ### Post-GA From 9e4d474bce86dc6c2789524bbcf84eed5cc517b3 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 5 Mar 2021 10:53:28 +0100 Subject: [PATCH 20/28] Remove unused section --- enhancements/storage/csi-migration.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index c8caf96466..c49aea061a 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -157,10 +157,6 @@ Main events only, this is not a faithful history. 2021-01-28: Re-worked proposal to use 2 _FeatureSets_ with manual application by the user. 2021-03-05: Re-worked proposal to use carry-patch instead. Moved previous approach to "Alternatives" for reference. -## Drawbacks - -The idea is to find the best form of an argument why this enhancement should _not_ be implemented. - ## Alternatives We have considered this alternative approach. It is NOT our preferable approach because it introduces many shortcomings: From 77868c3ce59dff485727c9eda2826d065a131c25 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 5 Mar 2021 11:06:52 +0100 Subject: [PATCH 21/28] Clarify core of the proposal --- enhancements/storage/csi-migration.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index c49aea061a..104f3a4e3d 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -76,14 +76,15 @@ In order to keep the Attach Detach Controller and the Kubelet in sync regarding As a result, the Attach Detach Controller knows if the in-tree plugin has been migrated on the node. If the feature flags are enabled in Kube Controller Manager **and** on the node, the Attach Detach Controller uses the CSI driver to attach volumes. Otherwise, it will falls back to the in-tree plugin. +#### Upstream and OCP patches + In OCP, we can easily set those feature gates by using the [FeatureGate] (https://docs.openshift.com/container-platform/4.7/nodes/clusters/nodes-cluster-enabling-features.html) Custom Resource. OCP operators read this resource and restart their operands with the appropriate features enabled. -However, this approach alone is not acceptable for CSI migration because the feature flags might be switched across components in _any_ arbitrary order. +However, this approach alone is not acceptable for CSI migration because the feature flags might be switched across components in _any_ arbitrary order. That being said, our approach is intended to make the Attach Detach Controller resilient to this issue. -That being said, we plan to submit an upstream patch that allows the Attach Detach Controller to have its own custom feature gates, independent from Kube Controller Manager. +We plan to submit an upstream patch that allows the Attach Detach Controller to have its own custom feature gates, independent from Kube Controller Manager. In addition to that, we propose to add a carry-patch to Attach Controller in OCP that enables CSI Migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). -That way, when deciding about using either the CSI driver or the in-tree plugin, the AttachDetch Controller will **only** rely on the information propagated by the node. -Other controllers from Kube Controller Manager, like the PV Controller, will still obey the flags passed to the Kube Controller Manager. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a node basis only. +That way, when deciding about using either the CSI driver or the in-tree plugin, the Attach Detch Controller will **only** rely on the information propagated by the node. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a node basis only. Other controllers from Kube Controller Manager, like the PV Controller, will still obey the flags passed to the Kube Controller Manager. #### Benefits From 0ca1e660da4950b44d4ce15dfbcd6c9aed33a741 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 5 Mar 2021 11:11:54 +0100 Subject: [PATCH 22/28] Fix typos --- enhancements/storage/csi-migration.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index 104f3a4e3d..548a721d39 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -51,7 +51,7 @@ For GA, existing in-tree volumes will be migrated to CSI and users should not ha ## Proposal -We propose to add a carry-patch to Attacth Detach Controller in OCP that enables the migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). +We propose to add a carry-patch to Attach Detach Controller in OCP that enables the migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). ### Implementation Details/Notes/Constraints @@ -84,7 +84,8 @@ However, this approach alone is not acceptable for CSI migration because the fea We plan to submit an upstream patch that allows the Attach Detach Controller to have its own custom feature gates, independent from Kube Controller Manager. In addition to that, we propose to add a carry-patch to Attach Controller in OCP that enables CSI Migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). -That way, when deciding about using either the CSI driver or the in-tree plugin, the Attach Detch Controller will **only** rely on the information propagated by the node. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a node basis only. Other controllers from Kube Controller Manager, like the PV Controller, will still obey the flags passed to the Kube Controller Manager. +That way, when deciding about using either the CSI driver or the in-tree plugin, the Attach Detach Controller will **only** rely on the information propagated by the node. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a node basis only. +Other controllers from Kube Controller Manager, like the PersistentVolume Controller, will still obey the flags passed to the Kube Controller Manager. #### Benefits @@ -137,7 +138,7 @@ This is what needs to be done across different support phases: 1. Tech Preview in OCP 4.8: * Introduce a new *FeatureSet* in openshift/api called `CSIMigration`. -* Make sure the *FeatureSet* used by [CCMO](https://github.com/openshift/enhancements/pull/463) contains the CSI migration feature flags enabled for the respective storage backened. +* Make sure the *FeatureSet* used by [CCMO](https://github.com/openshift/enhancements/pull/463) contains the CSI migration feature flags enabled for the respective storage backend. * Introduce an upstream patch that allows Attach Detach Controller to have its own custom feature gates, independent from Kube Controller Manager. * Introduce a carry-patch in OCP that enables CSI Migration for Cinder and GCP PD in Attach Detach Controller. * A PoC of both upstream and OCP patches [are available](https://github.com/openshift/kubernetes/pull/601). From ff77e264f1b97fb84775df44c5ccf62254829901 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 5 Mar 2021 13:39:41 +0100 Subject: [PATCH 23/28] Address @jsafrane review comments --- enhancements/storage/csi-migration.md | 48 ++++++++++++++++++--------- 1 file changed, 33 insertions(+), 15 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index 548a721d39..520d6e810e 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -81,18 +81,23 @@ As a result, the Attach Detach Controller knows if the in-tree plugin has been m In OCP, we can easily set those feature gates by using the [FeatureGate] (https://docs.openshift.com/container-platform/4.7/nodes/clusters/nodes-cluster-enabling-features.html) Custom Resource. OCP operators read this resource and restart their operands with the appropriate features enabled. However, this approach alone is not acceptable for CSI migration because the feature flags might be switched across components in _any_ arbitrary order. That being said, our approach is intended to make the Attach Detach Controller resilient to this issue. -We plan to submit an upstream patch that allows the Attach Detach Controller to have its own custom feature gates, independent from Kube Controller Manager. -In addition to that, we propose to add a carry-patch to Attach Controller in OCP that enables CSI Migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). +That being said, **we propose to carry a patch in OCP's Attach Controller to force-enable the CSI Migration feature gates there**. -That way, when deciding about using either the CSI driver or the in-tree plugin, the Attach Detach Controller will **only** rely on the information propagated by the node. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a node basis only. -Other controllers from Kube Controller Manager, like the PersistentVolume Controller, will still obey the flags passed to the Kube Controller Manager. +The carry-patch would only affect the Attach Detach Controller, other parts of Kube Controller Manager, like the PersistentVolume Controller, would still obey the flags passed to the Kube Controller Manager. +In order to do this the least invasive way possible, we will refactor the upstream code to allow the carry-patch to be minimal and self-contained. An existing PoC [is available](https://github.com/openshift/kubernetes/pull/601) to demonstrate how this would look like. + +This patch would be carried over in OCP until all in-tree storage plugins are migrated to CSI, which should happen around 3 or 4 releases. +Initially we would start with Cinder and GCP (_CSIMigrationGCE_ and _CSIMigrationCinder_ flags), so that we are aligned with the goals of both upstream and [CCMO](https://github.com/openshift/enhancements/pull/463). +In subsequent OCP releases we would proceed with EBS, vSphere and Azure (_CSIMigrationAWS_, _CSIMigrationvSphere_ and _CSIMigrationAzureFile_), without a predefined order. + +With this patch, when deciding about using either the CSI driver or the in-tree plugin, the Attach Detach Controller will **only** rely on the information propagated by the node. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a node basis only. #### Benefits The biggest benefit of this approach is that we do not need to worry about the ordering in which components are restarted with the CSI migration flags switched on or off. Migration flags can be switched on or off across components in any order. Another benefit is that this approach should not break downgrades once CSI Migration becomes GA. In OCP, a downgrade is fundamentally a regular upgrade to an older version. -That means that CVO downgrades components in the same order as it upgrades them: connntrol-plane first, nodes later. This would impose an issue if the user downgraded from a version with CSI Migration enabled to a version with CSI Migration disabled. With the above patch in the downgraded version, that would not be a problem. +That means that CVO downgrades components in the same order as it upgrades them: control-plane first, nodes later. This would impose an issue if the user downgraded from a version with CSI Migration enabled to a version with CSI Migration disabled. With the above patch in the downgraded version, that would not be a problem. It is important to notice that, with this carry-patch in OCP, Attach Detach Controller will _not_ change its current behaviour as long as nodes are not migrated to CSI, which is the default behaviour in OCP 4.8. @@ -100,7 +105,7 @@ It is important to notice that, with this carry-patch in OCP, Attach Detach Cont During Tech Preview in OCP 4.8, users can enable the CSI migration using a *FeatureSet*. It could be shared with [CCMO](https://github.com/openshift/enhancements/pull/463), but we may want to have a specific *FeatureSet* only for CSI Migration for users that want to enable CSI Migration without migrating to an external cloud provider. -Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default. As a result, it will not be necessary to use *FeatureSets* anymore. +Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default. As a result, it will remove the carry-patch above. Also, users will not have to use *FeatureSets* anymore. ### Risks and Mitigations @@ -133,23 +138,36 @@ Once CSI migration is GA, we expect the regular upgrade jobs will cover upgrades ### Graduation Criteria -This is what needs to be done across different support phases: +Each storage plugin will be migrated at their own paces and will have different support phases across OCP releases. For instance, GCP and Cinder migration would be Tech Preview and OCP 4.8 and GA in 4.9. AWS EBS will _most likely_ be Tech Preview in 4.9 and GA in 4.10. + +Having said that, the items below only cover the initial targets for CSI Migration: GCP PD and Cinder. -1. Tech Preview in OCP 4.8: +1. Tech Preview in 4.8: -* Introduce a new *FeatureSet* in openshift/api called `CSIMigration`. +* Introduce a new *FeatureSet* in openshift/api called `CSIMigration`. This step only needs to be done once because the same object will contain all * Make sure the *FeatureSet* used by [CCMO](https://github.com/openshift/enhancements/pull/463) contains the CSI migration feature flags enabled for the respective storage backend. -* Introduce an upstream patch that allows Attach Detach Controller to have its own custom feature gates, independent from Kube Controller Manager. +* Introduce an upstream patch to refactor the Attach Detach Controller in order to make our carry-patch smaller and self-contained. * Introduce a carry-patch in OCP that enables CSI Migration for Cinder and GCP PD in Attach Detach Controller. -* A PoC of both upstream and OCP patches [are available](https://github.com/openshift/kubernetes/pull/601). -2. GA in OCP 4.9 +2. GA in 4.9: + +* Update carry-patch to not force-enable CSI Migration for Cinder and GCP PD. Since the migration is enabled by default, there is no need to force-enable the migration. + +Assuming the next storage plugin to be migrated is AWS EBS, the steps below represent what needs to be done for that plugin: + +1. Tech Preview in 4.9: + +* Update the carry-patch in Attach Detach Controller to force-enable the CSI Migration for EBS. + +2. GA in 4.10: + +* Update the carry-patch in Attach Detach Controller to **not** force-enable the CSI Migration for EBS. -Nothing needs to be done, migration flags are enabled by default and cannot be disabled. +Once CSI Migration is GA for all storage plugins have been completely migrated to CSI, we can: -3. Post-GA in OCP 4.10 +* Remove the carry-patch from OCP. +* Remove the `CSIMigration` _FeatureSet_ from openshift/api. -* Remove `CSIMigration` *FeatureSet* from openshift/api. ## Implementation History From c8ab2035087c450f5570f049848546c929c5cf4c Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 5 Mar 2021 17:34:07 +0100 Subject: [PATCH 24/28] Address @huffmanca's comments --- enhancements/storage/csi-migration.md | 28 +++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index 520d6e810e..e5f72d811f 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -26,7 +26,7 @@ superseded-by: ## Summary -We want to allow cluster administrators to seamlessly migrate volumes created using the in-tree storage plugin to their counterparts CSI drivers. It is important to achieve this goal before CSI Migration feature becomes GA in upstream. This also a requirement for supporting [out-of-tree cloud providers](https://github.com/openshift/enhancements/pull/463) +We want to allow cluster administrators to seamlessly migrate volumes created using the in-tree storage plugin to their counterpart CSI drivers. It is important to achieve this goal before CSI Migration feature becomes GA in upstream. This also a requirement for supporting [out-of-tree cloud providers](https://github.com/openshift/enhancements/pull/463) ## Motivation @@ -34,15 +34,15 @@ We want to allow cluster administrators to seamlessly migrate volumes created us That is going to change in Kubernetes 1.21 (OCP 4.8), where the feature will remain beta, but *enabled* by default. In Kubernetes 1.22 (OCP 4.9) the feature may become GA. -In OCP we can optionally disable the CSI migration feature while it is still beta, however, that will no longer be an option once CSI migration becomes GA. In order to avoid surprises once the migration is enabled by default in OCP, we want to allow cluster administrators to optionally enable the feature earlier, preferably in OCP 4.8. +In OCP we can optionally disable the CSI migration feature while it is still beta, however, that will no longer be an option once CSI migration becomes GA. In order to avoid surprises, once the migration is enabled by default in OCP, we want to allow cluster administrators to optionally enable the feature earlier, preferably in OCP 4.8. ### Goals Our goals are different throughout our support lifecycle. -For Tech Preview, we want to introduce a mechanism to allow switching CSI migration feature flags on and off across OCP components. It is important that this mechanism allows for a seameless migration path, without breaking existing volumes. +For Tech Preview, we want to introduce a mechanism to allow switching CSI migration feature flags on and off across OCP components. It is important that this mechanism allows for a seamless migration path, without breaking existing volumes. -For GA, existing in-tree volumes will be migrated to CSI and users should not have to do any additional work. In this phase we will not support disabling CSI migration and we do want to make sure we will not break downgrades. +For GA, existing in-tree volumes will be migrated to use the CSI driver. Once the migration has been enabled, users should not have to do any additional work. In this phase we will not support disabling CSI migration, and we should ensure that downgrades are not broken. ### Non-Goals @@ -51,16 +51,16 @@ For GA, existing in-tree volumes will be migrated to CSI and users should not ha ## Proposal -We propose to add a carry-patch to Attach Detach Controller in OCP that enables the migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). +We propose to add a carry-patch to the Attach Detach Controller in OCP that enables the migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). ### Implementation Details/Notes/Constraints Before getting to our proposal, we need to describe some of the upstream requirements for using CSI Migration. -The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, to enable the migration of a in-tree AWS EBS volume to its counterpart CSI driver, the cluster administrator should turn on these two feature gates: *CSIMigration* and *CSIMigrationAWS*. However, these flags must be enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). +The CSI migration feature is hidden behind feature gates in Kubernetes. For instance, to enable the migration of an in-tree AWS EBS volume to its counterpart CSI driver, the cluster administrator should turn on these two feature gates: *CSIMigration* and *CSIMigrationAWS*. However, these flags must be enabled or disabled in a [specific order](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/csi-migration.md#upgradedowngrade-migrateunmigrate-scenarios). -It is important to respect this ordering to avoid an undesired state of volumes. For instance, if the feature is enabled in the Kubelet before it is enabled in the Attach Detach controller, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. -In the same vein, if the feature is disabled in the Attach Detach controller before it is disabled in the Kubelet, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. +It is important to respect this ordering to avoid an undesired state of volumes. For instance, if the feature is enabled in the Kubelet before it is enabled in the Attach Detach Controller, volumes attached to nodes by the in-tree volume plugin cannot be detached by the CSI driver and will stay attached forever. +In the same vein, if the feature is disabled in the Attach Detach Controller before it is disabled in the Kubelet, volumes attached by the CSI driver cannot be detached by the in-tree volume plugin and will stay attached forever. In summary, this is what upstream recommends: @@ -72,9 +72,9 @@ In summary, this is what upstream recommends: 1. One-by-one, drain the nodes and start the Kubelet with the feature gate disabled. 2. Once that's done, disable the feature gate in all control-plane components. -In order to keep the Attach Detach Controller and the Kubelet in sync regarding using the CSI driver or the in-tree plugin, upstream has a mechanism to keep the Attach Detach Controller informed about the status of the migration on nodes. Roughly speaking, Kubelet propagates to an annotation the information for each migrated in-tree plugin on the node. +In order to keep the Attach Detach Controller and the Kubelet in sync regarding using the CSI driver or the in-tree plugin, upstream has a mechanism to keep the Attach Detach Controller informed about the status of the migration on nodes. Roughly speaking, Kubelet tracks this status via an annotation with the information for each migrated in-tree plugin on the node. -As a result, the Attach Detach Controller knows if the in-tree plugin has been migrated on the node. If the feature flags are enabled in Kube Controller Manager **and** on the node, the Attach Detach Controller uses the CSI driver to attach volumes. Otherwise, it will falls back to the in-tree plugin. +As a result, the Attach Detach Controller knows if the in-tree plugin has been migrated on the node. If the feature flags are enabled in Kube Controller Manager **and** on the node, the Attach Detach Controller uses the CSI driver to attach volumes. Otherwise, it will revert to the in-tree plugin. #### Upstream and OCP patches @@ -83,7 +83,7 @@ However, this approach alone is not acceptable for CSI migration because the fea That being said, **we propose to carry a patch in OCP's Attach Controller to force-enable the CSI Migration feature gates there**. -The carry-patch would only affect the Attach Detach Controller, other parts of Kube Controller Manager, like the PersistentVolume Controller, would still obey the flags passed to the Kube Controller Manager. +This carry-patch would only affect the Attach Detach Controller. Other parts of Kube Controller Manager, such as the PersistentVolume Controller, would still obey the flags passed to the Kube Controller Manager. In order to do this the least invasive way possible, we will refactor the upstream code to allow the carry-patch to be minimal and self-contained. An existing PoC [is available](https://github.com/openshift/kubernetes/pull/601) to demonstrate how this would look like. This patch would be carried over in OCP until all in-tree storage plugins are migrated to CSI, which should happen around 3 or 4 releases. @@ -144,7 +144,7 @@ Having said that, the items below only cover the initial targets for CSI Migrati 1. Tech Preview in 4.8: -* Introduce a new *FeatureSet* in openshift/api called `CSIMigration`. This step only needs to be done once because the same object will contain all +* Introduce a new *FeatureSet* in openshift/api called `CSIMigration`. * Make sure the *FeatureSet* used by [CCMO](https://github.com/openshift/enhancements/pull/463) contains the CSI migration feature flags enabled for the respective storage backend. * Introduce an upstream patch to refactor the Attach Detach Controller in order to make our carry-patch smaller and self-contained. * Introduce a carry-patch in OCP that enables CSI Migration for Cinder and GCP PD in Attach Detach Controller. @@ -199,7 +199,7 @@ We have considered this alternative approach. It is NOT our preferable approach - All operators will recognize the `CSIMigrationNode` *FeatureSet*, however, control-plane operators already applied the associated feature gates in the step above, so in practice only MCO will have work to do. * At this point, CSI migration is fully enabled in the cluster. 3. To disable CSI migration, the cluster administrator should perform the same steps in the opposite order: - * In `featuregates/cluster` object, replace the `CSIMigrationNode` *FeatureSet* by `CSIMigrationControlPlane`. + * In the `featuregates/cluster` object, replace the `CSIMigrationNode` *FeatureSet* with `CSIMigrationControlPlane`. * Wait for all `CSINode` objects to have the annotation `storage.alpha.kubernetes.io/migrated-plugins` cleared. No storage plugins should be listed in this annotation. * Remove the `CSIMigrationControlPlane` *FeatureSet* from the `featuregates/cluster` object. 4. It is the **responsibility of the cluster administrator** to guarante the ordering described above is respected. @@ -208,7 +208,7 @@ We have considered this alternative approach. It is NOT our preferable approach Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default and the features will not be optional anymore. As a result, CSI migration will be enabled by default in OCP as well, and there will not be an option to disable it. -In addition that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operational because the feature flags they enable will already be enabled in the cluster. +In addition to that, the *FeatureSets* created to handle the Tech Preview feature will no longer be operational because the feature flags they enable will already be enabled in the cluster. It is important to note that the order in which operators are downgraded in OCP violates [upstream version skew policy](https://kubernetes.io/docs/setup/release/version-skew-policy). The policy states that a new Kubelet must never run with an older API server or Controller Manager. A direct consequence of this violation is the need to introduce a workaround to downgrade a cluster with CSI migration enabled. From 62b673f53a175a6cad24fe09a9ce4a0c3123d6a1 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Fri, 5 Mar 2021 17:46:21 +0100 Subject: [PATCH 25/28] Add note about upstream behaviour --- enhancements/storage/csi-migration.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index e5f72d811f..c9380602e5 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -79,7 +79,7 @@ As a result, the Attach Detach Controller knows if the in-tree plugin has been m #### Upstream and OCP patches In OCP, we can easily set those feature gates by using the [FeatureGate] (https://docs.openshift.com/container-platform/4.7/nodes/clusters/nodes-cluster-enabling-features.html) Custom Resource. OCP operators read this resource and restart their operands with the appropriate features enabled. -However, this approach alone is not acceptable for CSI migration because the feature flags might be switched across components in _any_ arbitrary order. That being said, our approach is intended to make the Attach Detach Controller resilient to this issue. +However, this approach alone is not acceptable for CSI migration because the feature flags might be switched across components in _any_ arbitrary order. Our solution is intended to make the Attach Detach Controller resilient to this issue. That being said, **we propose to carry a patch in OCP's Attach Controller to force-enable the CSI Migration feature gates there**. @@ -92,6 +92,8 @@ In subsequent OCP releases we would proceed with EBS, vSphere and Azure (_CSIMig With this patch, when deciding about using either the CSI driver or the in-tree plugin, the Attach Detach Controller will **only** rely on the information propagated by the node. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a node basis only. +Note: one might wonder why upstream does not implement this behaviour already. The answer is that they do not have the same issues OCP has in regards to disabling feature flags in the wrong order across components. + #### Benefits The biggest benefit of this approach is that we do not need to worry about the ordering in which components are restarted with the CSI migration flags switched on or off. Migration flags can be switched on or off across components in any order. From f0db9da1d5dabfb699ac065eda12887dc6daeead Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Mon, 22 Mar 2021 11:43:24 +0100 Subject: [PATCH 26/28] Clarify use of FeatureSet --- enhancements/storage/csi-migration.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index c9380602e5..da9cacaa16 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -53,6 +53,8 @@ For GA, existing in-tree volumes will be migrated to use the CSI driver. Once th We propose to add a carry-patch to the Attach Detach Controller in OCP that enables the migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). +In addition to that, we propose to utilize the existing **TechPreviewNoUpgrade** *FeatureSet* to switch the feature on and off during the Tech Preview phase. + ### Implementation Details/Notes/Constraints Before getting to our proposal, we need to describe some of the upstream requirements for using CSI Migration. @@ -105,9 +107,9 @@ It is important to notice that, with this carry-patch in OCP, Attach Detach Cont #### Graduation -During Tech Preview in OCP 4.8, users can enable the CSI migration using a *FeatureSet*. It could be shared with [CCMO](https://github.com/openshift/enhancements/pull/463), but we may want to have a specific *FeatureSet* only for CSI Migration for users that want to enable CSI Migration without migrating to an external cloud provider. +During the Tech Preview phase in OCP 4.8, users can enable and disable the CSI migration feature using the **TechPreviewNoUpgrade** *FeatureSet*. Among other things, this *FeatureSet* switches the complete [Cloud Migration](https://github.com/openshift/enhancements/pull/463) in OCP 4.8. -Once CSI migration reaches GA in upstream, the associated feature gates will be enabled by default. As a result, it will remove the carry-patch above. Also, users will not have to use *FeatureSets* anymore. +Once CSI migration of all storage plugins reaches GA in upstream, the associated feature gates will be enabled by default in OCP. As a result, we will remove the carry-patch we introduced in OCP 4.8. Also, users will not have to use the **TechPreviewNoUpgrade** *FeatureSet* anymore. ### Risks and Mitigations From 991c14db6de53c17e0de479389e0acb95c056bc1 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Mon, 22 Mar 2021 13:07:00 +0100 Subject: [PATCH 27/28] Add FeatureSets configuration examples --- enhancements/storage/csi-migration.md | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index da9cacaa16..6c677d57ed 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -107,7 +107,31 @@ It is important to notice that, with this carry-patch in OCP, Attach Detach Cont #### Graduation -During the Tech Preview phase in OCP 4.8, users can enable and disable the CSI migration feature using the **TechPreviewNoUpgrade** *FeatureSet*. Among other things, this *FeatureSet* switches the complete [Cloud Migration](https://github.com/openshift/enhancements/pull/463) in OCP 4.8. +During the Tech Preview phase in OCP 4.8, users can enable and disable the CSI migration feature using the **TechPreviewNoUpgrade** *FeatureSet*. Among other things, this *FeatureSet* switches the complete [Cloud Migration](https://github.com/openshift/enhancements/pull/463) in OCP 4.8. Here is a configuration example: + +```yaml +apiVersion: config.openshift.io/v1 +kind: FeatureGate +metadata: + name: cluster +spec: + featureSet: TechPreviewNoUpgrade +``` + +Optionally, users can enable the CSI Migration *without* enabling the Cloud Migration by using the **CustomNoUpgrade** *FeatureSet*. Here is a configuration example to enable the migration for AWS EBS: + +```yaml +apiVersion: config.openshift.io/v1 +kind: FeatureGate +metadata: + name: cluster +spec: + featureSet: CustomNoUpgrade + customNoUpgrade: + enabled: + - CSIMigrationAWS +``` + Once CSI migration of all storage plugins reaches GA in upstream, the associated feature gates will be enabled by default in OCP. As a result, we will remove the carry-patch we introduced in OCP 4.8. Also, users will not have to use the **TechPreviewNoUpgrade** *FeatureSet* anymore. From 4024bccdf38b3d89654dd313b42c125675882512 Mon Sep 17 00:00:00 2001 From: Fabio Bertinatto Date: Wed, 31 Mar 2021 17:32:11 +0200 Subject: [PATCH 28/28] EBS is TP in 4.8, GCE in 4.9 --- enhancements/storage/csi-migration.md | 27 ++++++++++++++------------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/enhancements/storage/csi-migration.md b/enhancements/storage/csi-migration.md index 6c677d57ed..c8a8739109 100644 --- a/enhancements/storage/csi-migration.md +++ b/enhancements/storage/csi-migration.md @@ -7,8 +7,8 @@ reviewers: approvers: - "@openshift/openshift-architects" creation-date: 2020-07-29 -last-updated: 2021-03-05 -status: provisional +last-updated: 2021-03-31 +status: implementable see-also: https://github.com/openshift/enhancements/pull/463 replaces: superseded-by: @@ -18,7 +18,7 @@ superseded-by: ## Release Signoff Checklist -- [ ] Enhancement is `implementable` +- [x] Enhancement is `implementable` - [ ] Design details are appropriately documented from clear requirements - [ ] Test plan is defined - [ ] Graduation criteria for dev preview, tech preview, GA @@ -51,7 +51,7 @@ For GA, existing in-tree volumes will be migrated to use the CSI driver. Once th ## Proposal -We propose to add a carry-patch to the Attach Detach Controller in OCP that enables the migration of some storage plugins. Initially we would start with Cinder and GCP, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). +We propose to add a carry-patch to the Attach Detach Controller in OCP that enables the migration of some storage plugins. Initially we would start with Cinder and EBS, so that we are aligned with the goals of [CCMO](https://github.com/openshift/enhancements/pull/463). In addition to that, we propose to utilize the existing **TechPreviewNoUpgrade** *FeatureSet* to switch the feature on and off during the Tech Preview phase. @@ -89,8 +89,8 @@ This carry-patch would only affect the Attach Detach Controller. Other parts of In order to do this the least invasive way possible, we will refactor the upstream code to allow the carry-patch to be minimal and self-contained. An existing PoC [is available](https://github.com/openshift/kubernetes/pull/601) to demonstrate how this would look like. This patch would be carried over in OCP until all in-tree storage plugins are migrated to CSI, which should happen around 3 or 4 releases. -Initially we would start with Cinder and GCP (_CSIMigrationGCE_ and _CSIMigrationCinder_ flags), so that we are aligned with the goals of both upstream and [CCMO](https://github.com/openshift/enhancements/pull/463). -In subsequent OCP releases we would proceed with EBS, vSphere and Azure (_CSIMigrationAWS_, _CSIMigrationvSphere_ and _CSIMigrationAzureFile_), without a predefined order. +Initially we would start with Cinder and EBS (_CSIMigrationAWS_ and _CSIMigrationCinder_ flags), so that we are aligned with the goals of both upstream and [CCMO](https://github.com/openshift/enhancements/pull/463). +In subsequent OCP releases we would proceed with GCP, vSphere and Azure (_CSIMigrationGCE_, _CSIMigrationvSphere_ and _CSIMigrationAzureFile_), without a predefined order. With this patch, when deciding about using either the CSI driver or the in-tree plugin, the Attach Detach Controller will **only** rely on the information propagated by the node. In other words, Attach Detach Controller will start considering which plugin to use (in-tree or CSI) on a node basis only. @@ -166,30 +166,30 @@ Once CSI migration is GA, we expect the regular upgrade jobs will cover upgrades ### Graduation Criteria -Each storage plugin will be migrated at their own paces and will have different support phases across OCP releases. For instance, GCP and Cinder migration would be Tech Preview and OCP 4.8 and GA in 4.9. AWS EBS will _most likely_ be Tech Preview in 4.9 and GA in 4.10. +Each storage plugin will be migrated at their own paces and will have different support phases across OCP releases. For instance, EBS and Cinder migration would be Tech Preview and OCP 4.8 and GA in 4.9. GCP PD will most likely be Tech Preview in 4.9 and GA in 4.10. -Having said that, the items below only cover the initial targets for CSI Migration: GCP PD and Cinder. +Having said that, the items below only cover the initial targets for CSI Migration: AWS EBS and Cinder. 1. Tech Preview in 4.8: * Introduce a new *FeatureSet* in openshift/api called `CSIMigration`. * Make sure the *FeatureSet* used by [CCMO](https://github.com/openshift/enhancements/pull/463) contains the CSI migration feature flags enabled for the respective storage backend. * Introduce an upstream patch to refactor the Attach Detach Controller in order to make our carry-patch smaller and self-contained. -* Introduce a carry-patch in OCP that enables CSI Migration for Cinder and GCP PD in Attach Detach Controller. +* Introduce a carry-patch in OCP that enables CSI Migration for Cinder and EBS in Attach Detach Controller. 2. GA in 4.9: -* Update carry-patch to not force-enable CSI Migration for Cinder and GCP PD. Since the migration is enabled by default, there is no need to force-enable the migration. +* Update carry-patch to not force-enable CSI Migration for Cinder and EBS. Since the migration is enabled by default, there is no need to force-enable the migration. -Assuming the next storage plugin to be migrated is AWS EBS, the steps below represent what needs to be done for that plugin: +Assuming the next storage plugin to be migrated is GCP PD, the steps below represent what needs to be done for that plugin: 1. Tech Preview in 4.9: -* Update the carry-patch in Attach Detach Controller to force-enable the CSI Migration for EBS. +* Update the carry-patch in Attach Detach Controller to force-enable the CSI Migration for GCP PD. 2. GA in 4.10: -* Update the carry-patch in Attach Detach Controller to **not** force-enable the CSI Migration for EBS. +* Update the carry-patch in Attach Detach Controller to **not** force-enable the CSI Migration for GCP PD. Once CSI Migration is GA for all storage plugins have been completely migrated to CSI, we can: @@ -204,6 +204,7 @@ Main events only, this is not a faithful history. 2020-07-29: Initial enhancement draft. 2021-01-28: Re-worked proposal to use 2 _FeatureSets_ with manual application by the user. 2021-03-05: Re-worked proposal to use carry-patch instead. Moved previous approach to "Alternatives" for reference. +2021-03-31: Swapped GCE migration for EBS: GCP will be Tech Preview in 4.9, EBS in 3.8. ## Alternatives