STOR-1803: add vsphere snapshot configuration fields to ClusterCSIDriver#1783
Conversation
|
@RomanBednar: This pull request references STOR-1803 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Hello @RomanBednar! Some important instructions when contributing to openshift/api: |
|
@RomanBednar: This pull request references STOR-1803 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
b443ef7 to
c12e3f4
Compare
|
verify-crd-schema error is probably not related to this change as I have not changed |
c12e3f4 to
489a2ec
Compare
489a2ec to
ed326f9
Compare
| // +kubebuilder:validation:Optional | ||
| // +openshift:enable:FeatureSets=TechPreviewNoUpgrade | ||
| // +optional | ||
| GlobalMaxSnapshotsPerBlockVolume *uint32 `json:"globalMaxSnapshotsPerBlockVolume,omitempty"` |
There was a problem hiding this comment.
For all new fields:
- There should be some validation, for example the value should not be negative. Or semantics of negative values should be described.
- Are there any max values?
- Value
0should have some description - is it a valid value? Does it turn off snapshot support for VSAN / VVOL / globally? - What are the impacts of setting the values high? I would expect higher values mean less performance, otherwise we can set the values to MAXINT by default and we don't need any configuration.
- It could help if there was link to vSphere docs, the questions could be answered there.
There was a problem hiding this comment.
There should be some validation, for example the value should not be negative. Or semantics of negative values should be described.
- Correct - currently the operator would fail to unmarshal as the field is uint32 type
- Adding validation for min/max volumes.
W0312 17:09:26.526662 8935 reflector.go:539] k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229: failed to list *v1.ClusterCSIDriver: json: cannot unmarshal number -1 into Go struct field VSphereCSIDriverConfigSpec.items.spec.driverConfig.vSphere.globalMaxSnapshotsPerBlockVolume of type uint32
E0312 17:09:26.526720 8935 reflector.go:147] k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229: Failed to watch *v1.ClusterCSIDriver: failed to list *v1.ClusterCSIDriver: json: cannot unmarshal number -1 into Go struct field VSphereCSIDriverConfigSpec.items.spec.driverConfig.vSphere.globalMaxSnapshotsPerBlockVolume of type uint32
Are there any max values?
- From VMware documentation:
Maximum of 32 snapshots are **supported** in a chain. However, for a better performance use only 2 to 3 snapshots. - Adding validation for max value of 32.
- Setting any higher value in the config is not prevented currently:
[Snapshot]
global-max-snapshots-per-block-volume = 33
Value
0should have some description - is it a valid value? Does it turn off snapshot support for VSAN / VVOL / globally?
- Not documented at all, after testing it seems like
0value results in disabling the option and default is used. - Snapshots are not disabled when
0is set - Setting min value to 1 should work fine as there is no reason to set
0
$ oc -n openshift-cluster-csi-drivers get cm/vsphere-csi-config -o yaml | grep global-max-snapshots-per-block-volume
global-max-snapshots-per-block-volume = 0
message: 'Failed to create snapshot: failed to take snapshot of the volume c0e2dc41-33de-4566-9323-9aa422741493: "rpc error: code = FailedPrecondition desc = the number of snapshots on the source volume c0e2dc41-33de-4566-9323-9aa422741493 reaches the configured maximum (3)"'
What are the impacts of setting the values high? I would expect higher values mean less performance, otherwise we can set the values to MAXINT by default and we don't need any configuration.
- This is documented under "Best practices for snapshots": https://kb.vmware.com/s/article/1025279
- 2-3 snapshots are recommended for better performance so we can expect a performance drop when using more than 3 snapshots
It could help if there was link to vSphere docs, the questions could be answered there.
- Do we actually want to link to external documentation from comments here?
- Adding the link to relevant doc: https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html
There was a problem hiding this comment.
Thanks for the update. I think linking vSphere docs will help users to actually understand what the options are good for and they don't need to guess.
| // overrides the global constraint if set, while it falls back to the global constraint if unset. | ||
| // +kubebuilder:validation:Optional |
There was a problem hiding this comment.
IMO it would be better to use overrides globalMaxSnapshotsPerBlockVolume instead of the global constraint.
204a82d to
e8fa952
Compare
f9ce480 to
7ea864d
Compare
|
The API looks good to me, you probably need to fix the tests somehow. |
7ea864d to
3277e7c
Compare
|
Yeah, so there are some changes in progress to change the process so new API fields are not enabled by feature set but feature gate, documentation should arrive soon. From current observations it seems there should be 3 separate files now for clustercsidriver CRD, that means we have to change CSO image a bit. Then verification requires a test file for each of those so I've added that, with a small test case for TechPreviewNoUpgrade with new fields - The |
No. The test works in other PRs without any issues. |
1020a5b to
3493251
Compare
|
@deads2k Given the simplicity and potential cadence of similar PRs we want to release this without TechPreview. With the new process which is based on feature gates, can this be achieved by simply including the feature gate in |
|
/lgtm |
3493251 to
61f5f93
Compare
| TopologyCategories []string `json:"topologyCategories,omitempty"` | ||
|
|
||
| // globalMaxSnapshotsPerBlockVolume is a global configuration parameter that applies to volumes on all kinds of | ||
| // datastores. If unset it defaults to 3. |
There was a problem hiding this comment.
Can we do something like
If omitted, the platform chooses a default, which is subject to change over time, currently that default is 3.
This matches other fields. Rather than promising a value.
There was a problem hiding this comment.
Sounds good, changed.
| // datastores. If unset it defaults to 3. | ||
| // Increasing number of snapshots above 3 can have negative impact on performance, for more details see: https://kb.vmware.com/s/article/1025279 | ||
| // Volume snapshot documentation: https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html | ||
| // +kubebuilder:validation:Minimum=1 |
There was a problem hiding this comment.
how do I choose: "no snapshots"?
There was a problem hiding this comment.
Already discussed here: #1783 (comment)
vmWare does not document this value but testing showed that it does not disable snapshots, but default is applied instead. I can add a clarifying statement:
Setting this value to 0 does not disable volume snapshots, but results in default value being used.
There was a problem hiding this comment.
Already discussed here: #1783 (comment)
Please add a sentence in the documentation indicating that snapshots cannot be disabled. On this item and the others if they cannot be disabled either.
There was a problem hiding this comment.
Added, for all 3 options as it's applicable to all of them.
config/v1/feature_gates.go
Outdated
| reportProblemsToJiraComponent("Storage / Kubernetes External Components"). | ||
| contactPerson("rbednar"). | ||
| productScope(ocpSpecific). | ||
| enableIn(Default, TechPreviewNoUpgrade). |
There was a problem hiding this comment.
TechPreview first please. Demonstrate stability (either via automation or QE sign off) and we can promote in 4.16.
There was a problem hiding this comment.
I know TechPreview first is the safest and most common option, however this change is relatively trivial and only allows setting options of an existing driver that are not new. We're expecting more requests similar to this one from customers and would like to see if there's a possibility to ship this in shorter time than two cycles.
Would it be possible to have this enabled by default in 4.16 (without TechPreview) if we have a sign off from QE?
There was a problem hiding this comment.
Before becoming accessible-by-default, we must have evidence of completeness and reliability. The preferred mechanism is to run automated CI tests in our existing TechPreview jobs and see 95+% pass rates over at least 14 runs. The alternative mechanism is to have QE sign off on a promotion PR. That can all happen in a single release. For instance,
- in 4.16
- merge as techpreview
- add functionality and tests
- after there are 14 runs, open PR promoting to GA
- CI automatically checks for the associated tests (see example failure here)
- PR is merged on green OR QE signs off on the PR and we override the automated stability check (we track data on this now in 4.16)
- we ship 4.16
So this doesn't delay availability in 4.16 assuming the feature is tested and is as stable as expected
There was a problem hiding this comment.
@deads2k Removed Default cluster profile for this feature, can we merge this now as TechPreview?
| // If cluster Infrastructure object has a topology, values specified in | ||
| // Infrastructure object will be used and modifications to topologyCategories | ||
| // will be rejected. | ||
| // +listType=set |
There was a problem hiding this comment.
this should have failed our CRD schema checker. Why would this be a safe change to make for clients?
There was a problem hiding this comment.
Previously the checker was failing without listType, I've retried it again now and with the recent changes verification is passing without it - dropping this change.
61f5f93 to
5554b88
Compare
|
@deads2k PTAL again |
openapi/openapi.json
Outdated
| "type": "object", | ||
| "properties": { | ||
| "globalMaxSnapshotsPerBlockVolume": { | ||
| "description": "globalMaxSnapshotsPerBlockVolume is a global configuration parameter that applies to volumes on all kinds of datastores. If omitted, the platform chooses a default, which is subject to change over time, currently that default is 3. Setting this value to 0 does not disable volume snapshots, but results in default value being used. Increasing number of snapshots above 3 can have negative impact on performance, for more details see: https://kb.vmware.com/s/article/1025279 Volume snapshot documentation: https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/3.0/vmware-vsphere-csp-getting-started/GUID-E0B41C69-7EEB-450F-A73D-5FD2FF39E891.html", |
There was a problem hiding this comment.
@RomanBednar If we mentioned the value could be set to 0(TBH, I think we could reject the 0 value safely.), it seems we also need to change the minmum value setting in https://github.com/openshift/api/pull/1783/files#diff-5cb495d65e27aec64145d4043a9e32d029bab426765d92b147556d459be5735bR219 , right?
There was a problem hiding this comment.
@Phaow The comment about 0 value is meant to be rather informative but not needed, and just explains why 0 does not make sense. I can write it better or remove it. The current validation (minimum: 1) is rejecting 0 which is correct, or am I missing something?
There was a problem hiding this comment.
@RomanBednar Yeah, current validation (minimum: 1) is rejecting 0 looks good to me, just a bit strange that we already reject 0, just the explain if set to 0 which will still use the default maybe make a bit confused.
5554b88 to
8ff768f
Compare
8ff768f to
d7b9ecd
Compare
|
@RomanBednar: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, jsafrane, RomanBednar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
cc @openshift/storage
Enhancement: openshift/enhancements#1563