-
Notifications
You must be signed in to change notification settings - Fork 213
allow more than one featureset #821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/retest |
pkg/cvo/cvo_scenarios_test.go
Outdated
| }, | ||
| "exclude-test", | ||
| false, | ||
| "Default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unclear to me if Default or the empty-string is the default value. Whichever way we go, can you doc it in at least the Operator property declaration?
pkg/cvo/sync_worker.go
Outdated
|
|
||
| // includeTechPreview is set to true when the CVO should create resources with the `release.openshift.io/feature-gate=TechPreviewNoUpgrade` | ||
| includeTechPreview bool | ||
| // requiredFeatureSet is set to true when the CVO should create resources with the `release.openshift.io/feature-gate=TechPreviewNoUpgrade` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Replace to true, etc. with whatever the string semantics are supposed to be.
pkg/start/start.go
Outdated
| switch { | ||
| case apierrors.IsNotFound(err): | ||
| includeTechPreview = false // if we have no featuregates, then we aren't tech preview | ||
| startingFeatureSet = "" // if we have no featuregates, then we assume the default featureset, which is "". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it needs to be "then we exclude everything that could possibly depend on the current feature set".
pkg/start/start.go
Outdated
| } | ||
| requiredFeatureSetAnnotationValue := "Default" // When the featureset is "", we require it to say "Default" | ||
| if len(startingFeatureSet) > 0 { | ||
| requiredFeatureSetAnnotationValue = startingFeatureSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to make this call way up the stack in start.go? It feels like something we could push down into library-go.
| featureSetAnnotationValues := strings.Split(featureSetAnnotationValue, ",") | ||
| for _, manifestFeatureSet := range featureSetAnnotationValues { | ||
| if !knownFeatureSets.Has(manifestFeatureSet) { | ||
| // never include the manifest if the feature-set annotation is outside of allowed values (only TechPreviewNoUpgrade and "" are currently allowed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we drop (only TechPreviewNoUpgrade and "" are currently allowed), because I expect it would go stale, and may already be stale with your configv1.FeatureSets source for knownFeatureSets.
| func getFeatureSets(annotations map[string]string) (sets.String, bool, error) { | ||
| ret := sets.String{} | ||
| specified := false | ||
| for _, featureSetAnnotation := range []string{"release.openshift.io/feature-gate", "release.openshift.io/feature-set"} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use your featureSetAnnotation for one of these?
| if featureGateAnnotationExists && featureGateAnnotationValue != string(configv1.TechPreviewNoUpgrade) { | ||
| return fmt.Errorf("unrecognized value %s=%s", featureGateAnnotation, featureGateAnnotationValue) | ||
| if manifestSpecifiesFeatureSets && !manifestFeatureSets.Has(*requiredFeatureSet) { | ||
| return fmt.Errorf("%q is required, and %s=%s", *requiredFeatureSet, featureSetAnnotation, strings.Join(manifestFeatureSets.List(), ",")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this assumes featureSetAnnotation, and not the old annotation name. Is the idea that we drop the old name before we GA 4.12, once existing folks have migrated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this assumes featureSetAnnotation, and not the old annotation name. Is the idea that we drop the old name before we GA 4.12, once existing folks have migrated?
correct
pkg/cvo/cvo.go
Outdated
|
|
||
| // includeTechPreview is set to true when the CVO should create resources with the `release.openshift.io/feature-gate=TechPreviewNoUpgrade` | ||
| // requiredFeatureSet is set to true when the CVO should create resources with the `release.openshift.io/feature-gate=TechPreviewNoUpgrade` | ||
| // label set. This is set based on whether the featuregates.config.openshift.io|.spec.featureSet is set to "TechPreviewNoUpgrade". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set to true and the TechPreviewNoUpgrade focus are stale.
pkg/cvo/sync_worker.go
Outdated
| // requiredFeatureSet is set to the value of Feature.config.openshift.io|spec.featureSet. | ||
| // The CVO should create resources with the `annotations[release.openshift.io/feature-set]` unset or if | ||
| // the annotation is set, it must contain the requiredFeatureSet. | ||
| // The library called by the CVO translates "" into "Default" to ease usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might not need to get into this level of detail about what's going on inside the vendored library. Maybe something more generic like:
requiredFeatureSet is set to the value of Feature.config.openshift.io|spec.featureSet, which contributes to whether or not some manifests are included for reconciliation.
| ) | ||
|
|
||
| // TechPreviewChangeStopper calls stop when the value of the featuregate changes from TechPreviewNoUpgrade to anything else | ||
| // or from anything to TechPreviewNoUpgrade. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too late for us to rename this to FeatureChangeStopper to match the directory name you initially selected? Might be able to do that with:
$ sed -i 's/TechPreviewChangeStopper/FeatureChangeStopper/g' pkg/featurechangestopper/*
$ mv pkg/featurechangestopper/{techpreview,feature}changestopper.go
$ mv pkg/featurechangestopper/{techpreview,feature}changestopper_test.goor similar.
| includeTechPreview = false // if we have no featuregates, then we aren't tech preview | ||
| // if we have no featuregates, then we assume the default featureset, which is "". | ||
| // This excludes everything that could possibly depend on a different feature set. | ||
| startingFeatureSet = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to assume here? If we have no cluster FeatureGate, can we be in any other set? Isn't that a clear (if implicit) declaration that we're in the default set?
pkg/start/start.go
Outdated
| // This excludes everything that could possibly depend on a different feature set. | ||
| startingFeatureSet = "" | ||
| case err != nil: | ||
| klog.Warningf("Error getting featuregate value: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we may need a pointer for passing this thing around (luckily, library-go is already using a pointer). openshift/enhancements#1227 currently includes:
If a special manifest is required for a FeatureSet that allows changes (LatencySensitive to Default for instance), the person adding the manifest will have to add the appropriate delete-annotated manifest in Default. This is expected to be a rare occurrence of a rare occurrence.
So there's a risk here of:
- User configures LatencySensitive.
- Resource A, that only happens in LatencySensitive, rolls out.
- Happy days.
- Something bumps the CVO.
- New CVO comes up, but network hiccup fails the FeatureGate GET.
- We enter this
err != nilcase, and warn, but leavestartingFeatureSet == "". - Reconciling manifests, we exclude the LatencySensitive resource A manifest (meh), but include the deletion manifest (!), and remove resource A from the cluster.
- Change-detector has a successful GET, notices the mistake, and shuts down the CVO.
- Replacement CVO has a successful GET, realizes it's LatencySensitive, and rolls resource A back into the cluster.
That's not the end of the world, but I'd feel more comfortable with pointers tracking whether we knew or not about the feature gate, and... hmm, I guess there's not a clear way to tell library-go to exclude everything that declares a feature-set annotation. Maybe we don't need a pointer, but we do need a dummy value here like:
startingFeatureSet = "not sure, so exclude everything that might care"?
pkg/start/start.go
Outdated
|
|
||
| default: | ||
| includeTechPreview = gate.Spec.FeatureSet == configv1.TechPreviewNoUpgrade | ||
| // otherwise, you're the default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this comment about? Just reminding us that we're in the default: case?
|
pushed another commit for the comments. |
|
|
||
| // TechPreviewChangeStopper calls stop when the value of the featuregate changes from TechPreviewNoUpgrade to anything else | ||
| // FeatureChangeStopper calls stop when the value of the featuregate changes from TechPreviewNoUpgrade to anything else | ||
| // or from anything to TechPreviewNoUpgrade. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale TechPreviewNoUpgrade focus. Maybe:
FeatureChangeStopper calls stop when the value of the feature-set changes.
?
wking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
@deads2k: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This was added in openshift/cluster-version-operator#821 to allow more featuresets and allow for a future migration to include actual gates
This was added in openshift/cluster-version-operator#821 to allow more featuresets and allow for a future migration to include actual gates
Originally, all component operators were responsible for creating their own ClusterOperator, and we'd just watch to make sure we were happy enough with what they did. However, on install, or when updating to a version that added a new component, we could have timelines like: 1. CVO creates a namespace for an operator. 2. CVO creates ... for the operator. 3. CVO creates the operator Deployment. 4. Operator deployment never comes up, for whatever reason. 5. Admin must-gathers. 6. Must gather uses ClusterOperators for discovering important stuff, and because the ClusterOperator doesn't exist yet, we get no data about why the deployment didn't come up. So in 2a469e3 (cvo: When installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318), we added ClusterOperator pre-creation to get: 1. CVO pre-creates ClusterOperator for an operator. 2. CVO creates the namespace for an operator. 3. CVO creates ... for the operator. 4. CVO creates the operator Deployment. 5. Operator deployment never comes up, for whatever reason. 6. Admin must-gathers. 7. Must gather uses ClusterOperators for discovering important stuff, and finds the one the CVO had pre-created with hard-coded relatedObjects, gathers stuff from the referenced operator namespace, and allows us to trouble-shoot the issue. However, all existing component operators already knew how to create their own ClusterOperator, because that was the only path before the CVO learned about pre-creation. And even since then, most new operators come into the cluster on install or on update, when the CVO is pre-creating. New in 4.12, the platform-operator is coming in [1], and it has two relevant characteristics: * It does not know how to create the platform-operators-aggregated ClusterOperator [2]. * It is gated behind TechPreviewNoUpgrade [3]. So we are exposed to: 1. Admin installs a cluster. No platform-operators-aggregated, because it's not TechPreviewNoUpgrade. 2. Install complete. CVO transitions to reconciling mode. 3. Admin enables TechPreviewNoUpgrade. 4. CVO notices, and reboots fc00c62 (update the manifest selection to honor any featureset, 2022-08-17, openshift#821). 5. Because we decided to not transition into updating mode for feature-set changes, we stay in reconciling mode. 6. Because we're in reconciling mode, we skip the ClusterOperator pre-creation, and get right in to the status check. 7. Because the platform operator didn't create the ClusterOperator either, the CVO's status check fails with [2]: 45657:E0923 01:43:25.610286 1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found With this commit, I stop making the ClusterOperator pre-creation conditional, so the new flow is: ... 6. Even in reconciling mode, we pre-create the ClusterOperator. 7. Because we pre-created the ClusterOperator, the CVO's status check succeeds (at least, after the operator writes acceptable status to the ClusterOperator we've created for it). This will also help us recover components where a bunch of in-cluster resources had been deleted, assuming the CVO was still alive. There may be other component operators who rely on the CVO for ClusterOperator creation, but which we haven't noticed because they aren't also gated behind TechPreviewNoUpgrade. [1]: https://github.com/openshift/enhancements/blob/6e1697418be807d0ae567a9f83ac654a1fd0ee9a/enhancements/olm/platform-operators.md [2]: https://issues.redhat.com/browse/OCPBUGS-1636 [3]: https://github.com/openshift/platform-operators/blob/4ecea427cf5302dfcdf4a5af8d28eadebacc2037/manifests/0000_50_cluster-platform-operator-manager_07-aggregated-clusteroperator.yaml#L8
Originally, all component operators were responsible for creating their own ClusterOperator, and we'd just watch to make sure we were happy enough with what they did. However, on install, or when updating to a version that added a new component, we could have timelines like: 1. CVO creates a namespace for an operator. 2. CVO creates ... for the operator. 3. CVO creates the operator Deployment. 4. Operator deployment never comes up, for whatever reason. 5. Admin must-gathers. 6. Must gather uses ClusterOperators for discovering important stuff, and because the ClusterOperator doesn't exist yet, we get no data about why the deployment didn't come up. So in 2a469e3 (cvo: When installing or upgrading, fast-fill cluster-operators, 2020-02-07, openshift#318), we added ClusterOperator pre-creation to get: 1. CVO pre-creates ClusterOperator for an operator. 2. CVO creates the namespace for an operator. 3. CVO creates ... for the operator. 4. CVO creates the operator Deployment. 5. Operator deployment never comes up, for whatever reason. 6. Admin must-gathers. 7. Must gather uses ClusterOperators for discovering important stuff, and finds the one the CVO had pre-created with hard-coded relatedObjects, gathers stuff from the referenced operator namespace, and allows us to trouble-shoot the issue. However, all existing component operators already knew how to create their own ClusterOperator, because that was the only path before the CVO learned about pre-creation. And even since then, most new operators come into the cluster on install or on update, when the CVO is pre-creating. New in 4.12, the platform-operator is coming in [1], and it has two relevant characteristics: * It does not know how to create the platform-operators-aggregated ClusterOperator [2]. * It is gated behind TechPreviewNoUpgrade [3]. So we are exposed to: 1. Admin installs a cluster. No platform-operators-aggregated, because it's not TechPreviewNoUpgrade. 2. Install complete. CVO transitions to reconciling mode. 3. Admin enables TechPreviewNoUpgrade. 4. CVO notices, and reboots fc00c62 (update the manifest selection to honor any featureset, 2022-08-17, openshift#821). 5. Because we decided to not transition into updating mode for feature-set changes, we stay in reconciling mode. 6. Because we're in reconciling mode, we skip the ClusterOperator pre-creation, and get right in to the status check. 7. Because the platform operator didn't create the ClusterOperator either, the CVO's status check fails with [2]: 45657:E0923 01:43:25.610286 1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found With this commit, I stop making the ClusterOperator pre-creation conditional, so the new flow is: ... 6. Even in reconciling mode, we pre-create the ClusterOperator. 7. Because we pre-created the ClusterOperator, the CVO's status check succeeds (at least, after the operator writes acceptable status to the ClusterOperator we've created for it). This will also help us recover components where a bunch of in-cluster resources had been deleted, assuming the CVO was still alive. There may be other component operators who rely on the CVO for ClusterOperator creation, but which we haven't noticed because they aren't also gated behind TechPreviewNoUpgrade. [1]: https://github.com/openshift/enhancements/blob/6e1697418be807d0ae567a9f83ac654a1fd0ee9a/enhancements/olm/platform-operators.md [2]: https://issues.redhat.com/browse/OCPBUGS-1636 [3]: https://github.com/openshift/platform-operators/blob/4ecea427cf5302dfcdf4a5af8d28eadebacc2037/manifests/0000_50_cluster-platform-operator-manager_07-aggregated-clusteroperator.yaml#L8
This is a sketch of how we could allow other featuresets like
""to be specified in teh annotations.If no annotation is specified, then it's always present. If any
release.openshift.io/feature-gateis specified then the manifest only used when the value matches the currently specified featureset. Also, present-me wishes that past-me had distinguished between gates and sets better. I wonder if I can change the world still. I might be able to transition during 4.12.