Skip to content

Conversation

@deads2k
Copy link
Contributor

@deads2k deads2k commented Aug 15, 2022

This is a sketch of how we could allow other featuresets like "" to be specified in teh annotations.

If no annotation is specified, then it's always present. If any release.openshift.io/feature-gate is specified then the manifest only used when the value matches the currently specified featureset. Also, present-me wishes that past-me had distinguished between gates and sets better. I wonder if I can change the world still. I might be able to transition during 4.12.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 15, 2022
@openshift-ci openshift-ci bot requested review from vrutkovs and wking August 15, 2022 21:13
@deads2k
Copy link
Contributor Author

deads2k commented Aug 17, 2022

waiting for the deletion of Security Rule

/retest

@deads2k deads2k changed the title [wip] allow more than one featureset allow more than one featureset Aug 17, 2022
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 17, 2022
},
"exclude-test",
false,
"Default",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclear to me if Default or the empty-string is the default value. Whichever way we go, can you doc it in at least the Operator property declaration?


// includeTechPreview is set to true when the CVO should create resources with the `release.openshift.io/feature-gate=TechPreviewNoUpgrade`
includeTechPreview bool
// requiredFeatureSet is set to true when the CVO should create resources with the `release.openshift.io/feature-gate=TechPreviewNoUpgrade`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Replace to true, etc. with whatever the string semantics are supposed to be.

switch {
case apierrors.IsNotFound(err):
includeTechPreview = false // if we have no featuregates, then we aren't tech preview
startingFeatureSet = "" // if we have no featuregates, then we assume the default featureset, which is "".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs to be "then we exclude everything that could possibly depend on the current feature set".

}
requiredFeatureSetAnnotationValue := "Default" // When the featureset is "", we require it to say "Default"
if len(startingFeatureSet) > 0 {
requiredFeatureSetAnnotationValue = startingFeatureSet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to make this call way up the stack in start.go? It feels like something we could push down into library-go.

featureSetAnnotationValues := strings.Split(featureSetAnnotationValue, ",")
for _, manifestFeatureSet := range featureSetAnnotationValues {
if !knownFeatureSets.Has(manifestFeatureSet) {
// never include the manifest if the feature-set annotation is outside of allowed values (only TechPreviewNoUpgrade and "" are currently allowed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop (only TechPreviewNoUpgrade and "" are currently allowed), because I expect it would go stale, and may already be stale with your configv1.FeatureSets source for knownFeatureSets.

func getFeatureSets(annotations map[string]string) (sets.String, bool, error) {
ret := sets.String{}
specified := false
for _, featureSetAnnotation := range []string{"release.openshift.io/feature-gate", "release.openshift.io/feature-set"} {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use your featureSetAnnotation for one of these?

if featureGateAnnotationExists && featureGateAnnotationValue != string(configv1.TechPreviewNoUpgrade) {
return fmt.Errorf("unrecognized value %s=%s", featureGateAnnotation, featureGateAnnotationValue)
if manifestSpecifiesFeatureSets && !manifestFeatureSets.Has(*requiredFeatureSet) {
return fmt.Errorf("%q is required, and %s=%s", *requiredFeatureSet, featureSetAnnotation, strings.Join(manifestFeatureSets.List(), ","))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this assumes featureSetAnnotation, and not the old annotation name. Is the idea that we drop the old name before we GA 4.12, once existing folks have migrated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this assumes featureSetAnnotation, and not the old annotation name. Is the idea that we drop the old name before we GA 4.12, once existing folks have migrated?

correct

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 24, 2022
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 24, 2022
pkg/cvo/cvo.go Outdated

// includeTechPreview is set to true when the CVO should create resources with the `release.openshift.io/feature-gate=TechPreviewNoUpgrade`
// requiredFeatureSet is set to true when the CVO should create resources with the `release.openshift.io/feature-gate=TechPreviewNoUpgrade`
// label set. This is set based on whether the featuregates.config.openshift.io|.spec.featureSet is set to "TechPreviewNoUpgrade".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set to true and the TechPreviewNoUpgrade focus are stale.

// requiredFeatureSet is set to the value of Feature.config.openshift.io|spec.featureSet.
// The CVO should create resources with the `annotations[release.openshift.io/feature-set]` unset or if
// the annotation is set, it must contain the requiredFeatureSet.
// The library called by the CVO translates "" into "Default" to ease usage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not need to get into this level of detail about what's going on inside the vendored library. Maybe something more generic like:

requiredFeatureSet is set to the value of Feature.config.openshift.io|spec.featureSet, which contributes to whether or not some manifests are included for reconciliation.

)

// TechPreviewChangeStopper calls stop when the value of the featuregate changes from TechPreviewNoUpgrade to anything else
// or from anything to TechPreviewNoUpgrade.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too late for us to rename this to FeatureChangeStopper to match the directory name you initially selected? Might be able to do that with:

$ sed -i 's/TechPreviewChangeStopper/FeatureChangeStopper/g' pkg/featurechangestopper/*
$ mv pkg/featurechangestopper/{techpreview,feature}changestopper.go
$ mv pkg/featurechangestopper/{techpreview,feature}changestopper_test.go

or similar.

includeTechPreview = false // if we have no featuregates, then we aren't tech preview
// if we have no featuregates, then we assume the default featureset, which is "".
// This excludes everything that could possibly depend on a different feature set.
startingFeatureSet = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to assume here? If we have no cluster FeatureGate, can we be in any other set? Isn't that a clear (if implicit) declaration that we're in the default set?

// This excludes everything that could possibly depend on a different feature set.
startingFeatureSet = ""
case err != nil:
klog.Warningf("Error getting featuregate value: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we may need a pointer for passing this thing around (luckily, library-go is already using a pointer). openshift/enhancements#1227 currently includes:

If a special manifest is required for a FeatureSet that allows changes (LatencySensitive to Default for instance), the person adding the manifest will have to add the appropriate delete-annotated manifest in Default. This is expected to be a rare occurrence of a rare occurrence.

So there's a risk here of:

  1. User configures LatencySensitive.
  2. Resource A, that only happens in LatencySensitive, rolls out.
  3. Happy days.
  4. Something bumps the CVO.
  5. New CVO comes up, but network hiccup fails the FeatureGate GET.
  6. We enter this err != nil case, and warn, but leave startingFeatureSet == "".
  7. Reconciling manifests, we exclude the LatencySensitive resource A manifest (meh), but include the deletion manifest (!), and remove resource A from the cluster.
  8. Change-detector has a successful GET, notices the mistake, and shuts down the CVO.
  9. Replacement CVO has a successful GET, realizes it's LatencySensitive, and rolls resource A back into the cluster.

That's not the end of the world, but I'd feel more comfortable with pointers tracking whether we knew or not about the feature gate, and... hmm, I guess there's not a clear way to tell library-go to exclude everything that declares a feature-set annotation. Maybe we don't need a pointer, but we do need a dummy value here like:

startingFeatureSet = "not sure, so exclude everything that might care"

?


default:
includeTechPreview = gate.Spec.FeatureSet == configv1.TechPreviewNoUpgrade
// otherwise, you're the default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this comment about? Just reminding us that we're in the default: case?

@deads2k
Copy link
Contributor Author

deads2k commented Aug 26, 2022

pushed another commit for the comments.


// TechPreviewChangeStopper calls stop when the value of the featuregate changes from TechPreviewNoUpgrade to anything else
// FeatureChangeStopper calls stop when the value of the featuregate changes from TechPreviewNoUpgrade to anything else
// or from anything to TechPreviewNoUpgrade.
Copy link
Member

@wking wking Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale TechPreviewNoUpgrade focus. Maybe:

FeatureChangeStopper calls stop when the value of the feature-set changes.

?

Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 1, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 1, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 1, 2022
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD df9db6c and 2 for PR HEAD 11afdc2 in total

@deads2k
Copy link
Contributor Author

deads2k commented Sep 6, 2022

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 6, 2022

@deads2k: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit d4151bc into openshift:master Sep 6, 2022
deads2k added a commit to deads2k/cluster-capi-operator that referenced this pull request Sep 6, 2022
This was added in
openshift/cluster-version-operator#821 to allow
more featuresets and allow for a future migration to include actual gates
deads2k added a commit to deads2k/api that referenced this pull request Sep 8, 2022
This was added in
openshift/cluster-version-operator#821 to allow
more featuresets and allow for a future migration to include actual gates
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 23, 2022
Originally, all component operators were responsible for creating
their own ClusterOperator, and we'd just watch to make sure we were
happy enough with what they did.  However, on install, or when
updating to a version that added a new component, we could have
timelines like:

1. CVO creates a namespace for an operator.
2. CVO creates ... for the operator.
3. CVO creates the operator Deployment.
4. Operator deployment never comes up, for whatever reason.
5. Admin must-gathers.
6. Must gather uses ClusterOperators for discovering important stuff,
   and because the ClusterOperator doesn't exist yet, we get no data
   about why the deployment didn't come up.

So in 2a469e3 (cvo: When installing or upgrading, fast-fill
cluster-operators, 2020-02-07, openshift#318), we added ClusterOperator
pre-creation to get:

1. CVO pre-creates ClusterOperator for an operator.
2. CVO creates the namespace for an operator.
3. CVO creates ... for the operator.
4. CVO creates the operator Deployment.
5. Operator deployment never comes up, for whatever reason.
6. Admin must-gathers.
7. Must gather uses ClusterOperators for discovering important stuff,
   and finds the one the CVO had pre-created with hard-coded
   relatedObjects, gathers stuff from the referenced operator
   namespace, and allows us to trouble-shoot the issue.

However, all existing component operators already knew how to create
their own ClusterOperator, because that was the only path before the
CVO learned about pre-creation.  And even since then, most new
operators come into the cluster on install or on update, when the CVO
is pre-creating.  New in 4.12, the platform-operator is coming in [1],
and it has two relevant characteristics:

* It does not know how to create the platform-operators-aggregated
  ClusterOperator [2].
* It is gated behind TechPreviewNoUpgrade [3].

So we are exposed to:

1. Admin installs a cluster.  No platform-operators-aggregated,
   because it's not TechPreviewNoUpgrade.
2. Install complete.  CVO transitions to reconciling mode.
3. Admin enables TechPreviewNoUpgrade.
4. CVO notices, and reboots fc00c62 (update the manifest selection
   to honor any featureset, 2022-08-17, openshift#821).
5. Because we decided to not transition into updating mode for
   feature-set changes, we stay in reconciling mode.
6. Because we're in reconciling mode, we skip the ClusterOperator
   pre-creation, and get right in to the status check.
7. Because the platform operator didn't create the ClusterOperator
   either, the CVO's status check fails with [2]:

     45657:E0923 01:43:25.610286       1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found

With this commit, I stop making the ClusterOperator pre-creation
conditional, so the new flow is:

...
6. Even in reconciling mode, we pre-create the ClusterOperator.
7. Because we pre-created the ClusterOperator, the CVO's status check
   succeeds (at least, after the operator writes acceptable status to
   the ClusterOperator we've created for it).

This will also help us recover components where a bunch of in-cluster
resources had been deleted, assuming the CVO was still alive.  There
may be other component operators who rely on the CVO for
ClusterOperator creation, but which we haven't noticed because they
aren't also gated behind TechPreviewNoUpgrade.

[1]: https://github.com/openshift/enhancements/blob/6e1697418be807d0ae567a9f83ac654a1fd0ee9a/enhancements/olm/platform-operators.md
[2]: https://issues.redhat.com/browse/OCPBUGS-1636
[3]: https://github.com/openshift/platform-operators/blob/4ecea427cf5302dfcdf4a5af8d28eadebacc2037/manifests/0000_50_cluster-platform-operator-manager_07-aggregated-clusteroperator.yaml#L8
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 23, 2022
Originally, all component operators were responsible for creating
their own ClusterOperator, and we'd just watch to make sure we were
happy enough with what they did.  However, on install, or when
updating to a version that added a new component, we could have
timelines like:

1. CVO creates a namespace for an operator.
2. CVO creates ... for the operator.
3. CVO creates the operator Deployment.
4. Operator deployment never comes up, for whatever reason.
5. Admin must-gathers.
6. Must gather uses ClusterOperators for discovering important stuff,
   and because the ClusterOperator doesn't exist yet, we get no data
   about why the deployment didn't come up.

So in 2a469e3 (cvo: When installing or upgrading, fast-fill
cluster-operators, 2020-02-07, openshift#318), we added ClusterOperator
pre-creation to get:

1. CVO pre-creates ClusterOperator for an operator.
2. CVO creates the namespace for an operator.
3. CVO creates ... for the operator.
4. CVO creates the operator Deployment.
5. Operator deployment never comes up, for whatever reason.
6. Admin must-gathers.
7. Must gather uses ClusterOperators for discovering important stuff,
   and finds the one the CVO had pre-created with hard-coded
   relatedObjects, gathers stuff from the referenced operator
   namespace, and allows us to trouble-shoot the issue.

However, all existing component operators already knew how to create
their own ClusterOperator, because that was the only path before the
CVO learned about pre-creation.  And even since then, most new
operators come into the cluster on install or on update, when the CVO
is pre-creating.  New in 4.12, the platform-operator is coming in [1],
and it has two relevant characteristics:

* It does not know how to create the platform-operators-aggregated
  ClusterOperator [2].
* It is gated behind TechPreviewNoUpgrade [3].

So we are exposed to:

1. Admin installs a cluster.  No platform-operators-aggregated,
   because it's not TechPreviewNoUpgrade.
2. Install complete.  CVO transitions to reconciling mode.
3. Admin enables TechPreviewNoUpgrade.
4. CVO notices, and reboots fc00c62 (update the manifest selection
   to honor any featureset, 2022-08-17, openshift#821).
5. Because we decided to not transition into updating mode for
   feature-set changes, we stay in reconciling mode.
6. Because we're in reconciling mode, we skip the ClusterOperator
   pre-creation, and get right in to the status check.
7. Because the platform operator didn't create the ClusterOperator
   either, the CVO's status check fails with [2]:

     45657:E0923 01:43:25.610286       1 task.go:117] error running apply for clusteroperator "openshift-platform-operators/platform-operators-aggregated" (587 of 960): clusteroperator.config.openshift.io "platform-operators-aggregated" not found

With this commit, I stop making the ClusterOperator pre-creation
conditional, so the new flow is:

...
6. Even in reconciling mode, we pre-create the ClusterOperator.
7. Because we pre-created the ClusterOperator, the CVO's status check
   succeeds (at least, after the operator writes acceptable status to
   the ClusterOperator we've created for it).

This will also help us recover components where a bunch of in-cluster
resources had been deleted, assuming the CVO was still alive.  There
may be other component operators who rely on the CVO for
ClusterOperator creation, but which we haven't noticed because they
aren't also gated behind TechPreviewNoUpgrade.

[1]: https://github.com/openshift/enhancements/blob/6e1697418be807d0ae567a9f83ac654a1fd0ee9a/enhancements/olm/platform-operators.md
[2]: https://issues.redhat.com/browse/OCPBUGS-1636
[3]: https://github.com/openshift/platform-operators/blob/4ecea427cf5302dfcdf4a5af8d28eadebacc2037/manifests/0000_50_cluster-platform-operator-manager_07-aggregated-clusteroperator.yaml#L8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants