Bug 2029750: pkg/start: Log and continue when we fail to retrieve the feature gate #706

wking · 2021-12-07T06:44:48Z

To make it easier to differentiate between "cluster version operator is confused about the cluster's tech-preview-ness", "cluster-version operator has broken manifest exclusion logic", and "manifest is mis-setting an annotation".

WIP until we remove the pkg/payload/payload.go logging, which I'm using to help debug openshift/cluster-capi-operator#20.

To make it easier to differentiate between "cluster version operator is confused about the cluster's tech-preview-ness", "cluster-version operator has broken manifest exclusion logic", and "manifest is mis-setting an annotation". I'm also converting an Infof into an Info for a call-site that needs no formatting for its static string.

From the enhancement [1]: During bootstrapping, the CVO will assume no feature sets are enabled until it can successfully retrieve `featuregates.config.openshift.io` from the Kubernetes API server. So this softens the error from 18dd189 (simplify includeTechPreview flag to be a static bool, 2021-11-22, openshift#694) to be log-and-continue. If it turns out that we actually were TechPreviewNoUpgrade, the TechPreviewChangeStopper controller will eventually succeed in pulling the feature-gate, notice, and ask the CVO to shut down. In the meantime, the CVO will have been able to get a head start on reconciling the vast majority of manifests which are not tech-preview. [1]: https://github.com/openshift/enhancements/blame/f74ffe7776f40dbd096b9ca10c27ee7a0a579e58/enhancements/update/cvo-techpreview-manifests.md#L70-L71

wking · 2021-12-07T07:55:27Z

27c7c4a -> 90b1454 drops the debug logging, and adds a new commit that turns 18dd189 (#694)'s hard-fail on initial-feature-gate-fetch into a log-and-continue, to make bootstrapping a bit quicker with the CVO working on GA manifests while the feature-gate hosting business is coming up.

jottofar · 2021-12-07T16:37:34Z

Is this "log and continue" change valid because the error usually occurs during start-up? So ignoring it just means we continue with includeTechPreview = false but eventually the Get succeeds and we'll restart if the value is actually true?

jottofar · 2021-12-07T17:05:19Z

/lgtm

openshift-ci · 2021-12-07T17:06:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jottofar,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2021-12-07T17:13:05Z

sandbox and connectivity issues are unrelated:

/override ci/prow/e2e-agnostic-upgrade

openshift-ci · 2021-12-07T17:13:42Z

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

sandbox and connectivity issues are unrelated:

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-12-09T18:42:19Z

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 2029750 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2029750: pkg/start: Log and continue when we fail to retrieve the feature gate

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ingTechPreviewState As described in [1], some 4.10 jobs that set TechPreviewNoUpdate very early during install are running into trouble like [2]: 1. Early in bootstrap, something sets TechPreviewNoUpdate. 2. Cluster-version operator comes up, and attempts to figure out the current featureSet. But because the Kubernetes API is also still coming up, that fails on an error like [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech' W0204 10:10:40.126809 1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused I0204 10:19:53.097129 1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false. 3. The TechPreviewChangeStopper waits for any FeatureGate changes, but we don't get any. 4. CVO happily spends hours without synchronizing any of the requested TechPreviewNoUpdate manifests. Step 2 was originally fatal, but I'd softened it in 90b1454 (pkg/start: Log and continue when we fail to retrieve the feature gate, 2021-12-06, openshift#706). Here are the relevant cases, and how they'd behave with the different approaches: a. No Kube-API hiccup on the initial FeatureGate fetch. All implementations handle this well. b. Kube-API hiccup on the initial FeatureGate fetch. i. And the actual FeatureGate value was not TechPreviewNoUpgrade. Before 90b1454, this would have caused a useless CVO container restart. Since 90b1454, and unchanged in this commit, the CVO container's default: includeTechPreview := false is correct, and we correctly ignore the hiccup. ii. The actual FeatureGate value was TechPreviewNoUpgrade. Before 90b1454, this would cause a useful CVO container restart. From 90b1454 until this commit, we'd hit the bug case where we'd go an unbounded amount of time failing to reconcile the TechPreviewNoUpgrade manifests the user was asking for. With this commit, we notice the divergence right after the informer caches sync, and restart the CVO container. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264

…ingTechPreviewState As described in [1], some 4.10 jobs that set TechPreviewNoUpgrade very early during install are running into trouble like [2]: 1. Early in bootstrap, something sets TechPreviewNoUpgrade. 2. Cluster-version operator comes up, and attempts to figure out the current featureSet. But because the Kubernetes API is also still coming up, that fails on an error like [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech' W0204 10:10:40.126809 1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused I0204 10:19:53.097129 1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false. 3. The TechPreviewChangeStopper waits for any FeatureGate changes, but we don't get any. 4. CVO happily spends hours without synchronizing any of the requested TechPreviewNoUpgrade manifests. Step 2 was originally fatal, but I'd softened it in 90b1454 (pkg/start: Log and continue when we fail to retrieve the feature gate, 2021-12-06, openshift#706). Here are the relevant cases, and how they'd behave with the different approaches: a. No Kube-API hiccup on the initial FeatureGate fetch. All implementations handle this well. b. Kube-API hiccup on the initial FeatureGate fetch. i. And the actual FeatureGate value was not TechPreviewNoUpgrade. Before 90b1454, this would have caused a useless CVO container restart. Since 90b1454, and unchanged in this commit, the CVO container's default: includeTechPreview := false is correct, and we correctly ignore the hiccup. ii. The actual FeatureGate value was TechPreviewNoUpgrade. Before 90b1454, this would cause a useful CVO container restart. From 90b1454 until this commit, we'd hit the bug case where we'd go an unbounded amount of time failing to reconcile the TechPreviewNoUpgrade manifests the user was asking for. With this commit, we notice the divergence right after the informer caches sync, and restart the CVO container. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264

…ingTechPreviewState As described in [1], some 4.10 jobs that set TechPreviewNoUpgrade very early during install are running into trouble like: 1. Early in bootstrap, something sets TechPreviewNoUpgrade. 2. Cluster-version operator comes up, and attempts to figure out the current featureSet. But because the Kubernetes API is also still coming up, that fails on an error like [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech' W0204 10:10:40.126809 1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused I0204 10:19:53.097129 1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false. 3. The TechPreviewChangeStopper waits for any FeatureGate changes, but we don't get any. 4. CVO happily spends hours without synchronizing any of the requested TechPreviewNoUpgrade manifests. Step 2 was originally fatal, but I'd softened it in 90b1454 (pkg/start: Log and continue when we fail to retrieve the feature gate, 2021-12-06, openshift#706). Here are the relevant cases, and how they'd behave with the different approaches: a. No Kube-API hiccup on the initial FeatureGate fetch. All implementations handle this well. b. Kube-API hiccup on the initial FeatureGate fetch. i. And the actual FeatureGate value was not TechPreviewNoUpgrade. Before 90b1454, this would have caused a useless CVO container restart. Since 90b1454, and unchanged in this commit, the CVO container's default: includeTechPreview := false is correct, and we correctly ignore the hiccup. ii. The actual FeatureGate value was TechPreviewNoUpgrade. Before 90b1454, this would cause a useful CVO container restart. From 90b1454 until this commit, we'd hit the bug case where we'd go an unbounded amount of time failing to reconcile the TechPreviewNoUpgrade manifests the user was asking for. With this commit, we notice the divergence right after the informer caches sync, and restart the CVO container. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 7, 2021

openshift-ci bot requested review from jottofar and vrutkovs December 7, 2021 06:45

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 7, 2021

wking added 2 commits December 6, 2021 23:48

wking force-pushed the expanded-tech-preview-logging branch from 27c7c4a to 90b1454 Compare December 7, 2021 07:53

wking changed the title ~~WIP: pkg/featurechangestopper: Log initial state~~ pkg/start: Log and continue when we fail to retrieve the feature gate Dec 7, 2021

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 7, 2021

openshift-ci bot assigned jottofar Dec 7, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 7, 2021

openshift-merge-robot merged commit 7427fb8 into openshift:master Dec 7, 2021

wking deleted the expanded-tech-preview-logging branch December 7, 2021 17:16

wking changed the title ~~pkg/start: Log and continue when we fail to retrieve the feature gate~~ Bug 2029750: pkg/start: Log and continue when we fail to retrieve the feature gate Dec 9, 2021

wking mentioned this pull request Feb 5, 2022

Bug 2050946: pkg/featurechangestopper: Seed queue to guard against incorrect startingTechPreviewState #736

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug 2029750: pkg/start: Log and continue when we fail to retrieve the feature gate #706

Bug 2029750: pkg/start: Log and continue when we fail to retrieve the feature gate #706

Uh oh!

wking commented Dec 7, 2021

Uh oh!

wking commented Dec 7, 2021

Uh oh!

jottofar commented Dec 7, 2021

Uh oh!

jottofar commented Dec 7, 2021

Uh oh!

openshift-ci bot commented Dec 7, 2021

Uh oh!

wking commented Dec 7, 2021

Uh oh!

openshift-ci bot commented Dec 7, 2021

Uh oh!

openshift-ci bot commented Dec 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bug 2029750: pkg/start: Log and continue when we fail to retrieve the feature gate #706

Bug 2029750: pkg/start: Log and continue when we fail to retrieve the feature gate #706

Uh oh!

Conversation

wking commented Dec 7, 2021

Uh oh!

wking commented Dec 7, 2021

Uh oh!

jottofar commented Dec 7, 2021

Uh oh!

jottofar commented Dec 7, 2021

Uh oh!

openshift-ci bot commented Dec 7, 2021

Uh oh!

wking commented Dec 7, 2021

Uh oh!

openshift-ci bot commented Dec 7, 2021

Uh oh!

openshift-ci bot commented Dec 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants