-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 2050946: pkg/featurechangestopper: Seed queue to guard against incorrect startingTechPreviewState #736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2050946: pkg/featurechangestopper: Seed queue to guard against incorrect startingTechPreviewState #736
Conversation
…ingTechPreviewState
As described in [1], some 4.10 jobs that set TechPreviewNoUpgrade very
early during install are running into trouble like:
1. Early in bootstrap, something sets TechPreviewNoUpgrade.
2. Cluster-version operator comes up, and attempts to figure out the
current featureSet. But because the Kubernetes API is also still
coming up, that fails on an error like [2]:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
W0204 10:10:40.126809 1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
I0204 10:19:53.097129 1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
3. The TechPreviewChangeStopper waits for any FeatureGate changes, but
we don't get any.
4. CVO happily spends hours without synchronizing any of the requested
TechPreviewNoUpgrade manifests.
Step 2 was originally fatal, but I'd softened it in 90b1454
(pkg/start: Log and continue when we fail to retrieve the feature
gate, 2021-12-06, openshift#706). Here are the relevant cases, and how they'd
behave with the different approaches:
a. No Kube-API hiccup on the initial FeatureGate fetch. All
implementations handle this well.
b. Kube-API hiccup on the initial FeatureGate fetch.
i. And the actual FeatureGate value was not TechPreviewNoUpgrade.
Before 90b1454, this would have caused a useless CVO
container restart. Since 90b1454, and unchanged in this
commit, the CVO container's default:
includeTechPreview := false
is correct, and we correctly ignore the hiccup.
ii. The actual FeatureGate value was TechPreviewNoUpgrade. Before
90b1454, this would cause a useful CVO container restart.
From 90b1454 until this commit, we'd hit the bug case where
we'd go an unbounded amount of time failing to reconcile the
TechPreviewNoUpgrade manifests the user was asking for. With
this commit, we notice the divergence right after the informer
caches sync, and restart the CVO container.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
|
@wking: This pull request references Bugzilla bug 2050946, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
1039482 adds debug logging, so we can confirm the post-init check in the non-tech-preview presubmits. /hold |
Not Found errors are a clear configuration for "no feature set", so syncHandler should treat them as successful results. This avoids continually requeuing for a new GET call, now that 79742d7 (pkg/featurechangestopper: Seed queue to guard against incorrect startingTechPreviewState, 2022-02-04, openshift#736) is seeding the queue, even in clusters where there is no FeatureSet clusters. And it also allows us to detect set -> removed transitions if those were allowed, although the API-server is supposed to make it impossible to remove TechPreviewNoUpgrade once it has been set [1]. [1]: https://docs.openshift.com/container-platform/4.9/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-about_nodes-cluster-enabling
1039482 to
eff21aa
Compare
Not Found errors are a clear configuration for "no feature set", so syncHandler should treat them as successful results. This avoids continually requeuing for a new GET call, now that 79742d7 (pkg/featurechangestopper: Seed queue to guard against incorrect startingTechPreviewState, 2022-02-04, openshift#736) is seeding the queue, even in clusters where there is no FeatureSet clusters. And it also allows us to detect set -> removed transitions if those were allowed, although the API-server is supposed to make it impossible to remove TechPreviewNoUpgrade once it has been set [1]. [1]: https://docs.openshift.com/container-platform/4.9/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-about_nodes-cluster-enabling
eff21aa to
41aac43
Compare
Not Found errors are a clear configuration for "no feature set", so syncHandler should treat them as successful results. This avoids continually requeuing for a new GET call, now that 79742d7 (pkg/featurechangestopper: Seed queue to guard against incorrect startingTechPreviewState, 2022-02-04, openshift#736) is seeding the queue, even in clusters where there is no FeatureSet clusters. And it also allows us to detect set -> removed transitions if those were allowed, although the API-server is supposed to make it impossible to remove TechPreviewNoUpgrade once it has been set [1]. [1]: https://docs.openshift.com/container-platform/4.9/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-about_nodes-cluster-enabling
295efc3 to
a44a3c0
Compare
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/736/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1490167038348365824/artifacts/e2e-agnostic-operator/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5d4c45b67b-sdmdz_cluster-version-operator.log | grep 'Error getting featuregate value\|techpreview'
W0206 04:03:32.900700 1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
I0206 04:11:21.806248 1 techpreviewchangestopper.go:102] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
I0206 04:11:38.407495 1 techpreviewchangestopper.go:71] WTK: Checking the featureSet: "" vs our initial falseLooks great. I've dropped the debugging commit with 295efc3 -> a44a3c0. /hold cancel |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jottofar, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Load balancer connectivity and etcd leader elections are unrelated: /override ci/prow/e2e-agnostic |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic, ci/prow/e2e-agnostic-upgrade DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@wking: All pull requests linked via external trackers have merged: Bugzilla bug 2050946 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Not Found errors are a clear configuration for "no feature set", so syncHandler should treat them as successful results. This avoids continually requeuing for a new GET call, now that 79742d7 (pkg/featurechangestopper: Seed queue to guard against incorrect startingTechPreviewState, 2022-02-04, openshift#736) is seeding the queue, even in clusters where there is no FeatureSet clusters. And it also allows us to detect set -> removed transitions if those were allowed, although the API-server is supposed to make it impossible to remove TechPreviewNoUpgrade once it has been set [1]. [1]: https://docs.openshift.com/container-platform/4.9/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-about_nodes-cluster-enabling
As described in the bug, some 4.10 jobs that set
TechPreviewNoUpgradevery early during install are running into trouble like:Early in bootstrap, something sets
TechPreviewNoUpgrade.Cluster-version operator comes up, and attempts to figure out the current
featureSet. But because the Kubernetes API is also still coming up, that fails on an error like:The
TechPreviewChangeStopperwaits for any FeatureGate changes, but we don't get any.CVO happily spends hours without synchronizing any of the requested
TechPreviewNoUpgrademanifests.Step 2 was originally fatal, but I'd softened it in 90b1454 (#706). Here are the relevant cases, and how they'd behave with the different approaches:
And the actual FeatureGate value was not
TechPreviewNoUpgrade. Before 90b1454, this would have caused a useless CVO container restart. Since 90b1454, and unchanged in this commit, the CVO container's default:is correct, and we correctly ignore the hiccup.
The actual FeatureGate value was
TechPreviewNoUpgrade. Before 90b1454, this would cause a useful CVO container restart. From 90b1454 until this commit, we'd hit the bug case where we'd go an unbounded amount of time failing to reconcile theTechPreviewNoUpgrademanifests the user was asking for. With this commit, we notice the divergence right after the informer caches sync, and restart the CVO container.