-
Notifications
You must be signed in to change notification settings - Fork 213
Bug 2029750: pkg/start: Log and continue when we fail to retrieve the feature gate #706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2029750: pkg/start: Log and continue when we fail to retrieve the feature gate #706
Conversation
To make it easier to differentiate between "cluster version operator is confused about the cluster's tech-preview-ness", "cluster-version operator has broken manifest exclusion logic", and "manifest is mis-setting an annotation". I'm also converting an Infof into an Info for a call-site that needs no formatting for its static string.
From the enhancement [1]: During bootstrapping, the CVO will assume no feature sets are enabled until it can successfully retrieve `featuregates.config.openshift.io` from the Kubernetes API server. So this softens the error from 18dd189 (simplify includeTechPreview flag to be a static bool, 2021-11-22, openshift#694) to be log-and-continue. If it turns out that we actually were TechPreviewNoUpgrade, the TechPreviewChangeStopper controller will eventually succeed in pulling the feature-gate, notice, and ask the CVO to shut down. In the meantime, the CVO will have been able to get a head start on reconciling the vast majority of manifests which are not tech-preview. [1]: https://github.com/openshift/enhancements/blame/f74ffe7776f40dbd096b9ca10c27ee7a0a579e58/enhancements/update/cvo-techpreview-manifests.md#L70-L71
27c7c4a to
90b1454
Compare
|
Is this "log and continue" change valid because the error usually occurs during start-up? So ignoring it just means we continue with |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jottofar, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
sandbox and connectivity issues are unrelated: /override ci/prow/e2e-agnostic-upgrade |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@wking: All pull requests linked via external trackers have merged: Bugzilla bug 2029750 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…ingTechPreviewState
As described in [1], some 4.10 jobs that set TechPreviewNoUpdate very
early during install are running into trouble like [2]:
1. Early in bootstrap, something sets TechPreviewNoUpdate.
2. Cluster-version operator comes up, and attempts to figure out the
current featureSet. But because the Kubernetes API is also still
coming up, that fails on an error like [2]:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
W0204 10:10:40.126809 1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
I0204 10:19:53.097129 1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
3. The TechPreviewChangeStopper waits for any FeatureGate changes, but
we don't get any.
4. CVO happily spends hours without synchronizing any of the requested
TechPreviewNoUpdate manifests.
Step 2 was originally fatal, but I'd softened it in 90b1454
(pkg/start: Log and continue when we fail to retrieve the feature
gate, 2021-12-06, openshift#706). Here are the relevant cases, and how they'd
behave with the different approaches:
a. No Kube-API hiccup on the initial FeatureGate fetch. All
implementations handle this well.
b. Kube-API hiccup on the initial FeatureGate fetch.
i. And the actual FeatureGate value was not TechPreviewNoUpgrade.
Before 90b1454, this would have caused a useless CVO
container restart. Since 90b1454, and unchanged in this
commit, the CVO container's default:
includeTechPreview := false
is correct, and we correctly ignore the hiccup.
ii. The actual FeatureGate value was TechPreviewNoUpgrade. Before
90b1454, this would cause a useful CVO container restart.
From 90b1454 until this commit, we'd hit the bug case where
we'd go an unbounded amount of time failing to reconcile the
TechPreviewNoUpgrade manifests the user was asking for. With
this commit, we notice the divergence right after the informer
caches sync, and restart the CVO container.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
…ingTechPreviewState
As described in [1], some 4.10 jobs that set TechPreviewNoUpgrade very
early during install are running into trouble like [2]:
1. Early in bootstrap, something sets TechPreviewNoUpgrade.
2. Cluster-version operator comes up, and attempts to figure out the
current featureSet. But because the Kubernetes API is also still
coming up, that fails on an error like [2]:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
W0204 10:10:40.126809 1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
I0204 10:19:53.097129 1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
3. The TechPreviewChangeStopper waits for any FeatureGate changes, but
we don't get any.
4. CVO happily spends hours without synchronizing any of the requested
TechPreviewNoUpgrade manifests.
Step 2 was originally fatal, but I'd softened it in 90b1454
(pkg/start: Log and continue when we fail to retrieve the feature
gate, 2021-12-06, openshift#706). Here are the relevant cases, and how they'd
behave with the different approaches:
a. No Kube-API hiccup on the initial FeatureGate fetch. All
implementations handle this well.
b. Kube-API hiccup on the initial FeatureGate fetch.
i. And the actual FeatureGate value was not TechPreviewNoUpgrade.
Before 90b1454, this would have caused a useless CVO
container restart. Since 90b1454, and unchanged in this
commit, the CVO container's default:
includeTechPreview := false
is correct, and we correctly ignore the hiccup.
ii. The actual FeatureGate value was TechPreviewNoUpgrade. Before
90b1454, this would cause a useful CVO container restart.
From 90b1454 until this commit, we'd hit the bug case where
we'd go an unbounded amount of time failing to reconcile the
TechPreviewNoUpgrade manifests the user was asking for. With
this commit, we notice the divergence right after the informer
caches sync, and restart the CVO container.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
…ingTechPreviewState
As described in [1], some 4.10 jobs that set TechPreviewNoUpgrade very
early during install are running into trouble like:
1. Early in bootstrap, something sets TechPreviewNoUpgrade.
2. Cluster-version operator comes up, and attempts to figure out the
current featureSet. But because the Kubernetes API is also still
coming up, that fails on an error like [2]:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
W0204 10:10:40.126809 1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
I0204 10:19:53.097129 1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
3. The TechPreviewChangeStopper waits for any FeatureGate changes, but
we don't get any.
4. CVO happily spends hours without synchronizing any of the requested
TechPreviewNoUpgrade manifests.
Step 2 was originally fatal, but I'd softened it in 90b1454
(pkg/start: Log and continue when we fail to retrieve the feature
gate, 2021-12-06, openshift#706). Here are the relevant cases, and how they'd
behave with the different approaches:
a. No Kube-API hiccup on the initial FeatureGate fetch. All
implementations handle this well.
b. Kube-API hiccup on the initial FeatureGate fetch.
i. And the actual FeatureGate value was not TechPreviewNoUpgrade.
Before 90b1454, this would have caused a useless CVO
container restart. Since 90b1454, and unchanged in this
commit, the CVO container's default:
includeTechPreview := false
is correct, and we correctly ignore the hiccup.
ii. The actual FeatureGate value was TechPreviewNoUpgrade. Before
90b1454, this would cause a useful CVO container restart.
From 90b1454 until this commit, we'd hit the bug case where
we'd go an unbounded amount of time failing to reconcile the
TechPreviewNoUpgrade manifests the user was asking for. With
this commit, we notice the divergence right after the informer
caches sync, and restart the CVO container.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
…ingTechPreviewState
As described in [1], some 4.10 jobs that set TechPreviewNoUpgrade very
early during install are running into trouble like:
1. Early in bootstrap, something sets TechPreviewNoUpgrade.
2. Cluster-version operator comes up, and attempts to figure out the
current featureSet. But because the Kubernetes API is also still
coming up, that fails on an error like [2]:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
W0204 10:10:40.126809 1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
I0204 10:19:53.097129 1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
3. The TechPreviewChangeStopper waits for any FeatureGate changes, but
we don't get any.
4. CVO happily spends hours without synchronizing any of the requested
TechPreviewNoUpgrade manifests.
Step 2 was originally fatal, but I'd softened it in 90b1454
(pkg/start: Log and continue when we fail to retrieve the feature
gate, 2021-12-06, openshift#706). Here are the relevant cases, and how they'd
behave with the different approaches:
a. No Kube-API hiccup on the initial FeatureGate fetch. All
implementations handle this well.
b. Kube-API hiccup on the initial FeatureGate fetch.
i. And the actual FeatureGate value was not TechPreviewNoUpgrade.
Before 90b1454, this would have caused a useless CVO
container restart. Since 90b1454, and unchanged in this
commit, the CVO container's default:
includeTechPreview := false
is correct, and we correctly ignore the hiccup.
ii. The actual FeatureGate value was TechPreviewNoUpgrade. Before
90b1454, this would cause a useful CVO container restart.
From 90b1454 until this commit, we'd hit the bug case where
we'd go an unbounded amount of time failing to reconcile the
TechPreviewNoUpgrade manifests the user was asking for. With
this commit, we notice the divergence right after the informer
caches sync, and restart the CVO container.
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
To make it easier to differentiate between "cluster version operator is confused about the cluster's tech-preview-ness", "cluster-version operator has broken manifest exclusion logic", and "manifest is mis-setting an annotation".
WIP until we remove the
pkg/payload/payload.gologging, which I'm using to help debug openshift/cluster-capi-operator#20.