Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Feb 5, 2022

As described in the bug, some 4.10 jobs that set TechPreviewNoUpgrade very early during install are running into trouble like:

  1. Early in bootstrap, something sets TechPreviewNoUpgrade.

  2. Cluster-version operator comes up, and attempts to figure out the current featureSet. But because the Kubernetes API is also still coming up, that fails on an error like:

    $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
    W0204 10:10:40.126809       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
    I0204 10:19:53.097129       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
  3. The TechPreviewChangeStopper waits for any FeatureGate changes, but we don't get any.

  4. CVO happily spends hours without synchronizing any of the requested TechPreviewNoUpgrade manifests.

Step 2 was originally fatal, but I'd softened it in 90b1454 (#706). Here are the relevant cases, and how they'd behave with the different approaches:

  1. No Kube-API hiccup on the initial FeatureGate fetch. All implementations handle this well.
  2. Kube-API hiccup on the initial FeatureGate fetch.
    1. And the actual FeatureGate value was not TechPreviewNoUpgrade. Before 90b1454, this would have caused a useless CVO container restart. Since 90b1454, and unchanged in this commit, the CVO container's default:

      includeTechPreview := false

      is correct, and we correctly ignore the hiccup.

    2. The actual FeatureGate value was TechPreviewNoUpgrade. Before 90b1454, this would cause a useful CVO container restart. From 90b1454 until this commit, we'd hit the bug case where we'd go an unbounded amount of time failing to reconcile the TechPreviewNoUpgrade manifests the user was asking for. With this commit, we notice the divergence right after the informer caches sync, and restart the CVO container.

…ingTechPreviewState

As described in [1], some 4.10 jobs that set TechPreviewNoUpgrade very
early during install are running into trouble like:

1. Early in bootstrap, something sets TechPreviewNoUpgrade.
2. Cluster-version operator comes up, and attempts to figure out the
   current featureSet.  But because the Kubernetes API is also still
   coming up, that fails on an error like [2]:

     $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
     W0204 10:10:40.126809       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
     I0204 10:19:53.097129       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.

3. The TechPreviewChangeStopper waits for any FeatureGate changes, but
   we don't get any.
4. CVO happily spends hours without synchronizing any of the requested
   TechPreviewNoUpgrade manifests.

Step 2 was originally fatal, but I'd softened it in 90b1454
(pkg/start: Log and continue when we fail to retrieve the feature
gate, 2021-12-06, openshift#706).  Here are the relevant cases, and how they'd
behave with the different approaches:

a. No Kube-API hiccup on the initial FeatureGate fetch.  All
   implementations handle this well.
b. Kube-API hiccup on the initial FeatureGate fetch.

   i. And the actual FeatureGate value was not TechPreviewNoUpgrade.
      Before 90b1454, this would have caused a useless CVO
      container restart.  Since 90b1454, and unchanged in this
      commit, the CVO container's default:

        includeTechPreview := false

      is correct, and we correctly ignore the hiccup.

   ii. The actual FeatureGate value was TechPreviewNoUpgrade.  Before
       90b1454, this would cause a useful CVO container restart.
       From 90b1454 until this commit, we'd hit the bug case where
       we'd go an unbounded amount of time failing to reconcile the
       TechPreviewNoUpgrade manifests the user was asking for.  With
       this commit, we notice the divergence right after the informer
       caches sync, and restart the CVO container.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 5, 2022

@wking: This pull request references Bugzilla bug 2050946, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jiajliu

Details

In response to this:

Bug 2050946: pkg/featurechangestopper: Seed queue to guard against incorrect startingTechPreviewState

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Feb 5, 2022
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2022
@wking
Copy link
Member Author

wking commented Feb 5, 2022

1039482 adds debug logging, so we can confirm the post-init check in the non-tech-preview presubmits.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 5, 2022
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 6, 2022
Not Found errors are a clear configuration for "no feature set", so
syncHandler should treat them as successful results.  This avoids
continually requeuing for a new GET call, now that 79742d7
(pkg/featurechangestopper: Seed queue to guard against incorrect
startingTechPreviewState, 2022-02-04, openshift#736) is seeding the queue, even
in clusters where there is no FeatureSet clusters.  And it also allows
us to detect set -> removed transitions if those were allowed,
although the API-server is supposed to make it impossible to remove
TechPreviewNoUpgrade once it has been set [1].

[1]: https://docs.openshift.com/container-platform/4.9/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-about_nodes-cluster-enabling
@wking wking force-pushed the confirm-feature-gate-level branch from 1039482 to eff21aa Compare February 6, 2022 03:19
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 6, 2022
Not Found errors are a clear configuration for "no feature set", so
syncHandler should treat them as successful results.  This avoids
continually requeuing for a new GET call, now that 79742d7
(pkg/featurechangestopper: Seed queue to guard against incorrect
startingTechPreviewState, 2022-02-04, openshift#736) is seeding the queue, even
in clusters where there is no FeatureSet clusters.  And it also allows
us to detect set -> removed transitions if those were allowed,
although the API-server is supposed to make it impossible to remove
TechPreviewNoUpgrade once it has been set [1].

[1]: https://docs.openshift.com/container-platform/4.9/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-about_nodes-cluster-enabling
@wking wking force-pushed the confirm-feature-gate-level branch from eff21aa to 41aac43 Compare February 6, 2022 03:27
Not Found errors are a clear configuration for "no feature set", so
syncHandler should treat them as successful results.  This avoids
continually requeuing for a new GET call, now that 79742d7
(pkg/featurechangestopper: Seed queue to guard against incorrect
startingTechPreviewState, 2022-02-04, openshift#736) is seeding the queue, even
in clusters where there is no FeatureSet clusters.  And it also allows
us to detect set -> removed transitions if those were allowed,
although the API-server is supposed to make it impossible to remove
TechPreviewNoUpgrade once it has been set [1].

[1]: https://docs.openshift.com/container-platform/4.9/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-about_nodes-cluster-enabling
@wking wking force-pushed the confirm-feature-gate-level branch 2 times, most recently from 295efc3 to a44a3c0 Compare February 6, 2022 04:47
@wking
Copy link
Member Author

wking commented Feb 6, 2022

e2e-agnostic-operator:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/736/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator/1490167038348365824/artifacts/e2e-agnostic-operator/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-5d4c45b67b-sdmdz_cluster-version-operator.log | grep 'Error getting featuregate value\|techpreview'
W0206 04:03:32.900700       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
I0206 04:11:21.806248       1 techpreviewchangestopper.go:102] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.
I0206 04:11:38.407495       1 techpreviewchangestopper.go:71] WTK: Checking the featureSet: "" vs our initial false

Looks great. I've dropped the debugging commit with 295efc3 -> a44a3c0.

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 6, 2022
@jottofar
Copy link
Contributor

jottofar commented Feb 7, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 7, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member Author

wking commented Feb 7, 2022

Load balancer connectivity and etcd leader elections are unrelated:

/override ci/prow/e2e-agnostic
/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 7, 2022

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic, ci/prow/e2e-agnostic-upgrade

Details

In response to this:

Load balancer connectivity and etcd leader elections are unrelated:

/override ci/prow/e2e-agnostic
/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 7, 2022

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 0409c3f into openshift:master Feb 7, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 7, 2022

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 2050946 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2050946: pkg/featurechangestopper: Seed queue to guard against incorrect startingTechPreviewState

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the confirm-feature-gate-level branch February 7, 2022 19:56
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 11, 2022
Not Found errors are a clear configuration for "no feature set", so
syncHandler should treat them as successful results.  This avoids
continually requeuing for a new GET call, now that 79742d7
(pkg/featurechangestopper: Seed queue to guard against incorrect
startingTechPreviewState, 2022-02-04, openshift#736) is seeding the queue, even
in clusters where there is no FeatureSet clusters.  And it also allows
us to detect set -> removed transitions if those were allowed,
although the API-server is supposed to make it impossible to remove
TechPreviewNoUpgrade once it has been set [1].

[1]: https://docs.openshift.com/container-platform/4.9/nodes/clusters/nodes-cluster-enabling-features.html#nodes-cluster-enabling-features-about_nodes-cluster-enabling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants