Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Dec 7, 2021

To make it easier to differentiate between "cluster version operator is confused about the cluster's tech-preview-ness", "cluster-version operator has broken manifest exclusion logic", and "manifest is mis-setting an annotation".

WIP until we remove the pkg/payload/payload.go logging, which I'm using to help debug openshift/cluster-capi-operator#20.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 7, 2021
@openshift-ci openshift-ci bot requested review from jottofar and vrutkovs December 7, 2021 06:45
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 7, 2021
wking added 2 commits December 6, 2021 23:48
To make it easier to differentiate between "cluster version operator
is confused about the cluster's tech-preview-ness", "cluster-version
operator has broken manifest exclusion logic", and "manifest is
mis-setting an annotation".

I'm also converting an Infof into an Info for a call-site that needs
no formatting for its static string.
From the enhancement [1]:

  During bootstrapping, the CVO will assume no feature sets are
  enabled until it can successfully retrieve
  `featuregates.config.openshift.io` from the Kubernetes API server.

So this softens the error from 18dd189 (simplify includeTechPreview
flag to be a static bool, 2021-11-22, openshift#694) to be log-and-continue.
If it turns out that we actually were TechPreviewNoUpgrade, the
TechPreviewChangeStopper controller will eventually succeed in pulling
the feature-gate, notice, and ask the CVO to shut down.  In the
meantime, the CVO will have been able to get a head start on
reconciling the vast majority of manifests which are not tech-preview.

[1]: https://github.com/openshift/enhancements/blame/f74ffe7776f40dbd096b9ca10c27ee7a0a579e58/enhancements/update/cvo-techpreview-manifests.md#L70-L71
@wking wking force-pushed the expanded-tech-preview-logging branch from 27c7c4a to 90b1454 Compare December 7, 2021 07:53
@wking wking changed the title WIP: pkg/featurechangestopper: Log initial state pkg/start: Log and continue when we fail to retrieve the feature gate Dec 7, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 7, 2021
@wking
Copy link
Member Author

wking commented Dec 7, 2021

27c7c4a -> 90b1454 drops the debug logging, and adds a new commit that turns 18dd189 (#694)'s hard-fail on initial-feature-gate-fetch into a log-and-continue, to make bootstrapping a bit quicker with the CVO working on GA manifests while the feature-gate hosting business is coming up.

@jottofar
Copy link
Contributor

jottofar commented Dec 7, 2021

Is this "log and continue" change valid because the error usually occurs during start-up? So ignoring it just means we continue with includeTechPreview = false but eventually the Get succeeds and we'll restart if the value is actually true?

@jottofar
Copy link
Contributor

jottofar commented Dec 7, 2021

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 7, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 7, 2021
@wking
Copy link
Member Author

wking commented Dec 7, 2021

sandbox and connectivity issues are unrelated:

/override ci/prow/e2e-agnostic-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 7, 2021

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

sandbox and connectivity issues are unrelated:

/override ci/prow/e2e-agnostic-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 7427fb8 into openshift:master Dec 7, 2021
@wking wking deleted the expanded-tech-preview-logging branch December 7, 2021 17:16
@wking wking changed the title pkg/start: Log and continue when we fail to retrieve the feature gate Bug 2029750: pkg/start: Log and continue when we fail to retrieve the feature gate Dec 9, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 9, 2021

@wking: All pull requests linked via external trackers have merged:

Bugzilla bug 2029750 has been moved to the MODIFIED state.

Details

In response to this:

Bug 2029750: pkg/start: Log and continue when we fail to retrieve the feature gate

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 5, 2022
…ingTechPreviewState

As described in [1], some 4.10 jobs that set TechPreviewNoUpdate very
early during install are running into trouble like [2]:

1. Early in bootstrap, something sets TechPreviewNoUpdate.
2. Cluster-version operator comes up, and attempts to figure out the
   current featureSet.  But because the Kubernetes API is also still
   coming up, that fails on an error like [2]:

     $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
     W0204 10:10:40.126809       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
     I0204 10:19:53.097129       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.

3. The TechPreviewChangeStopper waits for any FeatureGate changes, but
   we don't get any.
4. CVO happily spends hours without synchronizing any of the requested
   TechPreviewNoUpdate manifests.

Step 2 was originally fatal, but I'd softened it in 90b1454
(pkg/start: Log and continue when we fail to retrieve the feature
gate, 2021-12-06, openshift#706).  Here are the relevant cases, and how they'd
behave with the different approaches:

a. No Kube-API hiccup on the initial FeatureGate fetch.  All
   implementations handle this well.
b. Kube-API hiccup on the initial FeatureGate fetch.

   i. And the actual FeatureGate value was not TechPreviewNoUpgrade.
      Before 90b1454, this would have caused a useless CVO
      container restart.  Since 90b1454, and unchanged in this
      commit, the CVO container's default:

        includeTechPreview := false

      is correct, and we correctly ignore the hiccup.

   ii. The actual FeatureGate value was TechPreviewNoUpgrade.  Before
       90b1454, this would cause a useful CVO container restart.
       From 90b1454 until this commit, we'd hit the bug case where
       we'd go an unbounded amount of time failing to reconcile the
       TechPreviewNoUpgrade manifests the user was asking for.  With
       this commit, we notice the divergence right after the informer
       caches sync, and restart the CVO container.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 5, 2022
…ingTechPreviewState

As described in [1], some 4.10 jobs that set TechPreviewNoUpgrade very
early during install are running into trouble like [2]:

1. Early in bootstrap, something sets TechPreviewNoUpgrade.
2. Cluster-version operator comes up, and attempts to figure out the
   current featureSet.  But because the Kubernetes API is also still
   coming up, that fails on an error like [2]:

     $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
     W0204 10:10:40.126809       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
     I0204 10:19:53.097129       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.

3. The TechPreviewChangeStopper waits for any FeatureGate changes, but
   we don't get any.
4. CVO happily spends hours without synchronizing any of the requested
   TechPreviewNoUpgrade manifests.

Step 2 was originally fatal, but I'd softened it in 90b1454
(pkg/start: Log and continue when we fail to retrieve the feature
gate, 2021-12-06, openshift#706).  Here are the relevant cases, and how they'd
behave with the different approaches:

a. No Kube-API hiccup on the initial FeatureGate fetch.  All
   implementations handle this well.
b. Kube-API hiccup on the initial FeatureGate fetch.

   i. And the actual FeatureGate value was not TechPreviewNoUpgrade.
      Before 90b1454, this would have caused a useless CVO
      container restart.  Since 90b1454, and unchanged in this
      commit, the CVO container's default:

        includeTechPreview := false

      is correct, and we correctly ignore the hiccup.

   ii. The actual FeatureGate value was TechPreviewNoUpgrade.  Before
       90b1454, this would cause a useful CVO container restart.
       From 90b1454 until this commit, we'd hit the bug case where
       we'd go an unbounded amount of time failing to reconcile the
       TechPreviewNoUpgrade manifests the user was asking for.  With
       this commit, we notice the divergence right after the informer
       caches sync, and restart the CVO container.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 5, 2022
…ingTechPreviewState

As described in [1], some 4.10 jobs that set TechPreviewNoUpgrade very
early during install are running into trouble like:

1. Early in bootstrap, something sets TechPreviewNoUpgrade.
2. Cluster-version operator comes up, and attempts to figure out the
   current featureSet.  But because the Kubernetes API is also still
   coming up, that fails on an error like [2]:

     $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
     W0204 10:10:40.126809       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
     I0204 10:19:53.097129       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.

3. The TechPreviewChangeStopper waits for any FeatureGate changes, but
   we don't get any.
4. CVO happily spends hours without synchronizing any of the requested
   TechPreviewNoUpgrade manifests.

Step 2 was originally fatal, but I'd softened it in 90b1454
(pkg/start: Log and continue when we fail to retrieve the feature
gate, 2021-12-06, openshift#706).  Here are the relevant cases, and how they'd
behave with the different approaches:

a. No Kube-API hiccup on the initial FeatureGate fetch.  All
   implementations handle this well.
b. Kube-API hiccup on the initial FeatureGate fetch.

   i. And the actual FeatureGate value was not TechPreviewNoUpgrade.
      Before 90b1454, this would have caused a useless CVO
      container restart.  Since 90b1454, and unchanged in this
      commit, the CVO container's default:

        includeTechPreview := false

      is correct, and we correctly ignore the hiccup.

   ii. The actual FeatureGate value was TechPreviewNoUpgrade.  Before
       90b1454, this would cause a useful CVO container restart.
       From 90b1454 until this commit, we'd hit the bug case where
       we'd go an unbounded amount of time failing to reconcile the
       TechPreviewNoUpgrade manifests the user was asking for.  With
       this commit, we notice the divergence right after the informer
       caches sync, and restart the CVO container.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
wking added a commit to wking/cluster-version-operator that referenced this pull request Feb 11, 2022
…ingTechPreviewState

As described in [1], some 4.10 jobs that set TechPreviewNoUpgrade very
early during install are running into trouble like:

1. Early in bootstrap, something sets TechPreviewNoUpgrade.
2. Cluster-version operator comes up, and attempts to figure out the
   current featureSet.  But because the Kubernetes API is also still
   coming up, that fails on an error like [2]:

     $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264/artifacts/e2e-aws-techpreview/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-76cd65b7bb-4p945_cluster-version-operator.log | grep 'Error getting featuregate value\|tech'
     W0204 10:10:40.126809       1 start.go:142] Error getting featuregate value: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
     I0204 10:19:53.097129       1 techpreviewchangestopper.go:97] Starting stop-on-techpreview-change controller with TechPreviewNoUpgrade false.

3. The TechPreviewChangeStopper waits for any FeatureGate changes, but
   we don't get any.
4. CVO happily spends hours without synchronizing any of the requested
   TechPreviewNoUpgrade manifests.

Step 2 was originally fatal, but I'd softened it in 90b1454
(pkg/start: Log and continue when we fail to retrieve the feature
gate, 2021-12-06, openshift#706).  Here are the relevant cases, and how they'd
behave with the different approaches:

a. No Kube-API hiccup on the initial FeatureGate fetch.  All
   implementations handle this well.
b. Kube-API hiccup on the initial FeatureGate fetch.

   i. And the actual FeatureGate value was not TechPreviewNoUpgrade.
      Before 90b1454, this would have caused a useless CVO
      container restart.  Since 90b1454, and unchanged in this
      commit, the CVO container's default:

        includeTechPreview := false

      is correct, and we correctly ignore the hiccup.

   ii. The actual FeatureGate value was TechPreviewNoUpgrade.  Before
       90b1454, this would cause a useful CVO container restart.
       From 90b1454 until this commit, we'd hit the bug case where
       we'd go an unbounded amount of time failing to reconcile the
       TechPreviewNoUpgrade manifests the user was asking for.  With
       this commit, we notice the divergence right after the informer
       caches sync, and restart the CVO container.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2050946#c0
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-techpreview/1489537240471179264
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants