Get cluster version object earlier in startup #741

jottofar · 2022-02-14T16:39:31Z

No description provided.

jottofar · 2022-02-15T16:34:16Z

/test e2e-agnostic

wking · 2022-02-21T17:47:10Z

Looking back at 4d421e9, where you were adjusting the existing InitializeFromPayload in start.go, that was happening in Options.Run. But Options.Run calls Options.run as a final step, and the informers get started in Options.run, so at that point you'll be waiting on ClusterVersion forever watching an unstarted informer.

In 4611337, you've moved the load into OnStartedLeading, and that works because it's after we've started the informers. But we shouldn't need an actual lease to load the data, and there's some latency benefits by working on those in parallel, or at least performing the more fixed-time payload load first, before blocking on a possibly contended lease acquisition. Can we move the InitializeFromPayload call to sit right after the informer starts, and before the lease acquisition? Possibly with a leading wait for the controllerCtx.CVInformerFactory cache to sync? Or are we ok with the low latency cost of keeping the InitializeFromPayload behind OnStartedLeading?

jottofar · 2022-02-21T20:23:24Z

Looking back at 4d421e9, where you were adjusting the existing InitializeFromPayload in start.go, that was happening in Options.Run. But Options.Run calls Options.run as a final step, and the informers get started in Options.run, so at that point you'll be waiting on ClusterVersion forever watching an unstarted informer.

In 4611337, you've moved the load into OnStartedLeading, and that works because it's after we've started the informers. But we shouldn't need an actual lease to load the data, and there's some latency benefits by working on those in parallel, or at least performing the more fixed-time payload load first, before blocking on a possibly contended lease acquisition. Can we move the InitializeFromPayload call to sit right after the informer starts, and before the lease acquisition? Possibly with a leading wait for the controllerCtx.CVInformerFactory cache to sync? Or are we ok with the low latency cost of keeping the InitializeFromPayload behind OnStartedLeading?

No, I see no reason to wait on the lease acquisition so I'll move things around. I wasn't convinced yet that it was the actual lease acquisition but had not circled back to look further.

pkg/start/start.go

pkg/cvo/cvo.go

jottofar · 2022-02-24T20:06:57Z

/retitle Get cluster version object earlier in startup

jottofar · 2022-03-09T14:41:17Z

/retest

jottofar · 2022-03-09T17:01:16Z

/test e2e-agnostic-operator

jottofar · 2022-03-09T21:00:58Z

/retest

Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.

jottofar · 2022-03-11T15:48:20Z

/test e2e-agnostic-upgrade

openshift-ci · 2022-03-11T18:28:00Z

@LalatenduMohanty: Overrode contexts on behalf of LalatenduMohanty: ci/prow/e2e-agnostic

Details

In response to this:

/override ci/prow/e2e-agnostic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/cvo/cvo_scenarios_test.go

LalatenduMohanty · 2022-03-11T19:27:08Z

/override cancel

openshift-ci · 2022-03-11T19:29:09Z

@LalatenduMohanty: /override requires a failed status context or a job name to operate on.
The following unknown contexts were given:

cancel

Only the following contexts were expected:

ci/prow/e2e-agnostic
ci/prow/e2e-agnostic-operator
ci/prow/e2e-agnostic-upgrade
ci/prow/gofmt
ci/prow/golangci-lint
ci/prow/images
ci/prow/unit
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade
pull-ci-openshift-cluster-version-operator-master-gofmt
pull-ci-openshift-cluster-version-operator-master-golangci-lint
pull-ci-openshift-cluster-version-operator-master-images
pull-ci-openshift-cluster-version-operator-master-unit
tide

Details

In response to this:

/override cancel

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

LalatenduMohanty · 2022-03-11T19:30:09Z

/override cancel ci/prow/e2e-agnostic

openshift-ci · 2022-03-11T19:30:41Z

@LalatenduMohanty: /override requires a failed status context or a job name to operate on.
The following unknown contexts were given:

cancel ci/prow/e2e-agnostic

Only the following contexts were expected:

ci/prow/e2e-agnostic
ci/prow/e2e-agnostic-operator
ci/prow/e2e-agnostic-upgrade
ci/prow/gofmt
ci/prow/golangci-lint
ci/prow/images
ci/prow/unit
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-operator
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade
pull-ci-openshift-cluster-version-operator-master-gofmt
pull-ci-openshift-cluster-version-operator-master-golangci-lint
pull-ci-openshift-cluster-version-operator-master-images
pull-ci-openshift-cluster-version-operator-master-unit
tide

Details

In response to this:

/override cancel ci/prow/e2e-agnostic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking

/lgtm

openshift-ci · 2022-03-11T19:45:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jottofar, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jottofar,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2022-03-11T19:49:27Z

Nothing in the update run sounds like it's this PR, and it's Friday, and build02 and Azure are both struggling, so:

/override ci/prow/e2e-agnostic-upgrade

and we'll have lots of cook time in 4.11 release informers by the time we're back next week ;)

openshift-ci · 2022-03-11T19:49:57Z

@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade

Details

In response to this:

Nothing in the update run sounds like it's this PR, and it's Friday, and build02 and Azure are both struggling, so:

/override ci/prow/e2e-agnostic-upgrade

and we'll have lots of cook time in 4.11 release informers by the time we're back next week ;)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-03-11T19:49:58Z

@jottofar: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

…usterOperatorDegraded By adding cluster_operator_up handling for ClusterVersion, with 'version' as the component name, the same way we handle cluster_operator_conditions. This plugs us into ClusterOperatorDown (based on cluster_operator_up) and ClusterOperatorDegraded (based on both cluster_operator_conditions and cluster_operator_up). I've adjusted the ClusterOperatorDegraded rule so that it fires on ClusterVersion Failing=True and does not fire on Failing=False. Thinking through an update from before: 1. Outgoing CVO does not serve cluster_operator_up{name="version"}. 2. User requests an update to a release with this change. 3. New CVO comes in, starts serving cluster_operator_up{name="version"}. 4. Old ClusterOperatorDegraded no matching cluster_operator_conditions{name="version",condition="Degraded"}, falls through to cluster_operator_up{name="version"}, and starts cooking the 'for: 30m'. 5. If we go more than 30m before updating the ClusterOperatorDegraded rule to understand Failing, ClusterOperatorDegraded would fire. We'll need to backport the ClusterOperatorDegraded expr change to one 4.y release before the CVO-metrics change lands to get: 1. Outgoing CVO does not serve cluster_operator_up{name="version"}. 2. User requests an update to a release with the expr change. 3. Incoming ClusterOperatorDegraded sees no cluster_operator_conditions{name="version",condition="Degraded"}, cluster_operator_conditions{name="version",condition="Failing"} (we hope), or cluster_operator_up{name="version"}, so it doesn't fire. Unless we are Failing=True, in which case, hooray, we'll start alerting about it. 4. User requests an update to a release with the CVO-metrics change. 5. New CVO starts serving cluster_operator_up, just like the fresh-modern-install situation, and everything is great. The missing-ClusterVersion metrics don't matter all that much today, because the CVO has been creating replacement ClusterVersion since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45). But it will become more important with [1], which is planning on removing that default creation. When there is no ClusterVersion, we expect ClusterOperatorDown to fire. [1]: openshift#741

…usterOperatorDegraded By adding cluster_operator_up handling for ClusterVersion, with 'version' as the component name, the same way we handle cluster_operator_conditions. This plugs us into ClusterOperatorDown (based on cluster_operator_up) and ClusterOperatorDegraded (based on both cluster_operator_conditions and cluster_operator_up). I've adjusted the ClusterOperatorDegraded rule so that it fires on ClusterVersion Failing=True and does not fire on Failing=False. Thinking through an update from before: 1. Outgoing CVO does not serve cluster_operator_up{name="version"}. 2. User requests an update to a release with this change. 3. New CVO comes in, starts serving cluster_operator_up{name="version"}. 4. Old ClusterOperatorDegraded no matching cluster_operator_conditions{name="version",condition="Degraded"}, falls through to cluster_operator_up{name="version"}, and starts cooking the 'for: 30m'. 5. If we go more than 30m before updating the ClusterOperatorDegraded rule to understand Failing, ClusterOperatorDegraded would fire. We'll need to backport the ClusterOperatorDegraded expr change to one 4.y release before the CVO-metrics change lands to get: 1. Outgoing CVO does not serve cluster_operator_up{name="version"}. 2. User requests an update to a release with the expr change. 3. Incoming ClusterOperatorDegraded sees no cluster_operator_conditions{name="version",condition="Degraded"}, cluster_operator_conditions{name="version",condition="Failing"} (we hope), or cluster_operator_up{name="version"}, so it doesn't fire. Unless we are Failing=True, in which case, hooray, we'll start alerting about it. 4. User requests an update to a release with the CVO-metrics change. 5. New CVO starts serving cluster_operator_up, just like the fresh-modern-install situation, and everything is great. The missing-ClusterVersion metrics don't matter all that much today, because the CVO has been creating replacement ClusterVersion since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45). But it will become more important with [1], which is planning on removing that default creation. When there is no ClusterVersion, we expect ClusterOperatorDown to fire. The awkward: {{ "{{ ... \"version\" }} ... {{ end }}" }} business is because this content is unpacked in two rounds of templating: 1. The cluster-version operator's getPayloadTasks' renderManifest preprocessing for the CVO directory, which is based on Go templates. 2. Prometheus alerting-rule templates, which use console templates [2], which are also based on Go templates [3]. The '{{ "..." }}' wrapping is consumed by the CVO's templating, and the remaining: {{ ... "version" }} ... {{ end }} is left for Promtheus' templating. [1]: openshift#741 [2]: https://prometheus.io/docs/prometheus/2.51/configuration/alerting_rules/#templating [3]: https://prometheus.io/docs/visualization/consoles/

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 14, 2022

openshift-ci bot requested review from LalatenduMohanty and vrutkovs February 14, 2022 16:41

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2022

jottofar force-pushed the get-cv-sooner branch 3 times, most recently from fb9e93a to 4d421e9 Compare February 15, 2022 13:49

jottofar force-pushed the get-cv-sooner branch 7 times, most recently from e03ec75 to 4611337 Compare February 16, 2022 19:21

jottofar force-pushed the get-cv-sooner branch 4 times, most recently from cbe8320 to 2419571 Compare February 22, 2022 15:31

wking reviewed Feb 22, 2022

View reviewed changes

pkg/start/start.go Outdated Show resolved Hide resolved

wking reviewed Feb 22, 2022

View reviewed changes

pkg/start/start.go Outdated Show resolved Hide resolved

jottofar force-pushed the get-cv-sooner branch from 2419571 to 1b48d4e Compare February 22, 2022 21:42

wking reviewed Feb 23, 2022

View reviewed changes

pkg/start/start.go Outdated Show resolved Hide resolved

jottofar force-pushed the get-cv-sooner branch from 1b48d4e to bba8ed4 Compare February 24, 2022 16:52

wking reviewed Feb 24, 2022

View reviewed changes

pkg/start/start.go Outdated Show resolved Hide resolved

wking reviewed Feb 24, 2022

View reviewed changes

pkg/cvo/cvo.go Outdated Show resolved Hide resolved

jottofar force-pushed the get-cv-sooner branch from bba8ed4 to 9b6c3ec Compare February 24, 2022 20:06

jottofar force-pushed the get-cv-sooner branch 2 times, most recently from 72f2a04 to c86743c Compare March 9, 2022 14:37

jottofar force-pushed the get-cv-sooner branch from c86743c to d58dbb2 Compare March 9, 2022 16:16

jottofar force-pushed the get-cv-sooner branch from 5da523a to d58dbb2 Compare March 9, 2022 19:52

jottofar force-pushed the get-cv-sooner branch from b56a44d to 7480a5f Compare March 11, 2022 14:42

wking reviewed Mar 11, 2022

View reviewed changes

pkg/cvo/cvo_scenarios_test.go Show resolved Hide resolved

wking approved these changes Mar 11, 2022

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2022

openshift-merge-robot merged commit c85aa55 into openshift:master Mar 11, 2022

Get cluster version object earlier in startup #741

Get cluster version object earlier in startup #741

Uh oh!

Conversation

jottofar commented Feb 14, 2022

Uh oh!

jottofar commented Feb 15, 2022

Uh oh!

wking commented Feb 21, 2022

Uh oh!

jottofar commented Feb 21, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jottofar commented Feb 24, 2022

Uh oh!

jottofar commented Mar 9, 2022

Uh oh!

jottofar commented Mar 9, 2022

Uh oh!

jottofar commented Mar 9, 2022

Uh oh!

jottofar commented Mar 11, 2022

Uh oh!

openshift-ci bot commented Mar 11, 2022

Uh oh!

Uh oh!

LalatenduMohanty commented Mar 11, 2022

Uh oh!

openshift-ci bot commented Mar 11, 2022

Uh oh!

LalatenduMohanty commented Mar 11, 2022

Uh oh!

openshift-ci bot commented Mar 11, 2022

Uh oh!

wking left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Mar 11, 2022

Uh oh!

wking commented Mar 11, 2022

Uh oh!

openshift-ci bot commented Mar 11, 2022

Uh oh!

openshift-ci bot commented Mar 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants