-
Notifications
You must be signed in to change notification settings - Fork 213
Get cluster version object earlier in startup #741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get cluster version object earlier in startup #741
Conversation
fb9e93a to
4d421e9
Compare
|
/test e2e-agnostic |
e03ec75 to
4611337
Compare
|
Looking back at 4d421e9, where you were adjusting the existing In 4611337, you've moved the load into |
No, I see no reason to wait on the lease acquisition so I'll move things around. I wasn't convinced yet that it was the actual lease acquisition but had not circled back to look further. |
cbe8320 to
2419571
Compare
2419571 to
1b48d4e
Compare
1b48d4e to
bba8ed4
Compare
bba8ed4 to
9b6c3ec
Compare
|
/retitle Get cluster version object earlier in startup |
72f2a04 to
c86743c
Compare
|
/retest |
|
/test e2e-agnostic-operator |
|
/retest |
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
|
/test e2e-agnostic-upgrade |
|
@LalatenduMohanty: Overrode contexts on behalf of LalatenduMohanty: ci/prow/e2e-agnostic DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/override cancel |
|
@LalatenduMohanty: /override requires a failed status context or a job name to operate on.
Only the following contexts were expected:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/override cancel ci/prow/e2e-agnostic |
|
@LalatenduMohanty: /override requires a failed status context or a job name to operate on.
Only the following contexts were expected:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
wking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jottofar, wking The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Nothing in the update run sounds like it's this PR, and it's Friday, and build02 and Azure are both struggling, so: /override ci/prow/e2e-agnostic-upgrade and we'll have lots of cook time in 4.11 release informers by the time we're back next week ;) |
|
@wking: Overrode contexts on behalf of wking: ci/prow/e2e-agnostic-upgrade DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@jottofar: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
[1]: openshift#741
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
[1]: openshift#741
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
[1]: openshift#741
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
[1]: openshift#741
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
[1]: openshift#741
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
The awkward:
{{ "{{ ... \"version\" }} ... {{ end }}" }}
business is because this content is unpacked in two rounds of
templating:
1. The cluster-version operator's getPayloadTasks' renderManifest
preprocessing for the CVO directory, which is based on Go
templates.
2. Prometheus alerting-rule templates, which use console templates
[2], which are also based on Go templates [3].
The '{{ "..." }}' wrapping is consumed by the CVO's templating, and
the remaining:
{{ ... "version" }} ... {{ end }}
is left for Promtheus' templating.
[1]: openshift#741
[2]: https://prometheus.io/docs/prometheus/2.51/configuration/alerting_rules/#templating
[3]: https://prometheus.io/docs/visualization/consoles/
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
The awkward:
{{ "{{ ... \"version\" }} ... {{ end }}" }}
business is because this content is unpacked in two rounds of
templating:
1. The cluster-version operator's getPayloadTasks' renderManifest
preprocessing for the CVO directory, which is based on Go
templates.
2. Prometheus alerting-rule templates, which use console templates
[2], which are also based on Go templates [3].
The '{{ "..." }}' wrapping is consumed by the CVO's templating, and
the remaining:
{{ ... "version" }} ... {{ end }}
is left for Promtheus' templating.
[1]: openshift#741
[2]: https://prometheus.io/docs/prometheus/2.51/configuration/alerting_rules/#templating
[3]: https://prometheus.io/docs/visualization/consoles/
No description provided.