-
Notifications
You must be signed in to change notification settings - Fork 213
Split ClusterVersion into spec and status, write end-user condition messages, report metrics #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split ClusterVersion into spec and status, write end-user condition messages, report metrics #45
Conversation
d281701 to
587c376
Compare
|
The commits here need to be broken up more, and I need to do more testing, but this is getting close to rounding out the "CVO reports more accurate status, CV reports desired + actual state, CO is a copy, and users can interact with CV and CO more cleanly" story. |
587c376 to
3c6d66e
Compare
lib/resourceapply/cv.go
Outdated
| existing, err := client.ClusterOperators(required.Namespace).Get(required.Name, metav1.GetOptions{}) | ||
| if errors.IsNotFound(err) { | ||
| actual, err := client.ClusterOperators(required.Namespace).Create(required) | ||
| if err != nil && !errors.IsAlreadyExists(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we couldn't find the object why do this check... i'm probably missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for status you can't update the status at the same time as creation (creation and update only allow spec changes, you have to call update status). So this is creating whatever the object is, then applying the status. I think we could change these methods around to bail out, but since ClusterOperators is primarily a status object this seemed the closest to the intent of apply: "make sure this object exists with this status". We always pass in a valid ClusterOperator here.
Also, if someone deletes the ClusterOperator out from under us we just recreate it (we're the object owner).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i didn't realize that Create&Update don't change .status. 🤔
119e4be to
1249f5e
Compare
|
Added a couple more commits that tighten up error messages and reactivity - with all changes applied (ClusterVersion is scoped) you can perform the following get call and see the summary in one line: A key outcome here is that the |
c4fee14 to
a2cf828
Compare
|
Report prometheus metrics that show the current version state. Arguably we could report both the current version and any known update as possible failures. Or we could have the sync loop capture that info. |
|
|
For 1. my reasons are:
|
|
To be clear, I don't think CVO has to write cluster operator status. Is there anyone who wants to wait on it? |
can we drop it for now and see in the future if we need to update clusteroperatorstatus, we can revisit then. |
that's not happening right now.. we are queuing on all events
it seems odd that statusSync is syncing crds and clusteroperatorstatus for crds... |
cmd/start.go
Outdated
| startOpts.listenAddr = "0.0.0.0:11345" | ||
|
|
||
| rootCmd.AddCommand(startCmd) | ||
| startCmd.PersistentFlags().StringVar(&startOpts.listenAddr, "listen", startOpts.listenAddr, "Address to listen on for metrics") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why not inline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline how? localhost you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startCmd.PersistentFlags().StringVar(&startOpts.listenAddr, "listen", "0.0.0.0:12345", "Address to listen on for metrics")There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhinavdahiya when using a cobra approach with options structs and flags for customization (like kube-apiserver, generic-apiserver, oc, kubectl, etc), we've found that respecting a struct value allows for flexibility and clean composition. This matches the upstream projects better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see what you mean. Yeah, that's fine
pkg/cvo/metrics.go
Outdated
| ch <- g | ||
| if cv, err := m.optr.cvoConfigLister.Get(m.optr.name); err == nil { | ||
| if update := cv.Spec.DesiredUpdate; update != nil && update.Payload != current.Payload { | ||
| g = m.version.WithLabelValues("update", update.Version, update.Payload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is usual to just reuse the same variable when doing this metric push?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I'll fix that
|
Yes. I will remove those loops. |
Yes, available updates needs to be queued by an outside source.
StatusSync is the only bit of code that touches ClusterOperators - and it's the loop that requires cluster operator CRD to exist. So it's a prerequisite for level driven. That said, do we need to do CRD reconciliation here or should we move it to manifests/? Maybe we should be doing cluster version CRD reconciliation here as well? (the split is so that we can wait for the informers to sync, but it feels wrong).
Yes. I will be adding metrics to read ClusterOperator status (to track upgrade progress), but we don't need to write it. |
|
Oh, availableUpdates is being driven by the resync interval. |
a2cf828 to
e8215c5
Compare
|
|
||
| const ( | ||
| componentName = "cluster-version-operator" | ||
| componentName = "version" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reading top to bottom, this looks like an odd change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes the name for the cluster version object version. We could do something else, but no one should have to type that out.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
Since at least 90e9881 (cvo: Change the core CVO loops to report status to ClusterVersion, 2018-11-02, openshift#45), the CVO created a default ClusterVersion when there was none in the cluster. In d7760ce (pkg/cvo: Drop ClusterVersion defaulting during bootstrap, 2019-08-16, openshift#238), we removed that defaulting during cluster-bootstrap, to avoid racing with the installer-supplied ClusterVersion and its user-specified configuration. In this commit, we're removing ClusterVersion defaulting entirely, and the CVO will just patiently wait until it gets a ClusterVersion before continuing. Admins rarely delete ClusterVersion in practice, creating a sane default is becoming more difficult as the spec configuration becomes richer, and waiting for the admin to come back and ask the CVO to get back to work allows us to simplify the code without leaving customers at risk.
The function had returned the original pointer since it landed in db150e6 (cvo: Perform status updates in a single thread, 2018-11-03, openshift#45). But locking the operator structure to return a pointer reference is a bit risky, because after the lock is released you're still holding a pointer into that data, but lack easy access to the lock to guard against simultaneous access. For example, you could have setAvailableUpdates updating the structure, while simultaneously operatorMetrics.Collect, Operator.syncStatus, or Operator.mergeReleaseMetadata is looking at their pointer reference to the old data. There wasn't actually much exposure, because writes all happened to flow through setAvailableUpdates, and setAvailableUpdates's only changes were: * Bumping the u.LastSyncOrConfigChange Time. * Replacing the availableUpdates pointer with a new pointer. and neither of those should significantly disrupt any of the consumers. But switching to a DeepCopy doesn't cost much resource wise, and it protects us from a number of possible ways that this could break in the future if setAvailableUpdates does less full-pointer-replacement or one of the consumers starts to care about LastSyncOrConfigChange reliably lining up with the rest of the availableUpdates content.
The function had returned the original pointer since it landed in db150e6 (cvo: Perform status updates in a single thread, 2018-11-03, openshift#45). But locking the operator structure to return a pointer reference is a bit risky, because after the lock is released you're still holding a pointer into that data, but lack easy access to the lock to guard against simultaneous access. For example, you could have setAvailableUpdates updating the structure, while simultaneously operatorMetrics.Collect, Operator.syncStatus, or Operator.mergeReleaseMetadata is looking at their pointer reference to the old data. There wasn't actually much exposure, because writes all happened to flow through setAvailableUpdates, and setAvailableUpdates's only changes were: * Bumping the u.LastSyncOrConfigChange Time. * Replacing the availableUpdates pointer with a new pointer. and neither of those should significantly disrupt any of the consumers. But switching to a copy doesn't cost much resource wise, and it protects us from a number of possible ways that this could break in the future if setAvailableUpdates does less full-pointer-replacement or one of the consumers starts to care about LastSyncOrConfigChange reliably lining up with the rest of the availableUpdates content. It does mean we need to update the copy logic as we add new properties to the structure, but we'd need to do that even if we used deepcopy-gen or similar to automate the copy generation.
The function had returned the original pointer since it landed in db150e6 (cvo: Perform status updates in a single thread, 2018-11-03, openshift#45). But locking the operator structure to return a pointer reference is a bit risky, because after the lock is released you're still holding a pointer into that data, but lack easy access to the lock to guard against simultaneous access. For example, you could have setAvailableUpdates updating the structure, while simultaneously operatorMetrics.Collect, Operator.syncStatus, or Operator.mergeReleaseMetadata is looking at their pointer reference to the old data. There wasn't actually much exposure, because writes all happened to flow through setAvailableUpdates, and setAvailableUpdates's only changes were: * Bumping the u.LastSyncOrConfigChange Time. * Replacing the availableUpdates pointer with a new pointer. and neither of those should significantly disrupt any of the consumers. But switching to a copy doesn't cost much resource wise, and it protects us from a number of possible ways that this could break in the future if setAvailableUpdates does less full-pointer-replacement or one of the consumers starts to care about LastSyncOrConfigChange reliably lining up with the rest of the availableUpdates content. It does mean we need to update the copy logic as we add new properties to the structure, but we'd need to do that even if we used deepcopy-gen or similar to automate the copy generation.
The function had returned the original pointer since it landed in db150e6 (cvo: Perform status updates in a single thread, 2018-11-03, openshift#45). But locking the operator structure to return a pointer reference is a bit risky, because after the lock is released you're still holding a pointer into that data, but lack easy access to the lock to guard against simultaneous access. For example, you could have setAvailableUpdates updating the structure, while simultaneously operatorMetrics.Collect, Operator.syncStatus, or Operator.mergeReleaseMetadata is looking at their pointer reference to the old data. There wasn't actually much exposure, because writes all happened to flow through setAvailableUpdates, and setAvailableUpdates's only changes were: * Bumping the u.LastSyncOrConfigChange Time. * Replacing the availableUpdates pointer with a new pointer. and neither of those should significantly disrupt any of the consumers. But switching to a copy doesn't cost much resource wise, and it protects us from a number of possible ways that this could break in the future if setAvailableUpdates does less full-pointer-replacement or one of the consumers starts to care about LastSyncOrConfigChange reliably lining up with the rest of the availableUpdates content. It does mean we need to update the copy logic as we add new properties to the structure, but we'd need to do that even if we used deepcopy-gen or similar to automate the copy generation.
The function had returned the original pointer since it landed in db150e6 (cvo: Perform status updates in a single thread, 2018-11-03, openshift#45). But locking the operator structure to return a pointer reference is a bit risky, because after the lock is released you're still holding a pointer into that data, but lack easy access to the lock to guard against simultaneous access. For example, you could have setAvailableUpdates updating the structure, while simultaneously operatorMetrics.Collect, Operator.syncStatus, or Operator.mergeReleaseMetadata is looking at their pointer reference to the old data. There wasn't actually much exposure, because writes all happened to flow through setAvailableUpdates, and setAvailableUpdates's only changes were: * Bumping the u.LastSyncOrConfigChange Time. * Replacing the availableUpdates pointer with a new pointer. and neither of those should significantly disrupt any of the consumers. But switching to a copy doesn't cost much resource wise, and it protects us from a number of possible ways that this could break in the future if setAvailableUpdates does less full-pointer-replacement or one of the consumers starts to care about LastSyncOrConfigChange reliably lining up with the rest of the availableUpdates content. It does mean we need to update the copy logic as we add new properties to the structure, but we'd need to do that even if we used deepcopy-gen or similar to automate the copy generation.
The function had returned the original pointer since it landed in db150e6 (cvo: Perform status updates in a single thread, 2018-11-03, openshift#45). But locking the operator structure to return a pointer reference is a bit risky, because after the lock is released you're still holding a pointer into that data, but lack easy access to the lock to guard against simultaneous access. For example, you could have setAvailableUpdates updating the structure, while simultaneously operatorMetrics.Collect, Operator.syncStatus, or Operator.mergeReleaseMetadata is looking at their pointer reference to the old data. There wasn't actually much exposure, because writes all happened to flow through setAvailableUpdates, and setAvailableUpdates's only changes were: * Bumping the u.LastSyncOrConfigChange Time. * Replacing the availableUpdates pointer with a new pointer. and neither of those should significantly disrupt any of the consumers. But switching to a copy doesn't cost much resource wise, and it protects us from a number of possible ways that this could break in the future if setAvailableUpdates does less full-pointer-replacement or one of the consumers starts to care about LastSyncOrConfigChange reliably lining up with the rest of the availableUpdates content. It does mean we need to update the copy logic as we add new properties to the structure, but we'd need to do that even if we used deepcopy-gen or similar to automate the copy generation.
The function had returned the original pointer since it landed in db150e6 (cvo: Perform status updates in a single thread, 2018-11-03, openshift#45). But locking the operator structure to return a pointer reference is a bit risky, because after the lock is released you're still holding a pointer into that data, but lack easy access to the lock to guard against simultaneous access. For example, you could have setAvailableUpdates updating the structure, while simultaneously operatorMetrics.Collect, Operator.syncStatus, or Operator.mergeReleaseMetadata is looking at their pointer reference to the old data. There wasn't actually much exposure, because writes all happened to flow through setAvailableUpdates, and setAvailableUpdates's only changes were: * Bumping the u.LastSyncOrConfigChange Time. * Replacing the availableUpdates pointer with a new pointer. and neither of those should significantly disrupt any of the consumers. But switching to a copy doesn't cost much resource wise, and it protects us from a number of possible ways that this could break in the future if setAvailableUpdates does less full-pointer-replacement or one of the consumers starts to care about LastSyncOrConfigChange reliably lining up with the rest of the availableUpdates content. It does mean we need to update the copy logic as we add new properties to the structure, but we'd need to do that even if we used deepcopy-gen or similar to automate the copy generation.
The function had returned the original pointer since it landed in db150e6 (cvo: Perform status updates in a single thread, 2018-11-03, openshift#45). But locking the operator structure to return a pointer reference is a bit risky, because after the lock is released you're still holding a pointer into that data, but lack easy access to the lock to guard against simultaneous access. For example, you could have setAvailableUpdates updating the structure, while simultaneously operatorMetrics.Collect, Operator.syncStatus, or Operator.mergeReleaseMetadata is looking at their pointer reference to the old data. There wasn't actually much exposure, because writes all happened to flow through setAvailableUpdates, and setAvailableUpdates's only changes were: * Bumping the u.LastSyncOrConfigChange Time. * Replacing the availableUpdates pointer with a new pointer. and neither of those should significantly disrupt any of the consumers. But switching to a copy doesn't cost much resource wise, and it protects us from a number of possible ways that this could break in the future if setAvailableUpdates does less full-pointer-replacement or one of the consumers starts to care about LastSyncOrConfigChange reliably lining up with the rest of the availableUpdates content. It does mean we need to update the copy logic as we add new properties to the structure, but we'd need to do that even if we used deepcopy-gen or similar to automate the copy generation.
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
[1]: openshift#741
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
[1]: openshift#741
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
[1]: openshift#741
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
[1]: openshift#741
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
[1]: openshift#741
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
The awkward:
{{ "{{ ... \"version\" }} ... {{ end }}" }}
business is because this content is unpacked in two rounds of
templating:
1. The cluster-version operator's getPayloadTasks' renderManifest
preprocessing for the CVO directory, which is based on Go
templates.
2. Prometheus alerting-rule templates, which use console templates
[2], which are also based on Go templates [3].
The '{{ "..." }}' wrapping is consumed by the CVO's templating, and
the remaining:
{{ ... "version" }} ... {{ end }}
is left for Promtheus' templating.
[1]: openshift#741
[2]: https://prometheus.io/docs/prometheus/2.51/configuration/alerting_rules/#templating
[3]: https://prometheus.io/docs/visualization/consoles/
…usterOperatorDegraded
By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions. This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).
I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
cluster_operator_conditions{name="version",condition="Degraded"},
falls through to cluster_operator_up{name="version"}, and starts
cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
rule to understand Failing, ClusterOperatorDegraded would fire.
We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:
1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
cluster_operator_conditions{name="version",condition="Degraded"},
cluster_operator_conditions{name="version",condition="Failing"} (we
hope), or cluster_operator_up{name="version"}, so it doesn't fire.
Unless we are Failing=True, in which case, hooray, we'll start
alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
fresh-modern-install situation, and everything is great.
The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45). But it will become more important
with [1], which is planning on removing that default creation. When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.
The awkward:
{{ "{{ ... \"version\" }} ... {{ end }}" }}
business is because this content is unpacked in two rounds of
templating:
1. The cluster-version operator's getPayloadTasks' renderManifest
preprocessing for the CVO directory, which is based on Go
templates.
2. Prometheus alerting-rule templates, which use console templates
[2], which are also based on Go templates [3].
The '{{ "..." }}' wrapping is consumed by the CVO's templating, and
the remaining:
{{ ... "version" }} ... {{ end }}
is left for Promtheus' templating.
[1]: openshift#741
[2]: https://prometheus.io/docs/prometheus/2.51/configuration/alerting_rules/#templating
[3]: https://prometheus.io/docs/visualization/consoles/
This continues work in #44 to move fields to spec and status. Focuses on providing an end user focused view of the cluster version operators work.