Skip to content

Conversation

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Oct 29, 2018

This continues work in #44 to move fields to spec and status. Focuses on providing an end user focused view of the cluster version operators work.

  • Merge CVOStatus into ClusterVersion.Status
  • Move existing fields in ClusterVersion into Spec
  • Make ClusterVersion cluster scoped
  • Add custom printer columns to the CRDs
  • Add godoc to the object
  • Improve all condition output
  • Add a sync unit test that exercises the core CVO sync loop
  • Add prometheus metrics for core logic

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 29, 2018
@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 29, 2018
@smarterclayton smarterclayton force-pushed the spec_status branch 2 times, most recently from d281701 to 587c376 Compare October 31, 2018 23:37
@smarterclayton
Copy link
Contributor Author

The commits here need to be broken up more, and I need to do more testing, but this is getting close to rounding out the "CVO reports more accurate status, CV reports desired + actual state, CO is a copy, and users can interact with CV and CO more cleanly" story.

@smarterclayton smarterclayton changed the title WIP: Follow on to #44 to split into spec/status WIP: Iterating after #44 to split into spec/status and do better loops Oct 31, 2018
@smarterclayton smarterclayton changed the title WIP: Iterating after #44 to split into spec/status and do better loops WIP: Iterating after #44 to split into spec/status and improve CVO Oct 31, 2018
existing, err := client.ClusterOperators(required.Namespace).Get(required.Name, metav1.GetOptions{})
if errors.IsNotFound(err) {
actual, err := client.ClusterOperators(required.Namespace).Create(required)
if err != nil && !errors.IsAlreadyExists(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we couldn't find the object why do this check... i'm probably missing something here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for status you can't update the status at the same time as creation (creation and update only allow spec changes, you have to call update status). So this is creating whatever the object is, then applying the status. I think we could change these methods around to bail out, but since ClusterOperators is primarily a status object this seemed the closest to the intent of apply: "make sure this object exists with this status". We always pass in a valid ClusterOperator here.

Also, if someone deletes the ClusterOperator out from under us we just recreate it (we're the object owner).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i didn't realize that Create&Update don't change .status. 🤔

@smarterclayton smarterclayton force-pushed the spec_status branch 2 times, most recently from 119e4be to 1249f5e Compare November 1, 2018 03:00
@smarterclayton
Copy link
Contributor Author

smarterclayton commented Nov 1, 2018

Added a couple more commits that tighten up error messages and reactivity - with all changes applied (ClusterVersion is scoped) you can perform the following get call and see the summary in one line:

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE     STATUS
version   4.0.1     True        False         8m        Cluster version is 4.0.1

A key outcome here is that the Progressing condition message is always kept up to date, and the transition time is used as since. So instead of creating separate status fields, we have the controller keep the Progressing condition as the summary, and update the other conditions as appropriate. I think this is a general pattern.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Nov 1, 2018

Report prometheus metrics that show the current version state.

cluster_version{type="current",version="4.0.1",payload="location/registry/image:tag"}  1
cluster_version{type="update", version="4.0.2",payload="location/registry/image:tag2"} 1
cluster_version{type="failure",version="4.0.1",payload="location/registry/image:tag"}  1

Arguably we could report both the current version and any known update as possible failures. Or we could have the sync loop capture that info.

@abhinavdahiya
Copy link
Contributor

  1. ClusterVersionStatus vs ClusterOperatorStatus
    Do we need both? I think one is enough.
    https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#version
    I would hope we keep just ClusterVersionStatus.

  2. All 3 sync loops should not be driven by informers.
    availableUpdatesSync should be a periodic func
    statusSync should not be a separate one, we should be updating status as part of main sync
    sync should be the only one driven by informers.

  3. a2cf828 ❤️

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Nov 1, 2018

For 1. my reasons are:

  1. ClusterOperatorStatus is a generic interface. It can't report information like "what available updates there are". It can't report structured data response. ClusterVersionStatus can and should. I think we could argue that the CVO doesn't have to write a ClusterOperator, but since it will read them it may be worth it.

  2. If we change the upstream URL we need to trigger availableUpdatesSync. If we delete the ClusterVersion we need to trigger the statusSync. In general, I think of the queue as having inputs, one of which is periodic (the timer) and one of which is edge driven (deleting / recreating objects changing generation), but the end result is level driven, and having it this way encourages us to get used to that.

@smarterclayton
Copy link
Contributor Author

To be clear, I don't think CVO has to write cluster operator status. Is there anyone who wants to wait on it?

@abhinavdahiya
Copy link
Contributor

To be clear, I don't think CVO has to write cluster operator status. Is there anyone who wants to wait on it?

can we drop it for now and see in the future if we need to update clusteroperatorstatus, we can revisit then.

@abhinavdahiya
Copy link
Contributor

abhinavdahiya commented Nov 1, 2018

If we change the upstream URL we need to trigger availableUpdatesSync.

that's not happening right now.. we are queuing on all events

If we delete the ClusterVersion we need to trigger the statusSync.

it seems odd that statusSync is syncing crds and clusteroperatorstatus for crds...

cmd/start.go Outdated
startOpts.listenAddr = "0.0.0.0:11345"

rootCmd.AddCommand(startCmd)
startCmd.PersistentFlags().StringVar(&startOpts.listenAddr, "listen", startOpts.listenAddr, "Address to listen on for metrics")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why not inline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline how? localhost you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startCmd.PersistentFlags().StringVar(&startOpts.listenAddr, "listen", "0.0.0.0:12345", "Address to listen on for metrics")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinavdahiya when using a cobra approach with options structs and flags for customization (like kube-apiserver, generic-apiserver, oc, kubectl, etc), we've found that respecting a struct value allows for flexibility and clean composition. This matches the upstream projects better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see what you mean. Yeah, that's fine

ch <- g
if cv, err := m.optr.cvoConfigLister.Get(m.optr.name); err == nil {
if update := cv.Spec.DesiredUpdate; update != nil && update.Payload != current.Payload {
g = m.version.WithLabelValues("update", update.Version, update.Payload)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is usual to just reuse the same variable when doing this metric push?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I'll fix that

@smarterclayton
Copy link
Contributor Author

Yes. I will remove those loops.

@smarterclayton
Copy link
Contributor Author

that's not happening right now.. we are queuing on all events

Yes, available updates needs to be queued by an outside source.

it seems odd that statusSync is syncing crds and clusteroperatorstatus for crds...

StatusSync is the only bit of code that touches ClusterOperators - and it's the loop that requires cluster operator CRD to exist. So it's a prerequisite for level driven. That said, do we need to do CRD reconciliation here or should we move it to manifests/? Maybe we should be doing cluster version CRD reconciliation here as well? (the split is so that we can wait for the informers to sync, but it feels wrong).

can we drop it for now and see in the future if we need to update clusteroperatorstatus, we can revisit then.

Yes. I will be adding metrics to read ClusterOperator status (to track upgrade progress), but we don't need to write it.

@smarterclayton
Copy link
Contributor Author

Oh, availableUpdates is being driven by the resync interval.


const (
componentName = "cluster-version-operator"
componentName = "version"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading top to bottom, this looks like an odd change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes the name for the cluster version object version. We could do something else, but no one should have to type that out.

jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 2, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 2, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 3, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 3, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 3, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 4, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 4, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 4, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 4, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 4, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 4, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 7, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 9, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
jottofar added a commit to jottofar/cluster-version-operator that referenced this pull request Mar 11, 2022
Since at least 90e9881 (cvo: Change the core CVO loops to
report status to ClusterVersion, 2018-11-02, openshift#45), the CVO
created a default ClusterVersion when there was none in the
cluster. In d7760ce (pkg/cvo: Drop ClusterVersion
defaulting during bootstrap, 2019-08-16, openshift#238), we removed
that defaulting during cluster-bootstrap, to avoid racing
with the installer-supplied ClusterVersion and its
user-specified configuration. In this commit, we're removing
ClusterVersion defaulting entirely, and the CVO will just
patiently wait until it gets a ClusterVersion before
continuing. Admins rarely delete ClusterVersion in practice,
creating a sane default is becoming more difficult as the
spec configuration becomes richer, and waiting for the admin
to come back and ask the CVO to get back to work allows us
to simplify the code without leaving customers at risk.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
The function had returned the original pointer since it landed in
db150e6 (cvo: Perform status updates in a single thread,
2018-11-03, openshift#45).  But locking the operator structure to return a
pointer reference is a bit risky, because after the lock is released
you're still holding a pointer into that data, but lack easy access to
the lock to guard against simultaneous access.  For example, you could
have setAvailableUpdates updating the structure, while simultaneously
operatorMetrics.Collect, Operator.syncStatus, or
Operator.mergeReleaseMetadata is looking at their pointer reference to
the old data.

There wasn't actually much exposure, because writes all happened to
flow through setAvailableUpdates, and setAvailableUpdates's only
changes were:

* Bumping the u.LastSyncOrConfigChange Time.
* Replacing the availableUpdates pointer with a new pointer.

and neither of those should significantly disrupt any of the consumers.

But switching to a DeepCopy doesn't cost much resource wise, and it
protects us from a number of possible ways that this could break in
the future if setAvailableUpdates does less full-pointer-replacement
or one of the consumers starts to care about LastSyncOrConfigChange
reliably lining up with the rest of the availableUpdates content.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
The function had returned the original pointer since it landed in
db150e6 (cvo: Perform status updates in a single thread,
2018-11-03, openshift#45).  But locking the operator structure to return a
pointer reference is a bit risky, because after the lock is released
you're still holding a pointer into that data, but lack easy access to
the lock to guard against simultaneous access.  For example, you could
have setAvailableUpdates updating the structure, while simultaneously
operatorMetrics.Collect, Operator.syncStatus, or
Operator.mergeReleaseMetadata is looking at their pointer reference to
the old data.

There wasn't actually much exposure, because writes all happened to
flow through setAvailableUpdates, and setAvailableUpdates's only
changes were:

* Bumping the u.LastSyncOrConfigChange Time.
* Replacing the availableUpdates pointer with a new pointer.

and neither of those should significantly disrupt any of the consumers.

But switching to a copy doesn't cost much resource wise, and it
protects us from a number of possible ways that this could break in
the future if setAvailableUpdates does less full-pointer-replacement
or one of the consumers starts to care about LastSyncOrConfigChange
reliably lining up with the rest of the availableUpdates content.

It does mean we need to update the copy logic as we add new properties
to the structure, but we'd need to do that even if we used
deepcopy-gen or similar to automate the copy generation.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 19, 2023
The function had returned the original pointer since it landed in
db150e6 (cvo: Perform status updates in a single thread,
2018-11-03, openshift#45).  But locking the operator structure to return a
pointer reference is a bit risky, because after the lock is released
you're still holding a pointer into that data, but lack easy access to
the lock to guard against simultaneous access.  For example, you could
have setAvailableUpdates updating the structure, while simultaneously
operatorMetrics.Collect, Operator.syncStatus, or
Operator.mergeReleaseMetadata is looking at their pointer reference to
the old data.

There wasn't actually much exposure, because writes all happened to
flow through setAvailableUpdates, and setAvailableUpdates's only
changes were:

* Bumping the u.LastSyncOrConfigChange Time.
* Replacing the availableUpdates pointer with a new pointer.

and neither of those should significantly disrupt any of the consumers.

But switching to a copy doesn't cost much resource wise, and it
protects us from a number of possible ways that this could break in
the future if setAvailableUpdates does less full-pointer-replacement
or one of the consumers starts to care about LastSyncOrConfigChange
reliably lining up with the rest of the availableUpdates content.

It does mean we need to update the copy logic as we add new properties
to the structure, but we'd need to do that even if we used
deepcopy-gen or similar to automate the copy generation.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 20, 2023
The function had returned the original pointer since it landed in
db150e6 (cvo: Perform status updates in a single thread,
2018-11-03, openshift#45).  But locking the operator structure to return a
pointer reference is a bit risky, because after the lock is released
you're still holding a pointer into that data, but lack easy access to
the lock to guard against simultaneous access.  For example, you could
have setAvailableUpdates updating the structure, while simultaneously
operatorMetrics.Collect, Operator.syncStatus, or
Operator.mergeReleaseMetadata is looking at their pointer reference to
the old data.

There wasn't actually much exposure, because writes all happened to
flow through setAvailableUpdates, and setAvailableUpdates's only
changes were:

* Bumping the u.LastSyncOrConfigChange Time.
* Replacing the availableUpdates pointer with a new pointer.

and neither of those should significantly disrupt any of the consumers.

But switching to a copy doesn't cost much resource wise, and it
protects us from a number of possible ways that this could break in
the future if setAvailableUpdates does less full-pointer-replacement
or one of the consumers starts to care about LastSyncOrConfigChange
reliably lining up with the rest of the availableUpdates content.

It does mean we need to update the copy logic as we add new properties
to the structure, but we'd need to do that even if we used
deepcopy-gen or similar to automate the copy generation.
wking added a commit to wking/cluster-version-operator that referenced this pull request Sep 25, 2023
The function had returned the original pointer since it landed in
db150e6 (cvo: Perform status updates in a single thread,
2018-11-03, openshift#45).  But locking the operator structure to return a
pointer reference is a bit risky, because after the lock is released
you're still holding a pointer into that data, but lack easy access to
the lock to guard against simultaneous access.  For example, you could
have setAvailableUpdates updating the structure, while simultaneously
operatorMetrics.Collect, Operator.syncStatus, or
Operator.mergeReleaseMetadata is looking at their pointer reference to
the old data.

There wasn't actually much exposure, because writes all happened to
flow through setAvailableUpdates, and setAvailableUpdates's only
changes were:

* Bumping the u.LastSyncOrConfigChange Time.
* Replacing the availableUpdates pointer with a new pointer.

and neither of those should significantly disrupt any of the consumers.

But switching to a copy doesn't cost much resource wise, and it
protects us from a number of possible ways that this could break in
the future if setAvailableUpdates does less full-pointer-replacement
or one of the consumers starts to care about LastSyncOrConfigChange
reliably lining up with the rest of the availableUpdates content.

It does mean we need to update the copy logic as we add new properties
to the structure, but we'd need to do that even if we used
deepcopy-gen or similar to automate the copy generation.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-version-operator that referenced this pull request Sep 27, 2023
The function had returned the original pointer since it landed in
db150e6 (cvo: Perform status updates in a single thread,
2018-11-03, openshift#45).  But locking the operator structure to return a
pointer reference is a bit risky, because after the lock is released
you're still holding a pointer into that data, but lack easy access to
the lock to guard against simultaneous access.  For example, you could
have setAvailableUpdates updating the structure, while simultaneously
operatorMetrics.Collect, Operator.syncStatus, or
Operator.mergeReleaseMetadata is looking at their pointer reference to
the old data.

There wasn't actually much exposure, because writes all happened to
flow through setAvailableUpdates, and setAvailableUpdates's only
changes were:

* Bumping the u.LastSyncOrConfigChange Time.
* Replacing the availableUpdates pointer with a new pointer.

and neither of those should significantly disrupt any of the consumers.

But switching to a copy doesn't cost much resource wise, and it
protects us from a number of possible ways that this could break in
the future if setAvailableUpdates does less full-pointer-replacement
or one of the consumers starts to care about LastSyncOrConfigChange
reliably lining up with the rest of the availableUpdates content.

It does mean we need to update the copy logic as we add new properties
to the structure, but we'd need to do that even if we used
deepcopy-gen or similar to automate the copy generation.
wking added a commit to wking/cluster-version-operator that referenced this pull request Oct 25, 2023
The function had returned the original pointer since it landed in
db150e6 (cvo: Perform status updates in a single thread,
2018-11-03, openshift#45).  But locking the operator structure to return a
pointer reference is a bit risky, because after the lock is released
you're still holding a pointer into that data, but lack easy access to
the lock to guard against simultaneous access.  For example, you could
have setAvailableUpdates updating the structure, while simultaneously
operatorMetrics.Collect, Operator.syncStatus, or
Operator.mergeReleaseMetadata is looking at their pointer reference to
the old data.

There wasn't actually much exposure, because writes all happened to
flow through setAvailableUpdates, and setAvailableUpdates's only
changes were:

* Bumping the u.LastSyncOrConfigChange Time.
* Replacing the availableUpdates pointer with a new pointer.

and neither of those should significantly disrupt any of the consumers.

But switching to a copy doesn't cost much resource wise, and it
protects us from a number of possible ways that this could break in
the future if setAvailableUpdates does less full-pointer-replacement
or one of the consumers starts to care about LastSyncOrConfigChange
reliably lining up with the rest of the availableUpdates content.

It does mean we need to update the copy logic as we add new properties
to the structure, but we'd need to do that even if we used
deepcopy-gen or similar to automate the copy generation.
wking added a commit to wking/cluster-version-operator that referenced this pull request Nov 2, 2023
The function had returned the original pointer since it landed in
db150e6 (cvo: Perform status updates in a single thread,
2018-11-03, openshift#45).  But locking the operator structure to return a
pointer reference is a bit risky, because after the lock is released
you're still holding a pointer into that data, but lack easy access to
the lock to guard against simultaneous access.  For example, you could
have setAvailableUpdates updating the structure, while simultaneously
operatorMetrics.Collect, Operator.syncStatus, or
Operator.mergeReleaseMetadata is looking at their pointer reference to
the old data.

There wasn't actually much exposure, because writes all happened to
flow through setAvailableUpdates, and setAvailableUpdates's only
changes were:

* Bumping the u.LastSyncOrConfigChange Time.
* Replacing the availableUpdates pointer with a new pointer.

and neither of those should significantly disrupt any of the consumers.

But switching to a copy doesn't cost much resource wise, and it
protects us from a number of possible ways that this could break in
the future if setAvailableUpdates does less full-pointer-replacement
or one of the consumers starts to care about LastSyncOrConfigChange
reliably lining up with the rest of the availableUpdates content.

It does mean we need to update the copy logic as we add new properties
to the structure, but we'd need to do that even if we used
deepcopy-gen or similar to automate the copy generation.
wking added a commit to wking/cluster-version-operator that referenced this pull request Mar 18, 2024
…usterOperatorDegraded

By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions.  This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).

I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
   cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
   cluster_operator_conditions{name="version",condition="Degraded"},
   falls through to cluster_operator_up{name="version"}, and starts
   cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
   rule to understand Failing, ClusterOperatorDegraded would fire.

We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
   cluster_operator_conditions{name="version",condition="Degraded"},
   cluster_operator_conditions{name="version",condition="Failing"} (we
   hope), or cluster_operator_up{name="version"}, so it doesn't fire.
   Unless we are Failing=True, in which case, hooray, we'll start
   alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
   fresh-modern-install situation, and everything is great.

The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45).  But it will become more important
with [1], which is planning on removing that default creation.  When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.

[1]: openshift#741
wking added a commit to wking/cluster-version-operator that referenced this pull request Mar 18, 2024
…usterOperatorDegraded

By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions.  This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).

I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
   cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
   cluster_operator_conditions{name="version",condition="Degraded"},
   falls through to cluster_operator_up{name="version"}, and starts
   cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
   rule to understand Failing, ClusterOperatorDegraded would fire.

We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
   cluster_operator_conditions{name="version",condition="Degraded"},
   cluster_operator_conditions{name="version",condition="Failing"} (we
   hope), or cluster_operator_up{name="version"}, so it doesn't fire.
   Unless we are Failing=True, in which case, hooray, we'll start
   alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
   fresh-modern-install situation, and everything is great.

The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45).  But it will become more important
with [1], which is planning on removing that default creation.  When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.

[1]: openshift#741
wking added a commit to wking/cluster-version-operator that referenced this pull request Mar 19, 2024
…usterOperatorDegraded

By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions.  This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).

I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
   cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
   cluster_operator_conditions{name="version",condition="Degraded"},
   falls through to cluster_operator_up{name="version"}, and starts
   cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
   rule to understand Failing, ClusterOperatorDegraded would fire.

We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
   cluster_operator_conditions{name="version",condition="Degraded"},
   cluster_operator_conditions{name="version",condition="Failing"} (we
   hope), or cluster_operator_up{name="version"}, so it doesn't fire.
   Unless we are Failing=True, in which case, hooray, we'll start
   alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
   fresh-modern-install situation, and everything is great.

The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45).  But it will become more important
with [1], which is planning on removing that default creation.  When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.

[1]: openshift#741
wking added a commit to wking/cluster-version-operator that referenced this pull request Mar 27, 2024
…usterOperatorDegraded

By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions.  This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).

I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
   cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
   cluster_operator_conditions{name="version",condition="Degraded"},
   falls through to cluster_operator_up{name="version"}, and starts
   cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
   rule to understand Failing, ClusterOperatorDegraded would fire.

We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
   cluster_operator_conditions{name="version",condition="Degraded"},
   cluster_operator_conditions{name="version",condition="Failing"} (we
   hope), or cluster_operator_up{name="version"}, so it doesn't fire.
   Unless we are Failing=True, in which case, hooray, we'll start
   alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
   fresh-modern-install situation, and everything is great.

The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45).  But it will become more important
with [1], which is planning on removing that default creation.  When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.

[1]: openshift#741
wking added a commit to wking/cluster-version-operator that referenced this pull request Apr 8, 2024
…usterOperatorDegraded

By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions.  This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).

I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
   cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
   cluster_operator_conditions{name="version",condition="Degraded"},
   falls through to cluster_operator_up{name="version"}, and starts
   cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
   rule to understand Failing, ClusterOperatorDegraded would fire.

We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
   cluster_operator_conditions{name="version",condition="Degraded"},
   cluster_operator_conditions{name="version",condition="Failing"} (we
   hope), or cluster_operator_up{name="version"}, so it doesn't fire.
   Unless we are Failing=True, in which case, hooray, we'll start
   alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
   fresh-modern-install situation, and everything is great.

The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45).  But it will become more important
with [1], which is planning on removing that default creation.  When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.

[1]: openshift#741
wking added a commit to wking/cluster-version-operator that referenced this pull request Apr 9, 2024
…usterOperatorDegraded

By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions.  This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).

I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
   cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
   cluster_operator_conditions{name="version",condition="Degraded"},
   falls through to cluster_operator_up{name="version"}, and starts
   cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
   rule to understand Failing, ClusterOperatorDegraded would fire.

We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
   cluster_operator_conditions{name="version",condition="Degraded"},
   cluster_operator_conditions{name="version",condition="Failing"} (we
   hope), or cluster_operator_up{name="version"}, so it doesn't fire.
   Unless we are Failing=True, in which case, hooray, we'll start
   alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
   fresh-modern-install situation, and everything is great.

The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45).  But it will become more important
with [1], which is planning on removing that default creation.  When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.

The awkward:

  {{ "{{ ... \"version\" }} ... {{ end }}" }}

business is because this content is unpacked in two rounds of
templating:

1. The cluster-version operator's getPayloadTasks' renderManifest
   preprocessing for the CVO directory, which is based on Go
   templates.
2. Prometheus alerting-rule templates, which use console templates
   [2], which are also based on Go templates [3].

The '{{ "..." }}' wrapping is consumed by the CVO's templating, and
the remaining:

  {{ ... "version" }} ... {{ end }}

is left for Promtheus' templating.

[1]: openshift#741
[2]: https://prometheus.io/docs/prometheus/2.51/configuration/alerting_rules/#templating
[3]: https://prometheus.io/docs/visualization/consoles/
wking added a commit to wking/cluster-version-operator that referenced this pull request Apr 10, 2024
…usterOperatorDegraded

By adding cluster_operator_up handling for ClusterVersion, with
'version' as the component name, the same way we handle
cluster_operator_conditions.  This plugs us into ClusterOperatorDown
(based on cluster_operator_up) and ClusterOperatorDegraded (based on
both cluster_operator_conditions and cluster_operator_up).

I've adjusted the ClusterOperatorDegraded rule so that it fires on
ClusterVersion Failing=True and does not fire on Failing=False.
Thinking through an update from before:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with this change.
3. New CVO comes in, starts serving
   cluster_operator_up{name="version"}.
4. Old ClusterOperatorDegraded no matching
   cluster_operator_conditions{name="version",condition="Degraded"},
   falls through to cluster_operator_up{name="version"}, and starts
   cooking the 'for: 30m'.
5. If we go more than 30m before updating the ClusterOperatorDegraded
   rule to understand Failing, ClusterOperatorDegraded would fire.

We'll need to backport the ClusterOperatorDegraded expr change to one
4.y release before the CVO-metrics change lands to get:

1. Outgoing CVO does not serve cluster_operator_up{name="version"}.
2. User requests an update to a release with the expr change.
3. Incoming ClusterOperatorDegraded sees no
   cluster_operator_conditions{name="version",condition="Degraded"},
   cluster_operator_conditions{name="version",condition="Failing"} (we
   hope), or cluster_operator_up{name="version"}, so it doesn't fire.
   Unless we are Failing=True, in which case, hooray, we'll start
   alerting about it.
4. User requests an update to a release with the CVO-metrics change.
5. New CVO starts serving cluster_operator_up, just like the
   fresh-modern-install situation, and everything is great.

The missing-ClusterVersion metrics don't matter all that much today,
because the CVO has been creating replacement ClusterVersion since at
least 90e9881 (cvo: Change the core CVO loops to report status to
ClusterVersion, 2018-11-02, openshift#45).  But it will become more important
with [1], which is planning on removing that default creation.  When
there is no ClusterVersion, we expect ClusterOperatorDown to fire.

The awkward:

  {{ "{{ ... \"version\" }} ... {{ end }}" }}

business is because this content is unpacked in two rounds of
templating:

1. The cluster-version operator's getPayloadTasks' renderManifest
   preprocessing for the CVO directory, which is based on Go
   templates.
2. Prometheus alerting-rule templates, which use console templates
   [2], which are also based on Go templates [3].

The '{{ "..." }}' wrapping is consumed by the CVO's templating, and
the remaining:

  {{ ... "version" }} ... {{ end }}

is left for Promtheus' templating.

[1]: openshift#741
[2]: https://prometheus.io/docs/prometheus/2.51/configuration/alerting_rules/#templating
[3]: https://prometheus.io/docs/visualization/consoles/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants